=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_31
|storemode=property
|title=Depth-Wise
            Separable Atrous Convolution for Polyps Segmentation in Gastro-Intestinal Tract
|pdfUrl=https://ceur-ws.org/Vol-2882/paper31.pdf
|volume=Vol-2882
|authors=Syed Muhammad Faraz Ali,Muhammad Taha Khan,Syed Unaiz Haider,Talha Ahmed,Zeshan Khan,Muhammad Atif Tahir
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AliKHAKT20
}}
==Depth-Wise
            Separable Atrous Convolution for Polyps Segmentation in Gastro-Intestinal Tract==
<pdf width="1500px">https://ceur-ws.org/Vol-2882/paper31.pdf</pdf>
<pre>
              Depth-Wise Separable Atrous Convolution for Polyps
                    Segmentation in Gastro-Intestinal Tract
       Syed Muhammad Faraz Ali, Muhammad Taha Khan, Syed Unaiz Haider, Talha Ahmed, Zeshan
                                   Khan, Muhammad Atif Tahir
                                 {k190861,k173656,k173667,k173721,zeshan.khan,atif.tahir}@nu.edu.pk
                           National University of Computer and Emerging Sciences, Karachi Campus, Pakistan

ABSTRACT                                                                             to each word of a sentence. In semantic segmentation, an attention
Identification of polyps in endoscopic images is critical for the di-                mechanism is used to give attention to each pixel of an image which
agnosis of colon cancer. Finding the exact shape and size of polyps                  can then be used to make a prediction at pixel level [6].
requires the segmentation of endoscopic images. This research ex-
plores the advantage of using depth-wise separable convolution in
the atrous convolution of the ResUNet++ architecture. Deep atrous                                     Input
spatial pyramid pooling was also implemented on the ResUNet++                                                       Encoder                    Bridge
architecture. The results show that architecture with separable
convolution has a smaller size and fewer Giga-Floating Point Oper-
ations (GFLOPs) without degrading the performance too much.

                                                                                                       Output
1    INTRODUCTION                                                                                                                              Decoder
                                                                                                                     Bridge
Wireless capsule endoscopy (WSE) has been used for diagnosis for
nearly 10 years now. WSE images provide diagnosis capability for
many diseases such as colon cancer, ulcer, polyps detection, etc.
With the advent of deep learning in computer vision, this diagnosis
                                                                                                          Figure 1: Process Flow
task can be automated.

2    RELATED WORK                                                                       A bridge of pyramid pooling is used between encoder and de-
                                                                                     coder block [2] [1]. The atrous convolution is used in this bridge
The gastrointestinal tract has been an active area of research. The
                                                                                     through which the output of the encoder is viewed at the various
benefit that can be achieved through computer-aided diagnosis
                                                                                     respective fields. This block convolves the features with the kernel
is significant. Jha et al. [9] studied the semantic segmentation of
                                                                                     of different dilation rates and the final output is the concatenation
polyps in the GI tract. This research utilizes the well-accepted U-net
                                                                                     of all the convolutions. This way the contextual information in
architecture and modified U-net architecture also called ResUNet
                                                                                     features is captured at various scales.
for segmentation. Further research was conducted to introduce a
                                                                                        This Atrous Spatial Pyramid Pooling (ASPP) block in ResUNet++
novel architecture named ResUNet++.
                                                                                     was implemented using depth-wise separable convolution as well
                                                                                     as replaced with Deep Atrous Spatial Pyramid Pooling (DASPP)
3    APPROACH                                                                        module from [4] in separate experiment. The implementation of
The approach follows the method used by Jha et al. [9]. The Re-                      depth-wise separable convolution is done by applying kernel on
sUNet++ architecture was employed which uses the encoder and                         input at channel level. The output from here is passed through the
decoder structure for semantic segmentation. Pyramid pooling was                     pointwise convolution with 1x1 kernel [3]. The application of depth-
used as a bridge between the encoder and decoder block. The en-                      wise convolution results in fewer GFLOPs and parameters.DASPP
coder block contains residual units that take advantage of skip                      was implemented to see if going deep in network improves per-
connection in a neural network. The skip connection allows train-                    formance on polyps segmentation. Three modified architecture
ing a deep neural network without degrading the performance.                         are:
Squeeze and excitation blocks were used which ensure that the
                                                                                          (1) sepv_conv_resunet++ : ASPP module from ResUNet++ [9]
channel output features are weighted equally [5]. The attention
                                                                                              replaced with depth-wise separable convolution.
mechanism is used in the decoder block. The attention mechanism
                                                                                          (2) dsapp_resunet++ : ASPP module replaced with DASPP
is useful in making a pixel-wise prediction. This approach is popu-
                                                                                              module from [4].
lar in natural language processing (NLP) where attention is given
                                                                                          (3) dsapp_relu_resunet++ : 2 implemented with ReLu activa-
This research work was funded by Higher Education Commission (HEC) Pakistan
                                                                                              tion.
under NRPU Project 10225/2017.                                                          Semantic segmentation, unlike object detection, can be treated
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                     as a pixel wise classification problem. The output of semantic seg-
MediaEval’20, December 14-15 2020, Online                                            mentation for a pixel is a mask identifying the class to which the
                                                                                     pixel belongs. For the polyps segmentation problem [7], this mask
MediaEval’20, December 14-15 2020, Online                                                                                      Faraz et. al.


is either 0 or 1. The evaluation metrics used in semantic segmenta-
tion are accuracy, precision, recall, mean Intersection over Union
(mIoU), and dice co-efficient. All these except for accuracy were
used to identify model performance. The custom loss function for
mIoU was implemented and all model architectures were trained
on this custom loss.

4   DATASET
The experiments were performed on Kvasir-SEG dataset [8]. This
data consists of thousand polyps images. The ground truth values
against each of these images were provided as image masks in a
separate folder.

5   RESULTS AND ANALYSIS
All the experiments were performed on google Colab which pro-
vides a session for up to 12 hours. This 12 hours session is not
enough to train a deep learning model. So to make a fair compar-
ison, the number of epochs for all the experiments was kept the
same. The data was split into training, validation, and testing set
with the ratio of 80, 10, and 10 percent respectively. With this split,
800 images were selected for model training. These 800 images are
not enough to train a deep learning model. To increase the training
set data augmentation technique was applied to the training set.
The validation set and testing set were not modified and thus the                            Figure 2: Learning Curve
size of validation and test sets were 100 images each. 30 different
augmentations were applied to the training set after that the size
of it grew to 24800 images. The augmentations were also applied           This shows that increasing depth any further did not improve per-
to the provided mask so that the target variable is transformed in        formance. The size of the model which is measure by the number
the same way as the input image.                                          of parameters and Giga-Floating Point Operations (GFLOPs) is best
   The optimizer used from training was NAdams optimizer with             for the model with separable convolution. The results are compiled
a learning rate of 0.0001 and a batch size of 8. The learning curve       in table 2. The less number of parameters means that the model
for training and validation loss was recorded for each epoch. The         size is smaller and it may be easy to move this in a production
learning curve provides insights into the model convergence.              environment.
   Figure 2 shows the learning curve for each architecture. The
architecture with the DASPP bridge shows that it may have con-
                                                                                 Model                       Params       GFLOPs
verged within 10 epochs as the validation error started increasing.
However, the ResUNet++ and separable convolved ResUNet show                      Unet                   3,588,997         7,165,148
that the model can be trained for few more epochs as both training               resunet++              4,371,265         8,718,068
and validation error are still decreasing. For UNet, the learning                sepv_conv_resunet++    3,047,265         6,070,057
curve is also decreasing at the 10th epoch. However, the value of                dsapp_resunet++        5,024,705         10,024,898
the loss is higher than the loss of ResUNet++ architecture.                      dsapp_relu_resunet++ 5,024,705           10,024,898
                                                                                             Table 2: Model Size
 Model                         Recall     Precision     Dice       mIoU
 Unet                    75.23% 84.52%                  71.91%59.53%
 resunet++               64.97% 89.81%                  78.35%69.48%
 sepv_conv_resunet++     60.55% 93.31%                  77.25%67.56%
 dsapp_resunet++         69.72% 82.62%                  76.66%66.71% 6 CONCLUSION AND FUTURE WORK
 dsapp_relu_resunet++ 61.54% 92.33%                     74.63%66.03% The research gives empirical results of the advantage of using depth-
               Table 1: Test Data Results                            wise separable convolution which resulted in smaller model size
                                                                     without significantly affecting the performance. It has also been
                                                                     shown that increasing the depth further may not improve per-
   Table 1 gives the performance of each model on testing data.      formance and can result in overfitting of the model. It has been
The performance of ResUNet++ on dice coefficient and mIoU is         observed that the implementation of depth-wise separable convolu-
better than other models. The performance of the model with sepa-    tion results in a smaller model without much degradation in overall
rable convolution has comparable results on Dice and mIoU metrics.   performance. The tuning of hyper-parameter and a larger number
However, the model with the DASPP bridge did not perform well.       of epochs will give a better understanding of the performance.
Medico Multimedia Task                                                                   MediaEval’20, December 14-15 2020, Online


REFERENCES
[1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and
    Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolu-
    tional nets, atrous convolution, and fully connected crfs. IEEE transactions on
    pattern analysis and machine intelligence 40, 4 (2017), 834–848.
[2] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig
    Adam. 2018. Encoder-decoder with atrous separable convolution for semantic
    image segmentation. In Proceedings of the European conference on computer vision
    (ECCV). 801–818.
[3] François Chollet. 2017. Xception: Deep learning with depthwise separable con-
    volutions. In Proceedings of the IEEE conference on computer vision and pattern
    recognition. 1251–1258.
[4] Taha Emara, Hossam E Abd El Munim, and Hazem M Abbas. 2019. LiteSeg: A
    Novel Lightweight ConvNet for Semantic Segmentation. In 2019 Digital Image
    Computing: Techniques and Applications (DICTA). IEEE, 1–7.
[5] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed-
    ings of the IEEE conference on computer vision and pattern recognition. 7132–7141.
[6] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and
    Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In
    Proceedings of the IEEE International Conference on Computer Vision. 603–612.
[7] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, Håvard Johansen, Dag Johansen,
    Thomas de Lange, Michael A. Riegler, and Pål Halvorsen. 2020. Medico Multimedia
    Task at MediaEval 2020: Automatic Polyp Segmentation. In Proc. of the MediaEval
    2020 Workshop.
[8] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange,
    Dag Johansen, and Håvard D Johansen. 2020. Kvasir-seg: A segmented polyp
    dataset. In International Conference on Multimedia Modeling. Springer, 451–462.
[9] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Dag Johansen, Thomas De Lange,
    Pål Halvorsen, and Håvard D Johansen. 2019. Resunet++: An advanced architec-
    ture for medical image segmentation. In 2019 IEEE International Symposium on
    Multimedia (ISM). IEEE, 225–2255.

</pre>