=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_31
|storemode=property
|title=Depth-Wise
Separable Atrous Convolution for Polyps Segmentation in Gastro-Intestinal Tract
|pdfUrl=https://ceur-ws.org/Vol-2882/paper31.pdf
|volume=Vol-2882
|authors=Syed Muhammad Faraz Ali,Muhammad Taha Khan,Syed Unaiz Haider,Talha Ahmed,Zeshan Khan,Muhammad Atif Tahir
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AliKHAKT20
}}
==Depth-Wise
Separable Atrous Convolution for Polyps Segmentation in Gastro-Intestinal Tract==
Depth-Wise Separable Atrous Convolution for Polyps
Segmentation in Gastro-Intestinal Tract
Syed Muhammad Faraz Ali, Muhammad Taha Khan, Syed Unaiz Haider, Talha Ahmed, Zeshan
Khan, Muhammad Atif Tahir
{k190861,k173656,k173667,k173721,zeshan.khan,atif.tahir}@nu.edu.pk
National University of Computer and Emerging Sciences, Karachi Campus, Pakistan
ABSTRACT to each word of a sentence. In semantic segmentation, an attention
Identification of polyps in endoscopic images is critical for the di- mechanism is used to give attention to each pixel of an image which
agnosis of colon cancer. Finding the exact shape and size of polyps can then be used to make a prediction at pixel level [6].
requires the segmentation of endoscopic images. This research ex-
plores the advantage of using depth-wise separable convolution in
the atrous convolution of the ResUNet++ architecture. Deep atrous Input
spatial pyramid pooling was also implemented on the ResUNet++ Encoder Bridge
architecture. The results show that architecture with separable
convolution has a smaller size and fewer Giga-Floating Point Oper-
ations (GFLOPs) without degrading the performance too much.
Output
1 INTRODUCTION Decoder
Bridge
Wireless capsule endoscopy (WSE) has been used for diagnosis for
nearly 10 years now. WSE images provide diagnosis capability for
many diseases such as colon cancer, ulcer, polyps detection, etc.
With the advent of deep learning in computer vision, this diagnosis
Figure 1: Process Flow
task can be automated.
2 RELATED WORK A bridge of pyramid pooling is used between encoder and de-
coder block [2] [1]. The atrous convolution is used in this bridge
The gastrointestinal tract has been an active area of research. The
through which the output of the encoder is viewed at the various
benefit that can be achieved through computer-aided diagnosis
respective fields. This block convolves the features with the kernel
is significant. Jha et al. [9] studied the semantic segmentation of
of different dilation rates and the final output is the concatenation
polyps in the GI tract. This research utilizes the well-accepted U-net
of all the convolutions. This way the contextual information in
architecture and modified U-net architecture also called ResUNet
features is captured at various scales.
for segmentation. Further research was conducted to introduce a
This Atrous Spatial Pyramid Pooling (ASPP) block in ResUNet++
novel architecture named ResUNet++.
was implemented using depth-wise separable convolution as well
as replaced with Deep Atrous Spatial Pyramid Pooling (DASPP)
3 APPROACH module from [4] in separate experiment. The implementation of
The approach follows the method used by Jha et al. [9]. The Re- depth-wise separable convolution is done by applying kernel on
sUNet++ architecture was employed which uses the encoder and input at channel level. The output from here is passed through the
decoder structure for semantic segmentation. Pyramid pooling was pointwise convolution with 1x1 kernel [3]. The application of depth-
used as a bridge between the encoder and decoder block. The en- wise convolution results in fewer GFLOPs and parameters.DASPP
coder block contains residual units that take advantage of skip was implemented to see if going deep in network improves per-
connection in a neural network. The skip connection allows train- formance on polyps segmentation. Three modified architecture
ing a deep neural network without degrading the performance. are:
Squeeze and excitation blocks were used which ensure that the
(1) sepv_conv_resunet++ : ASPP module from ResUNet++ [9]
channel output features are weighted equally [5]. The attention
replaced with depth-wise separable convolution.
mechanism is used in the decoder block. The attention mechanism
(2) dsapp_resunet++ : ASPP module replaced with DASPP
is useful in making a pixel-wise prediction. This approach is popu-
module from [4].
lar in natural language processing (NLP) where attention is given
(3) dsapp_relu_resunet++ : 2 implemented with ReLu activa-
This research work was funded by Higher Education Commission (HEC) Pakistan
tion.
under NRPU Project 10225/2017. Semantic segmentation, unlike object detection, can be treated
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
as a pixel wise classification problem. The output of semantic seg-
MediaEval’20, December 14-15 2020, Online mentation for a pixel is a mask identifying the class to which the
pixel belongs. For the polyps segmentation problem [7], this mask
MediaEval’20, December 14-15 2020, Online Faraz et. al.
is either 0 or 1. The evaluation metrics used in semantic segmenta-
tion are accuracy, precision, recall, mean Intersection over Union
(mIoU), and dice co-efficient. All these except for accuracy were
used to identify model performance. The custom loss function for
mIoU was implemented and all model architectures were trained
on this custom loss.
4 DATASET
The experiments were performed on Kvasir-SEG dataset [8]. This
data consists of thousand polyps images. The ground truth values
against each of these images were provided as image masks in a
separate folder.
5 RESULTS AND ANALYSIS
All the experiments were performed on google Colab which pro-
vides a session for up to 12 hours. This 12 hours session is not
enough to train a deep learning model. So to make a fair compar-
ison, the number of epochs for all the experiments was kept the
same. The data was split into training, validation, and testing set
with the ratio of 80, 10, and 10 percent respectively. With this split,
800 images were selected for model training. These 800 images are
not enough to train a deep learning model. To increase the training
set data augmentation technique was applied to the training set.
The validation set and testing set were not modified and thus the Figure 2: Learning Curve
size of validation and test sets were 100 images each. 30 different
augmentations were applied to the training set after that the size
of it grew to 24800 images. The augmentations were also applied This shows that increasing depth any further did not improve per-
to the provided mask so that the target variable is transformed in formance. The size of the model which is measure by the number
the same way as the input image. of parameters and Giga-Floating Point Operations (GFLOPs) is best
The optimizer used from training was NAdams optimizer with for the model with separable convolution. The results are compiled
a learning rate of 0.0001 and a batch size of 8. The learning curve in table 2. The less number of parameters means that the model
for training and validation loss was recorded for each epoch. The size is smaller and it may be easy to move this in a production
learning curve provides insights into the model convergence. environment.
Figure 2 shows the learning curve for each architecture. The
architecture with the DASPP bridge shows that it may have con-
Model Params GFLOPs
verged within 10 epochs as the validation error started increasing.
However, the ResUNet++ and separable convolved ResUNet show Unet 3,588,997 7,165,148
that the model can be trained for few more epochs as both training resunet++ 4,371,265 8,718,068
and validation error are still decreasing. For UNet, the learning sepv_conv_resunet++ 3,047,265 6,070,057
curve is also decreasing at the 10th epoch. However, the value of dsapp_resunet++ 5,024,705 10,024,898
the loss is higher than the loss of ResUNet++ architecture. dsapp_relu_resunet++ 5,024,705 10,024,898
Table 2: Model Size
Model Recall Precision Dice mIoU
Unet 75.23% 84.52% 71.91%59.53%
resunet++ 64.97% 89.81% 78.35%69.48%
sepv_conv_resunet++ 60.55% 93.31% 77.25%67.56%
dsapp_resunet++ 69.72% 82.62% 76.66%66.71% 6 CONCLUSION AND FUTURE WORK
dsapp_relu_resunet++ 61.54% 92.33% 74.63%66.03% The research gives empirical results of the advantage of using depth-
Table 1: Test Data Results wise separable convolution which resulted in smaller model size
without significantly affecting the performance. It has also been
shown that increasing the depth further may not improve per-
Table 1 gives the performance of each model on testing data. formance and can result in overfitting of the model. It has been
The performance of ResUNet++ on dice coefficient and mIoU is observed that the implementation of depth-wise separable convolu-
better than other models. The performance of the model with sepa- tion results in a smaller model without much degradation in overall
rable convolution has comparable results on Dice and mIoU metrics. performance. The tuning of hyper-parameter and a larger number
However, the model with the DASPP bridge did not perform well. of epochs will give a better understanding of the performance.
Medico Multimedia Task MediaEval’20, December 14-15 2020, Online
REFERENCES
[1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and
Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolu-
tional nets, atrous convolution, and fully connected crfs. IEEE transactions on
pattern analysis and machine intelligence 40, 4 (2017), 834–848.
[2] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig
Adam. 2018. Encoder-decoder with atrous separable convolution for semantic
image segmentation. In Proceedings of the European conference on computer vision
(ECCV). 801–818.
[3] François Chollet. 2017. Xception: Deep learning with depthwise separable con-
volutions. In Proceedings of the IEEE conference on computer vision and pattern
recognition. 1251–1258.
[4] Taha Emara, Hossam E Abd El Munim, and Hazem M Abbas. 2019. LiteSeg: A
Novel Lightweight ConvNet for Semantic Segmentation. In 2019 Digital Image
Computing: Techniques and Applications (DICTA). IEEE, 1–7.
[5] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition. 7132–7141.
[6] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and
Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In
Proceedings of the IEEE International Conference on Computer Vision. 603–612.
[7] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, Håvard Johansen, Dag Johansen,
Thomas de Lange, Michael A. Riegler, and Pål Halvorsen. 2020. Medico Multimedia
Task at MediaEval 2020: Automatic Polyp Segmentation. In Proc. of the MediaEval
2020 Workshop.
[8] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange,
Dag Johansen, and Håvard D Johansen. 2020. Kvasir-seg: A segmented polyp
dataset. In International Conference on Multimedia Modeling. Springer, 451–462.
[9] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Dag Johansen, Thomas De Lange,
Pål Halvorsen, and Håvard D Johansen. 2019. Resunet++: An advanced architec-
ture for medical image segmentation. In 2019 IEEE International Symposium on
Multimedia (ISM). IEEE, 225–2255.