=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_75
|storemode=property
|title=Automatic
            Polyp Segmentation Using U-Net-ResNet50
|pdfUrl=https://ceur-ws.org/Vol-2882/paper75.pdf
|volume=Vol-2882
|authors=Saruar Alam,Nikhil Kumar Tomar,Aarati Thakur,Debesh Jha,Ashish Rauniyar
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AlamTTJR20
}}
==Automatic
            Polyp Segmentation Using U-Net-ResNet50==
<pdf width="1500px">https://ceur-ws.org/Vol-2882/paper75.pdf</pdf>
<pre>
              Automatic Polyp Segmentation Using U-Net-ResNet50
              Saruar Alam1 , Nikhil Kumar Tomar2 , Aarati Thakur3 , Debesh Jha2,4 , Ashish Rauniyar5,6
                                                                  1 University of Bergen, Norway
                                                                          2 SimulaMet, Norway
                                                  3 Nepal Medical College, Kathmandu University, Nepal
                                                             4 UiT The Arctic University of Norway
                                                                    5 University of Oslo, Norway
                                                            6 Oslo Metropolitan University, Norway
                      saruar.alam@uib.no,nikhilroxtomar@gmail.com,absolute2iti@gmail.com,debesh@simula.no
                                                       ashish@oslomet.no

ABSTRACT                                                                                The automatic polyp segmentation plays an important role in the
Polyps are the predecessors to colorectal cancer which is considered                 identification and localization of the polyps in the affected regions.
as one of the leading causes of cancer-related deaths worldwide.                     It helps in analyzing the images or even video frames and classify
Colonoscopy is the standard procedure for the identification, lo-                    each pixel into polyp or non-polyp class instances. This allows the
calization, and removal of colorectal polyps. Due to variability in                  clinician in easy, fast, and more accurate identification of the polyp
shape, size, and surrounding tissue similarity, colorectal polyps are                in the affected region. The automated polyp segmentation can help
often missed by the clinicians during colonoscopy. With the use of                   in the development of a Computer-Aided Diagnosis (CADx) system,
an automatic, accurate, and fast polyp segmentation method during                    which is specially designed for colonoscopy procedures.
the colonoscopy, many colorectal polyps can be easily detected and                      The “Medico Automatic Polyp Segmentation Challenge” [6] con-
removed. The “Medico automatic polyp segmentation challenge”                         sists of two tasks. The first task is “Polyp segmentation task” and
provides an opportunity to study polyp segmentation and build an                     the second is “Algorithm efficiency task”. We have submitted our
efficient and accurate segmentation algorithm. We use the U-Net                      model in task 1 only.
with pre-trained ResNet50 as the encoder for the polyp segmenta-
tion. The model is trained on Kvasir-SEG dataset provided for the
challenge and tested on the organizer’s dataset and achieves a dice
coefficient of 0.8154, Jaccard of 0.7396, recall of 0.8533, precision of              2   RELATED WORKS
0.8532, accuracy of 0.9506, and F2 score of 0.8272, demonstrating
                                                                                     For semantic segmentation task, encoder-decoder networks like
the generalization ability of our model.
                                                                                     FCN [9], U-Net [10], etc are mostly preferred over other approaches.
                                                                                     U-Net and its variants are used for both natural image segmentation
1 INTRODUCTION                                                                       and biomedical image segmentation. In general, the encoder uses
                                                                                     multiple convolutions to learn and capture the essential semantic
Identification and removal of polyps during colonoscopy have be-
                                                                                     features ranging from low-level to high-level. These upscaled fea-
come a standard procedure. It is often challenging to detect polyps,
                                                                                     tures are then concatenated with the features from the encoder
as they are often hard to differentiate from surrounding normal
                                                                                     using the skip connections and then followed by convolution layers
tissue. These polyps are usually covered with stool, mucosa, and
                                                                                     to generate the final output in the form of a binary mask.
other materials that can obscure the correct diagnosis. This is espe-
                                                                                        The encoder acts as a feature extractor, where the decoder uses
cially true for the small, flat, and sessile polyps that are typically not
                                                                                     features extracted from the input to produce to desired segmenta-
visible during colonoscopy. Moreover, this increases the miss-rate
                                                                                     tion mask. The encoder can be replaced by a pre-trained network
of polyps up-to 25% [8] and increases the risk of colorectal cancer
                                                                                     such as VGG16 [12], VGG19 [12], etc. These pre-trained networks
in the affected patient. An increase in the 1% adenoma detection
                                                                                     are already trained on the ImageNet [11] dataset and have the nec-
rate leads to a 3% decrease in the risk of colorectal cancer [3]. Re-
                                                                                     essary feature extraction capabilities. Architectures like SegNet [2]
cently, deep learning techniques have been developed to overcome
                                                                                     and TernausNet [5] use pre-trained VGG16 and VGG11 respectively
these challenges and improve polyp detection accuracy during
                                                                                     for segmentation task.
colonoscopy. Polyp segmentation based deep learning methods
                                                                                        With the success of the residual network [4], ResNet50 is one
has been successfully applied for automatic polyp detection in a
                                                                                     of the commonly used architecture for any transfer learning task.
real-time.
                                                                                     The residual network uses two 3 × 3 convolutional layers and an
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons   identity mapping. Each convolution layer is followed by a batch
License Attribution 4.0 International (CC BY 4.0).                                   normalization layer and a Rectified Linear Unit (ReLU) activation
MediaEval’20, 14-15 December 2020, Online
                                                                                     function. The identity mapping is the shortcut connection connect-
                                                                                     ing the input and output of the convolutional layer. The identity
                                                                                     mapping helps in building a deeper neural network by eliminating
                                                                                     the problem of vanishing gradients and exploding gradients.
MediaEval’20, December 14-15 2020, Online                                                                                        S. Alam et. al.


                                          Figure 1: The proposed U-Net-ResNet50 architecture


3   APPROACH                                                             Table 1: Quantitative Results on Kvasir-SEG and Test Set
                                                                         (Challenge) Dataset for Task 1.
Figure 1 shows an overview of the proposed U-Net-ResNet50 archi-
tecture. It is an encoder-decoder based architecture, where ResNet50
trained on ImageNet dataset [11] is used . The use of a pre-trained          Dataset      Jaccard DSC       Recall Prec. Acc.        F2
encoder helps the model to converge easily. The input image is
fed into the pre-trained ResNet50 encoder, consisting of a series of         Kvasir-SEG 0.7871      0.8926 0.8433 0.9207 0.9639 0.8585
residual blocks as their main component. These residual blocks help
the encoder extract the important features from the input image,
                                                                             Test Set     0.7396    0.8154 0.8533 0.8532 0.9506 0.8272
which are then passed to the decoder. The decoder starts a trans-
pose convolution that upscales the incoming feature maps into the
desired shape. Next, these upscaled feature maps are concatenated        a dice coefficient of 0.8154, Jaccard of 0.7396, recall of 0.8533, pre-
with the specific shape feature maps from the pre-trained encoder        cision of 0.8532, accuracy of 0.9506 and F2 score of 0.827 on the
via skip connections. These skip connections help the model to           organiser’s test dataset which can be seen from the table 1. These
get all the low-level semantic information from the encoder, which       results demonstrate the generalization ability of our model. More-
allows the decoder to generate the desired feature maps. After that,     over Table 1 also shows that the recall value of the organizer’s test
it is followed by the two 3 × 3 convolution layer, where each layer is   dataset is 1.00% higher than the Kvasir-SEG test dataset. This shows
followed by a batch normalization layer and a ReLU non-linearity.        that the model is not overfitting.
The last decoder block’s output is passed to a 1×1 convolution layer,
which is further passed to a sigmoid activation function, finally        5   CONCLUSION & FUTURE WORK
generating the desired binary mask.
                                                                         With our U-Net-ResNet50, we achieved competitive performance on
    The FastAI (version 2.0) library [1] is used to train and evaluate
                                                                         the organizer’s dataset with a dice coefficient of 0.8154. By replacing
our model. We have employed resizing, flipping, rotating, zoom-
                                                                         the U-Net encoder with a pre-trained ResNet50 and employing a
ing, lightning, warping, and normalizing intensity based on the
                                                                         one-cycle policy during training, we are able to converge the model
ImageNet dataset to augment the input images for training. The
                                                                         in a short time. Thus, it helps in reducing the training time as
model uses Adam optimizer with an initial learning rate of 10−2 ,
                                                                         the encoder weights are not initialized from scratch. This is an
and cross-entropy loss as its loss function. We have employed the
                                                                         important step towards faster convergence, which would be useful
one-cycle policy where the learning rate changes during training
                                                                         when the availability of high-performance computing resources is
and achieves super-convergence [13]. We have run just 50 epochs
                                                                         limited.
for training, and the model has converged.
                                                                            In the future, we would like to experiment with more than one
                                                                         pre-trained encoder by fusing their feature maps and using them
4   RESULTS AND ANALYSIS                                                 for training our model.
The Medico Automatic Polyp Segmentation challenge [6] provides
an opportunity to study the potential and challenges of automated        ACKNOWLEDGMENTS
polyp segmentation. This study aims at building a model that per-        The computations in this paper were performed on the equipment
forms well on the organizer’s dataset while training on a separate       provided by the Experimental Infrastructure for Exploration of
Kvasir-SEG dataset [7].                                                  Exascale Computing (eX3), which is financially supported by the
   Table 1 shows the overall results of the U-Net-ResNet50 architec-     Research Council of Norway under the contract 270053.
ture on the Kvasir-SEG test dataset and the organizer’s test dataset        The authors would also like to thank the machine learning group
provided for the final evaluation of the model. For the evaluation       of Mohn Medical Imaging and Visualization (MMIV) Centre, Nor-
of the model, the Jaccard index, Sørensen-Dice coefficient (DSC),        way, for providing the computing infrastructure for the experi-
recall, precision (Prec.), accuracy (Acc.), and the F2 are used as the   ments.
evaluation metrics. Our trained U-Net-ResNet50 model achieved
Medico Multimedia Task                                                                                MediaEval’20, December 14-15 2020, Online


REFERENCES                                                                    [8] Sheila Kumar, Nirav Thosani, Uri Ladabaum, Shai Friedland, Ann M
[1] 2020. FastAI Library. (2020). https://docs.fast.ai/.                          Chen, Rajan Kochar, and Subhas Banerjee. 2017. Adenoma miss rates
[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. Segnet:        associated with a 3-minute versus 6-minute colonoscopy withdrawal
    A deep convolutional encoder-decoder architecture for image segmen-           time: a prospective, randomized trial. Gastrointestinal endoscopy 85, 6
    tation. IEEE transactions on pattern analysis and machine intelligence        (2017), 1273–1280.
    39, 12 (2017), 2481–2495.                                                 [9] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully
[3] Douglas A Corley, Christopher D Jensen, Amy R Marks, Wei K Zhao,              convolutional networks for semantic segmentation. In Proceedings of
    Jeffrey K Lee, Chyke A Doubeni, Ann G Zauber, Jolanda de Boer,                the IEEE conference on computer vision and pattern recognition. 3431–
    Bruce H Fireman, Joanne E Schottinger, and others. 2014. Adenoma              3440.
    detection rate and risk of colorectal cancer and death. New england      [10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net:
    journal of medicine 370, 14 (2014), 1298–1306.                                Convolutional networks for biomedical image segmentation. In Inter-
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep             national Conference on Medical image computing and computer-assisted
    residual learning for image recognition. In Proceedings of the IEEE           intervention. Springer, 234–241.
    conference on computer vision and pattern recognition. 770–778.          [11] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
[5] Vladimir Iglovikov and Alexey Shvets. 2018. Ternausnet: U-net with            Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
    vgg11 encoder pre-trained on imagenet for image segmentation. arXiv           Michael Bernstein, and others. 2015. Imagenet large scale visual recog-
    preprint arXiv:1801.05746 (2018).                                             nition challenge. International journal of computer vision 115, 3 (2015),
[6] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, Håvard D. Jo-                211–252.
    hansen, Dag Johansen, Thomas de Lange, Michael A. Riegler, and           [12] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
    Pål Halvorsen. Medico Multimedia Task at MediaEval 2020:Automatic             lutional networks for large-scale image recognition. arXiv preprint
    Polyp Segmentation.                                                           arXiv:1409.1556 (2014).
[7] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen,            [13] Leslie N Smith and Nicholay Topin. 2019. Super-convergence: Very
    Thomas de Lange, Dag Johansen, and Håvard D Johansen. 2020. Kvasir-           fast training of neural networks using large learning rates. In Artificial
    SEG: A Segmented Polyp Dataset. In Proc. of International Conference          Intelligence and Machine Learning for Multi-Domain Operations Ap-
    on Multimedia Modeling (MMM). 451–462.                                        plications, Vol. 11006. International Society for Optics and Photonics,
                                                                                  1100612.

</pre>