<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Polyp Segmentation Using U-Net-ResNet50</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saruar Alam</string-name>
          <email>saruar.alam@uib.no</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikhil Kumar Tomar</string-name>
          <email>nikhilroxtomar@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aarati Thakur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Debesh Jha</string-name>
          <email>debesh@simula.no</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashish Rauniyar</string-name>
          <email>ashish@oslomet.no</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nepal Medical College, Kathmandu University</institution>
          ,
          <country country="NP">Nepal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oslo Metropolitan University</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SimulaMet</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>UiT The Arctic University of</institution>
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Bergen</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Oslo</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Polyps are the predecessors to colorectal cancer which is considered as one of the leading causes of cancer-related deaths worldwide. Colonoscopy is the standard procedure for the identification, localization, and removal of colorectal polyps. Due to variability in shape, size, and surrounding tissue similarity, colorectal polyps are often missed by the clinicians during colonoscopy. With the use of an automatic, accurate, and fast polyp segmentation method during the colonoscopy, many colorectal polyps can be easily detected and removed. The “Medico automatic polyp segmentation challenge” provides an opportunity to study polyp segmentation and build an eficient and accurate segmentation algorithm. W e use the U-Net with pre-trained ResNet50 as the encoder for the polyp segmentation. The model is trained on Kvasir-SEG dataset provided for the challenge and tested on the organizer's dataset and achieves a dice coeficient of 0.8154, Jaccard of 0.7396, recall of 0.8533, precision of 0.8532, accuracy of 0.9506, and F2 score of 0.8272, demonstrating the generalization ability of our model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Identification and removal of polyps during colonoscopy have
become a standard procedure. It is often challenging to detect polyps,
as they are often hard to diferentiate from surrounding normal
tissue. These polyps are usually covered with stool, mucosa, and
other materials that can obscure the correct diagnosis. This is
especially true for the small, flat, and sessile polyps that are typically not
visible during colonoscopy. Moreover, this increases the miss-rate
of polyps up-to 25% [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and increases the risk of colorectal cancer
in the afected patient. An increase in the 1% adenoma detection
rate leads to a 3% decrease in the risk of colorectal cancer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Recently, deep learning techniques have been developed to overcome
these challenges and improve polyp detection accuracy during
colonoscopy. Polyp segmentation based deep learning methods
has been successfully applied for automatic polyp detection in a
real-time.
      </p>
      <p>The automatic polyp segmentation plays an important role in the
identification and localization of the polyps in the afected regions.
It helps in analyzing the images or even video frames and classify
each pixel into polyp or non-polyp class instances. This allows the
clinician in easy, fast, and more accurate identification of the polyp
in the afected region. The automated polyp segmentation can help
in the development of a Computer-Aided Diagnosis (CADx) system,
which is specially designed for colonoscopy procedures.</p>
      <p>
        The “Medico Automatic Polyp Segmentation Challenge” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
consists of two tasks. The first task is “Polyp segmentation task” and
the second is “Algorithm eficiency task”. We have submitted our
model in task 1 only.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORKS</title>
      <p>
        For semantic segmentation task, encoder-decoder networks like
FCN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], U-Net [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], etc are mostly preferred over other approaches.
U-Net and its variants are used for both natural image segmentation
and biomedical image segmentation. In general, the encoder uses
multiple convolutions to learn and capture the essential semantic
features ranging from low-level to high-level. These upscaled
features are then concatenated with the features from the encoder
using the skip connections and then followed by convolution layers
to generate the final output in the form of a binary mask.
      </p>
      <p>
        The encoder acts as a feature extractor, where the decoder uses
features extracted from the input to produce to desired
segmentation mask. The encoder can be replaced by a pre-trained network
such as VGG16 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], VGG19 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], etc. These pre-trained networks
are already trained on the ImageNet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] dataset and have the
necessary feature extraction capabilities. Architectures like SegNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and TernausNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] use pre-trained VGG16 and VGG11 respectively
for segmentation task.
      </p>
      <p>
        With the success of the residual network [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], ResNet50 is one
of the commonly used architecture for any transfer learning task.
The residual network uses two 3 × 3 convolutional layers and an
identity mapping. Each convolution layer is followed by a batch
normalization layer and a Rectified Linear Unit (ReLU) activation
function. The identity mapping is the shortcut connection
connecting the input and output of the convolutional layer. The identity
mapping helps in building a deeper neural network by eliminating
the problem of vanishing gradients and exploding gradients.
Figure 1 shows an overview of the proposed U-Net-ResNet50
architecture. It is an encoder-decoder based architecture, where ResNet50
trained on ImageNet dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is used . The use of a pre-trained
encoder helps the model to converge easily. The input image is
fed into the pre-trained ResNet50 encoder, consisting of a series of
residual blocks as their main component. These residual blocks help
the encoder extract the important features from the input image,
which are then passed to the decoder. The decoder starts a
transpose convolution that upscales the incoming feature maps into the
desired shape. Next, these upscaled feature maps are concatenated
with the specific shape feature maps from the pre-trained encoder
via skip connections. These skip connections help the model to
get all the low-level semantic information from the encoder, which
allows the decoder to generate the desired feature maps. After that,
it is followed by the two 3 × 3 convolution layer, where each layer is
followed by a batch normalization layer and a ReLU non-linearity.
The last decoder block’s output is passed to a 1×1 convolution layer,
which is further passed to a sigmoid activation function, finally
generating the desired binary mask.
      </p>
      <p>
        The FastAI (version 2.0) library [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is used to train and evaluate
our model. We have employed resizing, flipping, rotating,
zooming, lightning, warping, and normalizing intensity based on the
ImageNet dataset to augment the input images for training. The
model uses Adam optimizer with an initial learning rate of 10−2,
and cross-entropy loss as its loss function. We have employed the
one-cycle policy where the learning rate changes during training
and achieves super-convergence [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We have run just 50 epochs
for training, and the model has converged.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>RESULTS AND ANALYSIS</title>
      <p>
        The Medico Automatic Polyp Segmentation challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] provides
an opportunity to study the potential and challenges of automated
polyp segmentation. This study aims at building a model that
performs well on the organizer’s dataset while training on a separate
Kvasir-SEG dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Table 1 shows the overall results of the U-Net-ResNet50
architecture on the Kvasir-SEG test dataset and the organizer’s test dataset
provided for the final evaluation of the model. For the evaluation
of the model, the Jaccard index, Sørensen-Dice coeficient (DSC),
recall, precision (Prec.), accuracy (Acc.), and the F2 are used as the
evaluation metrics. Our trained U-Net-ResNet50 model achieved
a dice coeficient of 0.8154, Jaccard of 0.7396, recall of 0.8533,
precision of 0.8532, accuracy of 0.9506 and F2 score of 0.827 on the
organiser’s test dataset which can be seen from the table 1. These
results demonstrate the generalization ability of our model.
Moreover Table 1 also shows that the recall value of the organizer’s test
dataset is 1.00% higher than the Kvasir-SEG test dataset. This shows
that the model is not overfitting.
5</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION &amp; FUTURE WORK</title>
      <p>With our U-Net-ResNet50, we achieved competitive performance on
the organizer’s dataset with a dice coeficient of 0.8154. By replacing
the U-Net encoder with a pre-trained ResNet50 and employing a
one-cycle policy during training, we are able to converge the model
in a short time. Thus, it helps in reducing the training time as
the encoder weights are not initialized from scratch. This is an
important step towards faster convergence, which would be useful
when the availability of high-performance computing resources is
limited.</p>
      <p>In the future, we would like to experiment with more than one
pre-trained encoder by fusing their feature maps and using them
for training our model.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>The computations in this paper were performed on the equipment
provided by the Experimental Infrastructure for Exploration of
Exascale Computing (eX3), which is financially supported by the
Research Council of Norway under the contract 270053.</p>
      <p>The authors would also like to thank the machine learning group
of Mohn Medical Imaging and Visualization (MMIV) Centre,
Norway, for providing the computing infrastructure for the
experiments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <year>2020</year>
          . FastAI Library. (
          <year>2020</year>
          ). https://docs.fast.ai/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Badrinarayanan</surname>
          </string-name>
          , Alex Kendall, and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Cipolla</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Segnet: A deep convolutional encoder-decoder architecture for image segmentation</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>39</volume>
          , 12 (
          <year>2017</year>
          ),
          <fpage>2481</fpage>
          -
          <lpage>2495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Douglas</surname>
            <given-names>A Corley</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christopher D Jensen</surname>
          </string-name>
          , Amy R Marks,
          <article-title>Wei K Zhao, Jefrey K Lee, Chyke A Doubeni</article-title>
          , Ann G Zauber, Jolanda de Boer, Bruce H Fireman, Joanne E Schottinger, and others.
          <source>2014</source>
          .
          <article-title>Adenoma detection rate and risk of colorectal cancer and death</article-title>
          .
          <source>New england journal of medicine 370</source>
          ,
          <issue>14</issue>
          (
          <year>2014</year>
          ),
          <fpage>1298</fpage>
          -
          <lpage>1306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Iglovikov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Shvets</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>05746</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          , Steven A.
          <string-name>
            <surname>Hicks</surname>
            , Krister Emanuelsen,
            <given-names>Håvard D.</given-names>
          </string-name>
          <string-name>
            <surname>Johansen</surname>
          </string-name>
          , Dag Johansen, Thomas de Lange,
          <article-title>Michael A</article-title>
          .
          <string-name>
            <surname>Riegler</surname>
            , and
            <given-names>Pål</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <source>Medico Multimedia Task at MediaEval</source>
          <year>2020</year>
          :
          <article-title>Automatic Polyp Segmentation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pia H Smedsrud</surname>
          </string-name>
          ,
          <article-title>Michael A Riegler, Pål Halvorsen</article-title>
          , Thomas de Lange, Dag Johansen, and
          <string-name>
            <given-names>Håvard D</given-names>
            <surname>Johansen</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>KvasirSEG: A Segmented Polyp Dataset</article-title>
          .
          <source>In Proc. of International Conference on Multimedia Modeling (MMM)</source>
          .
          <volume>451</volume>
          -
          <fpage>462</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sheila</given-names>
            <surname>Kumar</surname>
          </string-name>
          , Nirav Thosani, Uri Ladabaum, Shai Friedland, Ann M Chen,
          <string-name>
            <given-names>Rajan</given-names>
            <surname>Kochar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Subhas</given-names>
            <surname>Banerjee</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Adenoma miss rates associated with a 3-minute versus 6-minute colonoscopy withdrawal time: a prospective, randomized trial</article-title>
          .
          <source>Gastrointestinal endoscopy 85</source>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ),
          <fpage>1273</fpage>
          -
          <lpage>1280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Long</surname>
          </string-name>
          , Evan Shelhamer, and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Darrell</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Fully convolutional networks for semantic segmentation</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>3431</volume>
          -
          <fpage>3440</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Olaf</surname>
            <given-names>Ronneberger</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Brox</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>U-net: Convolutional networks for biomedical image segmentation</article-title>
          . In International Conference on
          <article-title>Medical image computing and computer-assisted intervention</article-title>
          . Springer,
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Olga</surname>
            <given-names>Russakovsky</given-names>
          </string-name>
          , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bernstein</surname>
          </string-name>
          , and others.
          <source>2015</source>
          .
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>International journal of computer vision 115</source>
          , 3 (
          <year>2015</year>
          ),
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Leslie</surname>
            <given-names>N</given-names>
          </string-name>
          <string-name>
            <surname>Smith and Nicholay Topin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Super-convergence: Very fast training of neural networks using large learning rates</article-title>
          .
          <source>In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications</source>
          , Vol.
          <volume>11006</volume>
          .
          <source>International Society for Optics and Photonics</source>
          ,
          <volume>1100612</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>