=Paper= {{Paper |id=Vol-2886/paper5 |storemode=property |title=Improving Generalizability in Polyp Segmentation using Ensemble Convolutional Neural Network |pdfUrl=https://ceur-ws.org/Vol-2886/paper5.pdf |volume=Vol-2886 |authors=Nikhil Kumar Tomar,Nabil Ibtehaz,Debesh Jha,Pål Halvorsen,Sharib Ali |dblpUrl=https://dblp.org/rec/conf/isbi/TomarIJHA21 }} ==Improving Generalizability in Polyp Segmentation using Ensemble Convolutional Neural Network== https://ceur-ws.org/Vol-2886/paper5.pdf
Improving Generalizability in Polyp Segmentation
using Ensemble Convolutional Neural Network
Nikhil Kumar Tomara , Nabil Ibtehazc , Debesh Jhaa,b , Pål Halvorsena and Sharib Alid
a
  SimulaMet, Oslo, Norway
b
  UiT The Arctic University of Norway, Tromsø, Norway
c
  Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
d
  Department of Engineering Science, Big Data Institute, University of Oxford, Oxford, UK


                                         Abstract
                                         Polyp segmentation is crucial for the diagnosis of colorectal cancer. Early detection and removal of
                                         polyps can prolong the life of patients and reduce the mortality rate. Despite near expert-label per-
                                         formance with applying the deep learning method in polyp segmentation tasks, the generalization of
                                         such models in the clinical environment remains a significant challenge. Transfer learning from a large
                                         medical dataset from the same domain is a common technique to address generalizability. However, it
                                         is difficult to find a similar large medical dataset. In this work, we investigate the feasibility of building
                                         a generalizable model for polyp segmentation using an ensemble of four MultiResUNet architectures,
                                         each trained on the combination of the different centered datasets provided by the challenge organizers.
                                         Our method achieved a decent performance of 0.6172 ± 0.0778 for the multi-centered dataset. Our
                                         findings show that significant work needs to be done to design a robust segmentation model for the
                                         development of a clinically acceptable system.

                                         Keywords
                                         Polyp segmentation, colonoscopy, generalization, deep learning

1. Introduction
The medical world concerned with the digestive system is currently in the midst of an uprising
wave of increased adaption and technology usage for automatic analysis and decision support.
With the increase of publicly available datasets, adapted methodologies such as convolutional
neural networks, improved hardware, and increased collaboration of computer scientists and
medical communities, this development is gaining more momentum than ever before. Global
Cancer Statistics 2020 (GLOBOCAN 2020) estimated colorectal cancer as the third most fre-
quently diagnosed cancer. Colorectal cancer accounts for 10.0% of total cancer, which is only 1%
below to the most frequently caused cancer, i.e., female breast cancer (11.7%) and lung cancer
(11.4%) [1]. Screening and removal of adenomatous polyps and other precancerous anomalies is
one of the best working methods for the early detection and avoiding colorectal cancer-based
mortality and incidence [2].
   Deep learning-based methods have gained popularity in the development of the computer-
aided diagnosis (CADx) system for detection of the colorectal polyps [3, 4, 5]. The successful
deployment of a CAD system for polyp segmentation would require a trained model that achieves
3rd International Workshop and Challenge on Computer Vision in Endoscopy (EndoCV2021) in conjunction with the
18th IEEE International Symposium on Biomedical Imaging ISBI2021, April 13th, 2021, Nice, France
" sharib.ali@eng.ox.ac.uk (S. Ali)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
high performance on unseen datasets irrespective of different hospitals, cohort populations, and
imaging protocol. However, deep learning algorithms are data-driven. The desired generalizable
algorithms would require large, high-quality, and diverse datasets samples to train algorithms.
Creating such datasets requires expert endoscopists and computer scientists for labeling and
pixel-wise annotations. In general, there are only a few publicly available datasets. Although
some studies report high performance on a specific dataset, the dataset is not publicly released [5,
3]. Therefore, it is challenging to develop a generalizable polyp segmentation model with a
limited or single-center dataset.
   Challenges and competitions are a good technique to access and explore new datasets for
experimentation. It is also a fair way to compare methods, analyze, and improve the results on
provided dataset. Additionally, challenges provide a solution for the lack of dataset availability
and help develop reliable and clinically applicable methods. We participated in the EndoCV2021
challenge1 to explore a multi-center dataset and develop a generalizable polyp segmentation
CADx system. Our goal is to investigate and develop a generalizable model, compare our results
with the other participants in the challenge, and observe our model’s behavior.
   The EndoCV2021 challenge offered two different tasks, namely, Detection generalization chal-
lenge and Segmentation generalization challenge. We only participated in the Segmentation
generalization challenge. We used an ensemble model as our solution for the segmentation
generalization challenge. The main motivation behind using ensemble methods was that it
showed winning results in the different challenges [6, 7]. For our solution, we made an ensemble
of four MultiResUNet [8] model. In short, the main contribution of our work are as follows:

    • We explore a convolutional neural network-based model for the generalizable polyp
      segmentation task with a multi-center dataset. In this study, the training dataset is
      collected from five different medical institutions from five different countries, and test
      data comes from independent institutions.
    • Our work reveals that the proposed deep learning model has significant challenges with
      the images having bleeding, adenomas, and covered by dyed. The model mostly showed
      over-segmentation or failed miserably with such scenarios. We highlight these cases that
      are among the significant challenges for developing a generalizable algorithm for the
      polyp segmentation task.

   The remainder of this paper is organized into five sections. Section 2 provides a short overview
of the related work. Section 3 gives an overview Methodology, and Section 4 describes the
experimental setup. Section 5 presents the results obtained using the challenge dataset. Finally,
we summarize and conclude the paper in Section 6.


2. Related Work
CNN-based architectures for polyp segmentation have been a common strategy for the de-
velopment of the CADx system. We briefly describe the work on polyp segmentation and
generalizability in the below subsection.

    1
        https://endocv2021.grand-challenge.org/
2.1. Polyp segmentation
There has been several study on colorectal polyp segmentation [5, 3, 8, 6, 9]. Most of the work
have proposed an architecture based on U-Net [10]. There have been also work on improving
the segmentation performance on the publicly available dataset [11, 4, 12] to the real-time
performance [13, 14, 3]. Although mostly retrospective studies were conducted [5, 13], there
has also been work that carried prospective randomized controlled studies [15, 16]. However,
most of the studies were conducted on the dataset from a single center. The experiments on
multi-center datasets have often been ignored.

2.2. Generalizibility
In medical image analysis, generalization refers to the ability of the machine learning algorithm
that is trained on specific interventions in specific health centers should be able to perform
well over other interventions or different health center [7]. Poor generalizability has become
one of the major issues for the clinical translation of the deep learning methods into clinical
practise [17]. Meta-learning under a few-shot setting has gained popularity in developing a
generalizable deep learning model and resolve the issue of data scarcity [18, 19].
   In our previous study [4], due to the lack of a publicly available multi-center dataset, we have
used a trained dataset on one publicly available dataset [20] and tested it against another [21]
to observe the generalization capability. Additionally, we have also mixed the datasets from
two or more institutions to observe the model’s generalization capability. This is our first work
where we have the opportunity to train the model with a multi-center dataset (five different
center datasets) and benchmark on the completely new dataset.


3. Methodology
To address the generalizability problem in polyp segmentation, we used an ensemble of the four
MultiResUNet [8] models. As each folder of the dataset has images from a unique center, we use
a different subset of the dataset to train each of the MultiResUNet models. The MultiResUNet
is an encoder-decoder architecture, which is an improvement over the existing U-Net [10]
architecture. It combines the strength of the U-Net and improving it by replacing the existing
components with more effective components such as “MultiRes block” and “Res Path”. The
MultiResUNet consists of four encoder blocks, four decoder blocks, and a bridge connecting
them. The encoder takes the input image, encodes it, and extracts more useful features from
it. Later these features are passed to the decoder, where they are upsampled and concatenated
with the feature maps from the skip connection. Finally, these features are used to generate a
segmentation mask for the input image. The additional block to form MultiResUNet models is
briefly described below.

3.1. MultiRes block
The MultiRes block is the major component used in the MultiResUNet [8] architecture. It
is the replacement of the convolution block, i.e., two 3 × 3 convolution used in the U-Net.
The MultiRes block is inspired from the Inception architecture [22] which consists of multiple
parallel convolutions with 3×3, 5×5, and 7×7 kernel size. These multiple parallel convolutions
help in capturing objects with different shapes and sizes. Using the bigger 5 × 5 and 7 × 7
kernel size increases the memory requirement. Therefore, these bigger kernels are factorized
and replaced by multiple 3 × 3 convolutions. The MultiRes block begins with a single 3 × 3
convolution, which is followed by two 3 × 3 convolutions which are combined together to
get the resultant effect of a 5 × 5 convolution. Next again are the multiple 3 × 3 convolutions
which are repeated to give the resultant effect of a 7 × 7 convolution. The outputs from these
convolutional blocks are concatenated together to have different scale feature maps. A residual
connection is also used, which connects the input to the concatenated output.

3.2. Res path
The introduction of the skip connection in the U-Net architecture proves to be a significant
contribution towards improving semantic segmentation performance. These skip connections
enable the flow of information from the encoder to the decoder that is lost during the pooling
operation. The simple concatenation of the features from the encoders to the decoders is flawed.
For example, the first skip connection contains the low-level features from the early layers,
which are fused with high-level features in the decoder. Therefore, there is a semantic gap
between the features that being merged. To resolve this semantic gap, some convolutional layers
and shortcut connections are being introduced as the skip connection in the MultiResUNet,
called the “Res path”.

3.3. MultiResUNet Architecture
The MultiResUNet [8] architecture begins by feeding the input image to the first encoder, which
consists of the MultiRes block, followed by a 2 × 2 max-pooling with a stride value of 2. The
max-pooled feature maps are passed on to the next encoder, and this process is repeated four
times. In each step, the number of filters doubles, and the spatial resolution reduces by half.
The output of the MultiRes block acts as the skip-connection, which first passes through the
Res path and joins the decoder block. Inside each Res path, the number of convolution blocks
decreases from 4, 3, 2 to 1 respectively along the four Res paths. The decoder begins with a
2 × 2 transpose convolution, which doubles the feature maps’ spatial dimensions. Next, the
feature maps are concatenated with the output of the Res path. Subsequently, the MultiRes
block is used to learn the semantic representation. Similarly, the network is followed by three
more decoder blocks, where the number of filters decreases and the feature maps resolution
increases. It is then followed by a 1 × 1 convolution with sigmoid activation to generate the
binary segmentation mask.


4. Experiment
To evaluate the performance of the ensemble method, we have performed extensive experiments.
This section describes the dataset, evaluation metrics, training strategy, and implementation
details used in our experimentation. Figure 1 shows the block diagram of the proposed ensemble
Figure 1: Block diagram of the proposed ensemble architecture


method. As explained in Section 3, the input image is fed to the different MultiResUNet models
that produce different segmentation outputs. These predicted outputs from four distinct models
are averaged to get the final mean mask.

4.1. Dataset
EndoCV2021 dataset [23] consists of both a single frame dataset and sequence dataset. The
dataset is captured from five different institutes. Each center dataset is provided in a separate
folder. The training dataset consists of 1452 single image frames. Additionally, the dataset
also consists of 165 negative sequence frames and 490 positive sequence frames, in a total of
655 image sequences. The sequence frames are taken from videos. Both positive (polyp) and
negative (normal) frames are provided. Each center dataset has a separate image, mask, image
with the bounding box, and bounding box information. All the images and their corresponding
masks are in jpeg format.

4.2. Evaluation Metrics
The evaluation metric for the detection task is the Average mean precision. Additionally, a
mean deviation is also calculated. For the segmentation tasks, the evaluation metrics such as
F1-score, mean Intersection over Union (mIoU), recall, precision, F2-score, and overall accuracy
is calculated. The procedures for the calculation can be found at GitHub 2 and further details
on generalisation metrics is provided in [24]. Out-of-sample distribution from multiple centers
were compared among each other to assess the deviation in scores and provide a quantifiable
generalisation score [24].

4.3. Training strategy
For training, the model1, i.e., MultiResUNet1, the subset from center1, center3, and center4 were
used. Similarly, we used center2, center1, and center4 for training model2 (MultiResUNet2).

   2
       https://github.com/sharibox/EndoCV2021-polyp_det_seg_gen
Figure 2: Qualitative results of the four ensemble MultiResUNet[8] models. The example images show
that the ensemble models produce high-quality segmentation maps for different polyp shapes and sizes.


Likewise, we used center2, center3, and center1 for training model3. For training model4, we
used the images from center2, center3, and center 4. We use the dataset from center5 as the
validation set.

4.4. Implementation Details
We have implemented the MultiResUNet using the Keras with TensorFlow as a backend. The
experiments were run on the Experimental Infrastructure for Exploration of Exascale Comput-
ing(eX3), NVIDIA DGX-2 machine. All four models are trained on 100 epochs using the same
set of hyperparameters. Each model uses an image size of 256 × 256 pixels with a batch size of
8. The dice coefficient is used as the loss function with Adam optimizer. The default learning
1𝑒 − 3 is used to training the model. We also use the ReduceLROnPlateau callback to reduce
further the learning rate for better generalization of the model.
5. Results and Discussion
On the test dataset, we achieved a score of 0.6172 ± 0.0778. Here, 0.6172 is the generalization
score and 0.0778 is the generalization deviation. Figure 2 shows the qualitative results of the
ensemble MultiResUNet model. The first, second and third column shows the input image, their
corresponding ground truth, and the predictions. From the qualitative results, we can see that
the model is performing well on polyp of different shapes and sizes (i.e., small, medium, and
large-sized polyps).
   However, a detailed dissection of the validation results shows that the models produce over-
segmentation for the outputs when the input images have bleeding. The model also fails on
challenging images such as flat polyps. The model also has a problem with detecting when
the input images are covered with dyed. Mostly the models show over-segmentation, and
sometimes the model completely fails to produce any segmentation masks. However, a more
detailed conclusion can be made when we can visualize the qualitative results on the test dataset.


6. Conclusion
In this paper, we presented a cascaded MultiResUNet based solution for addressing the general-
izability in polyp segmentation. The model can automatically segment polyp. The experimental
results showed that the ensemble model obtained an evaluation score of 0.6172 ± 0.0778.
The research results open a wide range of research directions to build generalizable model on
new datasets. Moreover, we showed that ensemble models are not always the best choice for
biomedical data science challenges. A deep analysis of the qualitative results showed that the
model performs well on polyps of different shapes and sizes. In the future, we plan to explore
the transfer learning from both large natural datasets and from biomedical imaging datasets
(polyp or similar domain datasets) for improving the results on the polyp segmentation tasks.


Acknowledgment
D. Jha is funded by the PRIVATON project (#263248) and the Autocap project (#282315) from
the Research Council of Norway (CRN). All experiments were performed on the Experimental
Infrastructure for Exploration of Exascale Computing (eX3) system, which is financially sup-
ported by CRN under contract 270053. S. Ali is supported by the National Institute for Health
Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of
the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.


References
 [1] H. Sung, et al., Global cancer statistics 2020: Globocan estimates of incidence and mortality
     worldwide for 36 cancers in 185 countries, CA: a cancer journal for clinicians (2021).
 [2] A. M. Wolf, E. T. Fontham, T. R. Church, C. R. Flowers, C. E. Guerra, S. J. LaMonte, R. Etzioni,
     M. T. McKenna, K. C. Oeffinger, Y.-C. T. Shih, et al., Colorectal cancer screening for average-
     risk adults: 2018 guideline update from the american cancer society, CA: a cancer journal
     for clinicians 68 (2018) 250–281.
 [3] J. Y. Lee, et al., Real-time detection of colon polyps during colonoscopy using deep learning:
     systematic validation with four independent datasets, Scientific reports 10 (2020) 1–9.
 [4] D. Jha, P. H. Smedsrud, D. Johansen, T. de Lange, H. Johansen, P. Halvorsen, M. Riegler,
     A comprehensive study on colorectal polyp segmentation with resunet++, conditional
     random field and test-time augmentation, IEEE journal of biomedical and health informatics
     (2021).
 [5] P. Wang, et al., Development and validation of a deep-learning algorithm for the detection
     of polyps during colonoscopy, Nature biomedical engineering 2 (2018) 741–748.
 [6] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Polat, A. Temizel, A. Krenzer, A. Hekalo,
     Y. B. Guo, B. Matuszewski, et al., Deep learning for detection and segmentation of artefact
     and disease instances in gastrointestinal endoscopy, Medical image analysis (2021) 102002.
 [7] T. Roß, A. Reinke, P. M. Full, M. Wagner, H. Kenngott, M. Apitz, H. Hempe, D. Mindroc-
     Filimon, P. Scholz, T. N. Tran, et al., Comparative validation of multi-instance instrument
     segmentation in endoscopy: Results of the robust-mis 2019 challenge, Medical Image
     Analysis 70 (2021) 101920.
 [8] N. Ibtehaz, M. S. Rahman, Multiresunet: Rethinking the u-net architecture for multimodal
     biomedical image segmentation, Neural Networks 121 (2020) 74–87.
 [9] Y. Guo, J. Bernal, B. J Matuszewski, Polyp segmentation with fully convolutional deep
     neural networks—extended evaluation study, Journal of Imaging 6 (2020) 69.
[10] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image
     segmentation, in: Proc. of International Conference on Medical image computing and
     computer-assisted intervention (MICCAI), 2015, pp. 234–241.
[11] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. De Lange, P. Halvorsen, H. D.
     Johansen, Resunet++: An advanced architecture for medical image segmentation, in: Proc.
     of International Symposium on Multimedia (ISM), 2019, pp. 225–2255.
[12] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, H. D. Johansen, Doubleu-net: A deep
     convolutional neural network for medical image segmentation, in: Proc. of International
     Symposium on Computer-Based Medical Systems (CBMS), 2020, pp. 558–564.
[13] N. K. Tomar, D. Jha, S. Ali, H. D. Johansen, D. Johansen, M. A. Riegler, P. Halvorsen,
     Ddanet: Dual decoder attention network for automatic polyp segmentation, in: Proc. of
     International Conference on Pattern Recognition (ICPR) Workshop, 2020.
[14] D. Jha, S. Ali, N. K. Tomar, H. D. Johansen, D. Johansen, J. Rittscher, M. A. Riegler,
     P. Halvorsen, Real-time polyp detection, localization and segmentation in colonoscopy
     using deep learning, IEEE Access 9 (2021) 40496–40510.
[15] P. Wang, et al., Real-time automatic detection system increases colonoscopic polyp and
     adenoma detection rates: a prospective randomised controlled study, Gut 68 (2019) 1813–
     1819.
[16] J.-R. Su, et al., Impact of a real-time automatic quality control system on colorectal
     polyp and adenoma detection: a prospective randomized controlled study (with videos),
     Gastrointestinal endoscopy 91 (2020) 415–424.
[17] K. Yasaka, O. Abe, Deep learning and artificial intelligence in radiology: Current applica-
     tions and future directions, PLoS medicine 15 (2018) e1002707.
[18] P. Zhang, J. Li, Y. Wang, J. Pan, Domain adaptation for medical image segmentation: A
     meta-learning method, Journal of Imaging 7 (2021) 31.
[19] S. Ravi, H. Larochelle, Optimization as a model for few-shot learning (2016).
[20] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, H. D.
     Johansen, Kvasir-seg: A segmented polyp dataset, in: Proc. of International Conference
     on Multimedia Modeling (ISM), 2020, pp. 451–462.
[21] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, F. Vilariño, Wm-dova
     maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from
     physicians, Computerized Medical Imaging and Graphics 43 (2015) 99–111.
[22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
     A. Rabinovich, Going deeper with convolutions, in: Proc. of IEEE conference on computer
     vision and pattern recognition (CVPR), 2015, pp. 1–9.
[23] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Cannizzaro, M. A. Riegler, P. Halvorsen, C. Daul,
     J. Rittscher, O. E. Salem, D. Lamarque, T. de Lange, J. E. East, Polypgen: A multi-center
     polyp detection and segmentation dataset for generalisability assessment, arXiv (2021).
[24] S. Ali, F. Zhou, B. Braden, A. Bailey, S. Yang, G. Cheng, P. Zhang, X. Li, M. Kayser, R. D.
     Soberanis-Mukul, S. Albarqouni, X. Wang, C. Wang, S. Watanabe, I. Oksuz, Q. Ning, S. Yang,
     M. A. Khan, X. W. Gao, S. Realdon, M. Loshchenov, J. A. Schnabel, J. E. East, G. Wagnieres,
     V. B. Loschenov, E. Grisan, C. Daul, W. Blondel, J. Rittscher, An objective comparison
     of detection and segmentation algorithms for artefacts in clinical endoscopy, Scientific
     Reports 10 (2020) 2748. doi:10.1038/s41598-020-59413-5.