=Paper= {{Paper |id=Vol-3181/paper25 |storemode=property |title=Predictive Uncertainty Masks from Deep Ensembles in Automated Polyp Segmentation |pdfUrl=https://ceur-ws.org/Vol-3181/paper25.pdf |volume=Vol-3181 |authors=Felicia Ly Jacobsen |dblpUrl=https://dblp.org/rec/conf/mediaeval/Jacobsen21 }} ==Predictive Uncertainty Masks from Deep Ensembles in Automated Polyp Segmentation== https://ceur-ws.org/Vol-3181/paper25.pdf
             Predictive Uncertainty Masks from Deep Ensembles in
                        Automated Polyp Segmentation
                                                                     Felicia Ly Jacobsen1
                                                                       1 SimulaMet, Norway

                                                                      f.l.jacobsen@fys.uio.no

ABSTRACT                                                                             test data only consist of a total of 200 images, excluding the ground
This paper presents the submission of team F-HOST for the Medico:                    truth masks. The dataset is based on the HyperKvasir dataset [2],
Transparency in Medical Image Segmentation task held at Media-                       but includes additional images and masks.
Eval 2021. We propose a U-Net-based ensemble model for solving
the automatic polyp segmentation task and interpret the predictions
using a specific method for obtaining uncertainty. Our predicted                     II.2    Experimental Setup
segmentation masks show a mean Dice score of 45.01% based on
                                                                                     We used the U-Net architecture as the base model for the Deep
the test data. The corresponding uncertainties show systematic
                                                                                     Ensemble, with a total of five U-Nets. The development data was
errors towards the training data, which indicates overfitting.
                                                                                     resized into 256 × 256 pixels before training, due to memory con-
                                                                                     straints and to reduce training time. The training data was split into
I    INTRODUCTION                                                                    batches of 32 images in order to obtain greater training efficiency
Polyps are abnormal growths inside the lining of the colon or rec-                   as opposed to a larger batch size of, e.g., 64. Data augmentation was
tum. They can potentially develop into being malignant, leading to                   performed on the fly for each training iteration in order to obtain
colorectal cancer, and thereby act as a precursor for cancer. Detect-                improved generalization. We use techniques such as blurring, color
ing and removing polyps with colonoscopic polypectomy during or                      jitter, horizontal flip, random rotate 90◦ , and vertical flip. Instead
before further development, will allow for more treatment options                    of using transposed convolution in the decoder part of the network
and overall improved prognosis [11].                                                 as proposed in the original U-Net paper [10], two-dimensional bi-
    Currently, the gold standard of finding and removing polyps                      linear upsampling is used in order to avoid potential checkerboard
is through a procedure called colonoscopy. This procedure is de-                     artifacts. All models in the ensemble were trained using an initial
pendent upon differences in skill, experience, and technique of                      learning rate of 1 · 10−4 , with a learning rate scheduler with a mini-
the endoscopists. However, studies show that up to 28% remain                        mum learning rate of 1 · 10−7 . Each model had a total of 150 training
undetected [8]. Automated semantic segmentation based on deep                        iterations, using the Adam optimizer [6] and the Dice coefficient
learning frameworks can be used as a tool to detect polyps based                     loss. After the last training iteration, the model weights for each
on images from colonoscopy examinations. Deep Ensembles can                          model in the deep ensemble was saved in a .pt format. Hyperparam-
provide an uncertainty quality of the predicted segmentation, even                   eter tuning was done manually by observing the dice loss on the
for ensembles with five trained models [7]. This method is known                     validation data as a function of training iterations, and evaluating
as being easy to implement and being scalable to different deep                      the Dice Coefficient (DE), Jaccard Index (JI) and Accuracy.
learning (DL) frameworks and can additionally improve classifica-                        When performing prediction with the deep ensemble, each indi-
tion error and robustness in terms of dataset shift. In this paper, the              vidual model is loaded, and each predict on the input image from
results based on the challenge test data are presented and discussed,                the test dataset. The element-wise mean is calculated from the
including their corresponding uncertainty mask estimated from a                      output from each of the models in the ensemble. They are later
Deep Ensemble model consisting of five U-Net networks.                               pushed through a Sigmoid activation and thresholded into binary
                                                                                     pixel values. The variance provided by the ensemble is used as an
II     APPROACH                                                                      approximation for the uncertainty of each prediction mask. This
                                                                                     is calculated by taking the squared sum of each probability predic-
In this section, the approach to the Medico task "Transparency
                                                                                     tion (Sigmoid output) minus the mean probability prediction from
in Medical Image Segmentation" of the MediaEval 2021 challenge
                                                                                     the ensemble. This squared sum is later divided by the number of
is presented. All models were trained using the PyTorch frame-
                                                                                     models in the ensemble, five in this case.
work [9] on an Nvidia Tesla V100 32GB General-Purpose Graphics
                                                                                         For subtask 2: "Algorithm Efficiency", the time in seconds was
Processing Unit (GPGPU).
                                                                                     calculated for the ensemble to make its overall mean prediction for
                                                                                     each of the test images in order to measure the model efficiency
II.1     Datasets                                                                    of the ensemble. A Docker image is made, and using this image
There is a total of 1, 362 images in the development dataset [5]. We                 will make a .csv file with the image name and its corresponding
randomly select 272 for validation and the rest for training. The                    prediction time in seconds. The Deep Ensemble will be run on
                                                                                     the challenge organizers’ hardware, and they provide us with the
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                   frames per second (FPS), which is the average number of masks
MediaEval’21, December 13-15 2021, Online                                            from the test dataset the ensemble is able to make per second.
MediaEval’21, December 13-15 2021, Online                                                                                      F. Jacobsen et al.


   For subtask 3: "Transparent Machine Learning Systems", all          corrections to training images where these rectangles appear by,
source code is made publicly available on GitHub1 , which also         e.g., cropping them out may boost model performance.
includes the uncertainty images for the prediction masks.

III    RESULTS AND ANALYSIS
Table 1 summarizes the results for the Medico subtask 1, includ-
ing the mean DC, mean JI and mean Accuracy for the prediction
masks on the validation data and the official task test data. These
results show that the Deep Ensemble generalize poorly onto the
test data, with a decrease of approximately 55% in the DC score
and 46% decrease in the mean JI when comparing the results from
the validation data on the test data. There is a high variance of DC
score in the individual images from the test images, some get a DC
as high as 0.8935, whereas some images get as low as 0.0000. Higher
performance can be increased by performing more hyperparameter
tuning, training the Deep Ensemble on more training examples
including similar datasets such as for example the CVC-ClinicDB
dataset [4] and the CVC-ColonDB dataset [1]. Additionally, decreas-
ing the number of training iterations can also contribute to a more    Figure 1: Examples of the input images from the official test
generalized ensemble model. Also, as proposed in the original pa-      dataset are shown on the top row. Their corresponding pre-
per [7], adding adversarial training and increasing the number of      dicted masks are shown on the middle row, and their uncer-
models in the ensemble from 5 to 15, may potentially decrease the      tainty heatmap representation are on the bottom row. The
prediction error significantly.                                        prediction masks and uncertainty heatmap are calculated
                                                                       using the Deep Ensemble of five trained U-Net networks.

Table 1: Results from validation data and test data by the
ensemble model of five U-Nets. The results on predicted test           IV     CONCLUSION AND FUTURE WORK
data were provided by the task organizers.                             In this paper, we presented a method of obtaining the approximate
                                                                       uncertainty values for a set of predicted segmentation masks. The
                               Mean        Mean            Mean        uncertainty masks provide an uncertainty measure of the perfor-
                               Dice        Jaccard Index   Accuracy    mance of a U-Net based DL model trained on medical colonoscopy
                                                                       images of polyps.
      Validation data          0.8226      0.7005          0.9242         A mean Dice score of 0.4501 was obtained on the test data, and
      Official Test data       0.4501      0.3231          0.8831      compared to the Dice score of 0.8226 from the validation data, this
                                                                       indicated that the Deep Ensemble model was being overfitted to
                                                                       the training data, and thus generalizing poorly onto unseen data.
   For the efficiency subtask, a FPS of 82.9496 was obtained. This
                                                                       Increasing the number of training examples by including similar
means that the time of approximately 2.4111 seconds in total was
                                                                       datasets, decreasing the number of training iterations, increasing
used to generate the masks on the entire test dataset. This result
                                                                       the number of models in the ensemble, as well as including adver-
indicates satisfactory model efficiency, but in return the deep en-
                                                                       sarial training may improve generalization. A total average FPS of
semble is both memory- and time consuming to train.
                                                                       82.9496 was obtained on the test data, but came at a high computa-
   A set of three randomly chosen images from the test data and
                                                                       tional cost when training the Deep Ensemble. In future work, we
their corresponding prediction masks and uncertainty heatmaps are
                                                                       will add the aforementioned proposed extensions, as well as experi-
shown in Figure 1. The brighter areas in the heatmaps illustrate the
                                                                       ment and compare to alternative methods such as Masksembles [3]
pixels where the models in the ensemble disagree the most. These
                                                                       in order to decrease computational cost of obtaining an ensemble
results show that the borders of the detected polyps are where
                                                                       model.
they disagree the most. Furthermore, the two uncertainty heatmaps
(from the left) shows an outlining of a rectangle in the bottom
                                                                       REFERENCES
left corner. Many of the input images in the HyperKvasir dataset
show green rectangles located in the same area, this is information     [1] Sánchez J. Vilarino-F Bernal, J. 2012. Towards automatic polyp detec-
                                                                            tion with a polyp appearance model. Endoscopy (2012), 3166–3182.
important to the medical experts. Thus, it is common to observe
                                                                        [2] Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks,
several images with green rectangles in the development dataset.            Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin
However, note that the input images do not contain these green              Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, Dag Johansen,
rectangles. These results indicate that the ensemble expected these         Carsten Griwodz, Håkon K Stensland, Enrique Garcia-Ceja, Peter T
rectangles, thus showing systematic bias towards the training data.         Schmidt, Hugo L Hammer, Michael A Riegler, Pål Halvorsen, and
Increasing the number of training examples, as well as performing           Thomas de Lange. 2020. HyperKvasir, a comprehensive multi-class
                                                                            image and video dataset for gastrointestinal endoscopy. Scientific Data
1 https://github.com/feliciajacobsen/MediaEval2021
                                                                            7, 1 (2020), 283. https://doi.org/10.1038/s41597-020-00622-y
Medico: Transparency in Medical Image Segmentation                              MediaEval’21, December 13-15 2021, Online


 [3] Nikita Durasov, Timur M. Bagautdinov, Pierre Baqué, and Pascal Fua.
     2020. Masksembles for Uncertainty Estimation. CoRR abs/2012.08334
     (2020). https://arxiv.org/abs/2012.08334
 [4] Bernal J. López-Cerón M. Córdova H. Sánchez-Montes C. Rodríguez
     de Miguel C. Sánchez F. J. Fernández-Esparrach, G. 2016. Exploring
     the clinical potential of an automatic colonic polyp detection method
     based on the creation of energy maps. Endoscopy (6 2016), 837–842.
 [5] Steven Hicks, Debesh Jha, Vajira Thambawita, Hugo Hammer, Thomas
     de Lange, Sravanthi Parasa, Michael Riegler, and Pål Halvorsen. 2021.
     Medico Multimedia Task at MediaEval 2021: Transparency in Medical
     Image Segmentation. In Proceedings of MediaEval 2021 CEUR Work-
     shop.
 [6] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic
     Optimization. International Conference on Learning Representations
     (12 2014).
 [7] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blun-
     dell. 2017.     Simple and Scalable Predictive Uncertainty Esti-
     mation using Deep Ensembles.            (12 2017), 6405–6416 pages.
     arXiv:stat.ML/1612.01474
 [8] Kim NH, Jung YS, Jeong WS, and et al. 2017. Miss rate of col-
     orectal neoplastic polyps and risk factors for missed polyps in con-
     secutive colonoscopies. Intestinal research 15, 3 (6 2017), 411–418.
     https://doi.org/10.5217/ir.2017.15.3.411.
 [9] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James
     Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia
     Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward
     Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-
     amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
     2019. PyTorch: An Imperative Style, High-Performance Deep Learning
     Library. In Advances in Neural Information Processing Systems 32.
     Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/
     9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
     pdf
[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net:
     Convolutional Networks for Biomedical Image Segmentation. In Med-
     ical Image Computing and Computer-Assisted Intervention – MICCAI
     2015. Springer International Publishing, 234–241.
[11] Winawer SJ, Zauber AG, Ho MN, and et al. 1993. Prevention of colorec-
     tal cancer by colonoscopic polypectomy. The New England Journal
     of Medicine 329, 27 (12 1993), 1977–1981. https://doi.org/10.1056/
     NEJM199312303292701