=Paper=
{{Paper
|id=Vol-3181/paper25
|storemode=property
|title=Predictive Uncertainty Masks from Deep Ensembles in Automated Polyp
Segmentation
|pdfUrl=https://ceur-ws.org/Vol-3181/paper25.pdf
|volume=Vol-3181
|authors=Felicia Ly Jacobsen
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Jacobsen21
}}
==Predictive Uncertainty Masks from Deep Ensembles in Automated Polyp
Segmentation==
Predictive Uncertainty Masks from Deep Ensembles in Automated Polyp Segmentation Felicia Ly Jacobsen1 1 SimulaMet, Norway f.l.jacobsen@fys.uio.no ABSTRACT test data only consist of a total of 200 images, excluding the ground This paper presents the submission of team F-HOST for the Medico: truth masks. The dataset is based on the HyperKvasir dataset [2], Transparency in Medical Image Segmentation task held at Media- but includes additional images and masks. Eval 2021. We propose a U-Net-based ensemble model for solving the automatic polyp segmentation task and interpret the predictions using a specific method for obtaining uncertainty. Our predicted II.2 Experimental Setup segmentation masks show a mean Dice score of 45.01% based on We used the U-Net architecture as the base model for the Deep the test data. The corresponding uncertainties show systematic Ensemble, with a total of five U-Nets. The development data was errors towards the training data, which indicates overfitting. resized into 256 × 256 pixels before training, due to memory con- straints and to reduce training time. The training data was split into I INTRODUCTION batches of 32 images in order to obtain greater training efficiency Polyps are abnormal growths inside the lining of the colon or rec- as opposed to a larger batch size of, e.g., 64. Data augmentation was tum. They can potentially develop into being malignant, leading to performed on the fly for each training iteration in order to obtain colorectal cancer, and thereby act as a precursor for cancer. Detect- improved generalization. We use techniques such as blurring, color ing and removing polyps with colonoscopic polypectomy during or jitter, horizontal flip, random rotate 90◦ , and vertical flip. Instead before further development, will allow for more treatment options of using transposed convolution in the decoder part of the network and overall improved prognosis [11]. as proposed in the original U-Net paper [10], two-dimensional bi- Currently, the gold standard of finding and removing polyps linear upsampling is used in order to avoid potential checkerboard is through a procedure called colonoscopy. This procedure is de- artifacts. All models in the ensemble were trained using an initial pendent upon differences in skill, experience, and technique of learning rate of 1 · 10−4 , with a learning rate scheduler with a mini- the endoscopists. However, studies show that up to 28% remain mum learning rate of 1 · 10−7 . Each model had a total of 150 training undetected [8]. Automated semantic segmentation based on deep iterations, using the Adam optimizer [6] and the Dice coefficient learning frameworks can be used as a tool to detect polyps based loss. After the last training iteration, the model weights for each on images from colonoscopy examinations. Deep Ensembles can model in the deep ensemble was saved in a .pt format. Hyperparam- provide an uncertainty quality of the predicted segmentation, even eter tuning was done manually by observing the dice loss on the for ensembles with five trained models [7]. This method is known validation data as a function of training iterations, and evaluating as being easy to implement and being scalable to different deep the Dice Coefficient (DE), Jaccard Index (JI) and Accuracy. learning (DL) frameworks and can additionally improve classifica- When performing prediction with the deep ensemble, each indi- tion error and robustness in terms of dataset shift. In this paper, the vidual model is loaded, and each predict on the input image from results based on the challenge test data are presented and discussed, the test dataset. The element-wise mean is calculated from the including their corresponding uncertainty mask estimated from a output from each of the models in the ensemble. They are later Deep Ensemble model consisting of five U-Net networks. pushed through a Sigmoid activation and thresholded into binary pixel values. The variance provided by the ensemble is used as an II APPROACH approximation for the uncertainty of each prediction mask. This is calculated by taking the squared sum of each probability predic- In this section, the approach to the Medico task "Transparency tion (Sigmoid output) minus the mean probability prediction from in Medical Image Segmentation" of the MediaEval 2021 challenge the ensemble. This squared sum is later divided by the number of is presented. All models were trained using the PyTorch frame- models in the ensemble, five in this case. work [9] on an Nvidia Tesla V100 32GB General-Purpose Graphics For subtask 2: "Algorithm Efficiency", the time in seconds was Processing Unit (GPGPU). calculated for the ensemble to make its overall mean prediction for each of the test images in order to measure the model efficiency II.1 Datasets of the ensemble. A Docker image is made, and using this image There is a total of 1, 362 images in the development dataset [5]. We will make a .csv file with the image name and its corresponding randomly select 272 for validation and the rest for training. The prediction time in seconds. The Deep Ensemble will be run on the challenge organizers’ hardware, and they provide us with the Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). frames per second (FPS), which is the average number of masks MediaEval’21, December 13-15 2021, Online from the test dataset the ensemble is able to make per second. MediaEval’21, December 13-15 2021, Online F. Jacobsen et al. For subtask 3: "Transparent Machine Learning Systems", all corrections to training images where these rectangles appear by, source code is made publicly available on GitHub1 , which also e.g., cropping them out may boost model performance. includes the uncertainty images for the prediction masks. III RESULTS AND ANALYSIS Table 1 summarizes the results for the Medico subtask 1, includ- ing the mean DC, mean JI and mean Accuracy for the prediction masks on the validation data and the official task test data. These results show that the Deep Ensemble generalize poorly onto the test data, with a decrease of approximately 55% in the DC score and 46% decrease in the mean JI when comparing the results from the validation data on the test data. There is a high variance of DC score in the individual images from the test images, some get a DC as high as 0.8935, whereas some images get as low as 0.0000. Higher performance can be increased by performing more hyperparameter tuning, training the Deep Ensemble on more training examples including similar datasets such as for example the CVC-ClinicDB dataset [4] and the CVC-ColonDB dataset [1]. Additionally, decreas- ing the number of training iterations can also contribute to a more Figure 1: Examples of the input images from the official test generalized ensemble model. Also, as proposed in the original pa- dataset are shown on the top row. Their corresponding pre- per [7], adding adversarial training and increasing the number of dicted masks are shown on the middle row, and their uncer- models in the ensemble from 5 to 15, may potentially decrease the tainty heatmap representation are on the bottom row. The prediction error significantly. prediction masks and uncertainty heatmap are calculated using the Deep Ensemble of five trained U-Net networks. Table 1: Results from validation data and test data by the ensemble model of five U-Nets. The results on predicted test IV CONCLUSION AND FUTURE WORK data were provided by the task organizers. In this paper, we presented a method of obtaining the approximate uncertainty values for a set of predicted segmentation masks. The Mean Mean Mean uncertainty masks provide an uncertainty measure of the perfor- Dice Jaccard Index Accuracy mance of a U-Net based DL model trained on medical colonoscopy images of polyps. Validation data 0.8226 0.7005 0.9242 A mean Dice score of 0.4501 was obtained on the test data, and Official Test data 0.4501 0.3231 0.8831 compared to the Dice score of 0.8226 from the validation data, this indicated that the Deep Ensemble model was being overfitted to the training data, and thus generalizing poorly onto unseen data. For the efficiency subtask, a FPS of 82.9496 was obtained. This Increasing the number of training examples by including similar means that the time of approximately 2.4111 seconds in total was datasets, decreasing the number of training iterations, increasing used to generate the masks on the entire test dataset. This result the number of models in the ensemble, as well as including adver- indicates satisfactory model efficiency, but in return the deep en- sarial training may improve generalization. A total average FPS of semble is both memory- and time consuming to train. 82.9496 was obtained on the test data, but came at a high computa- A set of three randomly chosen images from the test data and tional cost when training the Deep Ensemble. In future work, we their corresponding prediction masks and uncertainty heatmaps are will add the aforementioned proposed extensions, as well as experi- shown in Figure 1. The brighter areas in the heatmaps illustrate the ment and compare to alternative methods such as Masksembles [3] pixels where the models in the ensemble disagree the most. These in order to decrease computational cost of obtaining an ensemble results show that the borders of the detected polyps are where model. they disagree the most. Furthermore, the two uncertainty heatmaps (from the left) shows an outlining of a rectangle in the bottom REFERENCES left corner. Many of the input images in the HyperKvasir dataset show green rectangles located in the same area, this is information [1] Sánchez J. Vilarino-F Bernal, J. 2012. Towards automatic polyp detec- tion with a polyp appearance model. Endoscopy (2012), 3166–3182. important to the medical experts. Thus, it is common to observe [2] Hanna Borgli, Vajira Thambawita, Pia H Smedsrud, Steven Hicks, several images with green rectangles in the development dataset. Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin However, note that the input images do not contain these green Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, Dag Johansen, rectangles. These results indicate that the ensemble expected these Carsten Griwodz, Håkon K Stensland, Enrique Garcia-Ceja, Peter T rectangles, thus showing systematic bias towards the training data. Schmidt, Hugo L Hammer, Michael A Riegler, Pål Halvorsen, and Increasing the number of training examples, as well as performing Thomas de Lange. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific Data 1 https://github.com/feliciajacobsen/MediaEval2021 7, 1 (2020), 283. https://doi.org/10.1038/s41597-020-00622-y Medico: Transparency in Medical Image Segmentation MediaEval’21, December 13-15 2021, Online [3] Nikita Durasov, Timur M. Bagautdinov, Pierre Baqué, and Pascal Fua. 2020. Masksembles for Uncertainty Estimation. CoRR abs/2012.08334 (2020). https://arxiv.org/abs/2012.08334 [4] Bernal J. López-Cerón M. Córdova H. Sánchez-Montes C. Rodríguez de Miguel C. Sánchez F. J. Fernández-Esparrach, G. 2016. Exploring the clinical potential of an automatic colonic polyp detection method based on the creation of energy maps. Endoscopy (6 2016), 837–842. [5] Steven Hicks, Debesh Jha, Vajira Thambawita, Hugo Hammer, Thomas de Lange, Sravanthi Parasa, Michael Riegler, and Pål Halvorsen. 2021. Medico Multimedia Task at MediaEval 2021: Transparency in Medical Image Segmentation. In Proceedings of MediaEval 2021 CEUR Work- shop. [6] Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (12 2014). [7] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blun- dell. 2017. Simple and Scalable Predictive Uncertainty Esti- mation using Deep Ensembles. (12 2017), 6405–6416 pages. arXiv:stat.ML/1612.01474 [8] Kim NH, Jung YS, Jeong WS, and et al. 2017. Miss rate of col- orectal neoplastic polyps and risk factors for missed polyps in con- secutive colonoscopies. Intestinal research 15, 3 (6 2017), 411–418. https://doi.org/10.5217/ir.2017.15.3.411. [9] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf [10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Med- ical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer International Publishing, 234–241. [11] Winawer SJ, Zauber AG, Ho MN, and et al. 1993. Prevention of colorec- tal cancer by colonoscopic polypectomy. The New England Journal of Medicine 329, 27 (12 1993), 1977–1981. https://doi.org/10.1056/ NEJM199312303292701