=Paper=
{{Paper
|id=Vol-3148/paper3
|storemode=property
|title=Heterogeneous Model Ensemble For Automatic Polyp Segmentation In Endoscopic Video Sequences
|pdfUrl=https://ceur-ws.org/Vol-3148/paper3.pdf
|volume=Vol-3148
|authors=Thuy Nuong Tran,Fabian Isensee,Lars Kra ̈mer,Amine Yamlahi,Tim Adler,Patrick Godau,Minu Tizabi,Lena Maier-Hein
|dblpUrl=https://dblp.org/rec/conf/isbi/TranImYAGTM22
}}
==Heterogeneous Model Ensemble For Automatic Polyp Segmentation In Endoscopic Video Sequences==
Heterogeneous model ensemble for automatic polyp segmentation in endoscopic video sequences Thuy Nuong Tran1 , Fabian Isensee2,3 , Lars Krämer2,3 , Amine Yamlahi1 , Tim Adler1 , Patrick Godau1 , Minu Tizabi1 and Lena Maier-Hein1 1 Div. Intelligent Medical Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany 2 Div. Medical Image Computing, DKFZ, Heidelberg, Germany 3 Applied Computer Vision Lab, Helmholtz Imaging Abstract The detection and segmentation of polyps during colonoscopy can substantially contribute to the prevention of colon cancer. Assisting clinicians using automated systems can mitigate the risk of human error. In this work, we present our polyp segmentation approach, submitted to the EndoCV2022 challenge. Common polyp segmentation methods are based on single-model, single-frame predictions. This work presents a symbiosis of three separate models, each with their own strength, as part of a segmentation pipeline and a post-processing step designed to leverage unique predictions for more temporally coherent results. Keywords Polyp segmentation, Temporal coherence, High resolution, Heterogeneous ensemble 1. Introduction 2. Datasets Colorectal cancer is one of the most commonly found can- The dataset provided by the EndoCV2022 polyp segmen- cer types, ranking second in females and third in males tation sub-challenge[2, 3, 4] consists of 46 sequences [1]. By detecting and subsequently resecting polyps dur- of varied length, totalling 3,290 image frames and their ing colonoscopy screenings, the risk of developing the corresponding polyp segmentation masks. Further- disease can be reduced significantly. With the advance of more, three public polyp segmentation datasets were machine learning in the medical domain, deep learning- added as external data, namely CVC-ColonDB[5], CVC- based methods have the potential to assist in detecting ClinicDB[6] and ETIS-Larib[7], to enrich the diversity of and segmenting these polyps with high accuracy. The the dataset. These account for 1,108 additional training EndoCV2022 challenge[2] addresses generalizability of images, resulting in 4,398 frames in total. such deep learning models for segmentation in endo- scopic video sequences. The method presented in this paper tackles this issue with three primary design de- 3. Methodology cisions: (1) The provided challenge dataset underwent Our challenge strategy rests on three main pillars: (1) A a curation process that ensures annotation quality. (2) data pre-processing step to ensure high data annotation An ensemble of three networks with complementary quality, (2) the network architecture selection and train- strengths was trained for the segmentation prediction. ing step, which yields the segmentation models, and (3) a (3) Finally, a post-processing step was implemented to post-processing step, which leverages model heterogene- address false-negative frames caused by majority vote. A ity and uses structural similarity[8] of consecutive frames fallback mechanism was set to reweight the predictions in order to handle false-negative masks. An overview is of a single model in order to enable unique predictions. depicted in Fig.1. 3.1. Data pre-processing Correct data annotation of the training set is crucial to the learning capabilities of any segmentation model. In 4th International Workshop and Challenge on Computer Vision in order to ensure annotation quality, the provided chal- Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- lenge dataset was curated by manually removing images national Symposium on Biomedical Imaging ISBI2022, March with implausible or temporally inconsistent annotations, 28th, 2022, IC Royal Bengal, Kolkata, India to the best of our judgement. An example is shown in $ t.tran@dkfz-heidelberg.de (T. N. Tran) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Fig.2. This was conducted under the assumption that CEUR Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 false annotation would harm the training process more Figure 1: Overview of heterogeneous model ensemble pipeline. Data is curated. Predictions of Efficient-UNet[9] ensemble, nnU-Net[10] and Hierarchical Multi-Scale Attention Network[11] are combined. Post-processing yields final prediction. than having a larger number of frames for training. The we expected more knowledge to be leveraged from the external datasets described in section 2 underwent the provided high-resolution video sequences. same selection process. Including external data, the re- sulting training dataset amounted to 4,106 image-mask 3.2.1. nnU-Net pairs. The nnU-Net is able to automatically determine key de- cisions to set up the segmentation pipeline for training, irrespective of the dataset. While it has ranked first on many 3D-segmentation challenges1 , its self-configuring strategy can also be applied to 2D images. The nnU-Net was expected to provide a solid base prediction. 3.2.2. Hierarchical Multi-Scale Network By treating the polyp segmentation as a classic computer vision task, it is possible to use established segmentation Figure 2: Example of inconsistent annotation. The up- models that perform well on complex natural images. per row depicts three consecutive frames of the provided The Hierarchical Multi-Scale Attention Network (HM- seq23_endocv22 sequence. The lower row shows the segmenta- tion mask. The image at position 𝑡 has fewer polyps annotated ANet) was chosen as it is a state-of-the-art2 architecture compared to the neighboring frames, despite the polyps not in semantic segmentation on Cityscapes . The HM-ANet being obstructed or out of sight. operates on higher resolutions and combines predictions from different scales. This was expected to result in a precise polyp segmentation, irrespective of the size of the polyp. 3.2. Neural network architectures In order to solve the polyp segmentation task, a model 3.3. Efficient-UNet Ensemble ensemble was designed that consists of parts with com- Most of the current segmentation models operate on plementary strengths. This was realized by using an a frame-by-frame basis. In order to capture temporal nnU-Net[10], which is configured to automatically adapt information, one approach is to incorporate a recurrent its pre-processing and training framework to different neural network layer to a standard segmentation model, datasets, and thus serves as a strong segmentation base, such as a Gated Recurrent Unit (GRU)-layer. The chosen a Hierarchical Multi-Scale Attention Network[11], which base segmentation model is the Efficient-UNet (Eff-UNet). combines predictions of multiple scales for a better predic- It is an encoder-decoder architecture with an EfficientNet tion performance, and an ensemble of Efficient-UNets[9], as its backbone, which is able to scale with model size one of which is equipped with an internal GRU-layer to process temporal information. By focusing on incorpo- 1 medicaldecathlon.com,https://kits19.grand-challenge.org,https: rating temporal as well as high-resolution information, //www.med.upenn.edu/cbica/brats2020 2 www.cityscapes-dataset.com Figure 3: Example of differently sized polyp images with their segmentation masks from the provided challenge set. and outperform other ConvNet backbones. One GRU- 4.1. Implementation details layer was added to the bottleneck of the Eff-UNet, to The nnU-Net was used as a framework and manual form an Eff-GRUNet. Consecutive images are loaded in changes were made to its automatically generated con- batches of size two. They are encoded, pooled, flattened, figuration. The short edge of the image was resized to sequentially fed into the GRU-layer and then reshaped 512px with the other being resized according to aspect and fed to the decoder. The Eff-UNet is trained separately ratio. The patch-size was set to 448 x 448. The data was from the Eff-GRUNet. Variants of both combined form then heavily augmented with operations such as rotation, the Eff-UNet ensemble. intensity and gamma augment, scaling, mirroring and blurring. 3.3.1. Combining networks and weighting For the HM-ANet, the data was normalized and ran- Since the HM-ANet operates on high resolutions, it was dom scaling between [0.5,1], random crop to 512x1024, expected that it performs well on very small polyps, as RGB-shift, and random vertical and horizontal flipping well as being able to fully capture larger polyps in their was performed. The model was initiated with weights entirety. During ensembling, the HM-ANet was designed pre-trained on PaddleClas3 . The training was conducted to be weighted higher for the small and large polyps. in three phases: 1) Training the model on original chal- Since there is no standardized definition of polyp sizes, lenge data, 2) fine-tuning the model on challenge and the thresholds were set empirically by observing refer- external data, and 3) fine-tuning again on challenge data ence labels of public polyp datasets[5, 6, 7]. An example only. is shown in Fig.3. For the Eff-UNet ensemble, the data was resized to 480x480 (Eff-UNet_480) and 256x256 (Eff-UNet_256), in- corporating different resolutions. Resizing to 256x256 3.4. Post-processing by reweighting was chosen for the Eff-GRUNet, to fit memory restric- To mitigate the error of false-negative predictions, a post- tions. Augmentations such as rotation, elastic and grid processing step is added that considers empty segmen- deformation were used. tation masks and their surrounding frames. If a neigh- In order to combine the predictions, the segmented boring frame is polyp-positive and is similar to the cur- polyps were divided into small (≤ 0.4% of image size), rent frame, then any non-empty prediction of the cur- large (≥9% of image size) and medium (rest) polyps. If rent frame is reweighted, effectively allowing a polyp- polyps were predicted as small or large, the weight of positive prediction despite non-majority. The similarity the HM-ANet was increased to 0.5, while the others were score used for this approach is the structural similarity decreased to 0.25 each. If the polyp was of medium size, score(SSIM), as it is able to take texture into account. the models were weighted equally at 0.33. The final seg- mentation was formed by thresholding the weighted pre- dictions at 0.5. To address false-negatives resulting from 4. Experiments and Results an unmet majority criterion, unique single-model pre- dictions were encouraged if neighboring images were The original training dataset was split into four parts structurally similar (SSIM > 0.9) and predicted to be using GroupK-fold for 4-fold cross-validation(CV) train- polyp-positive. The single model prediction weight was ing, balancing the number of frames and sequence IDs. then increased to 0.5. This proved to solve some false- Each fold has 11-12 sequences with around 750 frames. negative cases, as illustrated in Fig.4. The following subsections describe the implementation details and experiment results after hyperparameter op- timization. 4.2. Single model experiment results All final single model DSC scores are reported in Table 1. The nnU-Net was trained with external data added to the 3 paddleclas.readthedocs.io/en/latest/index.html Figure 4: Example of post-process reweighting. The ensemble prediction at time step 𝑡 is empty. Because SSIM > 0.9 and prediction is non-empty for at least one of the neighboring images, the single model prediction is weighted with 0.5. training set, resulting in a mean CV DSC score of 0.67. Table 1 Training only on the challenge set or the external dataset Cross-Validation scores of all models, including the compo- resulted in a worse DSC score of 0.57 and 0.55. nents of the Eff-UNet ensemble. Underscored values indicate The HM-ANet had a mean CV DSC score across all folds selection for Eff-UNet ensemble. Bold values indicate compo- of 0.70. During training and inference, predictions of nents of the final heterogeneous ensemble. scales [0.5,1] were combined. Experiments with scales DSC score Fold 0 Fold 1 Fold 2 Fold 3 Mean of [0.5,1,2] resulted in a worse performance of 0.69 with nnU-Net 0.65 0.84 0.70 0.50 0.67 more false-positives in empty images. Training in three HM-ANet 0.67 0.82 0.69 0.60 0.70 steps as described in sub-subsection 4.1. yielded the best Eff-UNet_480 0.67 0.80 0.69 0.62 0.69 result. Other training strategies such as training on a Eff-UNet_256 0.68 0.80 0.71 0.65 0.71 combined dataset or pre-training on the external dataset Eff-GRUNet 0.61 0.72 0.58 0.60 0.62 and fine-tuning on the official dataset resulted in a worse Eff-UNet Ens 0.67 0.80 0.71 0.60 0.70 performance. 4-fold cross-validation was used to deter- mine the stopping epochs for all three phases. A final inference model was then trained on the entire dataset. 4.3. Reweighting and ensembling results The three Eff-UNet models were each trained on the com- In order to test the reweighting strategy of the HM-ANet, bined dataset over four folds, resulting in 12 models. The the proportion of small and big polyps was calculated mean CV DSC scores of the Eff-UNet_480, Eff-UNet_256 for the validation splits. For folds 0-3, the ratios were and Eff-GRUNet were 0.69, 0.71 and 0.62, respectively. As 45%, 28%, 37%, and 65%. For single models, fold 1 had an alternative experiment, the Eff-UNet_480 was trained the most medium polyps and highest average CV score. with external data for pre-training and challenge data Fold 3 has the most non-medium polyps, and the lowest for fine-tuning. This performed worse compared to us- average CV score. However, the difference in DSC scores ing the combined dataset, resulting in a mean CV DSC between models is small. Since the ratio was highest score of 0.65. In order to decrease inference time, two Eff- for fold 3, an experiment is conducted where the three UNet_480, one Eff-UNet_256 and one Eff-GRUNet were single models were validated on only the small and big selected for the ensemble, based on validation score and polyp images of fold 3 (n = 483 out of 738 frames). The fold representation. The final prediction was determined resulting DSC scores are 0.63, 0.66 and 0.70. The simple by majority vote. The mean CV DSC score of the final ensemble receives a score of 0.73 and the ensemble with ensemble was 0.70. reweighting of HM-ANet a score of 0.74. Adding post- processing did not decrease or increase the score for this validation set. [4] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. Anonsen, M. A. Riegler, et al., Polypgen: A 5. Conclusion multi-center polyp detection and segmentation dataset for generalisability assessment, arXiv Our investigation showed that the HM-ANet was favor- preprint arXiv:2106.04463 (2021). doi:10.48550/ able for small and large polyp cases, which our dedi- arXiv.2106.04463. cated weighing strategy takes into account during en- [5] J. Bernal et al., Towards automatic polyp detection sembling. Notably, on a dataset with small and big with a polyp appearance model, Pattern Recogni- polyps, it achieves a DSC score of 0.74, improving the tion 45 (2012) 3166–3182. best-performing single model HM-ANet by 0.04. The [6] J. Bernal et al., Wm-dova maps for accurate polyp post-processing leverages self-adaptive training as well highlighting in colonoscopy: Validation vs. saliency as temporal and high resolution information by enabling maps from physicians, Computerized Medical Imag- unique predictions of all three heterogeneous compo- ing and Graphics 43 (2015) 99–111. nents, resulting in less false-negative predictions. The [7] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado, inference time as the sum of the slowest component (nnU- Toward embedded detection of polyps in wce im- Net) and the ensembling step is 0.71 fps. ages for early diagnosis of colorectal cancer, Inter- national journal of computer assisted radiology and 6. Compliance with ethical surgery 9 (2014) 283–293. [8] Z. Wang et al., Image quality assessment: from error standards visibility to structural similarity, IEEE transactions on image processing 13 (2004) 600–612. This work was conducted using public datasets of human [9] B. Baheti et al., Eff-unet: A novel architecture for se- subject data made available by [2, 3, 4, 5, 6, 7]. mantic segmentation in unstructured environment, in: Proceedings of the IEEE/CVF Conference on 7. Acknowledgments Computer Vision and Pattern Recognition Work- shops, 2020, pp. 358–359. This project was supported by a Twinning Grant of the [10] F. Isensee et al., nnU-Net: a self-configuring method German Cancer Research Center(DKFZ) and the Robert for deep learning-based biomedical image segmen- Bosch Center for Tumor Diseases(RBCT). Part of this tation, Nature methods 18 (2021) 203–211. work was funded by Helmholtz Imaging(HI), a platform [11] A. Tao et al., Hierarchical multi-scale atten- of the Helmholtz Incubator on Information and Data tion for semantic segmentation, arXiv preprint Science. arXiv:2005.10821 (2020). References [1] F. A. Haggar, R. P. Boushey, Colorectal cancer epi- demiology: incidence, mortality, survival, and risk factors, Clinics in colon and rectal surgery 22 (2009) 191–197. [2] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po- lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, V. Thambawita, et al., Assessing generalisabil- ity of deep learning-based polyp detection and segmentation methods through a computer vision challenge, arXiv preprint arXiv:2202.12031 (2022). doi:10.48550/arXiv.2202.12031. [3] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Matuszewski, et al., Deep learning for detec- tion and segmentation of artefact and disease in- stances in gastrointestinal endoscopy, Medical image analysis 70 (2021) 102002. doi:10.1016/j. media.2021.102002.