Heterogeneous model ensemble for polyp detection and tracking in colonoscopy Amine Yamlahi1 , Patrick Godau1 , Thuy Nuong Tran1 , Lucas-Raphael Müller1 , Tim Adler1 , Minu Dietlinde Tizabi1 , Michael Baumgartner2 , Paul Jäger3 and Lena Maier-Hein1 1 Div. Intelligent Medical Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany 2 Div. Medical Image Computing, DKFZ, Heidelberg, Germany 3 Interactive Machine Learning Group, DKFZ, Heidelberg, Germany Abstract Regular colonoscopy screening substantially contributes to the prevention of colon cancer, as a polyp found in early stages can safely be removed. Assisting physicians during screening with automated detection systems can potentially increase the sensitivity of polyp detection. In this work, we present our polyp detection and tracking approach, submitted to the EndoCV2022 challenge. The core of our method is a heterogeneous ensemble of YOLOv5 models, each trained with a different strategy based on external data and varying data augmentation concepts. The output of the ensemble members is merged with the weighted boxes fusion algorithm, and the final output bounding boxes are reduced in size. Our method yields a mean Average Precision (mAP) of 0.44 on our validation test set. Keywords Polyp detection, model ensembling, image augmentation 1. Introduction 2. Methods Colorectal cancer is one of the most commonly found can- Our strategy for algorithm design comprised the follow- cer types, ranking second in females and third in males ing steps: [1]. By detecting and subsequently resecting polyps dur- ing colonoscopy screenings, the risk of developing the 1. Data preparation: Identification and curation (sec. disease can be reduced significantly. With the advance of 2.1) as well as splitting (sec. 2.2) of relevant machine learning in the medical domain, deep learning- datasets. based methods have the potential to assist in detecting 2. Ensemble training: Development of a heteroge- these polyps with high accuracy. Generalizability across neous model ensemble for per-frame polyp detec- diverse and heterogeneous populations, devices and hos- tion (sec. 2.3). pitals is a major issue regarding these methods that needs 3. Tracking: Development of a strategy for leverag- to be addressed to allow for realistic clinical translation. ing the temporal information in endoscopic video The method presented in this paper tackles this issue sequences (sec. 2.4). by ensembling heterogeneous, complementary training 4. Post-processing: Development of a post- strategies (see Figure 1). The remaining part of this pa- processing step to avoid systematic over- per is structured as follows: Sec. 2 first introduces the segmentation (sec. 2.5). data we use and goes on to describe all steps of training and post-processing the outputs of the models in the en- semble. Cross-validation results, including ablations, are 2.1. Datasets reported in sec. 3, which is followed by a brief discussion in sec. 4. The dataset provided by the EndoCV2022 polyp segmen- tation sub-challenge [2, 3, 4] consists of 46 sequences of varied length, totalling 3290 image frames and their corresponding polyp segmentation masks. Furthermore, we identified four public polyp datasets, namely CVC- ColonDB [5] (segmentation), CVC-ClinicDB [6] (seg- 4th International Workshop and Challenge on Computer Vision in mentation), ETIS-Larib [7] (segmentation) and CVC- Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- ClinicVideoDB [8, 9] (detection). We converted segmen- national Symposium on Biomedical Imaging ISBI2022, March tation challenge datasets to detection datasets by comput- 28th, 2022, IC Royal Bengal, Kolkata, India ing the tightest possible bounding box for the provided $ m.elyamlahi@dkfz-heidelberg.de (A. Yamlahi) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License segmentation masks. Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Method overview: A heterogeneous model ensemble comprises three YOLOv5 models, each trained with a different strategy based on external data and data augmentation. The output of the ensemble members is merged with the weighted boxes fusion algorithm and passed on to a Norfair tracking-based algorithm. The final output bounding boxes are reduced in size. 2.2. Validation strategy tations. The augmentations applied on the first fold comprise mosaic and mixup augmentations We split the 46 EndoCV2022 sequences into four folds with a probability of 1.0 and 0.5, respectively, Hue- using the GroupK-Fold algorithm from the sklearn library Saturation-Value (HSV) channel enhancements [10]. The split was based on the sequence ID in order to with a maximal magnitude of 0.2 each, horizontal prevent leakage, and we stratified based on the sequence flip, vertical flip and Copy-Paste augmentation length to have a balanced number of frames per fold. We with a probability of 0.5 each as well as a final used the validation performance on the left out fold for rotation of up to 25 degrees. We will refer to selecting our model checkpoints in the ensemble. For this combination of augmentations as the "default a faster training and inference time, we used two out augmentation pipeline". The augmentations on of the four folds for both training and validation. As a the second fold are almost identical, setting the validation metric we used the mean average Precision HSV enhancement to more deliberate magnitudes (mAP) over the Intersection Over Union (IoU) threshold 0.015, 0.7 and 0.4. In addition, the Copy-Paste aug- range between 0.5 and 0.95 (mAP@[.5 : .95]) as proposed mentation was omitted. by the organizers of the EndoCV2022 challenge. 2. Model M_L-AUGMENT: YOLOv5l6 trained with images of size 768x768 with light image aug- 2.3. Heterogenous model ensemble mentations. On the first fold, we drastically re- We based our method upon YOLOv5x6 and YOLOv5l6 duced the default augmentation pipeline: Omit- [11] as our detection models as we identified them as ting mixup, vertical flipping, rotation as well as being a good compromise between accuracy and speed. Copy-Paste transform. Furthermore, we used the To build our heterogeneous model ensemble, we tested deliberate HSV magnitudes again. The augmenta- different augmentation strategies aimed to improve the tions on the second fold are closer to the default model generalization. We group our trained models in augmentation pipeline in terms of augmentations three categories, based upon model architecture and the used. The single difference is to drastically re- training data. Each category comprises models trained duce the magnitude of mosaic from 1.0 to 0.2. We upon two of the folds. aimed to bring diversity to the ensemble by in- cluding both models trained with light and heavy 1. Model M_H-AUGMENT: YOLOv5x6 trained with augmentations. images of size 768x768 with heavy image augmen- 3. Model M_E-DATA: YOLOv5l6 trained with the resized external data described in sec. 2.1. The 3. Results first fold was trained with images of size 768x768 while the second fold with images of size 512x512. In the interest of a shorter inference time, we only con- With the enriched training data, comprising addi- sidered the models M_L-AUGMENT, M_H-AUGMENT tional 13,251 frames from additional data sources, and M_E-DATA trained over two folds out of the original this model specifically targeted generalizability four folds for evaluation and inference. Table 1 compares to new settings. the results of the three models averaged over two folds and validated on their respective validation fold. We in- All models were initiated with the standard-pretrained ferred the models with the following hyperparameters weights on the COCO dataset [12] and trained for 20 configuration: a confidence threshold of 0.01 and im- epochs. In cases of slow convergence, the training pe- age size of 768x768 for the models without external data riod was extended up to 40 epochs using a Stochastic and image size of 512x512 for the models with external Gradient Descent optimizer with momentum set to 0.937, data. Our best single model M_L-AUGMENT obtained a learning rate of 0.01 and complete intersection over an mAP@[.5 : .95] score of 0.42 on the validation set. union (CIoU) loss [13] as the loss function. We saved the With the ensemble of three different models trained with weights on the epoch with the best mAP score based on post-processing, we obtained the best performance of the validation data for the current fold. The predicted 0.44 mAP@[.5 : .95] on the validation split thanks to bounding boxes of each model were post-processed using the variation in model architectures and augmentations. the Non-Maximum-Suppression (NMS) algorithm with Adding the bounding box tracking to the pipeline did an IoU threshold of 0.5, to pick one bounding box out of not improve performance with respect to the entire area many overlapping entities. To ensemble the bounding under the precision-recall curve, as measured by mAP. box predictions of multiple models, we used the weighted However, we observed improved F2 scores at relevant boxes fusion (WBF) algorithm [14] with an IoU threshold working points of the curve and leave an in-depth analy- of 0.5 and the skip box threshold of 0.02. All models were sis of potential benefits to future research. weighted equally. Table 1 2.4. Tracking Mean Average Precision (mAP) scores of the selected models and the ensemble with tracking and post- In order to leverage the temporal information in the video processing. sequences, we added a second stage tracker on top of the detection model to track the bounding boxes. We used Model AP AP50 AP75 Norfair [15], a multiple-object tracker, to track the polyps M-L_AUGMENT 0.42 0.55 0.46 by calculating the Euclidean distance between the already M-H_AUGMENT 0.37 0.56 0.45 tracked polyp and the prediction provided by the detec- M-E_DATA 0.33 0.49 0.37 tion model. The tracker only considers bounding boxes Ensemble 0.43 0.59 0.49 within a distance of a set threshold to each other. On a Ensemble + tracking 0.42 0.59 0.49 1080x1920 image, we experimented with several distance Ensemble+ post-processing 0.44 0.60 0.50 thresholds in the range 50px-250px, minimum hit inertia values in the range of 3-30, maximum hit inertia values in the range 6-50, and initialization delay values in the range of 1-20. The best results were obtained with a dis- 4. Conclusion tance threshold of 50px, minimum hit inertia value of 10, maximum inertia value of 25 and an initialization delay We presented a new approach to polyp detection in en- of 10. doscopic video sequences that leverages a heterogeneous ensemble of YOLOv5 models to achieve generalization. 2.5. Post-processing According to our analyses, the biggest performance gains were obtained from application-specific augmentation While bounding boxes are generated from the segmen- strategies and the ensemble of different architectures. tation masks and are calculated to fit tightly around the Future work should aim for generating substantial per- polyp, the predictions by object detection models tend formance gains by incorporating temporal information. to cover more surface than the reference labels, which results in the inclusion of false-positive pixels inside the bounding box. To avoid this over-segmentation, we shrink the bounding boxes with a confidence score higher than 0.4 by 2% of their size. 5. Compliance with ethical ages for early diagnosis of colorectal cancer, Inter- national journal of computer assisted radiology and standards surgery 9 (2014) 283–293. This work was conducted using public datasets of human [8] Q. Angermann, J. Bernal, C. Sánchez-Montes, subject data made available by [2, 3, 4, 5, 6, 7, 8, 9]. M. Hammami, G. Fernández-Esparrach, X. Dray, O. Romain, F. J. Sánchez, A. Histace, Towards real-time polyp detection in colonoscopy videos: 6. Acknowledgments Adapting still frame-based methodologies for video sequences analysis, in: Computer assisted and This project was supported by a Twinning Grant of the robotic endoscopy and clinical image-based pro- German Cancer Research Center (DKFZ) and the Robert cedures, Springer, 2017, pp. 29–41. Bosch Center for Tumor Diseases (RBCT). Part of this [9] J. Bernal, A. Histace, M. Masana, Q. Angermann, G., work was funded by Helmholtz Imaging (HI), a platform dray, x., and sanchez, j. polyp detection benchmark of the Helmholtz Incubator on Information and Data in colonoscopy videos using gtcreator: A novel fully Science. configurable tool for easy and fast annotation of image databases, in: Proceedings of 32nd CARS Conference (Berlin, Germany, 2018. References [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, [1] F. A. Haggar, R. P. Boushey, Colorectal cancer epi- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, demiology: incidence, mortality, survival, and risk R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, factors, Clinics in colon and rectal surgery 22 (2009) D. Cournapeau, M. Brucher, M. Perrot, E. Duch- 191–197. esnay, Scikit-learn: Machine learning in Python, [2] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- Journal of Machine Learning Research 12 (2011) lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, 2825–2830. B. Matuszewski, et al., Deep learning for detec- [11] G. R. Jocher, ultralytics/yolov5, 2022. URL: https: tion and segmentation of artefact and disease in- //github.com/ultralytics/yolov5. stances in gastrointestinal endoscopy, Medical [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, image analysis 70 (2021) 102002. doi:10.1016/j. D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft media.2021.102002. coco: Common objects in context, in: European [3] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- conference on computer vision, Springer, 2014, pp. nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. 740–755. Anonsen, M. A. Riegler, et al., Polypgen: A [13] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren, multi-center polyp detection and segmentation Distance-iou loss: Faster and better learning for dataset for generalisability assessment, arXiv bounding box regression, in: Proceedings of the preprint arXiv:2106.04463 (2021). doi:10.48550/ AAAI Conference on Artificial Intelligence, vol- arXiv.2106.04463. ume 34, 2020, pp. 12993–13000. [4] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po- [14] R. Solovyev, W. Wang, T. Gabruseva, Weighted lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, boxes fusion: Ensembling boxes from different ob- V. Thambawita, et al., Assessing generalisabil- ject detection models, Image and Vision Computing ity of deep learning-based polyp detection and 107 (2021) 104117. segmentation methods through a computer vision [15] J. Alori, A. Descoins, KotaYuhara, David, B. Ríos, challenge, arXiv preprint arXiv:2202.12031 (2022). fatih, shafu, A. Castro, D. Huh, tryolabs/norfair: doi:10.48550/arXiv.2202.12031. v0.4.0, 2022. URL: https://doi.org/10.5281/zenodo. [5] J. Bernal, J. Sánchez, F. Vilarino, Towards automatic 6095785. doi:10.5281/zenodo.6095785. polyp detection with a polyp appearance model, Pattern Recognition 45 (2012) 3166–3182. [6] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, F. Vilariño, Wm-dova maps for accurate polyp highlighting in colonoscopy: Val- idation vs. saliency maps from physicians, Com- puterized Medical Imaging and Graphics 43 (2015) 99–111. [7] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado, Toward embedded detection of polyps in wce im-