Heterogeneous model ensemble for polyp detection and
tracking in colonoscopy
Amine Yamlahi1 , Patrick Godau1 , Thuy Nuong Tran1 , Lucas-Raphael Müller1 , Tim Adler1 ,
Minu Dietlinde Tizabi1 , Michael Baumgartner2 , Paul Jäger3 and Lena Maier-Hein1
1
  Div. Intelligent Medical Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany
2
  Div. Medical Image Computing, DKFZ, Heidelberg, Germany
3
  Interactive Machine Learning Group, DKFZ, Heidelberg, Germany


                                             Abstract
                                             Regular colonoscopy screening substantially contributes to the prevention of colon cancer, as a polyp found in early stages
                                             can safely be removed. Assisting physicians during screening with automated detection systems can potentially increase
                                             the sensitivity of polyp detection. In this work, we present our polyp detection and tracking approach, submitted to the
                                             EndoCV2022 challenge. The core of our method is a heterogeneous ensemble of YOLOv5 models, each trained with a different
                                             strategy based on external data and varying data augmentation concepts. The output of the ensemble members is merged
                                             with the weighted boxes fusion algorithm, and the final output bounding boxes are reduced in size. Our method yields a
                                             mean Average Precision (mAP) of 0.44 on our validation test set.

                                             Keywords
                                             Polyp detection, model ensembling, image augmentation


1. Introduction                                                                                      2. Methods
Colorectal cancer is one of the most commonly found can-                                             Our strategy for algorithm design comprised the follow-
cer types, ranking second in females and third in males                                              ing steps:
[1]. By detecting and subsequently resecting polyps dur-
ing colonoscopy screenings, the risk of developing the                                                    1. Data preparation: Identification and curation (sec.
disease can be reduced significantly. With the advance of                                                    2.1) as well as splitting (sec. 2.2) of relevant
machine learning in the medical domain, deep learning-                                                       datasets.
based methods have the potential to assist in detecting                                                   2. Ensemble training: Development of a heteroge-
these polyps with high accuracy. Generalizability across                                                     neous model ensemble for per-frame polyp detec-
diverse and heterogeneous populations, devices and hos-                                                      tion (sec. 2.3).
pitals is a major issue regarding these methods that needs                                                3. Tracking: Development of a strategy for leverag-
to be addressed to allow for realistic clinical translation.                                                 ing the temporal information in endoscopic video
The method presented in this paper tackles this issue                                                        sequences (sec. 2.4).
by ensembling heterogeneous, complementary training                                                       4. Post-processing: Development of a post-
strategies (see Figure 1). The remaining part of this pa-                                                    processing step to avoid systematic over-
per is structured as follows: Sec. 2 first introduces the                                                    segmentation (sec. 2.5).
data we use and goes on to describe all steps of training
and post-processing the outputs of the models in the en-
semble. Cross-validation results, including ablations, are                                           2.1. Datasets
reported in sec. 3, which is followed by a brief discussion
in sec. 4.                                                                                             The dataset provided by the EndoCV2022 polyp segmen-
                                                                                                       tation sub-challenge [2, 3, 4] consists of 46 sequences
                                                                                                       of varied length, totalling 3290 image frames and their
                                                                                                       corresponding polyp segmentation masks. Furthermore,
                                                                                                       we identified four public polyp datasets, namely CVC-
                                                                                                       ColonDB [5] (segmentation), CVC-ClinicDB [6] (seg-
4th International Workshop and Challenge on Computer Vision in mentation), ETIS-Larib [7] (segmentation) and CVC-
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- ClinicVideoDB [8, 9] (detection). We converted segmen-
national Symposium on Biomedical Imaging ISBI2022, March tation challenge datasets to detection datasets by comput-
28th, 2022, IC Royal Bengal, Kolkata, India                                                            ing the tightest possible bounding box for the provided
$ m.elyamlahi@dkfz-heidelberg.de (A. Yamlahi)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License segmentation masks.
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: Method overview: A heterogeneous model ensemble comprises three YOLOv5 models, each trained with a different
strategy based on external data and data augmentation. The output of the ensemble members is merged with the weighted
boxes fusion algorithm and passed on to a Norfair tracking-based algorithm. The final output bounding boxes are reduced in
size.


2.2. Validation strategy                                              tations. The augmentations applied on the first
                                                                      fold comprise mosaic and mixup augmentations
We split the 46 EndoCV2022 sequences into four folds
                                                                      with a probability of 1.0 and 0.5, respectively, Hue-
using the GroupK-Fold algorithm from the sklearn library
                                                                      Saturation-Value (HSV) channel enhancements
[10]. The split was based on the sequence ID in order to
                                                                      with a maximal magnitude of 0.2 each, horizontal
prevent leakage, and we stratified based on the sequence
                                                                      flip, vertical flip and Copy-Paste augmentation
length to have a balanced number of frames per fold. We
                                                                      with a probability of 0.5 each as well as a final
used the validation performance on the left out fold for
                                                                      rotation of up to 25 degrees. We will refer to
selecting our model checkpoints in the ensemble. For
                                                                      this combination of augmentations as the "default
a faster training and inference time, we used two out
                                                                      augmentation pipeline". The augmentations on
of the four folds for both training and validation. As a
                                                                      the second fold are almost identical, setting the
validation metric we used the mean average Precision
                                                                      HSV enhancement to more deliberate magnitudes
(mAP) over the Intersection Over Union (IoU) threshold
                                                                      0.015, 0.7 and 0.4. In addition, the Copy-Paste aug-
range between 0.5 and 0.95 (mAP@[.5 : .95]) as proposed
                                                                      mentation was omitted.
by the organizers of the EndoCV2022 challenge.
                                                                   2. Model M_L-AUGMENT: YOLOv5l6 trained with
                                                                      images of size 768x768 with light image aug-
2.3. Heterogenous model ensemble                                      mentations. On the first fold, we drastically re-
We based our method upon YOLOv5x6 and YOLOv5l6                        duced the default augmentation pipeline: Omit-
[11] as our detection models as we identified them as                 ting mixup, vertical flipping, rotation as well as
being a good compromise between accuracy and speed.                   Copy-Paste transform. Furthermore, we used the
To build our heterogeneous model ensemble, we tested                  deliberate HSV magnitudes again. The augmenta-
different augmentation strategies aimed to improve the                tions on the second fold are closer to the default
model generalization. We group our trained models in                  augmentation pipeline in terms of augmentations
three categories, based upon model architecture and the               used. The single difference is to drastically re-
training data. Each category comprises models trained                 duce the magnitude of mosaic from 1.0 to 0.2. We
upon two of the folds.                                                aimed to bring diversity to the ensemble by in-
                                                                      cluding both models trained with light and heavy
    1. Model M_H-AUGMENT: YOLOv5x6 trained with                       augmentations.
       images of size 768x768 with heavy image augmen-             3. Model M_E-DATA: YOLOv5l6 trained with the
       resized external data described in sec. 2.1. The     3. Results
       first fold was trained with images of size 768x768
       while the second fold with images of size 512x512.  In the interest of a shorter inference time, we only con-
       With the enriched training data, comprising addi-   sidered the models M_L-AUGMENT, M_H-AUGMENT
       tional 13,251 frames from additional data sources,  and M_E-DATA trained over two folds out of the original
       this model specifically targeted generalizability   four folds for evaluation and inference. Table 1 compares
       to new settings.                                    the results of the three models averaged over two folds
                                                           and validated on their respective validation fold. We in-
   All models were initiated with the standard-pretrained ferred the models with the following hyperparameters
weights on the COCO dataset [12] and trained for 20 configuration: a confidence threshold of 0.01 and im-
epochs. In cases of slow convergence, the training pe- age size of 768x768 for the models without external data
riod was extended up to 40 epochs using a Stochastic and image size of 512x512 for the models with external
Gradient Descent optimizer with momentum set to 0.937, data. Our best single model M_L-AUGMENT obtained
a learning rate of 0.01 and complete intersection over an mAP@[.5 : .95] score of 0.42 on the validation set.
union (CIoU) loss [13] as the loss function. We saved the With the ensemble of three different models trained with
weights on the epoch with the best mAP score based on post-processing, we obtained the best performance of
the validation data for the current fold. The predicted 0.44 mAP@[.5 : .95] on the validation split thanks to
bounding boxes of each model were post-processed using the variation in model architectures and augmentations.
the Non-Maximum-Suppression (NMS) algorithm with Adding the bounding box tracking to the pipeline did
an IoU threshold of 0.5, to pick one bounding box out of not improve performance with respect to the entire area
many overlapping entities. To ensemble the bounding under the precision-recall curve, as measured by mAP.
box predictions of multiple models, we used the weighted However, we observed improved F2 scores at relevant
boxes fusion (WBF) algorithm [14] with an IoU threshold working points of the curve and leave an in-depth analy-
of 0.5 and the skip box threshold of 0.02. All models were sis of potential benefits to future research.
weighted equally.
                                                            Table 1
2.4. Tracking                                               Mean Average Precision (mAP) scores of the selected
                                                            models and the ensemble with tracking and post-
In order to leverage the temporal information in the video  processing.
sequences, we added a second stage tracker on top of the
detection model to track the bounding boxes. We used            Model                         AP     AP50 AP75
Norfair [15], a multiple-object tracker, to track the polyps    M-L_AUGMENT                   0.42   0.55    0.46
by calculating the Euclidean distance between the already       M-H_AUGMENT                   0.37   0.56    0.45
tracked polyp and the prediction provided by the detec-         M-E_DATA                      0.33   0.49    0.37
tion model. The tracker only considers bounding boxes           Ensemble                      0.43   0.59    0.49
within a distance of a set threshold to each other. On a        Ensemble + tracking           0.42   0.59    0.49
1080x1920 image, we experimented with several distance          Ensemble+ post-processing 0.44 0.60          0.50
thresholds in the range 50px-250px, minimum hit inertia
values in the range of 3-30, maximum hit inertia values
in the range 6-50, and initialization delay values in the
range of 1-20. The best results were obtained with a dis- 4. Conclusion
tance threshold of 50px, minimum hit inertia value of 10,
maximum inertia value of 25 and an initialization delay We presented a new approach to polyp detection in en-
of 10.                                                       doscopic video sequences that leverages a heterogeneous
                                                             ensemble of YOLOv5 models to achieve generalization.
2.5. Post-processing                                         According to our analyses, the biggest performance gains
                                                             were obtained from application-specific augmentation
While bounding boxes are generated from the segmen- strategies and the ensemble of different architectures.
tation masks and are calculated to fit tightly around the Future work should aim for generating substantial per-
polyp, the predictions by object detection models tend formance gains by incorporating temporal information.
to cover more surface than the reference labels, which
results in the inclusion of false-positive pixels inside
the bounding box. To avoid this over-segmentation, we
shrink the bounding boxes with a confidence score higher
than 0.4 by 2% of their size.
5. Compliance with ethical                                         ages for early diagnosis of colorectal cancer, Inter-
                                                                   national journal of computer assisted radiology and
   standards                                                       surgery 9 (2014) 283–293.
This work was conducted using public datasets of human         [8] Q. Angermann, J. Bernal, C. Sánchez-Montes,
subject data made available by [2, 3, 4, 5, 6, 7, 8, 9].           M. Hammami, G. Fernández-Esparrach, X. Dray,
                                                                   O. Romain, F. J. Sánchez, A. Histace, Towards
                                                                   real-time polyp detection in colonoscopy videos:
6. Acknowledgments                                                 Adapting still frame-based methodologies for video
                                                                   sequences analysis, in: Computer assisted and
This project was supported by a Twinning Grant of the              robotic endoscopy and clinical image-based pro-
German Cancer Research Center (DKFZ) and the Robert                cedures, Springer, 2017, pp. 29–41.
Bosch Center for Tumor Diseases (RBCT). Part of this           [9] J. Bernal, A. Histace, M. Masana, Q. Angermann, G.,
work was funded by Helmholtz Imaging (HI), a platform              dray, x., and sanchez, j. polyp detection benchmark
of the Helmholtz Incubator on Information and Data                 in colonoscopy videos using gtcreator: A novel fully
Science.                                                           configurable tool for easy and fast annotation of
                                                                   image databases, in: Proceedings of 32nd CARS
                                                                   Conference (Berlin, Germany, 2018.
References                                                    [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
 [1] F. A. Haggar, R. P. Boushey, Colorectal cancer epi-           B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     demiology: incidence, mortality, survival, and risk           R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     factors, Clinics in colon and rectal surgery 22 (2009)        D. Cournapeau, M. Brucher, M. Perrot, E. Duch-
     191–197.                                                      esnay, Scikit-learn: Machine learning in Python,
 [2] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po-            Journal of Machine Learning Research 12 (2011)
     lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,            2825–2830.
     B. Matuszewski, et al., Deep learning for detec-         [11] G. R. Jocher, ultralytics/yolov5, 2022. URL: https:
     tion and segmentation of artefact and disease in-             //github.com/ultralytics/yolov5.
     stances in gastrointestinal endoscopy, Medical           [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
     image analysis 70 (2021) 102002. doi:10.1016/j.               D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
     media.2021.102002.                                            coco: Common objects in context, in: European
 [3] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can-              conference on computer vision, Springer, 2014, pp.
     nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.             740–755.
     Anonsen, M. A. Riegler, et al., Polypgen: A              [13] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren,
     multi-center polyp detection and segmentation                 Distance-iou loss: Faster and better learning for
     dataset for generalisability assessment, arXiv                bounding box regression, in: Proceedings of the
     preprint arXiv:2106.04463 (2021). doi:10.48550/               AAAI Conference on Artificial Intelligence, vol-
     arXiv.2106.04463.                                             ume 34, 2020, pp. 12993–13000.
 [4] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po-       [14] R. Solovyev, W. Wang, T. Gabruseva, Weighted
     lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,          boxes fusion: Ensembling boxes from different ob-
     V. Thambawita, et al., Assessing generalisabil-               ject detection models, Image and Vision Computing
     ity of deep learning-based polyp detection and                107 (2021) 104117.
     segmentation methods through a computer vision           [15] J. Alori, A. Descoins, KotaYuhara, David, B. Ríos,
     challenge, arXiv preprint arXiv:2202.12031 (2022).            fatih, shafu, A. Castro, D. Huh, tryolabs/norfair:
     doi:10.48550/arXiv.2202.12031.                                v0.4.0, 2022. URL: https://doi.org/10.5281/zenodo.
 [5] J. Bernal, J. Sánchez, F. Vilarino, Towards automatic         6095785. doi:10.5281/zenodo.6095785.
     polyp detection with a polyp appearance model,
     Pattern Recognition 45 (2012) 3166–3182.
 [6] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach,
     D. Gil, C. Rodríguez, F. Vilariño, Wm-dova maps
     for accurate polyp highlighting in colonoscopy: Val-
     idation vs. saliency maps from physicians, Com-
     puterized Medical Imaging and Graphics 43 (2015)
     99–111.
 [7] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado,
     Toward embedded detection of polyps in wce im-