<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Heterogeneous model ensemble for polyp detection and tracking in colonoscopy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amine Yamlahi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Godau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thuy Nuong Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucas-Raphael Müller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Adler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minu Dietlinde Tizabi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Baumgartner</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Jäger</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lena Maier-Hein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Div. Intelligent Medical Systems, German Cancer Research Center (DKFZ)</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Div. Medical Image Computing</institution>
          ,
          <addr-line>DKFZ, Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Interactive Machine Learning Group, DKFZ</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Regular colonoscopy screening substantially contributes to the prevention of colon cancer, as a polyp found in early stages can safely be removed. Assisting physicians during screening with automated detection systems can potentially increase the sensitivity of polyp detection. In this work, we present our polyp detection and tracking approach, submitted to the EndoCV2022 challenge. The core of our method is a heterogeneous ensemble of YOLOv5 models, each trained with a diferent strategy based on external data and varying data augmentation concepts. The output of the ensemble members is merged with the weighted boxes fusion algorithm, and the final output bounding boxes are reduced in size. Our method yields a mean Average Precision (mAP) of 0.44 on our validation test set.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Polyp detection</kwd>
        <kwd>model ensembling</kwd>
        <kwd>image augmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>Colorectal cancer is one of the most commonly found can- Our strategy for algorithm design comprised the
followcer types, ranking second in females and third in males ing steps:
[1]. By detecting and subsequently resecting polyps
during colonoscopy screenings, the risk of developing the 1. Data preparation: Identification and curation (sec.
disease can be reduced significantly. With the advance of 2.1) as well as splitting (sec. 2.2) of relevant
machine learning in the medical domain, deep learning- datasets.
based methods have the potential to assist in detecting 2. Ensemble training: Development of a
heterogethese polyps with high accuracy. Generalizability across neous model ensemble for per-frame polyp
detecdiverse and heterogeneous populations, devices and hos- tion (sec. 2.3).
pitals is a major issue regarding these methods that needs 3. Tracking: Development of a strategy for
leveragto be addressed to allow for realistic clinical translation. ing the temporal information in endoscopic video
The method presented in this paper tackles this issue sequences (sec. 2.4).
by ensembling heterogeneous, complementary training 4. Post-processing: Development of a
poststrategies (see Figure 1). The remaining part of this pa- processing step to avoid systematic
overper is structured as follows: Sec. 2 first introduces the segmentation (sec. 2.5).
data we use and goes on to describe all steps of training
and post-processing the outputs of the models in the
ensemble. Cross-validation results, including ablations, are 2.1. Datasets
reported in sec. 3, which is followed by a brief discussion
in sec. 4.</p>
      <sec id="sec-2-1">
        <title>The dataset provided by the EndoCV2022 polyp segmen</title>
        <p>tation sub-challenge [2, 3, 4] consists of 46 sequences
of varied length, totalling 3290 image frames and their
corresponding polyp segmentation masks. Furthermore,
we identified four public polyp datasets, namely
CVCColonDB [5] (segmentation), CVC-ClinicDB [6]
(segmentation), ETIS-Larib [7] (segmentation) and
CVCClinicVideoDB [8, 9] (detection). We converted
segmentation challenge datasets to detection datasets by
computing the tightest possible bounding box for the provided</p>
        <sec id="sec-2-1-1">
          <title>2.2. Validation strategy</title>
          <p>We split the 46 EndoCV2022 sequences into four folds
using the GroupK-Fold algorithm from the sklearn library
[10]. The split was based on the sequence ID in order to
prevent leakage, and we stratified based on the sequence
length to have a balanced number of frames per fold. We
used the validation performance on the left out fold for
selecting our model checkpoints in the ensemble. For
a faster training and inference time, we used two out
of the four folds for both training and validation. As a
validation metric we used the mean average Precision
(mAP) over the Intersection Over Union (IoU) threshold
range between 0.5 and 0.95 (mAP@[.5 : .95]) as proposed
by the organizers of the EndoCV2022 challenge.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.3. Heterogenous model ensemble</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>We based our method upon YOLOv5x6 and YOLOv5l6</title>
        <p>[11] as our detection models as we identified them as
being a good compromise between accuracy and speed.
To build our heterogeneous model ensemble, we tested
diferent augmentation strategies aimed to improve the
model generalization. We group our trained models in
three categories, based upon model architecture and the
training data. Each category comprises models trained
upon two of the folds.</p>
      </sec>
      <sec id="sec-2-3">
        <title>1. Model M_H-AUGMENT: YOLOv5x6 trained with</title>
        <p>images of size 768x768 with heavy image
augmentations. The augmentations applied on the first
fold comprise mosaic and mixup augmentations
with a probability of 1.0 and 0.5, respectively,
HueSaturation-Value (HSV) channel enhancements
with a maximal magnitude of 0.2 each, horizontal
lfip, vertical flip and Copy-Paste augmentation
with a probability of 0.5 each as well as a final
rotation of up to 25 degrees. We will refer to
this combination of augmentations as the "default
augmentation pipeline". The augmentations on
the second fold are almost identical, setting the
HSV enhancement to more deliberate magnitudes
0.015, 0.7 and 0.4. In addition, the Copy-Paste
augmentation was omitted.
2. Model M_L-AUGMENT: YOLOv5l6 trained with
images of size 768x768 with light image
augmentations. On the first fold, we drastically
reduced the default augmentation pipeline:
Omitting mixup, vertical flipping, rotation as well as
Copy-Paste transform. Furthermore, we used the
deliberate HSV magnitudes again. The
augmentations on the second fold are closer to the default
augmentation pipeline in terms of augmentations
used. The single diference is to drastically
reduce the magnitude of mosaic from 1.0 to 0.2. We
aimed to bring diversity to the ensemble by
including both models trained with light and heavy
augmentations.</p>
        <p>3. Model M_E-DATA: YOLOv5l6 trained with the</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>All models were initiated with the standard-pretrained</title>
        <p>weights on the COCO dataset [12] and trained for 20
epochs. In cases of slow convergence, the training
period was extended up to 40 epochs using a Stochastic
Gradient Descent optimizer with momentum set to 0.937,
a learning rate of 0.01 and complete intersection over
union (CIoU) loss [13] as the loss function. We saved the
weights on the epoch with the best mAP score based on
the validation data for the current fold. The predicted
bounding boxes of each model were post-processed using
the Non-Maximum-Suppression (NMS) algorithm with
an IoU threshold of 0.5, to pick one bounding box out of
many overlapping entities. To ensemble the bounding
box predictions of multiple models, we used the weighted
boxes fusion (WBF) algorithm [14] with an IoU threshold
of 0.5 and the skip box threshold of 0.02. All models were
weighted equally.</p>
        <p>resized external data described in sec. 2.1. The
ifrst fold was trained with images of size 768x768
while the second fold with images of size 512x512. In the interest of a shorter inference time, we only
conWith the enriched training data, comprising addi- sidered the models M_L-AUGMENT, M_H-AUGMENT
tional 13,251 frames from additional data sources, and M_E-DATA trained over two folds out of the original
this model specifically targeted generalizability four folds for evaluation and inference. Table 1 compares
to new settings. the results of the three models averaged over two folds
and validated on their respective validation fold. We
inferred the models with the following hyperparameters
configuration: a confidence threshold of 0.01 and
image size of 768x768 for the models without external data
and image size of 512x512 for the models with external
data. Our best single model M_L-AUGMENT obtained
an mAP@[.5 : .95] score of 0.42 on the validation set.</p>
        <p>With the ensemble of three diferent models trained with
post-processing, we obtained the best performance of
0.44 mAP@[.5 : .95] on the validation split thanks to
the variation in model architectures and augmentations.</p>
        <p>Adding the bounding box tracking to the pipeline did
not improve performance with respect to the entire area
under the precision-recall curve, as measured by mAP.</p>
        <p>However, we observed improved F2 scores at relevant
working points of the curve and leave an in-depth
analysis of potential benefits to future research.</p>
      </sec>
      <sec id="sec-3-2">
        <title>We presented a new approach to polyp detection in en</title>
        <p>doscopic video sequences that leverages a heterogeneous
ensemble of YOLOv5 models to achieve generalization.
2.5. Post-processing According to our analyses, the biggest performance gains
were obtained from application-specific augmentation
While bounding boxes are generated from the segmen- strategies and the ensemble of diferent architectures.
tation masks and are calculated to fit tightly around the Future work should aim for generating substantial
perpolyp, the predictions by object detection models tend formance gains by incorporating temporal information.
to cover more surface than the reference labels, which
results in the inclusion of false-positive pixels inside
the bounding box. To avoid this over-segmentation, we
shrink the bounding boxes with a confidence score higher
than 0.4 by 2% of their size.</p>
        <sec id="sec-3-2-1">
          <title>2.4. Tracking</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>In order to leverage the temporal information in the video</title>
        <p>sequences, we added a second stage tracker on top of the
detection model to track the bounding boxes. We used
Norfair [15], a multiple-object tracker, to track the polyps
by calculating the Euclidean distance between the already
tracked polyp and the prediction provided by the
detection model. The tracker only considers bounding boxes
within a distance of a set threshold to each other. On a
1080x1920 image, we experimented with several distance
thresholds in the range 50px-250px, minimum hit inertia
values in the range of 3-30, maximum hit inertia values
in the range 6-50, and initialization delay values in the
range of 1-20. The best results were obtained with a
distance threshold of 50px, minimum hit inertia value of 10,
maximum inertia value of 25 and an initialization delay
of 10.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
    </sec>
    <sec id="sec-5">
      <title>5. Compliance with ethical standards</title>
      <sec id="sec-5-1">
        <title>This work was conducted using public datasets of human subject data made available by [2, 3, 4, 5, 6, 7, 8, 9].</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgments References</title>
      <sec id="sec-6-1">
        <title>This project was supported by a Twinning Grant of the</title>
        <p>German Cancer Research Center (DKFZ) and the Robert
Bosch Center for Tumor Diseases (RBCT). Part of this
work was funded by Helmholtz Imaging (HI), a platform
of the Helmholtz Incubator on Information and Data
Science.
ages for early diagnosis of colorectal cancer,
International journal of computer assisted radiology and
surgery 9 (2014) 283–293.
[8] Q. Angermann, J. Bernal, C. Sánchez-Montes,</p>
        <p>M. Hammami, G. Fernández-Esparrach, X. Dray,
O. Romain, F. J. Sánchez, A. Histace, Towards
real-time polyp detection in colonoscopy videos:
Adapting still frame-based methodologies for video
sequences analysis, in: Computer assisted and
robotic endoscopy and clinical image-based
procedures, Springer, 2017, pp. 29–41.
[9] J. Bernal, A. Histace, M. Masana, Q. Angermann, G.,
dray, x., and sanchez, j. polyp detection benchmark
in colonoscopy videos using gtcreator: A novel fully
configurable tool for easy and fast annotation of
image databases, in: Proceedings of 32nd CARS</p>
        <p>Conference (Berlin, Germany, 2018.
[10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
[1] F. A. Haggar, R. P. Boushey, Colorectal cancer epi- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
demiology: incidence, mortality, survival, and risk R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
factors, Clinics in colon and rectal surgery 22 (2009) D. Cournapeau, M. Brucher, M. Perrot, E.
Duch191–197. esnay, Scikit-learn: Machine learning in Python,
[2] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- Journal of Machine Learning Research 12 (2011)
lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, 2825–2830.</p>
        <p>B. Matuszewski, et al., Deep learning for detec- [11] G. R. Jocher, ultralytics/yolov5, 2022. URL: https:
tion and segmentation of artefact and disease in- //github.com/ultralytics/yolov5.
stances in gastrointestinal endoscopy, Medical [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
image analysis 70 (2021) 102002. doi:10.1016/j. D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
media.2021.102002. coco: Common objects in context, in: European
[3] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- conference on computer vision, Springer, 2014, pp.
nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. 740–755.</p>
        <p>Anonsen, M. A. Riegler, et al., Polypgen: A [13] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren,
multi-center polyp detection and segmentation Distance-iou loss: Faster and better learning for
dataset for generalisability assessment, arXiv bounding box regression, in: Proceedings of the
preprint arXiv:2106.04463 (2021). doi:10.48550/ AAAI Conference on Artificial Intelligence,
volarXiv.2106.04463. ume 34, 2020, pp. 12993–13000.
[4] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po- [14] R. Solovyev, W. Wang, T. Gabruseva, Weighted
lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, boxes fusion: Ensembling boxes from diferent
obV. Thambawita, et al., Assessing generalisabil- ject detection models, Image and Vision Computing
ity of deep learning-based polyp detection and 107 (2021) 104117.
segmentation methods through a computer vision [15] J. Alori, A. Descoins, KotaYuhara, David, B. Ríos,
challenge, arXiv preprint arXiv:2202.12031 (2022). fatih, shafu, A. Castro, D. Huh, tryolabs/norfair:
doi:10.48550/arXiv.2202.12031. v0.4.0, 2022. URL: https://doi.org/10.5281/zenodo.
[5] J. Bernal, J. Sánchez, F. Vilarino, Towards automatic 6095785. doi:10.5281/zenodo.6095785.
polyp detection with a polyp appearance model,</p>
        <p>Pattern Recognition 45 (2012) 3166–3182.
[6] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach,</p>
        <p>D. Gil, C. Rodríguez, F. Vilariño, Wm-dova maps
for accurate polyp highlighting in colonoscopy:
Validation vs. saliency maps from physicians,
Computerized Medical Imaging and Graphics 43 (2015)
99–111.
[7] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado,</p>
        <p>Toward embedded detection of polyps in wce
im</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>