=Paper=
{{Paper
|id=Vol-3148/paper3
|storemode=property
|title=Heterogeneous Model Ensemble For Automatic Polyp Segmentation In Endoscopic Video Sequences
|pdfUrl=https://ceur-ws.org/Vol-3148/paper3.pdf
|volume=Vol-3148
|authors=Thuy Nuong Tran,Fabian Isensee,Lars Kra ̈mer,Amine Yamlahi,Tim Adler,Patrick Godau,Minu Tizabi,Lena Maier-Hein
|dblpUrl=https://dblp.org/rec/conf/isbi/TranImYAGTM22
}}
==Heterogeneous Model Ensemble For Automatic Polyp Segmentation In Endoscopic Video Sequences==
<pdf width="1500px">https://ceur-ws.org/Vol-3148/paper3.pdf</pdf>
<pre>
Heterogeneous model ensemble for automatic polyp
segmentation in endoscopic video sequences
Thuy Nuong Tran1 , Fabian Isensee2,3 , Lars Krämer2,3 , Amine Yamlahi1 , Tim Adler1 ,
Patrick Godau1 , Minu Tizabi1 and Lena Maier-Hein1
1
  Div. Intelligent Medical Systems, German Cancer Research Center (DKFZ), Heidelberg, Germany
2
  Div. Medical Image Computing, DKFZ, Heidelberg, Germany
3
  Applied Computer Vision Lab, Helmholtz Imaging


                                       Abstract
                                       The detection and segmentation of polyps during colonoscopy can substantially contribute to the prevention of colon cancer.
                                       Assisting clinicians using automated systems can mitigate the risk of human error. In this work, we present our polyp
                                       segmentation approach, submitted to the EndoCV2022 challenge. Common polyp segmentation methods are based on
                                       single-model, single-frame predictions. This work presents a symbiosis of three separate models, each with their own strength,
                                       as part of a segmentation pipeline and a post-processing step designed to leverage unique predictions for more temporally
                                       coherent results.

                                       Keywords
                                       Polyp segmentation, Temporal coherence, High resolution, Heterogeneous ensemble


1. Introduction                                                                                 2. Datasets
Colorectal cancer is one of the most commonly found can-                                        The dataset provided by the EndoCV2022 polyp segmen-
cer types, ranking second in females and third in males                                         tation sub-challenge[2, 3, 4] consists of 46 sequences
[1]. By detecting and subsequently resecting polyps dur-                                        of varied length, totalling 3,290 image frames and their
ing colonoscopy screenings, the risk of developing the                                          corresponding polyp segmentation masks. Further-
disease can be reduced significantly. With the advance of                                       more, three public polyp segmentation datasets were
machine learning in the medical domain, deep learning-                                          added as external data, namely CVC-ColonDB[5], CVC-
based methods have the potential to assist in detecting                                         ClinicDB[6] and ETIS-Larib[7], to enrich the diversity of
and segmenting these polyps with high accuracy. The                                             the dataset. These account for 1,108 additional training
EndoCV2022 challenge[2] addresses generalizability of                                           images, resulting in 4,398 frames in total.
such deep learning models for segmentation in endo-
scopic video sequences. The method presented in this
paper tackles this issue with three primary design de-                                          3. Methodology
cisions: (1) The provided challenge dataset underwent
                                                                                                Our challenge strategy rests on three main pillars: (1) A
a curation process that ensures annotation quality. (2)
                                                                                                data pre-processing step to ensure high data annotation
An ensemble of three networks with complementary
                                                                                                quality, (2) the network architecture selection and train-
strengths was trained for the segmentation prediction.
                                                                                                ing step, which yields the segmentation models, and (3) a
(3) Finally, a post-processing step was implemented to
                                                                                                post-processing step, which leverages model heterogene-
address false-negative frames caused by majority vote. A
                                                                                                ity and uses structural similarity[8] of consecutive frames
fallback mechanism was set to reweight the predictions
                                                                                                in order to handle false-negative masks. An overview is
of a single model in order to enable unique predictions.
                                                                                                depicted in Fig.1.

                                                                                                3.1. Data pre-processing
                                                                                                       Correct data annotation of the training set is crucial to
                                                                                                       the learning capabilities of any segmentation model. In
4th International Workshop and Challenge on Computer Vision in order to ensure annotation quality, the provided chal-
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- lenge dataset was curated by manually removing images
national Symposium on Biomedical Imaging ISBI2022, March with implausible or temporally inconsistent annotations,
28th, 2022, IC Royal Bengal, Kolkata, India                                                            to the best of our judgement. An example is shown in
$ t.tran@dkfz-heidelberg.de (T. N. Tran)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Fig.2. This was conducted under the assumption that
    CEUR
          Attribution 4.0 International (CC BY 4.0).
          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       false annotation would harm the training process more
Figure 1: Overview of heterogeneous model ensemble pipeline. Data is curated. Predictions of Efficient-UNet[9] ensemble,
nnU-Net[10] and Hierarchical Multi-Scale Attention Network[11] are combined. Post-processing yields final prediction.


than having a larger number of frames for training. The we expected more knowledge to be leveraged from the
external datasets described in section 2 underwent the provided high-resolution video sequences.
same selection process. Including external data, the re-
sulting training dataset amounted to 4,106 image-mask 3.2.1. nnU-Net
pairs.
                                                         The nnU-Net is able to automatically determine key de-
                                                         cisions to set up the segmentation pipeline for training,
                                                         irrespective of the dataset. While it has ranked first on
                                                         many 3D-segmentation challenges1 , its self-configuring
                                                         strategy can also be applied to 2D images. The nnU-Net
                                                         was expected to provide a solid base prediction.

                                                               3.2.2. Hierarchical Multi-Scale Network
                                                              By treating the polyp segmentation as a classic computer
                                                              vision task, it is possible to use established segmentation
Figure 2: Example of inconsistent annotation. The up-
                                                              models that perform well on complex natural images.
per row depicts three consecutive frames of the provided
                                                              The Hierarchical Multi-Scale Attention Network (HM-
seq23_endocv22 sequence. The lower row shows the segmenta-
tion mask. The image at position 𝑡 has fewer polyps annotated ANet) was chosen as it is a state-of-the-art2
                                                                                                               architecture
compared to the neighboring frames, despite the polyps not    in semantic  segmentation     on Cityscapes   . The HM-ANet
being obstructed or out of sight.                             operates on higher resolutions and combines predictions
                                                              from different scales. This was expected to result in a
                                                              precise polyp segmentation, irrespective of the size of
                                                              the polyp.
3.2. Neural network architectures
In order to solve the polyp segmentation task, a model         3.3. Efficient-UNet Ensemble
ensemble was designed that consists of parts with com-         Most of the current segmentation models operate on
plementary strengths. This was realized by using an            a frame-by-frame basis. In order to capture temporal
nnU-Net[10], which is configured to automatically adapt        information, one approach is to incorporate a recurrent
its pre-processing and training framework to different         neural network layer to a standard segmentation model,
datasets, and thus serves as a strong segmentation base,       such as a Gated Recurrent Unit (GRU)-layer. The chosen
a Hierarchical Multi-Scale Attention Network[11], which        base segmentation model is the Efficient-UNet (Eff-UNet).
combines predictions of multiple scales for a better predic-   It is an encoder-decoder architecture with an EfficientNet
tion performance, and an ensemble of Efficient-UNets[9],       as its backbone, which is able to scale with model size
one of which is equipped with an internal GRU-layer to
process temporal information. By focusing on incorpo-          1
                                                                   medicaldecathlon.com,https://kits19.grand-challenge.org,https:
rating temporal as well as high-resolution information,            //www.med.upenn.edu/cbica/brats2020
                                                               2
                                                                   www.cityscapes-dataset.com
Figure 3: Example of differently sized polyp images with their segmentation masks from the provided challenge set.


and outperform other ConvNet backbones. One GRU-               4.1. Implementation details
layer was added to the bottleneck of the Eff-UNet, to
                                                               The nnU-Net was used as a framework and manual
form an Eff-GRUNet. Consecutive images are loaded in
                                                               changes were made to its automatically generated con-
batches of size two. They are encoded, pooled, flattened,
                                                               figuration. The short edge of the image was resized to
sequentially fed into the GRU-layer and then reshaped
                                                               512px with the other being resized according to aspect
and fed to the decoder. The Eff-UNet is trained separately
                                                               ratio. The patch-size was set to 448 x 448. The data was
from the Eff-GRUNet. Variants of both combined form
                                                               then heavily augmented with operations such as rotation,
the Eff-UNet ensemble.
                                                               intensity and gamma augment, scaling, mirroring and
                                                               blurring.
3.3.1. Combining networks and weighting                           For the HM-ANet, the data was normalized and ran-
Since the HM-ANet operates on high resolutions, it was         dom scaling between [0.5,1], random crop to 512x1024,
expected that it performs well on very small polyps, as        RGB-shift, and random vertical and horizontal flipping
well as being able to fully capture larger polyps in their     was performed. The model was initiated with weights
entirety. During ensembling, the HM-ANet was designed          pre-trained on PaddleClas3 . The training was conducted
to be weighted higher for the small and large polyps.          in three phases: 1) Training the model on original chal-
Since there is no standardized definition of polyp sizes,      lenge data, 2) fine-tuning the model on challenge and
the thresholds were set empirically by observing refer-        external data, and 3) fine-tuning again on challenge data
ence labels of public polyp datasets[5, 6, 7]. An example      only.
is shown in Fig.3.                                                For the Eff-UNet ensemble, the data was resized to
                                                               480x480 (Eff-UNet_480) and 256x256 (Eff-UNet_256), in-
                                                               corporating different resolutions. Resizing to 256x256
3.4. Post-processing by reweighting                            was chosen for the Eff-GRUNet, to fit memory restric-
To mitigate the error of false-negative predictions, a post-   tions. Augmentations such as rotation, elastic and grid
processing step is added that considers empty segmen-          deformation were used.
tation masks and their surrounding frames. If a neigh-            In order to combine the predictions, the segmented
boring frame is polyp-positive and is similar to the cur-      polyps were divided into small (≤ 0.4% of image size),
rent frame, then any non-empty prediction of the cur-          large (≥9% of image size) and medium (rest) polyps. If
rent frame is reweighted, effectively allowing a polyp-        polyps were predicted as small or large, the weight of
positive prediction despite non-majority. The similarity       the HM-ANet was increased to 0.5, while the others were
score used for this approach is the structural similarity      decreased to 0.25 each. If the polyp was of medium size,
score(SSIM), as it is able to take texture into account.       the models were weighted equally at 0.33. The final seg-
                                                               mentation was formed by thresholding the weighted pre-
                                                               dictions at 0.5. To address false-negatives resulting from
4. Experiments and Results                                     an unmet majority criterion, unique single-model pre-
                                                               dictions were encouraged if neighboring images were
The original training dataset was split into four parts
                                                               structurally similar (SSIM > 0.9) and predicted to be
using GroupK-fold for 4-fold cross-validation(CV) train-
                                                               polyp-positive. The single model prediction weight was
ing, balancing the number of frames and sequence IDs.
                                                               then increased to 0.5. This proved to solve some false-
Each fold has 11-12 sequences with around 750 frames.
                                                               negative cases, as illustrated in Fig.4.
The following subsections describe the implementation
details and experiment results after hyperparameter op-
timization.                                              4.2. Single model experiment results
                                                               All final single model DSC scores are reported in Table 1.
                                                               The nnU-Net was trained with external data added to the
                                                               3
                                                                   paddleclas.readthedocs.io/en/latest/index.html
Figure 4: Example of post-process reweighting. The ensemble prediction at time step 𝑡 is empty. Because SSIM > 0.9 and
prediction is non-empty for at least one of the neighboring images, the single model prediction is weighted with 0.5.


training set, resulting in a mean CV DSC score of 0.67.        Table 1
Training only on the challenge set or the external dataset     Cross-Validation scores of all models, including the compo-
resulted in a worse DSC score of 0.57 and 0.55.                nents of the Eff-UNet ensemble. Underscored values indicate
The HM-ANet had a mean CV DSC score across all folds           selection for Eff-UNet ensemble. Bold values indicate compo-
of 0.70. During training and inference, predictions of         nents of the final heterogeneous ensemble.
scales [0.5,1] were combined. Experiments with scales           DSC score         Fold 0   Fold 1    Fold 2   Fold 3    Mean
of [0.5,1,2] resulted in a worse performance of 0.69 with       nnU-Net             0.65     0.84      0.70     0.50     0.67
more false-positives in empty images. Training in three         HM-ANet             0.67     0.82      0.69     0.60     0.70
steps as described in sub-subsection 4.1. yielded the best      Eff-UNet_480        0.67     0.80      0.69     0.62     0.69
result. Other training strategies such as training on a         Eff-UNet_256        0.68     0.80      0.71     0.65     0.71
combined dataset or pre-training on the external dataset        Eff-GRUNet          0.61     0.72      0.58     0.60     0.62
and fine-tuning on the official dataset resulted in a worse     Eff-UNet Ens        0.67     0.80      0.71     0.60     0.70
performance. 4-fold cross-validation was used to deter-
mine the stopping epochs for all three phases. A final
inference model was then trained on the entire dataset.        4.3. Reweighting and ensembling results
The three Eff-UNet models were each trained on the com-        In order to test the reweighting strategy of the HM-ANet,
bined dataset over four folds, resulting in 12 models. The     the proportion of small and big polyps was calculated
mean CV DSC scores of the Eff-UNet_480, Eff-UNet_256           for the validation splits. For folds 0-3, the ratios were
and Eff-GRUNet were 0.69, 0.71 and 0.62, respectively. As      45%, 28%, 37%, and 65%. For single models, fold 1 had
an alternative experiment, the Eff-UNet_480 was trained        the most medium polyps and highest average CV score.
with external data for pre-training and challenge data         Fold 3 has the most non-medium polyps, and the lowest
for fine-tuning. This performed worse compared to us-          average CV score. However, the difference in DSC scores
ing the combined dataset, resulting in a mean CV DSC           between models is small. Since the ratio was highest
score of 0.65. In order to decrease inference time, two Eff-   for fold 3, an experiment is conducted where the three
UNet_480, one Eff-UNet_256 and one Eff-GRUNet were             single models were validated on only the small and big
selected for the ensemble, based on validation score and       polyp images of fold 3 (n = 483 out of 738 frames). The
fold representation. The final prediction was determined       resulting DSC scores are 0.63, 0.66 and 0.70. The simple
by majority vote. The mean CV DSC score of the final           ensemble receives a score of 0.73 and the ensemble with
ensemble was 0.70.                                             reweighting of HM-ANet a score of 0.74. Adding post-
                                                               processing did not decrease or increase the score for this
validation set.                                                [4] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can-
                                                                   nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.
                                                                   Anonsen, M. A. Riegler, et al., Polypgen: A
5. Conclusion                                                      multi-center polyp detection and segmentation
                                                                   dataset for generalisability assessment, arXiv
Our investigation showed that the HM-ANet was favor-
                                                                   preprint arXiv:2106.04463 (2021). doi:10.48550/
able for small and large polyp cases, which our dedi-
                                                                   arXiv.2106.04463.
cated weighing strategy takes into account during en-
                                                               [5] J. Bernal et al., Towards automatic polyp detection
sembling. Notably, on a dataset with small and big
                                                                   with a polyp appearance model, Pattern Recogni-
polyps, it achieves a DSC score of 0.74, improving the
                                                                   tion 45 (2012) 3166–3182.
best-performing single model HM-ANet by 0.04. The
                                                               [6] J. Bernal et al., Wm-dova maps for accurate polyp
post-processing leverages self-adaptive training as well
                                                                   highlighting in colonoscopy: Validation vs. saliency
as temporal and high resolution information by enabling
                                                                   maps from physicians, Computerized Medical Imag-
unique predictions of all three heterogeneous compo-
                                                                   ing and Graphics 43 (2015) 99–111.
nents, resulting in less false-negative predictions. The
                                                               [7] J. Silva, A. Histace, O. Romain, X. Dray, B. Granado,
inference time as the sum of the slowest component (nnU-
                                                                   Toward embedded detection of polyps in wce im-
Net) and the ensembling step is 0.71 fps.
                                                                   ages for early diagnosis of colorectal cancer, Inter-
                                                                   national journal of computer assisted radiology and
6. Compliance with ethical                                         surgery 9 (2014) 283–293.
                                                               [8] Z. Wang et al., Image quality assessment: from error
   standards                                                       visibility to structural similarity, IEEE transactions
                                                                   on image processing 13 (2004) 600–612.
This work was conducted using public datasets of human
                                                               [9] B. Baheti et al., Eff-unet: A novel architecture for se-
subject data made available by [2, 3, 4, 5, 6, 7].
                                                                   mantic segmentation in unstructured environment,
                                                                   in: Proceedings of the IEEE/CVF Conference on
7. Acknowledgments                                                 Computer Vision and Pattern Recognition Work-
                                                                   shops, 2020, pp. 358–359.
This project was supported by a Twinning Grant of the         [10] F. Isensee et al., nnU-Net: a self-configuring method
German Cancer Research Center(DKFZ) and the Robert                 for deep learning-based biomedical image segmen-
Bosch Center for Tumor Diseases(RBCT). Part of this                tation, Nature methods 18 (2021) 203–211.
work was funded by Helmholtz Imaging(HI), a platform          [11] A. Tao et al., Hierarchical multi-scale atten-
of the Helmholtz Incubator on Information and Data                 tion for semantic segmentation, arXiv preprint
Science.                                                           arXiv:2005.10821 (2020).


References
 [1] F. A. Haggar, R. P. Boushey, Colorectal cancer epi-
     demiology: incidence, mortality, survival, and risk
     factors, Clinics in colon and rectal surgery 22 (2009)
     191–197.
 [2] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po-
     lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,
     V. Thambawita, et al., Assessing generalisabil-
     ity of deep learning-based polyp detection and
     segmentation methods through a computer vision
     challenge, arXiv preprint arXiv:2202.12031 (2022).
     doi:10.48550/arXiv.2202.12031.
 [3] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po-
     lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
     B. Matuszewski, et al., Deep learning for detec-
     tion and segmentation of artefact and disease in-
     stances in gastrointestinal endoscopy, Medical
     image analysis 70 (2021) 102002. doi:10.1016/j.
     media.2021.102002.

</pre>