ENDOSCOPIC ARTEFACT DETECTION WITH ENSEMBLE OF DEEP NEURAL
                   NETWORKS AND FALSE POSITIVE ELIMINATION

                                  Gorkem Polat, Deniz Sen, Alperen Inci, Alptekin Temizel

                                           Graduate School of Informatics
                                  Middle East Technical University, Ankara, Turkey
                           {gorkem.polat, deniz.sen 01, alperen.inci, atemizel}@metu.edu.tr


                           ABSTRACT
Video frames obtained through endoscopic examination can
be corrupted by many artefacts. These artefacts adversely af-
fect the diagnosis process and make the examination of the
underlying tissue difficult for the professionals. In addition,
detection of these artefacts is essential for further automated
analysis of the images and high-quality frame restoration. In
this study, we propose an endoscopic artefact detection frame-
work based on an ensemble of deep neural networks, class-
agnostic non-maximum suppression, and false-positive elimi-
nation. We have used different ensemble techniques and com-
bined both one-stage and two-stage networks to have a hetero-
geneous solution exploiting the distinctive properties of dif-
ferent approaches. Faster R-CNN, Cascade R-CNN, which
are two-stage detector, and RetinaNet, which is single-stage
detector, have been used as base models. The best results have
been obtained using the consensus of their predictions, which
were passed through class-agnostic non-maximum suppres-
sion, and false-positive elimination.                                              Fig. 1: Flowchart of the proposed method.
    Index Terms Endoscopic artefact detection, Faster R-
CNN, Feature pyramid networks, RetinaNet


                     1. INTRODUCTION
                                                                          a class imbalance problem. While artefacts such as specular-
                                                                          ity account for nearly 34% of all detections, instrument class
Endoscopic imaging is a widely used clinical procedure to
                                                                          accounts for only 1.7%. Three classes (specularity, artifact,
inspect hollow organs and collect tissue samples for further
                                                                          and bubbles), in total, account for 82% of all bounding boxes.
examination. However, video frames captured during endo-
                                                                          Secondly, there is a scale imbalance problem. There are var-
scopic examination are corrupted by many artefacts due to
                                                                          ious bounding boxes that cover almost the entire frame and
several factors such as lighting and shape of the organ. In or-
                                                                          various bounding boxes only have very few pixels. Hence,
der to perform a detailed endoscopic procedure, it is required
                                                                          the parameters of the object detection algorithms should be
to detect and localize these artefacts. This is also an essen-
                                                                          chosen carefully in the light of these observations to detect
tial process for high-quality frame restoration and developing
                                                                          both small and large objects. We adopted an approach based
computer-assisted endoscopy tools.
                                                                          on an ensemble of object detectors. Despite being slower, we
     There are many challenges in artefact detection in endo-
                                                                          mainly focused on two-stage networks due to their ability to
scopic images. Analysis of the dataset provided by EAD2020
                                                                          detect small and very close objects and used Faster R-CNN
Challenge [1, 2], reveals two major problems. Firstly, there is
                                                                          [3] and Cascade R-CNN [4]. In addition, we used a single-
    Copyright c 2020 for this paper by its authors. Use permitted under   stage detector, RetinaNet [5], as a complementary model in
Creative Commons License Attribution 4.0 International (CC BY 4.0).       the ensemble.
                 2. PROPOSED METHOD

The flowchart of the proposed approach is given in Figure 1.
We use three base models. The outputs of these base mod-
els are then fed into a class agnostic non-maximum suppres-
sion algorithm independently before combining the results
through an ensemble model. Then a false-positive elimina-


                                                                    Fig. 3: Blue: Ground-truth bounding boxes. Red: Bounding
                                                                    boxes eliminated after the FP reduction step. Green: Remain-
                                                                    ing predicted boxes after elimination.
                                                                    posal network. It has one backbone network that extracts
                                                                    features and two subnetworks for object classification and
                                                                    bounding box regression. An important difference of this net-
                                                                    work from other single-stage networks (e.g. YOLO, SSD)
Fig. 2: IoU histogram of each class with the other seven            is the use of focal loss. Focal loss is an extension to cross-
classes Vertical axis is clipped to provide better visualization.   entropy loss that puts a focus on sparse hard examples. It
tion is applied to the output of the ensemble model. In the         changes the weight of loss according to the performance of
remainder of this section, we describe these steps in more de-      the model on different examples.
tail.
                                                                    2.2. Class-Agnostic Non-Maximum Suppression (NMS)
2.1. Base Models
                                                                    In the original Faster R-CNN architecture, NMS operation is
We use two two-stage models: Faster R-CNN [3], Cascade R-           performed on each class independently. Yet, these architec-
CNN [4] and one single-stage model: RetinaNet [5] as base           tures are generally designed considering non-medical datasets
models. Examination of the previous studies in this domain          such as COCO [9] or PASCAL VOC [10], which have high
reveals that feature pyramid network (FPN) [6] and ResNet           overlap ratios among the bounding boxes of different classes.
[7] architectures achieve promising results [8]. Therefore,             However, it is not expected to have frequent overlaps be-
these networks have been selected as the basis for our models.      tween different objects in the endoscopic images. To validate
    The first model is based on Faster R-CNN and uses FPN           this assertion, we calculated the IoU values for each class with
as a backbone. Although FPNs are compute and memory                 the other classes. Figure 2 shows the IoU histogram of each
intensive, they are good at extracting features at different        class with the other seven classes. As seen in this figure, EAD
scales. Since the dataset consists of objects in a wide variety     Challenge dataset does not exhibit high number of overlaps
of sizes, FPNs are an important element of the proposed net-        between class bounding boxes. On the other hand, the origi-
work. We used a ResNet50 model with FPN as a backbone               nal object detector predictions result in a high IoU between
of this model. Standard convolutional and fully connected           classes. Therefore, we propose a class-agnostic procedure
heads have been used for box predictions.                           where the model predictions are passed through the NMS pro-
    The second model is Cascade R-CNN. While it is a simi-          cess together for all classes. As a consequence of this process,
lar model to Faster R-CNN, it is claimed to alleviate the prob-     if an artefact is detected by multiple models with high IoU,
lem of overfitting at training. Cascade R-CNN consists of           the ones having the lower confidence scores are eliminated.
consecutive detectors which are trained sequentially with in-       A threshold of 0.4 IoU has been used to perform this class-
creasing intersection-over-union (IoU) thresholds. This ar-         agnostic NMS step.
chitecture is reported to be more selective against close false
positives. Again, we used a ResNet50 model with FPN as a
                                                                    2.3. Ensemble of Models
backbone.
    In addition to these two-stage object detectors, we trained     Two different ensemble methods, affirmative and consensus,
and used a RetinaNet as our third model. RetinaNet is a             have been used [11]. In the affirmative method, the outputs of
single-stage method and, as such, does not use a region pro-        different models are merged, and NMS operation is applied
   Table 1: IoU thresholds for false-positive elimination.         the statistics of the ground-truth object bounding boxes. Fig-

     Class        Threshold         Class       Threshold
   Specularity      0.13          Contrast        0.19
   Saturation       0.21          Bubbles         0.12
    Artifact        0.17         Instrument       0.24
      Blur           0.4           Blood          0.11


on the result. It can be regarded as the union of all bounding
boxes. In the consensus method, only the bounding boxes
for which the majority of the models agree are kept. This
method is similar to the ensemble of models in classification
problems.

2.4. False-Positive (FP) Elimination
Although class-agnostic NMS discards the bounding boxes
that have high IoU with other bounding boxes in the detector       Fig. 4: Histogram of normalized bounding box sizes, where
network, the IoU threshold (0.4) might be still too high for the   1.0 corresponds to an area of 512x512
                                                                   ure 4 shows the histogram of the bounding boxes. Accord-
same class types. For example, if the intersection of two bub-
                                                                   ing to this figure, most of the bounding boxes are located in
ble bounding boxes has very low probability but model pre-
                                                                   the region where bounding boxes are smaller than the median
dicts bounding boxes that have high IoU, it implies that there
                                                                   area (256x512); therefore, 12, 25, and 80-pixel sizes for both
is redundancy and one of them should be removed. Therefore,
                                                                   width and height have been used for smaller boxes. For the
we have examined the IoU histogram of each class individu-
                                                                   mid-size and larger bounding boxes, 256 and 384 pixels have
ally and determined a class-specific threshold. When there are
                                                                   been chosen respectively. Each anchor box size [12, 25, 80,
bounding boxes with higher IoU values than the threshold, the
                                                                   256, 384] was mapped to the corresponding feature map layer
ones having lower confidence scores are removed. Thresholds
                                                                   in [P2 , P3 , P4 , P5 , P6 ] respectively where Pn is the nth fea-
are determined according to the 1.5 interquartile range (IQR)
                                                                   ture map layer. Three different aspect ratios (width/height):
above the 3rd quartile. Thresholds for elimination are given
                                                                   0.5, 1, and 2, were used for each anchor box.
in Table 1. This process is applied after the ensemble opera-
                                                                       The total number of iterations was 200000 for Faster R-
tion. An example image demonstrating the effect of this step
                                                                   CNN and Cascade R-CNN and 90000 for RetinaNet. Learn-
is shown in Figure 3.
                                                                   ing rate scheduling by a factor of 10 was used for all three
                                                                   models. Scheduling has been done at iterations 130000 and
              3. EXPERIMENTAL DESIGN                               180000 for Faster R-CNN, at iterations 150000 and 190000
                                                                   for Cascade R-CNN and at iterations 60000 and 80000 for
We have evaluated the performance of the individual models         RetinaNet.
and their combination through affirmative and consensus en-
                                                                       We used PyTorch [12] and Detectron2 API [13] to train
semble models. In addition, we have evaluated the effect of
                                                                   the models on a workstation with two NVIDIA RTX2080
adding a false-positive elimination step on the outputs of these
                                                                   GPUs. Faster R-CNN and Cascade R-CNN models took 15
models. We have used the EAD Challenge dataset through-
                                                                   hours to train, and RetinaNet model took 11 hours to train
out the experiments. The dataset contains 2200 images and
                                                                   using a single GPU. We used the other GPU to train different
1555 of them, corresponding to 70%, have a dimension of
                                                                   models in parallel. For all three models, weights of pretrained
512x512. Therefore, we rescaled all the images to that size
                                                                   models on COCO dataset have been used.
in order to fix the input size. In order to prevent overfitting,
                                                                       The results are given in Table 2. In addition to the results
10% of the overall dataset (∼250 images) has been set aside
                                                                   using different network types, ensemble models and their ver-
for validation. The rest of the images in the dataset have been
                                                                   sion with class-agnostic NMS and FP elimination steps are
used for training. The training dataset has been expanded by
                                                                   also provided. Ensemble methods utilize all three networks.
image augmentation techniques. Each image has been trans-
formed by horizontal flipping and 90◦ , 180◦ , 270◦ rotations.
As a result, there was an eight-fold increase in the training         4. EXPERIMENTAL RESULTS & DISCUSSION
dataset size. We have observed that augmenting the dataset
results in better generalization.                                  According to the results in Table 2, while the individual
    For the best performance, anchor box sizes should match        networks have very similar mAP values, the Faster R-CNN
the object bounding boxes. For this purpose, we calculated         model has a higher mIoU. The affirmative ensemble gives the
                                                   Table 2: Experimental results.

                                                           Without Class-Agnostic NMS         With Class-Agnostic NMS
                           Method
                                                           mAP            mIoU                mAP          mIoU
               Faster R-CNN with FPN                       45.66          40.78               44.20        42.82
             Cascade R-CNN with FPN                        45.98          32.23               44.07        35.03
                       RetinaNet                           45.09          36.44               43.91        41.22
                Ensemble (affirmative)                     47.91          26.03               47.12        30.28
                 Ensemble (consensus)                      47.29          42.89               45.96        45.19
       Ensemble (affirmative) with FP elimination          46.92          32.21               46.54        34.25
       Ensemble (consensus) with FP elimination            46.86          44.65               45.71        45.91


highest mAP score, which is expected as some true positives                             5. CONCLUSIONS
missed by a model can be detected by the other models. On
the other hand, a higher number of false positives are gener-       In this study, we have trained three different object detectors
ated, which adversely affects its mIoU score. The consensus         for endoscopic artefact detection. We have used ensemble
ensemble has the highest mIoU value among the methods not           techniques to utilize all three individual networks. Applying
utilizing FP elimination. Although class-agnostic NMS and           a class-agnostic NMS to each of them independently resulted
FP reduction steps decrease the mAP values marginally, they         in a better trade-off between mAP and mIoU scores. As a
eliminate many false-positives and give higher mIoU scores,         final step, FP elimination is applied, which resulted in more
resulting in a more balanced mAP and mIoU scores. For               robust results.
example, when FP elimination is applied to the ensemble (af-            In this work, we have focused on using lighter networks
firmative) result, in return to a 0.99 points decrease in mAP,      and taken ensemble of weak classifiers approach. Use of
there is a 6.18 points increase in mIoU. Increasing mIoU by         lighter networks made the hyperparameter tuning possible in
such an elimination mechanism adversely affect mAP. Be-             feasible time periods and allowed us to experiment with vari-
cause, in some cases, object detectors do not perform well;         ous network parameters. In the future, more sophisticated net-
models may detect artefacts incorrectly and boxes which             works, such as ResNeXt or ResNet152, which require more
have true classes but low confidence scores are suppressed by       time to train and tune parameters could also be investigated.
wrongly detected high confidence boxes. It is observed that
FP elimination works better if there is a lower mIoU. Since
there is a trade-off between mAP and mIoU, these steps can                          6. ACKNOWLEDGEMENTS
be utilized to have more robust object detectors. Different
score metrics are used for different object detection tasks. In     We would like to thank MTA TI Tower AG for donation of
this work, we have used post-processing techniques to have a        the workstation and GPUs used in this work.
balanced mAP and mIoU scores.
   The highest scores are obtained using the consensus en-                               7. REFERENCES
semble of the classifiers, which were passed through a class-
agnostic NMS, and FP reduction as the final step.                    [1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
                                                                         Adam Bailey, Stefano Realdon, James East, Georges
    Object detectors are generic and they are not developed
                                                                         Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
considering the domain-specific challenges. In addition, these
                                                                         doscopy artifact detection (ead 2019) challenge dataset.
networks have many internal parameters and these parameters
                                                                         arXiv preprint arXiv:1905.03209, 2019.
need to be tuned for the particular application. Hence, it is not
sufficient to use more advanced models and a comprehensive
understanding of the characteristics of the data is of essence.      [2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
                                                                         James East, Xin Lu, and Jens Rittscher. A deep learn-
    To integrate the domain knowledge into detection archi-              ing framework for quality assessment and restoration
tecture, we have qualitatively observed that some classes such           in video endoscopy. arXiv preprint arXiv:1904.07073,
as specularity and saturation have bounding boxes overlap-               2019.
ping with each other. While removal of the one that has less
confidence seems to be a solution, this is not ideal since, in a     [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
number of cases, the one that has less confidence is the true            Sun. Faster r-cnn: Towards real-time object detection
class. Therefore, specific algorithms should be included in              with region proposal networks. In Advances in neural
the detection framework to tackle this problem.                          information processing systems, pages 91–99, 2015.
 [4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn:                  style, high-performance deep learning library. In Ad-
     Delving into high quality object detection. In Proceed-           vances in Neural Information Processing Systems 32,
     ings of the IEEE conference on computer vision and pat-           pages 8024–8035. 2019.
     tern recognition, pages 6154–6162, 2018.
                                                                  [13] Yuxin Wu, Alexander Kirillov, Francisco Massa,
 [5] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,             Wan-Yen Lo, and Ross Girshick.             Detectron2.
     and Piotr Dollár. Focal loss for dense object detection.         https://github.com/facebookresearch/detectron2, 2019.
     In Proceedings of the IEEE international conference on
     computer vision, pages 2980–2988, 2017.

 [6] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
     Bharath Hariharan, and Serge Belongie. Feature pyra-
     mid networks for object detection. In Proceedings of the
     IEEE conference on computer vision and pattern recog-
     nition, pages 2117–2125, 2017.

 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     Sun. Deep residual learning for image recognition. In
     Proceedings of the IEEE conference on computer vision
     and pattern recognition, pages 770–778, 2016.

 [8] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
     ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
     qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
     Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
     Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
     fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
     Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
     abel, James E. East, Geroges Wagnieres, Victor B.
     Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
     del, and Jens Rittscher. An objective comparison of de-
     tection and segmentation algorithms for artefacts in clin-
     ical endoscopy. Scientific Reports, 10, 2020.

 [9] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
     Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
     C Lawrence Zitnick. Microsoft coco: Common objects
     in context. In European conference on computer vision,
     pages 740–755. Springer, 2014.

[10] Mark Everingham, Luc Van Gool, Christopher KI
     Williams, John Winn, and Andrew Zisserman. The pas-
     cal visual object classes (voc) challenge. International
     journal of computer vision, 88(2):303–338, 2010.

[11] Jonathan Heras and Angela Casado-Garcia. Ensemble
     methods for object detection. In European Conference
     on Artificial Intelligence, 2020.

[12] Adam Paszke, Sam Gross, Francisco Massa, Adam
     Lerer, James Bradbury, Gregory Chanan, Trevor
     Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,
     Alban Desmaison, Andreas Kopf, Edward Yang,
     Zachary DeVito, Martin Raison, Alykhan Tejani,
     Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
     Bai, and Soumith Chintala. Pytorch: An imperative