ENDOSCOPIC ARTEFACT DETECTION WITH ENSEMBLE OF DEEP NEURAL NETWORKS AND FALSE POSITIVE ELIMINATION Gorkem Polat, Deniz Sen, Alperen Inci, Alptekin Temizel Graduate School of Informatics Middle East Technical University, Ankara, Turkey {gorkem.polat, deniz.sen 01, alperen.inci, atemizel}@metu.edu.tr ABSTRACT Video frames obtained through endoscopic examination can be corrupted by many artefacts. These artefacts adversely af- fect the diagnosis process and make the examination of the underlying tissue difficult for the professionals. In addition, detection of these artefacts is essential for further automated analysis of the images and high-quality frame restoration. In this study, we propose an endoscopic artefact detection frame- work based on an ensemble of deep neural networks, class- agnostic non-maximum suppression, and false-positive elimi- nation. We have used different ensemble techniques and com- bined both one-stage and two-stage networks to have a hetero- geneous solution exploiting the distinctive properties of dif- ferent approaches. Faster R-CNN, Cascade R-CNN, which are two-stage detector, and RetinaNet, which is single-stage detector, have been used as base models. The best results have been obtained using the consensus of their predictions, which were passed through class-agnostic non-maximum suppres- sion, and false-positive elimination. Fig. 1: Flowchart of the proposed method. Index Terms Endoscopic artefact detection, Faster R- CNN, Feature pyramid networks, RetinaNet 1. INTRODUCTION a class imbalance problem. While artefacts such as specular- ity account for nearly 34% of all detections, instrument class Endoscopic imaging is a widely used clinical procedure to accounts for only 1.7%. Three classes (specularity, artifact, inspect hollow organs and collect tissue samples for further and bubbles), in total, account for 82% of all bounding boxes. examination. However, video frames captured during endo- Secondly, there is a scale imbalance problem. There are var- scopic examination are corrupted by many artefacts due to ious bounding boxes that cover almost the entire frame and several factors such as lighting and shape of the organ. In or- various bounding boxes only have very few pixels. Hence, der to perform a detailed endoscopic procedure, it is required the parameters of the object detection algorithms should be to detect and localize these artefacts. This is also an essen- chosen carefully in the light of these observations to detect tial process for high-quality frame restoration and developing both small and large objects. We adopted an approach based computer-assisted endoscopy tools. on an ensemble of object detectors. Despite being slower, we There are many challenges in artefact detection in endo- mainly focused on two-stage networks due to their ability to scopic images. Analysis of the dataset provided by EAD2020 detect small and very close objects and used Faster R-CNN Challenge [1, 2], reveals two major problems. Firstly, there is [3] and Cascade R-CNN [4]. In addition, we used a single- Copyright c 2020 for this paper by its authors. Use permitted under stage detector, RetinaNet [5], as a complementary model in Creative Commons License Attribution 4.0 International (CC BY 4.0). the ensemble. 2. PROPOSED METHOD The flowchart of the proposed approach is given in Figure 1. We use three base models. The outputs of these base mod- els are then fed into a class agnostic non-maximum suppres- sion algorithm independently before combining the results through an ensemble model. Then a false-positive elimina- Fig. 3: Blue: Ground-truth bounding boxes. Red: Bounding boxes eliminated after the FP reduction step. Green: Remain- ing predicted boxes after elimination. posal network. It has one backbone network that extracts features and two subnetworks for object classification and bounding box regression. An important difference of this net- work from other single-stage networks (e.g. YOLO, SSD) Fig. 2: IoU histogram of each class with the other seven is the use of focal loss. Focal loss is an extension to cross- classes Vertical axis is clipped to provide better visualization. entropy loss that puts a focus on sparse hard examples. It tion is applied to the output of the ensemble model. In the changes the weight of loss according to the performance of remainder of this section, we describe these steps in more de- the model on different examples. tail. 2.2. Class-Agnostic Non-Maximum Suppression (NMS) 2.1. Base Models In the original Faster R-CNN architecture, NMS operation is We use two two-stage models: Faster R-CNN [3], Cascade R- performed on each class independently. Yet, these architec- CNN [4] and one single-stage model: RetinaNet [5] as base tures are generally designed considering non-medical datasets models. Examination of the previous studies in this domain such as COCO [9] or PASCAL VOC [10], which have high reveals that feature pyramid network (FPN) [6] and ResNet overlap ratios among the bounding boxes of different classes. [7] architectures achieve promising results [8]. Therefore, However, it is not expected to have frequent overlaps be- these networks have been selected as the basis for our models. tween different objects in the endoscopic images. To validate The first model is based on Faster R-CNN and uses FPN this assertion, we calculated the IoU values for each class with as a backbone. Although FPNs are compute and memory the other classes. Figure 2 shows the IoU histogram of each intensive, they are good at extracting features at different class with the other seven classes. As seen in this figure, EAD scales. Since the dataset consists of objects in a wide variety Challenge dataset does not exhibit high number of overlaps of sizes, FPNs are an important element of the proposed net- between class bounding boxes. On the other hand, the origi- work. We used a ResNet50 model with FPN as a backbone nal object detector predictions result in a high IoU between of this model. Standard convolutional and fully connected classes. Therefore, we propose a class-agnostic procedure heads have been used for box predictions. where the model predictions are passed through the NMS pro- The second model is Cascade R-CNN. While it is a simi- cess together for all classes. As a consequence of this process, lar model to Faster R-CNN, it is claimed to alleviate the prob- if an artefact is detected by multiple models with high IoU, lem of overfitting at training. Cascade R-CNN consists of the ones having the lower confidence scores are eliminated. consecutive detectors which are trained sequentially with in- A threshold of 0.4 IoU has been used to perform this class- creasing intersection-over-union (IoU) thresholds. This ar- agnostic NMS step. chitecture is reported to be more selective against close false positives. Again, we used a ResNet50 model with FPN as a 2.3. Ensemble of Models backbone. In addition to these two-stage object detectors, we trained Two different ensemble methods, affirmative and consensus, and used a RetinaNet as our third model. RetinaNet is a have been used [11]. In the affirmative method, the outputs of single-stage method and, as such, does not use a region pro- different models are merged, and NMS operation is applied Table 1: IoU thresholds for false-positive elimination. the statistics of the ground-truth object bounding boxes. Fig- Class Threshold Class Threshold Specularity 0.13 Contrast 0.19 Saturation 0.21 Bubbles 0.12 Artifact 0.17 Instrument 0.24 Blur 0.4 Blood 0.11 on the result. It can be regarded as the union of all bounding boxes. In the consensus method, only the bounding boxes for which the majority of the models agree are kept. This method is similar to the ensemble of models in classification problems. 2.4. False-Positive (FP) Elimination Although class-agnostic NMS discards the bounding boxes that have high IoU with other bounding boxes in the detector Fig. 4: Histogram of normalized bounding box sizes, where network, the IoU threshold (0.4) might be still too high for the 1.0 corresponds to an area of 512x512 ure 4 shows the histogram of the bounding boxes. Accord- same class types. For example, if the intersection of two bub- ing to this figure, most of the bounding boxes are located in ble bounding boxes has very low probability but model pre- the region where bounding boxes are smaller than the median dicts bounding boxes that have high IoU, it implies that there area (256x512); therefore, 12, 25, and 80-pixel sizes for both is redundancy and one of them should be removed. Therefore, width and height have been used for smaller boxes. For the we have examined the IoU histogram of each class individu- mid-size and larger bounding boxes, 256 and 384 pixels have ally and determined a class-specific threshold. When there are been chosen respectively. Each anchor box size [12, 25, 80, bounding boxes with higher IoU values than the threshold, the 256, 384] was mapped to the corresponding feature map layer ones having lower confidence scores are removed. Thresholds in [P2 , P3 , P4 , P5 , P6 ] respectively where Pn is the nth fea- are determined according to the 1.5 interquartile range (IQR) ture map layer. Three different aspect ratios (width/height): above the 3rd quartile. Thresholds for elimination are given 0.5, 1, and 2, were used for each anchor box. in Table 1. This process is applied after the ensemble opera- The total number of iterations was 200000 for Faster R- tion. An example image demonstrating the effect of this step CNN and Cascade R-CNN and 90000 for RetinaNet. Learn- is shown in Figure 3. ing rate scheduling by a factor of 10 was used for all three models. Scheduling has been done at iterations 130000 and 3. EXPERIMENTAL DESIGN 180000 for Faster R-CNN, at iterations 150000 and 190000 for Cascade R-CNN and at iterations 60000 and 80000 for We have evaluated the performance of the individual models RetinaNet. and their combination through affirmative and consensus en- We used PyTorch [12] and Detectron2 API [13] to train semble models. In addition, we have evaluated the effect of the models on a workstation with two NVIDIA RTX2080 adding a false-positive elimination step on the outputs of these GPUs. Faster R-CNN and Cascade R-CNN models took 15 models. We have used the EAD Challenge dataset through- hours to train, and RetinaNet model took 11 hours to train out the experiments. The dataset contains 2200 images and using a single GPU. We used the other GPU to train different 1555 of them, corresponding to 70%, have a dimension of models in parallel. For all three models, weights of pretrained 512x512. Therefore, we rescaled all the images to that size models on COCO dataset have been used. in order to fix the input size. In order to prevent overfitting, The results are given in Table 2. In addition to the results 10% of the overall dataset (∼250 images) has been set aside using different network types, ensemble models and their ver- for validation. The rest of the images in the dataset have been sion with class-agnostic NMS and FP elimination steps are used for training. The training dataset has been expanded by also provided. Ensemble methods utilize all three networks. image augmentation techniques. Each image has been trans- formed by horizontal flipping and 90◦ , 180◦ , 270◦ rotations. As a result, there was an eight-fold increase in the training 4. EXPERIMENTAL RESULTS & DISCUSSION dataset size. We have observed that augmenting the dataset results in better generalization. According to the results in Table 2, while the individual For the best performance, anchor box sizes should match networks have very similar mAP values, the Faster R-CNN the object bounding boxes. For this purpose, we calculated model has a higher mIoU. The affirmative ensemble gives the Table 2: Experimental results. Without Class-Agnostic NMS With Class-Agnostic NMS Method mAP mIoU mAP mIoU Faster R-CNN with FPN 45.66 40.78 44.20 42.82 Cascade R-CNN with FPN 45.98 32.23 44.07 35.03 RetinaNet 45.09 36.44 43.91 41.22 Ensemble (affirmative) 47.91 26.03 47.12 30.28 Ensemble (consensus) 47.29 42.89 45.96 45.19 Ensemble (affirmative) with FP elimination 46.92 32.21 46.54 34.25 Ensemble (consensus) with FP elimination 46.86 44.65 45.71 45.91 highest mAP score, which is expected as some true positives 5. CONCLUSIONS missed by a model can be detected by the other models. On the other hand, a higher number of false positives are gener- In this study, we have trained three different object detectors ated, which adversely affects its mIoU score. The consensus for endoscopic artefact detection. We have used ensemble ensemble has the highest mIoU value among the methods not techniques to utilize all three individual networks. Applying utilizing FP elimination. Although class-agnostic NMS and a class-agnostic NMS to each of them independently resulted FP reduction steps decrease the mAP values marginally, they in a better trade-off between mAP and mIoU scores. As a eliminate many false-positives and give higher mIoU scores, final step, FP elimination is applied, which resulted in more resulting in a more balanced mAP and mIoU scores. For robust results. example, when FP elimination is applied to the ensemble (af- In this work, we have focused on using lighter networks firmative) result, in return to a 0.99 points decrease in mAP, and taken ensemble of weak classifiers approach. Use of there is a 6.18 points increase in mIoU. Increasing mIoU by lighter networks made the hyperparameter tuning possible in such an elimination mechanism adversely affect mAP. Be- feasible time periods and allowed us to experiment with vari- cause, in some cases, object detectors do not perform well; ous network parameters. In the future, more sophisticated net- models may detect artefacts incorrectly and boxes which works, such as ResNeXt or ResNet152, which require more have true classes but low confidence scores are suppressed by time to train and tune parameters could also be investigated. wrongly detected high confidence boxes. It is observed that FP elimination works better if there is a lower mIoU. Since there is a trade-off between mAP and mIoU, these steps can 6. ACKNOWLEDGEMENTS be utilized to have more robust object detectors. Different score metrics are used for different object detection tasks. In We would like to thank MTA TI Tower AG for donation of this work, we have used post-processing techniques to have a the workstation and GPUs used in this work. balanced mAP and mIoU scores. The highest scores are obtained using the consensus en- 7. REFERENCES semble of the classifiers, which were passed through a class- agnostic NMS, and FP reduction as the final step. [1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Object detectors are generic and they are not developed Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- considering the domain-specific challenges. In addition, these doscopy artifact detection (ead 2019) challenge dataset. networks have many internal parameters and these parameters arXiv preprint arXiv:1905.03209, 2019. need to be tuned for the particular application. Hence, it is not sufficient to use more advanced models and a comprehensive understanding of the characteristics of the data is of essence. [2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher. A deep learn- To integrate the domain knowledge into detection archi- ing framework for quality assessment and restoration tecture, we have qualitatively observed that some classes such in video endoscopy. arXiv preprint arXiv:1904.07073, as specularity and saturation have bounding boxes overlap- 2019. ping with each other. While removal of the one that has less confidence seems to be a solution, this is not ideal since, in a [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian number of cases, the one that has less confidence is the true Sun. Faster r-cnn: Towards real-time object detection class. Therefore, specific algorithms should be included in with region proposal networks. In Advances in neural the detection framework to tackle this problem. information processing systems, pages 91–99, 2015. [4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: style, high-performance deep learning library. In Ad- Delving into high quality object detection. In Proceed- vances in Neural Information Processing Systems 32, ings of the IEEE conference on computer vision and pat- pages 8024–8035. 2019. tern recognition, pages 6154–6162, 2018. [13] Yuxin Wu, Alexander Kirillov, Francisco Massa, [5] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Wan-Yen Lo, and Ross Girshick. Detectron2. and Piotr Dollár. Focal loss for dense object detection. https://github.com/facebookresearch/detectron2, 2019. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. [6] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2117–2125, 2017. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [8] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, Stefano Realdon, Maxim Loshchenov, Julia A. Schn- abel, James E. East, Geroges Wagnieres, Victor B. Loschenov, Enrico Grisan, Christian Daul, Walter Blon- del, and Jens Rittscher. An objective comparison of de- tection and segmentation algorithms for artefacts in clin- ical endoscopy. Scientific Reports, 10, 2020. [9] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [10] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pas- cal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. [11] Jonathan Heras and Angela Casado-Garcia. Ensemble methods for object detection. In European Conference on Artificial Intelligence, 2020. [12] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative