EXPLORING DEEP LEARNING BASED APPROACHES FOR ENDOSCOPIC ARTEFACT DETECTION AND SEGMENTATION Anand Subramanian, Koushik Srivatsan Claritrics India Pvt Ltd, Chennai ABSTRACT • A detailed analysis of the performance of RetinaNet with two ResNet [1] based feature extractors (ResNet- The Endoscopic Artefact Detection challenge comprises tasks 50 and ResNet-101) for the detection tasks, and the im- for detection and segmentation of artefacts found in endo- pact of test-time-augmentation. scopic imaging, with a specific task for evaluating the gen- eralization capacity of detection algorithms on external data. • A detailed analysis of the performance of Deeplab-v3 For the detection of artefacts, we train RetinaNet and Faster- model [2] and U-Net model [3] for the segmentation RCNN models. To segment artefacts from the endoscopic im- task along with the post-processing techniques applied. ages, we train a Deeplab v3 model and a U-Net model and also implement post-processing techniques such as the usage • An implementation of a model agnostic tracking pipeline, of an EAST text detector for detection of text artefacts and using existing image correlation trackers for real-time pixel-wise voting ensemble after applying test time augmen- inference of object detection models. tation. We observe that the RetinaNet model with a ResNet- 101 feature extractor is the best performing model across all object detection tasks, while the U-Net performs best in the 2. DATASETS segmentation tasks. We also implement a model agnostic ob- ject tracking pipeline utilizing image correlation-based track- The artefact detection, sequence detection, and generalization ers to reduce the inference time of object detection models. tasks include specularity, bubbles, artefact, saturation, con- We believe that this pipeline can enable real-time analysis of trast, blood, blur, and instrument as classes to be detected. endoscopic images in systems with processing constraints. The training set for this challenge [4, 5, 6] was released in phases, initially as a set of 2200 images with bounding boxes annotations. Sequence data, taken from videos, was provided as a total of 232 images in the next phase, with a set of 99 1. INTRODUCTION images provided in the final phase. The artefact segmenta- tion dataset included instrument, specularity, artefact, bub- Endoscopy is an important clinical procedure that finds ap- bles, and saturation as artefact classes. The training data for plication in the diagnosis of medical ailments. However, the this comprised 474 images and their corresponding class wise video footage captured through endoscopes may be riddled segmentation masks, with the other data releases being the with artefacts, due to variations in contrast, blurs, among same data as released for detection, with masks for only these other artefacts. Therefore there is a requirement for algo- five classes. rithms that can detect, localize, and segment these artefacts. Detecting and localizing these artefacts would be of immense help in applying image restoration techniques to correct them, 3. METHODS with the downstream benefits of reducing the impact of in- strument errors in making medical diagnoses, ultimately 3.1. Object detection improving endoscopic imaging. The motivation of the Endo- In this work, we train two RetinaNet models with ResNet- scopic Artefact Detection and Segmentation challenge is to 50 and ResNet-101 feature extractors and a Faster-RCNN [7] encourage research in this direction. Our work is an attempt model with an Inception v2 [8] feature extractor, for the de- towards tackling these problems by implementing algorithms tection task. We use a Keras implementation of RetinaNet specific to these tasks. Our main contributions in this work [9] and the Tensorflow implementation of Faster-RCNN [10] are as follows: for training the models. We apply on the fly augmentation Copyright c 2020 for this paper by its authors. Use permitted under techniques such as rotation, shear, random-image-flip, image Creative Commons License Attribution 4.0 International (CC BY 4.0). contrast, brightness, saturation, and hue variations randomly during training, to improve generalization of the object de- tection algorithm. During training, when augmentation is ap- plied to the images on the fly, the model is not only exposed to the actual raw image, but also to the transformed versions of the data across iterations. The concept of Test Time aug- mentation builds on this by providing augmented test images to the model, in the assumption that the model would out- put better predictions since it has learned features from im- ages which have had the same transformations applied while training. This technique is implemented at inference using an ensemble framework [11], and the final output bounding boxes are ensembled based on their Intersection-over-Union values with a majority criterion. The augmentations applied to the image at inference are horizontal flip and sharpening. Both the RetinaNet models predict outputs for the augmented image and for the image without augmentation. The outputs from each model are then ensembled to get the final predic- tion. 3.2. Sequence Detection The other sub-task involved in this challenge is the detection of image sequences. We observe that the problem of sequence detection is primarily a video object detection problem. To this end, we implement an object tracking pipeline using the Dlib Correlation Tracker [12, 13] and the Discriminative Cor- relation Filter-with Channel and Spatial Reliability (CSRT) tracker [14] in tandem with an object detection model. We design this as a baseline model agnostic pipeline, working with any algorithm/model that outputs frame-wise bounding Fig. 1: Flowchart of our tracker pipeline boxes, scores, and classes. Some of the problems with us- ing object detection directly for videos are the latency issues involving the model, as detection has to be made once every frame. This overhead may be unacceptable for deep mod- ResNet-50 backbone pre-trained on Imagenet [16]. We treat els with high inference times, especially in systems with pro- this problem as a multi-label segmentation problem due to cessing constraints. Other issues may involve identifying and multiple overlapping masks. We use Keras implementations associating one object across multiple frames. This contri- of Deeplab [17] and U-Net [18] for this task. On analy- bution is intended to explore building systems that can func- sis of the images, we find text segments in images labeled tion even in resource-constrained setups, by reducing model as artefacts and use a pre-trained EAST [19] text detector latency, where model predictions are required for only ev- to capture these regions. This is done after observing that ery N frames, instead of every single frame. We design the the models were unable to capture the text parts accurately object tracking pipeline, such that the bounding boxes from during segmentation. For the purpose of image augmen- one frame are tracked across multiple frames, with minimal tation, we use scale variation, random flips, rotation, the drops in accuracy whilst being robust to movement artefacts, addition of noise, cropping, blurring, hue, saturation, contrast and decreasing the overall processing time, as the model does changes, and sharpening. These augmentation techniques are not have to predict over all the frames. We control the rate randomly applied on the fly during training, with only one at which the model needs to refresh the trackers with a pa- transformation applied for a batch of images. We perform rameter called the Window Size (W), which is the number of Test-Time-Augmentation for the Deeplab and U-Net models, frames after which the model re-initializes the trackers. This and ensemble their outputs using pixel-wise voting (major- is required for both the correlation trackers. ity), and finally, add the text masks captured using the EAST Detector to the result. We implement a custom pipeline that ensembles the two segmentation models using Test-Time- 3.3. Semantic Segmentation Augmentation and integrates the EAST text detector. We use For artefact segmentation, we train a Deeplab v3 model with the imgaug library [20] for building train-time and test-time an Xception backbone [15], as well as a U-Net model with a augmentation pipelines. Method mAP IoU FPS RetinaNet (R-50) 0.20061 0.25118 0.96667 RetinaNet (R-50) w Dlib tracker 0.15386 Nil 3.68767 RetinaNet (R-50) w CSRT tracker 0.14286 Nil 2.30704 Table 2: Sequence detection results on partially released data Method mAP deviation RetinaNet (R-50) 0.20713 0.12230 Faster-RCNN 0.10823 0.09380 Table 3: Object detection results on partially released gener- alization data Method Segmentation score DSC score Deeplab+Unet+East Ensemble 0.49666 0.42946 Deeplab 0.45742 0.39998 Table 4: Artefact segmentation results on partially released segmentation data Object detection method Sequence detection method mAP deviation RetinaNet (R-101) RetinaNet (R-101) 0.2151 0.0762 (w/o) TTA (w/o) TTA Fig. 2: Object detection results of RetinaNet (R-101) on the RetinaNet (R-101 and R-50) RetinaNet (R-101) final test data. The box color representation for each class + 0.1537 0.0419 (w) TTA CSRT tracker are as follows. Artefact , Bubbles, Saturation, RetinaNet (R-101) RetinaNet (R-101 and R-50) Instruments, Contrast, Blur, Specularity and (w) TTA + 0.1502 0.0419 Dlib tracker Blood. Table 5: Object detection results on the detection and se- Method mAP IoU quence data RetinaNet (R-50) 0.15754 0.24115 RetinaNet (R-50) w TTA 0.13515 0.29334 metrics. Faster-RCNN 0.06484 0.12803 Table 2 contains the results of the Retinanet (R-50) model and the trackers used in tandem with it for the sequence de- Table 1: Object detection results on partially released detec- tection task on the partial data. The methods are also bench- tion data marked in terms of frames-per-second. Table 3 contains the results of the models on partially re- 4. RESULTS leased generalization data. We observe that the RetinaNet performs better overall compared to the Faster-RCNN across In the tabulation of results, the RetinaNet with a ResNet- both metrics. This could in part be due to the better general- 50 feature extractor is mentioned as RetinaNet (R-50) and ization of the RetinaNet as a result of applying augmentation the RetinaNet with ResNet-101 feature extractor as RetinaNet while training, whilst augmentations were not applied while (R-101). The use of Test-Time-Augmentation is abbreviated training the Faster-RCNN model. Table 4 contains the re- as TTA. sults of the model on partially released segmentation test data. We find that the ensemble of the Deeplab, the U-Net and the EAST Detector has a significantly higher segmentation score 4.1. Results on first phase of test data and DSC score, compared to the Deeplab model in isolation. From the results on the partial test data for object detection Here, the U-Net is not specifically benchmarked. provided in Table 1, we find that the usage of test-time aug- mentation for the RetinaNet model increases the overall IoU 4.2. Results on full test data by almost 4 percent, but the overall mAP is reduced by 2 percent compared to the original model without applying this Table 5 contains the results of the models on the complete method. The Faster-RCNN, however, performs poorly in detection and sequence data. For the results on the final test comparison with the RetinaNet model, in terms of both the data, only the combined mAP scores of the detection and se- Method mAP deviation RetinaNet (R-50) w/o TTA 0.2121 0.2188 RetinaNet (R-101) w/o TTA 0.2020 0.1920 RetinaNet (R-101 and R-50) w TTA 0.1481 0.1250 Table 7: Object detection results on generalization data Method Segmentation Score deviation Unet 0.5012 0.2648 Deeplab+Unet+East Ensemble 0.4863 0.2751 Deeplab 0.4314 0.2985 Table 8: Artefact segmentation results on full test data RetinaNet-101 with both the trackers, similar to the one car- ried out in Table 2. We benchmark the performance on an Intel(R) Core(TM) i7-8750H CPU with 6 CPU cores by aver- aging the frames-per-second across five runs on the sequence data. A window size parameter of 5 was set for running the tracker pipeline. The results are shown in Table 6. We also carry out the inference of the RetinaNet model on the CPU. We find that the Dlib and CSRT trackers are significantly Fig. 3: Semantic Segmentation Results of U-Net on the final faster than just the RetinaNet model. This can be attributed test data. The color representation for each class are as fol- to two reasons: lows. Instrument, Specularity, Artefact, Bubbles and Saturation. 1. We take the model’s inferences once every five frames for the tracker pipelines as compared to per frame for Method FPS the RetinaNet model without the tracker. RetinaNet (R-101) + Dlib tracker 2.57412083 2. The tracking process carried out across the frames is RetinaNet (R-101) + CSRT tracker 1.19047612 faster than the model’s inference time per frame. This RetinaNet (R-101) w/o tracker 0.68269294 validates our assertion that a tracking system can be more reliable for building systems on endoscopic anal- Table 6: Frames per second results of the tracking and detec- ysis where memory and processing resources may pose tion systems constraints. Table 7 contains the results of the models on the fully quence task was provided whilst other metrics were not pro- released test data for the generalization task. We observe vided. There is some ambiguity in analyzing the exact impact that the RetinaNet (R-50) without test time augmentation per- of the techniques. In one submission, we apply the RetinaNet formed the best in the generalization task compared to the with ResNet-101 feature extractor for both tasks. In the other other two results. two submissions, we apply the ensemble of RetinaNet mod- Table 8 contains the results of the Deeplab model, the U- els separately for the detection tasks, and the object tracking Net Model, and the ensembled model for the segmentation solutions for the sequence tasks. We find that our RetinaNet task on the full test data. For the final test data, only the seg- model with ResNet-101 feature extractor has a higher mAP mentation score is provided, without the additional metrics score, compared to the results of ensembling RetinaNet with as provided for the partial release data. The U-Net model is ResNet-50 and ResNet-101 feature extractors using test-time benchmarked only for the full test data. While the ensemble augmentation. model is a considerable improvement over the results of the We observe that the complete sequence detection test data Deeplab model, the U-Net model outperforms the others on was released as one folder comprising frames belonging to the segmentation score. two different videos. Since our tracking pipeline takes frames However, a key area where our ensemble model is lim- belonging to one video as input for processing, we group ited is in the segmentation of instruments from endoscopic the sequence data frames based on the video they belonged images. to and run inference individually for each batch of frames. As observed from Fig. 4, the ensembled model is unable We benchmark the frames-per-second performance of the to pick out metal bands linking the edges of the instruments. 7. REFERENCES [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. [2] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous con- volution for semantic image segmentation. CoRR, abs/1706.05587, 2017. [3] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical im- age segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234–241, 2015. [4] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- doscopy artifact detection (ead 2019) challenge dataset. arXiv preprint arXiv:1905.03209, 2019. Fig. 4: Images of instruments and mask predicted by the en- [5] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, semble model James East, Xin Lu, and Jens Rittscher. A deep learn- ing framework for quality assessment and restoration in video endoscopy. arXiv preprint arXiv:1904.07073, This is observed across multiple instances and is a limitation 2019. in the predictions that are output by the model. Methods to tackle these could include the usage of image processing tech- [6] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- niques involving region growth to link the edges. ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, 5. DISCUSSION & CONCLUSION Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, We present the results of the models trained for the purpose Stefano Realdon, Maxim Loshchenov, Julia A. Schn- of detecting, localizing and segmenting endoscopic artefacts. abel, James E. East, Geroges Wagnieres, Victor B. We train RetinaNet and Faster-RCNN models to detect and Loschenov, Enrico Grisan, Christian Daul, Walter Blon- locate endoscopic artefacts, and Deeplab v3 and U-Net mod- del, and Jens Rittscher. An objective comparison of de- els for the purpose of segmenting endoscopic artefacts. We tection and segmentation algorithms for artefacts in clin- implement a model agnostic object tracking pipeline for the ical endoscopy. Scientific Reports, 10, 2020. sequence detection task, utilizing image correlation-based trackers, and observe its impact on model inference time. We [7] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian also implement post-processing techniques, such as test-time- Sun. Faster R-CNN: towards real-time object detection augmentation, ensembling, and text detection, and present a with region proposal networks. CoRR, abs/1506.01497, study of their impact on the released test data. Further work 2015. could concentrate on improving the segmentation of instru- [8] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, ments from images, as well as exploring the performance of Jonathon Shlens, and Zbigniew Wojna. Rethinking different models for segmentation in general. the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. 6. ACKNOWLEDGEMENT [9] keras-retinanet. https://github.com/fizyr/keras-retinanet. We would like to express our sincere gratitude towards the [10] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong research and development team at Claritrics India for their Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, help and guidance. Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Others. Speed/accuracy trade-offs for modern convo- lutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7310–7311, 2017. [11] A. Casado-Garca and J. Heras. Ensem- ble methods for object detection. 2019. https://github.com/ancasag/ensembleObjectDetection. [12] Martin Danelljan, Gustav Hger, Fahad Khan, and Michael Felsberg. Accurate scale estimation for robust visual tracking. pages 65.1–65.11, 01 2014. [13] Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009. [14] Alan Lukei, Tom Voj, Luka ehovin Zajc, Jiri Matas, and Matej Kristan. Discriminative correlation filter with channel and spatial reliability. International Journal of Computer Vision, 126, 11 2016. [15] François Chollet. Xception: Deep learning with depth- wise separable convolutions. CoRR, abs/1610.02357, 2016. [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [17] bonlime. keras-deeplab-v3-plus. GitHub repository. https://github.com/bonlime/keras-deeplab-v3-plus. [18] Pavel Yakubovskiy. Segmentation models, 2019. [19] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. EAST: an efficient and accurate scene text detector. CoRR, abs/1704.03155, 2017. [20] Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi Tanaka, Jake Graving, Christoph Reinders, Sarthak Ya- dav, Joy Banerjee, Gbor Vecsei, Adam Kraft, Zheng Rui, Jirka Borovec, Christian Vallentin, Semen Zhy- denko, Kilian Pfeiffer, Ben Cook, Ismael Fernndez, Franois-Michel De Rainville, Chi-Hung Weng, Abner Ayala-Acevedo, Raphael Meudec, Matias Laporte, et al. imgaug. https://github.com/aleju/imgaug, 2020. Online; accessed 01-Feb-2020.