ENDOSCOPIC ARTEFACT DETECTION AND SEGMENTATION WITH DEEP CONVOLUTIONAL NEURAL NETWORK Suhui Yang, Guanju Cheng Ping An Technology (Shenzhen) Co. Ltd., Shenzhen, China ABSTRACT tions, and low contrast can be present in the same frame. Further, not all artefact types contaminate the frame equally. Endoscopic artefact detection challenge (EAD2019[?]) in- So, unless multiple artefacts present in the frame are known cludes three tasks: (1) Multi-class artefact detection: lo- with their precise spatial location, clinically relevant frame calization of bounding boxes and class labels for 7 artefact restoration quality cannot be guaranteed. Another advantage classes for given frames (specularity, saturation, artefact, of such detection is that frame quality assessments can be blur, contrast, bubbles and instrument); (2) Region segmen- guided to minimise the number of frames that gets discarded tation: precise boundary delineation of detected artefacts during automated video analysis. 2) Multi-class artefact re- (instrument, specularity, artefact, bubbles and saturation); (3) gion segmentation: Frame artefacts typically have irregular Detection generalization: detection performance independent shapes that are non-rectangular and consequently are over- of specific data type and source. We participated all three estimated by the detected bounding boxes. Development tasks of EAD2019, and this manuscript summarizes our solu- of accurate semantic segmentation methods to precisely de- tion based on deep learning for each task. In short, for task 1, lineate the boundaries of each detected frame artefact will we apply the improved Cascade R-CNN [1] model combined enable optimized restoration of video frames without sacri- with feature pyramid networks (FPN) [2] to deal with multi- ficing information. 3) Multi-class artefact generalisation: It is class artefact detection; for task 2, we apply the network important for algorithms to avoid biases induced by specific architecture like Deeplab v3+ [3] with different backbones training data sets. Also, it is well known that expert annota- (ResNet101 [4] and MobileNet [5]) to segment multi-class tion generation is time consuming and can be infeasible for artefact regions; for task 3, we used Cycle-GAN [6] and then many data institutions. In this challenge, we encourage the perform image translation between training dataset and test- participants to develop machine learning algorithms that can ing dataset to improve the model generalization of multi-class be used across different endoscopic datasets worldwide based artefact detection. Besides, we apply unsupervised t-SNE [7] on our large combined dataset from 6 different institutions. to visualize the date distribution to achieve targeted data reduction and augmentation before training detection and segmentation model; and finally, some effective strategies of 2. MATERIALS AND METHODS model fusion and post-processing are also used to obtain the final results. 2.1. Task 1: Multi-class artefact detection Index Terms— Endoscopic artefact detection challenge, EAD2019 [8, 9] provides two batches of training data for t-SNE, cascade R-CNN, generalization multi-class artefact detection, the first batch contains 886 en- doscopic images labeled with 9352 bounding boxes and the 1. INTRODUCTION second batch contains labeled 1306 endoscopic images la- beled with 8466 bounding boxes. After checking the train- Endoscopy is a widely used clinical procedure for the early ing data, we notice that there may be two difficulties in this detection of numerous cancers (e.g., nasopharyngeal, oe- task. One is unbalance sample distribution, another is various sophageal adenocarcinoma, gastric, colorectal cancers, blad- size/aspect ratio of image and detection object. As shown in der cancer etc.), therapeutic procedures and minimally in- table 1, there are 4074 specularity and only 327 blur in train- vasive surgery (e.g., laparoscopy). EAD2019 challenge[?] ing data1, and there are 3487 artefact and only 46 instrument proposal aims to address the following key problems inher- in training data2. ent in all video endoscopy: 1) Multi-class artefact detection: Based on this observation, we therefore propose an im- Existing endoscopy workflows detect only one artefact class proved Cascade R-CNN [1] as our detection model (Figure which is insufficient to obtain high-quality frame restora- 1). Compared to original Cascade R-CNN, we add the FPN tion. In general, the same video frame can be corrupted [2] module during feature extraction. As shown in Figure 1, with multiple artefacts, e.g. motion blur, specular reflec- there are two main sub-modules, including multi-scale feature num First batch of Second batch of Classes training data(886) training data(1306) specularity 4074(44%) 1761(21%) saturation 511(44%) 611(7%) artefact 1609(17%) 3487(41%) blur 327(%) 348(4%) contrast 686(7%) 872(%) bubbles 1738(19%) 1341(16%) instrument 407(4%) 46(%) total 9352 8466 Table 1. Statistics of two batches of training data Fig. 2. Data visualization with t-SNE for training data1, train- ing data2, validation data for detection (task 1) and general- ization (task 3). Fig. 1. The flowchart of our detection network, a improved Cascade R-CNN by adding FPN module. representation and multi-stage object detection with cascade structures. The module of multi-scale feature representation consists of a bottom-up pathway and a top-down pathway [2]. Using ResNet101 [4] as backbone, the input image is pro- cessed through bottom-up pathway with a series of residual blocks. We denote the feature activation outputs of last resid- ual blocks as {C2, C3, C4, C5}, and we do not include the output of first residual block due to memory space. Then in top-down pathway, each feature map is constructed by merg- ing the corresponding bottom-up map and the unsampled map from a coarser-resolution feature map with a factor of 2. The Fig. 3. According to the distribution of testing data for detec- final feature maps are denoted as {P2, P3, P4, P5} with dif- tion task and generalization task, the training data within the ferent spatial sizes corresponding to {C2, C3, C4, C5}. With shaded area were selected to feed the model, which ignored the FPN, we could produce the multi-scale feature represen- the noisy outliers have less contribution for model training. tations, which can improve the detection rate of small objects (e.g. specularity) by combining low-level features and high- level semantic information. In general, IoU and mAP is a sionality reduction method called t-SNE [7] to visualizing all pair of mutually contradictory index, e.g. an object detector the dataset including training data1, training data2, validation with higher IoU value may usually produce noisy detections data for detection and generalization (shown in Figure 2). We leading to low mAP. We apply three cascade stages of object find there are different data distributions among two training detection networks (R-CNN) to improve the performance [2]. datasets and validation data. Therefore, we delete some out- This structure can prevent mAP from dropping sharply when liers and continuous frames in training data2, and then per- IOU is high between the prediction box and the real box. form data augmentation for categories with fewer sample (e.g. Besides the network architecture, we pay more attention in saturation and blur). Similar operation is also carried out for data distribution. We apply an unsupervised nonlinear dimen- task 2 and task 3. Methods mAPd IoUd scored Methods mAPd IoUd scored Faster RCNN[10] 0.2618 0.3448 0.2950 Model1(only training 0.2210 0.4504 0.3127 Cascade RCNN[1] 0.2996 0.3221 0.3086 data1) Model2(only training Table 2. Results of different models 0.2138 0.4323 0.3012 data2) Model3(training data1+ 0.2996 0.3221 0.3086 2.2. Task 2: Region segmentation training data2) Model4(selected by t- We select Deeplab v3+ [3] network for multi-class artefact 0.2379 0.4512 0.3235 SNE) segmentation, with different backbones (ResNet101 [4] and Model5(selected by t-SNE MobileNet [5]). After backbone network, we add 5 paral- + data augmentation for 0.2658 0.4476 0.3385 lel convolution layers as the feature extraction layers, which training data) include one 1*1 convolutional layer, three 3*3 dilated con- Model6(selected by t-SNE volutional layers with different ratios of 6, 12, 18, and one + data augmentation for 0.2633 0.4663 0.3445 global pooling layer. Then these feature maps are merged and testing data) unsampled to achieve the region segmentation. Model7(model5 and 0.3235 0.4172 0.3610 model 6 fusion by NMS) 2.3. Task 3: Detection generalization Table 3. Results of different datasets by the improved Cas- For detection generalization, we translate the training data to cade R-CNN the style of validation data with Cycle-GAN [6]. we replace the deconvolution with the linear interpolation with 1*1 con- volution to improve the performance of style transfer. Then we retrain the detection model with translated training data and test its performance.. (a) (b) (c) 3. EXPERIMENTS AND RESULTS In our experiments we evaluated the method of each task in Fig. 4. (a) the original image; (b) the results of model without detail. We also compare the experimental results of our meth- post-processing; (c) the final result. ods and Faster-rcnn [10] model for task 1. sigmoid as loss function since there may be overlap among 3.1. Task 1: Multi-class artefact detection different artefact classes. Table 4 shows the results of multi- We use SGD method to optimize the improved Cascade R- class artefact segmentation by different methods. Due to lim- CNN. The learning rate is 0.005 with a staged decline mode, ited training datasets, the contours of some instruments can’t and a total of 30 epoch is performed, and the batch size is be extracted completely, to remedy the over-segmentation 2, all images are resized to 1333*800. Table 1 shows the re- problem of instruments, we combined a marker-based water- sults with different methods. Table 2 shows the results with shed segmentation, which took partially extracted regions as different data methods. In validation data, we perform the markers, to perceptually group together the mulitiple parts data augmentation with the operations of flip and contrast. of instruments. In this way, the multiple parts of instruments Note that the best results for each condition are achieved by are perceptually merged according to their regional homo- the technique of non-maximum suppression (NMS[11]). As geneity. And the result of segmentation is shown in Figure shown in Table 2, the cascade-rcnn method achieves a good 4 which we can see how the post-processing improves the trade-off between mAP and IOU, which is 1.36% higher than segmentation results clearly. Besides, from Table 4, the result the faster-rcnn [10] model. We also compared the results of of merging the two backbones (Resnet101[4], Mobilenet[5]) the different methods in detail, As shown in the table 3, the fi- Increased from 0.6414 to 0.6568, with an increase of nearly nal result of model7 brought 4.83% improvement when com- 1.5 percentage points, and further improved to 0.6700 by pared to the original cascade-rcnn results. post-processing such as regional growth. 3.2. Task 2: Region segmentation 3.3. Task 3: Detection generalization The Adam optimizer is used, the initial learning rate is 0.007, We trained the Cycle-GAN with some hyper parameters: the a total of 30k iterations are trained, and the batch size is 10., Adam optimizer is used, the initial learning rate is 0.002, a all the images are resized to 513*513.We select multi-class total of 30 epochs are trained, and the batch size is 1, all Methods Overlap F2-score scores tection model of task 1, which could effectively improve the Resnet101 backbone 0.6288 0.6795 0.6414 generalization of the detection model of task 1. Ensemble 0.6592 0.6937 0.6568 two backbones Ensemble 0.6612 0.6964 0.6700 ACKNOWLEGMENT +Post-processing We gratefully thank Dr. Peng Gao, Dr. Bin Lv, Dr. Jun Wang, Table 4. Segmentation results of by different methods Xia Zhou, Ge Li, Chenfeng Zhang and Yue Wang for their valuable discussion and kind support during this challenge. 5. REFERENCES [1] Zhaowei Cai and Nuno Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162. [2] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie, “Feature pyra- (a) (b) mid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. Fig. 5. (a) the original image; (b) the translated image. [3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Methods mAPg devg Florian Schroff, and Hartwig Adam, “Encoder-decoder Train with original with atrous separable convolution for semantic image 0.3187 0.1018 training data segmentation,” in Proceedings of the European Confer- Train with translated training ence on Computer Vision (ECCV), 2018, pp. 801–818. 0.3747 0.0693 data by Cycle-GAN[6] [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Table 5. Results of detection generalization Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. the images are resized to 512*512. Then trained the detec- tion model in task 1 with original and translated training [5] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry data respectively, and compare their performance of de- Kalenichenko, Weijun Wang, Tobias Weyand, Marco tection generalization. As shown in Table 5, the model Andreetto, and Hartwig Adam, “Mobilenets: Efficient trained with original training data obtains mAP g=0.3187, convolutional neural networks for mobile vision appli- and dev g=0.1018, while the model trained with translated cations,” arXiv preprint arXiv:1704.04861, 2017. data obtains mAP g=0.3747, and dev g=0.0693. Therefore, the performance of detection generalization improved with [6] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A style transfer. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vi- 4. CONCLUSION sion, 2017, pp. 2223–2232. In Task 1, the better results are obtained by combining the fpn [7] Laurens van der Maaten and Geoffrey Hinton, “Visu- and cascade-rcnn models. The mAP and IOU evaluation in- alizing data using t-sne,” Journal of machine learning dicators are more balanced. At the same time, using t-SNE to research, vol. 9, no. Nov, pp. 2579–2605, 2008. automatically select similar samples between the two batches of training data and the test datasets, which is helpful to accel- [8] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, erate the model training; In the task 2, the deeplabv3+ model Adam Bailey, Stefano Realdon, James East, Georges is supplemented by the multi-class sigmoid loss function to Wagnires, Victor Loschenov, Enrico Grisan, Walter improve the segmentation effect of the model; In the task 3, Blondel, and Jens Rittscher, “Endoscopy artifact de- We presented using cycle-gan to translate the training set of tection (EAD 2019) challenge dataset,” CoRR, vol. task 1 to the testing set in task 3, and the fine-tuning the de- abs/1905.03209, 2019. [9] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher, “A deep learn- ing framework for quality assessment and restoration in video endoscopy,” CoRR, vol. abs/1904.07073, 2019. [10] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun, “Faster R-CNN: towards real-time object de- tection with region proposal networks,” CoRR, vol. abs/1506.01497, 2015. [11] Alexander Neubeck and Luc Van Gool, “Efficient non- maximum suppression,” in 18th International Confer- ence on Pattern Recognition (ICPR’06). IEEE, 2006, vol. 3, pp. 850–855.