=Paper=
{{Paper
|id=Vol-2595/endoCV2020_Yu_Guo
|storemode=property
|title=Endoscopic Artefact Detection using Cascade R-CNN based Model
|pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_27.pdf
|volume=Vol-2595
|authors=Zhimiao Yu,Yuanfan Guo
|dblpUrl=https://dblp.org/rec/conf/isbi/YuG20
}}
==Endoscopic Artefact Detection using Cascade R-CNN based Model==
ENDOSCOPIC ARTEFACT DETECTION USING CASCADE R-CNN BASED MODEL Zhimiao Yu, and Yuanfan Guo Shanghai Jiao Tong University, Shanghai, China {gyfastas,Carboxy}@sjtu.edu.cn ABSTRACT network architecture is it achieves state-of-art detection per- formance. Accurate detection of artefacts is a core challenge in a wide- range of endoscopic applications addressing multiple differ- ent disease areas. Our work aims to localise bounding bboxes 2. DATASETS and predict class labels of 8 different artefact classes for given frames and clinical endoscopy video clips. To solve the The 8 artefact classes in the dataset for “Endoscopic Arte- task, we use Cascade R -CNN[1] as network architecture and fact Detection” include specularity, specularity saturation, ar- adopt ImageNet pretrained ResNet101[2] as backbone with tifact, blur, contrast, bubbles, instrument and blood. The vi- Feature Pyramid Network (FPN) [3] structure. To improve sualization of ground truth bboxes are shown in Fig 2. The the network performance, methods like data augmentation artefact detection task will be evaluated based on the results of and multi-scale are also be adopted. In the end, we analyze the test dataset provided from a subset of the data collected for the major challenge of the task. training. Specifically, the training dataset for detection con- sists in total 2200 annotated frames over all 8 artifact classes and test dataset 100[8] [?] [9]. 1. INTRODUCTION 3. METHODS Endoscopy is a widely used clinical procedure for the early detection of numerous cancers (e.g., nasopharyngeal, oe- 3.1. Architecture sophageal adenocarcinoma, gastric, colorectal cancers, blad- The model architecture is shown in Fig 1. We use Cascade der cancer etc.), therapeutic procedures and minimally in- R-CNN[1] as network architecture and adopt ImageNet pre- vasive surgery (e.g.,laparoscopy). However, video frames trained ResNet101[2] as backbone with Feature Pyramid Net- captured by an endoscope usually contain multiple artefacts, work (FPN)[3] structure. Taking the areas of artefacts into which not only present difficulty in visualising the underly- consideration, the anchors base areas are tuned from 162 to ing tissue during diagnosis but also affect any post analysis 5122 on P 2 to P 6 . Specifically, anchor scales, ratios and methods required for follow-ups. Existing endoscopy work- strides are [8], [0.5, 1.0, 2.0] and [4, 8, 16, 32, 64], respec- flows are not competent qualified for restoring high-quality tively. endoscopic frames because they can detect only one artefact class in most cases. Generally, the same video frame can be corrupted with multiple artefacts, e.g. motion blur, specular 3.2. Implement Details reflections, and low contrast can be present in the same frame. For data augmentation, each image will be horizontally Besides, corruption varies with video frames in artefact types. flipped with a 50 percent chance. We replace the nms opera- Therefore, improving detection accuracy is a core challenge tion with the sof t-nms[10] operation in the architecture and in a wide-range of endoscopic applications. set the learning rate scheduling strategy as consine decay[11]. Recently, deep ConvNets have significant improved im- The classification and regression loss function are CrossEn- age classification and object detection accuracy[4]. In deep tropyLoss and SmoothL1Loss, respectively. The model is learning era, object detection can be grouped into two genres: trained for 24 epochs. “two-stage detection” (e.g. RCNN[5]) and “one-stage detec- In the experiment, we find that specularity, artifact and tion” (e.g. [6])[7]. In this task, we use Cascade R-CNN[1] bubbles are hard to classify. A probable reason is these three as network architecture. It is a multi-stage object detection artefacts have similar appearance (e.g. Some of them all ap- architecture. The reason we adopt Cascade R-CNN as our pears as spots of light). To solve this problem, we modify Copyright c 2020 for this paper by its authors. Use permitted under the loss function. In specific, we up-weight loss when model Creative Commons License Attribution 4.0 International (CC BY 4.0). mistakenly classify these three artefacts. The result turns out Fig. 1. Model architecture based on Cascade R-CNN. 4.1. Data augmentation of resizing We report our results obtained from baseline in Table 1. Our baseline achieves 0.26 mAP. To improve the model perfor- mance, we added image resizing operation to the data aug- mentation pipeline. Specifically, each image will be randomly resized among the range from (512, 512) to (1024, 1024) with (a) specularity (b) saturation (c) artifact the same aspect ratio as the original. Considering the image size varies, we believe this operation will be effective. The results are shown in Table 2. According to the re- sults, we argue that resizing operation can obvious improve the model performance with an increase in mAP of 0.017. We notice that the improvement is mainly on APsmall . The main reason is that in most cases the resizing operation en- (d) blur (e) contrast (f) bubbles larges image size and thus makes it possible to detect more small objects. Note that the scales of test images are often larger than train images, i.e. images in testset often with height and width larger than 1000 while images in trainset around 500, so the resizing operation can solve the scale mismatch problem be- tween training images and testing images. (g) instrument (h) blood 4.2. Difficult classification among specularity, artifact Fig. 2. Visualization of ground truth boxes. and bubbles In the experiment, we found that network has some difficul- ties in distinguishing classes among specularity, artifact and to be an improvement of AP for these three artefacts but a bubbles. To demonstrate the problem clearly, we calculated decline of mAP. the confusion matrix, which is shown in Table 4. According to the Table 4, the network has two drawbacks. Firstly, the network tends to confuse specularity, artifact and bubbles in 4. RESULTS the classification procedure. Secondly, the network has poor performance in detecting blur. We randomly divide the data provided into 5 subsets and use To solve the first problem, we modified the loss func- one of them for validation while others for training. The fol- tion. Specifically, we increased the loss weights to the mis- lowing metrics are based on the validation set. classification of specularity, artifact and bubbles. The result Table 1. Baseline performance on validation set. Artefacts AP AP IoU =.50 AP IoU =.75 AP small AP medium AP large specularity 0.123 0.319 0.063 0.064 0.193 0.202 saturation 0.197 0.670 0.217 0.040 0.210 0.345 artifact 0.225 0.486 0.170 0.129 0.218 0.421 blur 0.184 0.275 0.167 0 0 0.191 contrast 0.414 0.760 0.416 0.033 0.187 0.439 bubbles 0.124 0.345 0.061 0.094 0.128 0.216 instrument 0.531 0.801 0.624 / 0 0.551 blood 0.181 0.454 0.103 / 0.079 0.221 mean 0.260 0.514 0.228 0.060 0.127 0.323 Table 2. Model performance on validation set with resizing operation. Artefacts AP AP IoU =.50 AP IoU =.75 AP small AP medium AP large specularity 0.138 0.380 0.062 0.091 0.216 0.199 saturation 0.295 0.669 0.246 0.050 0.212 0.338 artifact 0.243 0.516 0.185 0.140 0.239 0.427 blur 0.181 0.279 0.178 0 0 0.188 contrast 0.422 0.760 0.424 0 0.224 0.443 bubbles 0.153 0.384 0.085 0.125 0.151 0.254 instrument 0.569 0.830 0.649 / 0.044 0.587 blood 0.212 0.495 0.172 / 0.130 0.244 mean 0.277 0.539 0.250 0.068 0.152 0.335 to 300. Then we used the model to obtain the testset results Table 3. Final result on leaderboard. and the performance is shown in Table 3. dataset dscore 50% testset 0.2603 5. DISCUSSION & CONCLUSION 100% testset 0.2036 In our work, we found the major challenge in “Endoscopic Artefact Detection” task is the difficult classification among turned out to be an improvement of AP for these three arte- specularity, artifact and bubbles. One intuitive explanation is facts but a decline for mAP. that some of them all appears as spots of light, sharing a high degree of similarity. In the future, we intend to train 3 separate 4.3. Qualitative Results classifiers for these 3 artefacts and adopt more advanced fea- ture extraction networks, which may solve this challenge to To find out what kinds of artefact our model can successfully some extent. Boxes ensemble method was performed in our detect, we show some qualitative results in Fig 3 and Fig 4. experiment. However, it seemed this method caused lower The qualitative results indicate a). for artefacts with not so mAP. small size, our model tends to generate accurate detections; To sum up, we constructed a Cascade R-CNN based b). more artefacts in an image lead to more difficulties in de- model to solve the “Endoscopic Artefact Detection” task. tecting; c). our model generates a fair number of true negative We adopted several methods to improve the network perfor- blur. We are not sure the reason for problem c) mentioned mance, including data augmentation, modifying loss function above is whether the shortcomings of the model itself or the and boxes ensemble. We also identified the major challenge absence of annotation blur, because the corresponding images in this task. show blur characters. 4.4. Leaderboard Result 6. REFERENCES We added image resizing operation to the data augmentation [1] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: pipeline and fine-tuned the maximum box number per image Delving into high quality object detection. In Proceed- Table 4. Confusion matrix of 8 classes Labels specularity saturation artifact blur contrast bubbles instrument blood specularity 70.3% 7.2% 10.9% 0.0% 0.0% 19.9% 0.0% 0.0% saturation 1.4% 79.9% 0.9% 0.7% 0.0% 0.5% 1.7% 0.0% artifact 19.4% 5.1% 75.4% 20.6% 1.9% 13.1% 6.0% 2.3% blur 0.0% 1.3% 2.3% 49.1% 3.8% 0.1% 4.3% 2.3% Predicted contrast 0.0% 0.0% 1.0% 11.8% 88.5% 0.0% 11.2% 12.8% bubbles 8.4% 2.6% 8.7% 0.0% 0.0% 66.2% 0.0% 0.8% instrument 0.3% 3.3% 0.7% 12.5% 4.5% 0.1% 75.0% 3.8% blood 0.2% 0.5% 0.1% 5.2% 1.2% 0.2% 1.7% 78.2% Fig. 3. High quality detection examples (i.e. Model generates accurate detections). The first row shows ground truth where artefacts are annotated with blue bounding boxes . The second row shows results where detected artefacts are annotated with yellow bounding boxes. Fig. 4. Low quality detection examples (i.e. Model generates inaccurate detections). The first row shows ground truth where artefacts are annotated with blue bounding boxes . The second row shows results where detected artefacts are annotated with yellow bounding boxes. The last two columns represent false positive blur. ings of the IEEE conference on computer vision and pat- [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian tern recognition, pages 6154–6162, 2018. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [3] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2117–2125, 2017. [4] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015. [5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten- dra Malik. Region-based convolutional networks for ac- curate object detection and segmentation. IEEE trans- actions on pattern analysis and machine intelligence, 38(1):142–158, 2015. [6] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time ob- ject detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779– 788, 2016. [7] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055, 2019. [8] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, Stefano Realdon, Maxim Loshchenov, Julia A. Schn- abel, James E. East, Geroges Wagnieres, Victor B. Loschenov, Enrico Grisan, Christian Daul, Walter Blon- del, and Jens Rittscher. An objective comparison of de- tection and segmentation algorithms for artefacts in clin- ical endoscopy. Scientific Reports, 10, 2020. [9] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher. A deep learn- ing framework for quality assessment and restoration in video endoscopy. arXiv preprint arXiv:1904.07073, 2019. [10] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE in- ternational conference on computer vision, pages 5561– 5569, 2017. [11] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image clas- sification with convolutional neural networks. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 558–567, 2019.