=Paper=
{{Paper
|id=Vol-2595/endoCV2020_Chen_et_al
|storemode=property
|title=Endoscopy Artefact Detection and Segmentation using Deep Convolutional Neural Network
|pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_26.pdf
|volume=Vol-2595
|authors=Haijian Chen,Chenyu Lian,Liansheng Wang
|dblpUrl=https://dblp.org/rec/conf/isbi/ChenLW20
}}
==Endoscopy Artefact Detection and Segmentation using Deep Convolutional Neural Network==
ENDOSCOPY ARTEFACT DETECTION AND SEGMENTATION USING DEEP CONVOLUTIONAL NEURAL NETWORK Haijian Chen, Chenyu Lian, Liansheng Wang Department of Computer Science, School of Informatics, Xiamen University, China ABSTRACT Class Count Ratio Class Count Ratio Endoscopy Artefact Detection and Segmentation (EAD2020) specularity 9791 36.2% contrast 1641 6.1% includes 3 sub-tasks: Multi-class artefact detection, Seman- saturation 1277 4.7% bubbles 4670 17.3% tic segmentation and Out-of-sample generalisation. This artifact 8012 29.6% instrument 470 1.7% manuscript summarizes our solution. The challenge can be blur 684 2.5% blood 491 1.8% considered as two independent problems: object detection and semantic segmentation. For the detection problem, we Table 1. Class distribution of detection dataset use Cascade R-CNN with FPN and Hyper Task Cascade. For the segmentation problem, we use DeepLab v3+ model with P bce+dice loss. Class fg bg fg/ fg fg/(fg+bg) Instrument 15997225 371567134 36.39% 4.13% Specularity 4700063 382864296 10.69% 1.21% 1. INTRODUCTION Artefact 4100248 383464111 9.33% 1.06% Bubbles 8967902 378596457 20.40% 2.31% Endoscopy is a widely used clinical procedure for the early Saturation 10190545 377373814 23.18% 2.63% detection of numerous cancers. However, a major drawback of these video frames is that they are heavily corrupted with Table 2. Pixel distribution of segmentation dataset (fg: fore- multiple artifacts. Thus, accurate detection and even segmen- ground, bg : background) tation of artifacts are very helpful to improve the endoscopy tools. This task aims to localise bounding boxes, predict class labels and pixel-wise segmentation of 8 different arti- Size Count Ratio Size Count Ratio fact classes for given frames and clinical endoscopy video 512 × 512 138 25.36% Smaller 129 23.71% clips. 1349 × 1079 118 21.69% Bigger 159 29.23% Total 544 100% 2. DATASETS Table 3. Image sizes of segmentation dataset (Smaller : The details of Endoscopy Artifact Detection and Segmenta- height <800, width <700, Bigger is the contrary) tion Dataset are described well in the original papers [1, 2, 3]. The following part gives a brief analysis of EAD2020 data. 2.1. Object detection 2.2. Semantic segmentation We combine the two phases of the dataset together. As shown in Table 1, the distribution of different classes is very imbal- Many ground-truth pixel values are between 0 and 255 in the anced. The counts of ‘blur’, ‘instrument’, and ‘blood’ are dataset. After dividing all ground truth pixel values by 255 significantly smaller than others, which could become hard and using a threshold of 0.5 to classify foreground and back- examples when training models. The counts of ‘specularity’ ground pixels, the results are shown in Table 2. Foreground and ‘artifact’ are very big and the objects of them are very pixels are significantly fewer than background pixels. The small in size. Based on the condition, we pay attention to the foreground pixels of different classes in ground-truth images balance of each class when we divide 20% of data as valida- are imbalanced as well. As shown in Table 3, the most com- tion set. mon size of images is 512 × 512 and 1349 × 1079. The others Copyright c 2020 for this paper by its authors. Use permitted under contain different sizes. We shuffle the dataset randomly and Creative Commons License Attribution 4.0 International (CC BY 4.0). use 20% of the data as the validation set. 3. METHODS bce+dice as the loss of our model at last, which simply means 3.1. Object Detection L = Lbce + Ldice = − ygt log ypred − (1 − ygt ) log ypred P (2 ygt · ypred + 3.1.1. Model Overview +1− P P ) ygt + ypred + We use Cascade R-CNN [4] with ResNetXt-101 [5] back- bone and FPN [6] as the neck of the model. We also train a ( = 10−7 , ygt and ypred are flattened tensors) Hyper Task Cascade model [7] with the same backbone and neck. 3.2.3. Augmentation We apply random brightness and contrast changes, random 3.1.2. Loss horizontal and vertical flip, random shift scale rotation, Gaus- sian blurring, resizing and normalization to images of the We use Cross Entropy Loss for classification. Smooth L1 training set. All random transformations are applied by a Loss is utilized to improve the precision of detection. probability of 0.5 with the default parameters of Albumen- tations library [11]. We apply image normalization in the 3.1.3. Augmentation validation set. The images are resized to 512 × 512 and 1024 × 1024 In training data, we perform random flip, normalization and during the training phase, see 4.2. resizing. The images are resized to 512 × 512. 3.2.4. Implementation Details 3.1.4. Implementation Details We load the weights pre-trained on the ImageNet for the back- We extract candidate bounding box with RPN (region pro- bone network. The network is trained using SGD with a mo- posal network) , and use non-maximum suppression (NMS) mentum of 0.9 and a weight decay of 0.0001. We train the to filter the useful bounding-box. Observing that some small model using mini-batches of size 4. The learning rate is in- objects are ignored, NMS threshold is increased from 0.7 creased linearly over the warm-up period of 5 epochs, to the to 0.8. It slightly improves the recall rate and mAP. Soft- maximum value of 0.01, then adjusted by cosine annealing NMS [8] is applied to avoid mistakenly discard the bounding- with warm restarts [12] by a period of 40 epochs. The images box directly. are resized to 512 × 512 to train 200 epochs and then resized to 1024 × 1024 to train another 100 epochs. We use SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001. In order to get better results when convergence, we add a warm-up period to make the training 4. RESULTS rate linearly increase to 0.0025 in the first 500 iterations. The network is trained for 13 epochs totally. 4.1. Object Detection Table 4 shows mAPs of different classes in the validation set, 3.2. Semantic Segmentation which are evaluated by COCO metrics. And Table 5 shows more details of evaluation results. Metrics of both models 3.2.1. Model Overview are pretty close to each other. In Figure 1, we find the HTC model is good at detecting large objects while doing poorly We use DeepLab V3+ network [9] with ResNet101 [5] back- in some small objects, though its AP small metric is slightly bone for semantic segmentation. DeepLab V3+ is an encoder- better than the other. decoder network with dilation convolution. ASPP modules and decoder is implemented as the original paper does. Class Cascade R-CNN HTC Faster R-CNN The output of the network is activated by sigmoid function instrument 0.64791 0.64965 0.56197 to get the probability map, since there may be overlap among artifact 0.22540 0.22511 0.21733 different channels of the mask. The segmentation problem is blood 0.10594 0.12520 0.10998 considered as multiple binary segmentation tasks. blur 0.26506 0.26097 0.19428 bubbles 0.11302 0.10491 0.10600 3.2.2. Loss contrast 0.40275 0.39182 0.38044 saturation 0.27912 0.24990 0.26373 We evaluated different losses, including Binary Cross En- specularity 0.09281 0.09485 0.08561 tropy, Dice Loss, Lovsz-Hinge Loss [10], and their combina- tion. Based on the testing results discussed in 4.2, we choose Table 4. mAPs of different classes in validation set Metric Cascade R-CNN HTC Faster R-CNN mAP 0.267 0.263 0.240 AP50 0.501 0.505 0.498 AP75 0.246 0.249 0.209 AP small 0.082 0.091 0.086 AP medium 0.162 0.166 0.166 AP large 0.337 0.337 0.299 Table 5. AP metrics of evaluation results in validation set Model mAPd IoUd mAPg mAPsq Cascade R-CNN 0.2238 0.1707 0.2405 0.3038 Fig. 1. Predictions of two images (left: Hybrid Task Cascade; HTC network 0.2393 0.0674 0.2621 0.3214 right: Cascade R-CNN) Table 6. Detection scores in the first phase of test data Model Score d dstd gmAP gdev Cascade R-CNN 0.2193 0.0871 0.2485 0.0552 HTC network 0.2021 0.0901 0.2744 0.0556 Table 7. Detection scores in the final test Fig. 2. Test image Fig. 3. Pred-512 Fig. 4. Pred-1024 Size Score d dstd gmAP gdev 512 × 512 0.2193 0.0871 0.2485 0.0552 using bce+dice, showing the effectiveness of this loss. 1024 × 1024 0.2156 0.0991 0.2659 0.0764 After 300 epochs, the DeepLabV3+ model using bce+dice Table 8. Detection scores in the final test with Cascade mod- got 0.7927 in F1, 0.8386 in F2, 0.6857 in IoU, 0.7422 in pre- els trained with different sizes cision and 0.887 in recall. The U-Net models don’t get much better scores as they almost converge after 165 epochs. We also tested Lovsz-Hinge loss. In our test, it is hard to The results of Table 6,7,8 are provided by the official converge if the model is trained from the ground up. Hence, leaderboard. Table 6 shows the detection scores in the first we use Lovsz-Hinge loss to fine-tune the Deeplab model phase of test data. The Hybrid Task Cascade network per- trained with bce+dice for 300 epochs. Table 11 shows the re- forms better in mAP, while getting a lower score in IoU. sults of the first 20 epochs (Epochs means the training epochs Table 7 shows the scores in the final test. We get a higher with Lovsz). This model converges after 30 epochs but these detection score with the Cascade R-CNN network. results are worse than the model before fine-tuning, so we As shown in Table 8, resizing the image to 1024 × 1024 give up this method. instead of 512×512 doesn’t give a better score but contributes to generalization performance. Metric F1 F2 IoU p r bce 0.585 0.6754 0.4447 0.5014 0.8136 4.2. Semantic Segmentation dice 0.5846 0.5881 0.4874 0.6755 0.601 4.2.1. Experiments of losses in validation set bce+dice 0.6728 0.7042 0.5523 0.6699 0.7585 To evaluate the results of different losses, we train a DeepLab Table 9. The metrics of the validation set with different V3+ model with ResNet101 backbone and a modified U- losses in DeepLab V3+ model with ResNet101 backbone Net [13] model with ResNet-34 backbone for 160 epochs. (160 epochs) The threshold to predict foreground pixels is 0.5. Other configurations are the same as 3.2.4. In Table 9 and Table Metric F1 F2 IoU p r 10, ‘bce is Binary Cross Entropy loss, ‘dice is Dice Loss, bce 0.4666 0.4346 0.3743 0.7201 0.4196 ‘bce+dice’ is defined in 3.2.2. All ‘p’ and ‘r’ in the tables dice 0.561 0.5421 0.469 0.7188 0.5353 stand for precision and recall. bce+dice 0.6138 0.5837 0.5057 0.7415 0.5709 In Table 9, the experiment shows that bce+dice gets the best score in Dice, F2, and IoU score. The precision of Table 10. The metrics of the validation set with different bce+dice is pretty close to dice, while not losing much recall. losses in a modified U-Net model with ResNet-34 backbone In Table 10, we can see a significant improvement of UNet (160 epochs) We choose bce+dice to train the final model. 512, models trained with 1024 × 1024 get better scores in the validation set. Some predictions look smoother, as shown in Epochs F1 F2 IoU p r Figure 2,3,4. 5 0.5491 0.5038 0.4486 0.8490 0.4828 We find that adding segmentation data of EAD2019 to the 20 0.5115 0.4610 0.4173 0.8927 0.4367 training set also helps a little, although there is potential val- 40 0.5373 0.4846 0.4387 0.8872 0.4594 idation data leakage, making validation metrics unbelievable. However, it does not help in the detection task. Table 11. Using Lovsz-Hinge loss to fine-tune a model We chose Model 3 to predict the final test data and got trained with BCE + Dice loss scores as Table 14 shows. 4.2.2. Experiments of backbones in validation set 5. DISCUSSION & CONCLUSION Table 12 shows another experiment to compare different net- In task 1, we compare Cascade R-CNN with Hyper Task Cas- works. We find that Xception-based DeepLabV3+ converges cade to get a better detection model. FPN and Soft-NMS are significantly slower than ResNet101-based model, and does used to improve the detection precision due to class imbal- not get better scores than the U-Net model. ance. A proper threshold of NMS is helpful to improve the recall rate of small objects. Model F1 F2 IoU p r In task 2, we select DeepLabV3+ to solve the problem. We select bce+dice as the loss function to balance precision D-X 0.4189 0.4248 0.3167 0.486 0.4388 and recall. Image sizes of the dataset is a noticeable part at D-R101 0.5823 0.5967 0.4717 0.6288 0.6313 the training phase. Adjusting the threshold of predicting also U-R34 0.5535 0.5209 0.4507 0.7512 0.5078 contributes to a more balanced model. Table 12. The metrics of validation set with different net- work (85 epochs, D: DeeplabV3+, U:UNet, X:Xception, 6. REFERENCES R:ResNet) [1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges 4.2.3. Submission results Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- doscopy artifact detection (ead 2019) challenge dataset. F1 F2 p r sscore sd arXiv preprint arXiv:1905.03209, 2019. 1 0.4872 0.5027 0.5250 0.5467 0.5154 0.2327 2 0.4802 0.5156 0.4836 0.5872 0.5167 0.2403 [2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, 3 0.5012 0.5042 0.5817 0.5390 0.5315 0.2644 James East, Xin Lu, and Jens Rittscher. A deep learn- ing framework for quality assessment and restoration Table 13. Segmentation scores in the first phase of test data in video endoscopy. arXiv preprint arXiv:1904.07073, (50% of final data) 2019. [3] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- Model sscore sstd ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- 3 : DeepLabV3+/ResNet101/1024x 0.5459 0.2682 qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, Table 14. Segmentation scores in the final test Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, The training parameters are listed in 3.2.4. All the results Stefano Realdon, Maxim Loshchenov, Julia A. Schn- above are provided by the official leaderboard. In Table 13, abel, James E. East, Geroges Wagnieres, Victor B. Model 1 is trained with 512 × 512 images and a threshold Loschenov, Enrico Grisan, Christian Daul, Walter Blon- of 0.5. Model 2 is the same as model 1 except changing the del, and Jens Rittscher. An objective comparison of de- threshold to 0.7. Model 3 is trained with 1024 × 1024 images tection and segmentation algorithms for artefacts in clin- and a threshold of 0.7. ical endoscopy. Scientific Reports, 10, 2020. We resized the image to 512 × 512 at first. However, as discussed in Table 3, there are many bigger images. This can [4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: be found in the first phase of test images as well. Compared delving into high quality object detection. CoRR, with the models only trained with images resized to 512 × abs/1712.00726, 2017. [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. [6] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Fea- ture pyramid networks for object detection. CoRR, abs/1612.03144, 2016. [7] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xi- aoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jian- ping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Hybrid task cascade for instance segmentation. CoRR, abs/1901.07518, 2019. [8] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Improving object detection with one line of code. CoRR, abs/1704.04503, 2017. [9] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation, 2018. [10] Maxim Berman, Amal Rannen Triki, and Matthew B. Blaschko. The lovsz-softmax loss: A tractable surro- gate for the optimization of the intersection-over-union measure in neural networks, 2017. [11] E. Khvedchenya V. I. Iglovikov A. Buslaev, A. Parinov and A. A. Kalinin. Albumentations: fast and flexible image augmentations. ArXiv e-prints, 2018. [12] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016. [13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical im- age segmentation. In Medical Image Computing and Computer-Assisted Intervention –MICCAI 2015, pages 234–241, 2015.