=Paper=
{{Paper
|id=Vol-2595/endoCV2020_Hung_et_al
|storemode=property
|title=Artefact Detection and Segmentation Using Cascade R-CNN and U-Net
|pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_31.pdf
|volume=Vol-2595
|authors=Hoang Manh Hung,Phan Tran Dac Thinh,Hyung-Jeong Yang,Soo-Hyung Kim,Guee-Sang Lee
|dblpUrl=https://dblp.org/rec/conf/isbi/HungTYKL20
}}
==Artefact Detection and Segmentation Using Cascade R-CNN and U-Net==
ARTEFACT DETECTION AND SEGMENTATION USING CASCADE R-CNN & U-NET Hoang Manh Hung∗ , Phan Tran Dac Thinh∗ , Hyung-Jeong Yang, Soo-Hyung Kim, and Guee-Sang Lee Department of Electronic and Computer Engineering, Chonnam National University, South Korea ABSTRACT tip captures the frames inside your body, observing the un- common specks which could be indications of disease or can- Endoscopy is a widely adopted procedure for the early cer. Unfortunately, diagnosis is made difficult by corrupted detection of various types of cancers, therapeutic procedures frames with multiple artefacts such as pixel saturations, mo- and minimally invasive surgery. Nevertheless, the efficiency tion blur, defocus, fluid, debris, etc. However, a technique of this method is badly influenced by several artefacts, namely called high-quality endoscopic frame restoration is able to pixel saturations, motion blur, defocus, bubbles, specular re- thoroughly solve this issue if undesired objects are accurately flections, etc. Providing that all of these artefacts are spotted tracked down. As a result, the task of detecting and identify- comprehensively, contaminated frame could be restored to an ing those artefacts is crucial for the procedure of endoscopic adequate quality in order to better visualize the underlying diagnosis. An amendment technique called high-quality en- tissue during diagnosis. In this paper, we present and dis- doscopic frame restoration is able to thoroughly solve this cuss our methodology of detection and segmentation on en- issue only if undesired objects are accurately tracked down. doscopy images. For artefact detection, we modified a deep Consequently, the task of detecting and identifying those arte- neural network structure based on the cascade R-CNN with facts is crucial for the procedure of endoscopic diagnosis. ResNeXt101 as backbone, including the deformable convolu- Each endoscopic frame is tainted by multiple artefacts and tions. Moreover, Feature Pyramid Network is added to refine the influence of them on each image is not even. Unless the raw feature maps, enhancing the performance of feature the restoration technique has the knowledge of those artefacts extraction. In semantic segmentation task, U-Net is utilized precise spatial location, the quality of image can be guaran- with the support of SE-ResNeXt50 as backbone. The clas- teed for further diagnosis. sification model appears to be legitimately dominant when compared to other models that we tested. At the end of the Object detection and segmentation has been gaining a lot Endoscopy Artefact Detection challenge 2020, we attain the of attraction lately, leading to the advent of many powerful mAP score of 0.2366 with the deviation of 0.0762 on the test neural network models. In terms of object detection, Zhang et dataset of detection task and the dice score of 0.5700 on the al proposed Mask-Aided R-CNN based on Mask R-CNN [1] test dataset of segmentation. to modify its mask header for assisting training on pixel-level labelled samples. RetinaNet [2] associated with focal loss was introduced by Lin et al in order to predict bounding boxs sizes more precisely and cope with class imbalance issue. To 1. INTRODUCTION reduce vanishing positive samples for large thresholds and in- crease hypotheses quality, Cascade R-CNN [3] with feature In the last decade, medical imaging for disease diagnosis and pyramid networks (FPN) [4] was presented by Cai et al. For early treatment has a great step forward thanks to the surge of object segmentation, U-Net [5] built by Ronneberger et al is machine learning application in computer vision. The appli- one of the most popular models. It is favored for multiple cations of this new technique are indisputably innumerable, of applications and performs effectively. Moreover, DeepLab shortening the time of diagnose and accelerating the treat- [6] proposed by Chen et al is a vigorous model because it is ment. Although the precision of the software related to med- capable of enlarging the field of view of filters for larger con- ical imaging currently still cannot be compared to that of ex- text learning but not compromise with the amount of compu- perts, recent advances have been proving its auspicious capa- tation. With its high capability of global context information bilities of human replacement in differing tasks. Endoscopy exploitation, Pyramid Scene Parsing (PSP) [7] network in- is one of those clinical procedure that can utterly benefit from troduced by Zhao et al is another model that achieved high this burgeoning technology. In this procedure, a long, thin, record in the task of segmentation. flexible tube along with a light source and a camera at the Endoscopy Artefact Detection 2020 (EAD) Challenge [8, *Both authors contributed equally to this manuscript 9, 10] is one of the competitions that are interested in optimiz- Copyright c 2020 for this paper by its authors. Use permitted under ing the automatic artefact detection capability. The challenge Creative Commons License Attribution 4.0 International (CC BY 4.0). contributes two kinds of labelled data: for detection task, they problems. In this task, after assessing several models for segmentation, we decide to take the advantage of segmen- tation from U-Net [5] that have been verified through a lot of papers which include papers about medical segmentation. Many researches favored U-Net as their main network. Usu- ally, U-Net is preferred mostly because of its flexible and interchangeable backbone. The competitive classification performance of the neural network as backbone will deter- mine the success of the whole segmentation model. The higher the accuracy of the backbone can achieve, the better the U-Net model can tell the difference between desired pixels and background. Therefore, through a few trials with some Fig. 1. The proposed method based on Cascade R-CNN. backbones (variants of Resnet, ResNext and SE-ResNeXt), we choose SE-ResNeXt50 model because of its equilibrium among strong performance, reasonable computational cost provide images with bounding box annotation and for seg- and acceptable time of training. Some models yielded greater mentation task, the data are images with pixel-level annota- results than our chosen model but they took us a huge amount tion. In this paper, we present two different models for each of time to train. SE-ResNeXt50 is a modified version of task of the EAD challenge after several preliminary tests with ResNeXt model with Squeeze and Excitation blocks [12], a few models. Cascade R-CNN is utilized for former task improving the representational power of a network. This and U-Net with SE-ResNext50 is applied for the latter task. model can tackle the problem of the imbalanced dataset bet- Both models are discussed in Section 2 and we delineate the ter than other models and retrieve more hidden fragments progress of training them in Section 3. inside the picture. Furthermore, binary cross entropy loss and dice loss are combined to deal with the discrete distribution of the foreground. Another important point is using the pre- 2. METHODS trained model of the backbone. Not only does it increase the accuracy of prediction but also reduces the total amount of 2.1. Artefact Detection training time. The predicted mask is subsequently applied a Our approach is based on Cascade R-CNN [3], which is threshold value which is verified earlier. This threshold value trained sequentially by using cascaded bounding box regres- from 0.2 to 0.9 is picked if it satisfies the best result of dice sion. This network can produce higher quality proposals score on the total dataset. during the inference process than other models that we have tested. In addition, the detector uses the resampling mecha- 3. EXPERIMENTAL RESULTS nism which can reduce overfitting with the high intersection over union (IoU) threshold and also dismiss some outliers. 3.1. Artefact Detection The backbone ResNeXt-101 (64x4d), following by the Fea- ture Pyramid Networks (FPN) [4] in a top-down mode as the In multi-class artefact detection task, the dataset consists of neck, increases the networks capability of feature extraction 2531 images and each image can contain one or many classes and improves recall. Furthermore, the deformable convolu- in the total of 8 classes, namely specularity, saturation, arti- tion (DCN) [11] is added into the backbone stage 3 to stage 5, fact, blur, contrast, bubbles, instrument and blood. First, we which assists the model in perceiving the image content and preprocess the images by transforming all of them in to the differentiating the desired objects from the large background. size of 608x608. Then, they are normalized by our mean and Our modified model for detection is illustrated in Fig. 1. standard and augmented with random flip and crop. Stochas- tic gradient descent is the chosen optimization along with the 2.2. Semantic Segmentation initiated learning rate of 0.02. The backbone uses the pre- trained model on COCO dataset (2017) to enhance the perfor- The mask images for segmentation contain 5 classes, namely mance of the main model. The 5-fold cross validation is used instrument, specularity, artefact, bubbles and saturation. The for stable deviation. In the testing phase, weights from train- distribution of desired objects in each images is not even ing 5 folds are used to detect the online testing dataset. Non- and some artefacts doesnt appear much on the background, max suppression (NMS) [13] is selected to filter redundant especially the case of specularity. Moreover, sometimes, detections in post-processing for each fold. Subsequently, to the mask images are ambiguous when we try to point out combine the predictions from 5 weights, we employ Weighted the foreground and background. Therefore, we need a tool Boxes Fusion (WBF) [14] method instead of the commonly that is powerful enough to solve the previously-mentioned NMS. WBF not only deletes redundant boxes but also takes Table 2. Segmentation results. Dataset Model sscore sstd Test U-Net + SE ResNeXt50 0.5700 0.2703 amount of epochs for one training time is 100. The predicted mask of one class is the average score of 5 model weights from 5 folds. After training, we find the threshold for our pre- dicted images. 0.4 is the threshold value which leads us to our own best score as shown in Table 2. Fig. 2. Sample of predicted image from test dataset. 4. CONCLUSION Table 1. Detection results. In this paper, we demonstrate two verified models for the Dataset Model Score d gmAP EAD2020 Challenge. The difficulty that we met when par- Cascade R-CNN + DCN ticipating in this challenge is how to pick up the most suitable Test 0.2366 0.2153 component for our network that performs excellently on the + ResNeXt101 given dataset. It proves that there are still a lot of approaches that we havent tested and the development of artefact detec- advantage of information from these boxes to align the final tion related to endoscopy is very promising. boxes more accurately. At the end, we get mAP score of 0.2366 with a deviation of 0.0762 in the testing set as shown 5. ACKNOWLEDGMENT in Table 1. The Fig. 2 below is one of our predicted images from the test dataset. This research was supported by the Bio & Medical Technol- ogy Development Program of the National Research Foun- 3.2. Semantic Segmentation dation (NRF) & funded by the Korean government (MSIT) (NRF-2019M3E5D1A02067961) and a grant (HCRI 19136) The challenge provides 643 pixel-level annotation images for Chonnam National University Hwasun Hospital Institute for semantic segmentation. The size of images is not fixed so it Biomedical Science and National Research Foundation of needs to be preprocessed before training. There is no sepa- Korea(NRF) grant funded by the Korea government(MSIP) rate validation data so we use 5-fold cross validation to en- (NRF-2017R1A2B4011409). sure the correctness of our own evaluation. First, we set the size of all images into 256x256. The original size of images is varied and bigger than the normalized size. Empirically, 6. REFERENCES if the normalized size is set at higher value, the overall re- sult of segmentation task is absolutely enhanced. Second, we [1] Kaiming He, Georgia Gkioxari, Piotr Dollr, and Ross train the U-Net segmentation model with the backbone net- Girshick. Mask r-cnn. IEEE Transactions on Pattern work of SE-ResNeXt50. We also use the pretrained model of Analysis and Machine Intelligence, 42, 2020. SE-ResNext50 to boost the performance of segmentation and [2] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, reduce the duration of training. The backbone is not frozen and Piotr Dollr. Focal loss for dense object detection. for fine-tuning at the beginning. We load the weight of the IEEE Transactions on Pattern Analysis and Machine In- backbone and train it at the same time with the main model. telligence, 2017. Augmentation is not applied in our training method due to lower results when it is compared to training without augmen- [3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: tation on a few lighter-weight neural network models. We did High quality object detection and instance segmenta- not check if augmentation could boost the performance of the tion. IEEE Transactions on Pattern Analysis and Ma- model due to lack of time. The GPU we have is Nvidia RTX chine Intelligence, 2019. 2080Ti so we set the batch size of 32. Adam optimizer is used for our network and the learning is kept constant at 3x10−4 . [4] Tsung-Yi Lin, Piotr Dollr, Ross Girshick, Kaiming He, Due to overlay in pixel-level labels, which means one pixel Bharath Hariharan, , and Serge Belongie. Feature pyra- could contain more than one class, we train each class once mid networks for object detection. IEEE Conference at the same time. Thus, we have to train 25 times in total for on Computer Vision and Pattern Recognition (CVPR), 5 classes, which is evaluated by 5 fold cross validation. The 2017. [5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. [14] Roman Solovyev and Weimin Wang. Weighted boxes U-net: Convolutional networks for biomedical image fusion: ensembling boxes for object detection models. segmentation. Lecture Notes in Computer Science: CoRR, abs/1910.13302, 2019. Medical Image Computing and Computer-Assisted In- tervention (MICCAI), 9351, 2015. [6] Liang-Chieh Chen, George Papandreou, Iasonas Kokki- nos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 40, 2018. [7] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiao- gang Wang, and Jiaya Jia. Pyramid scene parsing net- work. IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2017. [8] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- doscopy artifact detection (ead 2019) challenge dataset. arXiv preprint arXiv:1905.03209, 2019. [9] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher. A deep learn- ing framework for quality assessment and restoration in video endoscopy. arXiv preprint arXiv:1904.07073, 2019. [10] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, Stefano Realdon, Maxim Loshchenov, Julia A. Schn- abel, James E. East, Geroges Wagnieres, Victor B. Loschenov, Enrico Grisan, Christian Daul, Walter Blon- del, and Jens Rittscher. An objective comparison of de- tection and segmentation algorithms for artefacts in clin- ical endoscopy. Scientific Reports, 10, 2020. [11] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convo- lutional networks. IEEE International Conference on Computer Vision (ICCV), 2017. [12] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2018. [13] Alexander Neubeck and Luc Van Gool. Efficient non- maximum suppression. 18th International Conference on Pattern Recognition (ICPR), 3, 2006.