=Paper=
{{Paper
|id=Vol-2595/endoCV2020_Nguyen_et_al
|storemode=property
|title=Detection and Segmentation of Endoscopic Artefacts and Diseases
Using Deep Architectures
|pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_16.pdf
|volume=Vol-2595
|authors=Nhan T. Nguyen,Dat Q. Tran,Dung B. Nguyen
|dblpUrl=https://dblp.org/rec/conf/isbi/NguyenTN20
}}
==Detection and Segmentation of Endoscopic Artefacts and Diseases
Using Deep Architectures==
DETECTION AND SEGMENTATION OF ENDOSCOPIC ARTEFACTS AND DISEASES USING DEEP ARCHITECTURES Nhan T. Nguyen∗ , Dat Q. Tran∗ , Dung B. Nguyen Medical Imaging Department, Vingroup Big Data Institute (VinBDI), Hanoi, Vietnam {v.nhannt64;v.dattq13;v.dungnb1}@vinbdi.org ABSTRACT We describe in this paper our deep learning-based approach for the EndoCV2020 challenge, which aims to detect and segment either artefacts or diseases in endoscopic images. For the detection task, we propose to train and optimize EfficientDet—a state-of-the-art detector—with different Ef- ficientNet backbones using Focal loss. By ensembling mul- tiple detectors, we obtain a mean average precision (mAP) of 0.2524 on EDD2020 and 0.2202 on EAD2020. For the segmentation task, two different architectures are proposed: UNet with EfficientNet-B3 encoder and Feature Pyramid Fig. 1. The number of bounding boxes for each disease class Network (FPN) with dilated ResNet-50 encoder. Each of in training set provided by the EDD2020 dataset. them is trained with an auxiliary classification branch. Our model ensemble reports an sscore of 0.5972 on EAD2020 and 0.701 on EDD2020, which were among the top submitters of 2. DATASETS both challenges. EDD2020 [1] is a comprehensive dataset established to benchmark algorithms for disease detection and segmentation 1. INTRODUCTION in endoscopy. It is annotated for 5 different disease classes, including BE, Suspicious, HGD, Cancer, and Polyp. The Disease detection and segmentation in endoscopic imaging dataset comes with bounding boxes for disease detection and play an important role in the early detection of numerous with masked image annotations for semantic segmentation. cancers, such as gastric, colorectal, and bladder cancers [1]. The training set includes total 386 endoscopy frames, each Meanwhile, the detection and segmentation of endoscopic of which is annotated with either single or multiple diseases. artefacts is necessary for image reconstruction and quality Regions of the same class are merged into a single mask, assertion [2]. Many approaches [3, 4, 5] have been proposed while a bounding box of multiple classes is treated as sepa- to detect and segment artefacts and diseases in endoscopy. rate boxes with the same location. Figure 1 shows the number This paper describes our solution for the EndoCV2020 chal- of bounding boxes for each disease class. EAD2020 [10, 11], lenge, which consists of two tracks1 : one deals with artefacts on the other hand, is used for the track of endoscopy artefact (EAD2020) and the other one is for diseases (EDD2020) detection and segmentation. The training set contains 2,531 . Each track is divided into two tasks: detection and seg- annotated frames for 8 artefact classes, including specular- mentation. We tackle both tasks in both tracks by exploiting ity, bubbles, saturation, contrast, blood, instrument, blur, and state-of-the-art deep architectures like EfficientDet [6] and imaging artefacts. Note that only first 5 classes are used for U-Net [7] with variants of EfficientNet [8] and ResNet [9] as the segmentation task. backbones. In the next sections, we provide a short descrip- tion of the datasets, the details of the proposed approach, and 3. PROPOSED METHODS experimental results. ∗ Equal contribution. 3.1. Multi-class detection task Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Detection network: For the detection task, we deployed 1 https://endocv.grand-challenge.org EfficientDet [6], currently a state-of-the-art architecture for of 6 models with different backbones (D0, D1, D2, D3, D4, and D5) using weighted box fusion [15] serves as our final model. Additionally, we search for the non-maximum sup- pression (NMS) threshold and the confidence threshold for different categories so that the resulting score (0.5 × mAP + 0.5 × IOU) is maximized. 3.2. Multi-class segmentation task Segmentation network: We propose two different archi- tectures for this task: U-Net with EfficientNet encoders and BiFPN with ResNet encoders. U-Net: Our first network design makes use of U-Net with EfficientNetB3/B4 as backbones. We keep the original strides between blocks in EfficientNet and extract the feature maps from the last 5 blocks for the segmentation. A classification branch is used to provide the label predictions. The overall framework is depicted in Figure 2. BiFPN: To generate the segmentation output from the BiFPN features, we combine all levels of the BiFPN pyramid by following the design illustrated in Figure 5. Starting with Fig. 2. The U-Net with EfficientNetB3/B4 encoder and a clas- the deepest BiFPN level (stride-32 output), we apply three sification branch architecture. upsampling stages to obtain the feature map of the stride-4 output. An upsampling stage consists of a 3×3 Convolution, BatchNorm, ReLU and a 2×2 bilinear upsampling. This object detection. It employs EfficientNet [8] as the backbone strategy is repeated for other BiFPN levels with strides of 16, network, BiFPN as the feature network, and shared class/box 8, and 4. The result is a set of feature maps at the same scale, prediction network. Both BiFPN layers and class/box net which are then channel-wise concatenated. Finally, a 1 × 1 layers are repeated multiple times based on different resource Convolution, 4×4 bilinear upsampling and Sigmoid activa- constraints. Figure 3 illustrates the EfficientDet architecture. tion are used to generate the mask at the image resolution. Training procedure: Due to the limited training data avail- able (386 images in EDD2020 and 2531 images in EAD2020), Training procedure: All models are trained end-to-end we use various data augmentation techniques, including ran- with additional supervision from the multi-label classifica- dom shift, random crop, rotation, scale, horizontal flip, verti- tion task. The image labels are obtained directly from the cal flip, blur, Gauss noise, sharpen, emboss, and contrast. In segmentation masks. For example, if an image has B.E. mask particular, we found that the use of mixup could significantly annotation then the B.E. label is 1. Due to class imbalance reduce the overfitting. Given x1 and x2 as input images, the in the training dataset, we use Focal loss for the classifica- mixup image x̃ is constructed as tion task. Our final loss is L = Lseg +λ×Lcls where λ = 0.4. x̃ = λx1 + (1 − λ)x2 , Inference: Relying solely on segmentation branch to pre- Network dict masks will result in high false positives. Hence, we make x̃ −−−−→ ŷ. use of the class predictions to remove masks. We search opti- During training, our goal is to minimize the MixLoss Lmixup , mal classification thresholds to maximize the macro F1 score which is expressed as on the validation set. For every image, if the class probability is less than the optimal threshold then its predicted mask is Lmixup = λL(ŷ, y1 ) + (1 − λ)L(ŷ, y2 ). (1) completely removed. where the symbol L denotes the Focal loss [12] and λ is drawn from β(0.75, 0.75) distribution; y1 and y2 are the ground- 4. EXPERIMENTAL RESULTS truth labels, while ŷ is the predicted label produced by the network. Fig. 4 visualizes a mixup example with λ being Table 1 summarizes the detection and segmentation results fixed to 0.5. of our submissions for both challenges. We describe the re- Our detectors are optimized by the gradient decent using sults of each sub-task below. Results on the validation set of Adam update rule [13] with weight decay. In addition, cycli- EDD2020 for the detection task are detailed in Table 2. Our cal learning rate [14] with restarts is also used. The ensemble best single model (i.e. EfficientDet-D5) obtained a detection Fig. 3. The EfficientDet architecture. The class prediction network was modified for providing the probabilities of 5 disease classes. The figure was reproduced from Tan et al. [6]. Challenge dscore dstd sscore sstd EAD2020 0.2202 0.1029 0.5972 0.2765 EDD2020 0.2524 0.0948 0.7008 0.3211 Table 1. Detection and segmentation scores on the En- doCV2020 test set. Method dScore mAP IoU Fig. 4. Mixup visualization with λ = 0.5. ED0 [6] 0.23 0.13± 0.04 0.33 ED0, Augs 0.34 0.26±0.07 0.42 ED0, Augs, Mixup, CLR [16] 0.40 0.30±0.05 0.51 ED5, Augs, Mixup, CLR [16] 0.41 0.29±0.05 0.54 Ensemble (ED0-ED5), WBF [15] 0.44 0.36±0.05 0.52 Table 2. Experimental results on EDD2020 validation set. Method Dice IoU UNet-EfficientNetB4 [8][7] 0.8522 ± 0.0221 0.8279 ± 0.0213 BiFPN-ResNet50 0.8544 ± 0.0232 0.8317 ± 0.0228 Table 3. 5-fold cross-validation results on EDD2020. Fig. 5. The BiFPN decoder for semantic segmentation. Method Dice IoU score (dScore) of 0.41. The best detection performance was UNet-EfficientNetB4 0.7131 ± 0.0379 0.555 ± 0.0451 provided by the ensemble model, which reported a dScore of BiFPN-ResNet50 0.7325 ± 0.0162 0.578 ± 0.0201 0.44, a mean mAP of 0.36±0.05, and an IoU of 0.52. As shown in Table 1, our ensemble model yielded dScores of Table 4. 3-fold cross-validation results on EAD2020. 0.2524±0.0948 and 0.2202±0.1029 on the hidden test sets of EDD2020 and EAD2020, respectively. Results on validation sets for the segmentation task are provided in Table 3 and Table 4. On the EDD2020 validation ble 1, our ensemble achieved a segmentation score (sscore) of set, our best single model achieved a Dice score of 0.854 and 0.5972 in the EAD2020 challenge and an sscore of 0.7008 in an IoU of 0.832. On the EAD2020 validation set, we obtained the EDD2020 challenge, both of which were among the top a Dice score of 0.732 and an IoU of 0.578. As shown in Ta- results for the segmentation task of both tracks. 5. CONCLUSION [8] Mingxing Tan and Quoc V Le. Efficientnet: Improv- ing accuracy and efficiency through automl and model We have described our solutions for the detection and seg- scaling. arXiv preprint arXiv:1905.11946, 2019. mentation tasks on both tracks of EndoCV2020: EAD for artefacts and EDD for diseases. By using EfficientDet for de- [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian tection and U-Net/BiFPN for segmentation, we obtained sig- Sun. Deep residual learning for image recognition. In nificant results on both datasets, especially for the segmen- IEEE CVPR, pages 770–778, 2016. tation task. These results suggest that some of the deep ar- [10] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, chitectures that are effective for natural images can also be James East, Xin Lu, and Jens Rittscher. A deep learn- useful for medical images like endoscopic ones, even with a ing framework for quality assessment and restoration small-size training datasets. in video endoscopy. arXiv preprint arXiv:1904.07073, 2019. 6. REFERENCES [11] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- [1] Sharib Ali, Noha Ghatwary, Barbara Braden, Lamarque qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, Dominique, Adam Bailey, Stefano Realdon, Cannizzaro Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, Renato, Jens Rittscher, Christian Daul, and James East. Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- Endoscopy disease detection challenge 2020. CoRR, fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, abs/2003.03376, February 2020. Stefano Realdon, Maxim Loshchenov, Julia A. Schn- [2] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, abel, James E. East, Geroges Wagnieres, Victor B. Adam Bailey, Stefano Realdon, James East, Georges Loschenov, Enrico Grisan, Christian Daul, Walter Blon- Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- del, and Jens Rittscher. An objective comparison of de- doscopy artifact detection (ead 2019) challenge dataset. tection and segmentation algorithms for artefacts in clin- arXiv preprint arXiv:1905.03209, 2019. ical endoscopy. Scientific Reports, 10, 2020. [12] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, [3] PS Hiremath, BV Dhandra, Iranna Humnabad, Ravin- and Piotr Dollár. Focal loss for dense object detection. dra Hegadi, and GG Rajput. Detection of esophageal In Proceedings of the IEEE international conference on cancer (necrosis) in the endoscopic images using color computer vision, pages 2980–2988, 2017. image segmentation. In Proceedings of second Na- tional Conference on Document Analysis and Recog- [13] Diederik P Kingma and Jimmy Ba. Adam: A nition (NCDAR-2003), Mandya, India, pages 417–422, method for stochastic optimization. arXiv preprint 2003. arXiv:1412.6980, 2014. [4] Piotr Szczypiński, Artur Klepaczko, Marek Pazurek, [14] Leslie N Smith. Cyclical learning rates for training neu- and Piotr Daniel. Texture and color based image ral networks. In 2017 IEEE Winter Conference on Ap- segmentation and pathology detection in capsule en- plications of Computer Vision (WACV), pages 464–472. doscopy videos. Computer methods and programs in IEEE, 2017. biomedicine, 113(1):396–411, 2014. [15] Roman Solovyev and Weimin Wang. Weighted boxes [5] Eva Tuba, Milan Tuba, and Raka Jovanovic. An algo- fusion: ensembling boxes for object detection models. rithm for automated segmentation for bleeding detec- arXiv preprint arXiv:1910.13302, 2019. tion in endoscopic images. In 2017 International Joint [16] Ilya Loshchilov and Frank Hutter. Decoupled weight Conference on Neural Networks (IJCNN), pages 4579– decay regularization. arXiv preprint arXiv:1711.05101, 4586. IEEE, 2017. 2017. [6] Mingxing Tan, Ruoming Pang, and Quoc V Le. Effi- cientdet: Scalable and efficient object detection. arXiv preprint arXiv:1911.09070, 2019. [7] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical im- age segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015.