A U-NET++ WITH PRE-TRAINED EFFICIENTNET BACKBONE FOR SEGMENTATION OF DISEASES AND ARTIFACTS IN ENDOSCOPY IMAGES AND VIDEOS Le Duy Huỳnh, Nicolas Boutry EPITA Research and Development Laboratory (LRDE), 14-16 rue Voltaire, F-94270 Le Kremlin-Bicêtre, France ABSTRACT Endoscopy is a widely used clinical procedure for the early detection of numerous diseases. However, the images pro- duced are usually heavily corrupted with multiple artifacts that reduce the visualization of the underlying tissue. More- (a) (b) (c) (d) over, the localization of actual diseased regions is also a com- plex problem. For that reason, EndoCV2020 challenges aim to make progress in the state-of-the-art in the detection and segmentation of artifacts and diseases in endoscopy images. In this work, we propose approaches based on U-Net and U- (e) (f) (g) (h) Net++ architecture to automate the segmentation task of En- doCV2020. We use the EfficientNet as our encoder to extract Fig. 1. Some output examples of our U-Net++ with Effi- powerful features for our decoders. Data augmentation and cientNet B1 and YOLOv3. The top row is the ground truths pre-trained weights are employed to prevent overfilling and (GT). a),b): EAD2020 detection and segmentation GT. c),d): improve generalization. Test-time augmentation also helps EDD2020 detection and segmentation GT. The bottom row is in improving the results of our models. Our methods per- the results for corresponding tasks on the top row. forms well in this challenge and achieves a score of 60.20% for the EAD2020 semantic segmentation task and 59.81% for the EDD2020’s. vision challenges on segmentation and detection 2020 (En- Index Terms— Endoscopy · U-Net++ · EfficientNet · doCV 2020) are held to make progress in the state-of-the-art Test-time augmentation · Segmentation · Detection further. It consists of two sub-challenges: Endoscopy Arte- fact Detection and Segmentation (EAD2020), and Endoscopy Disease Detection and Segmentation (EDD2020) [3]. Each challenge is further divided into the detection and the seman- 1. INTRODUCTION tic segmentation tasks. In this paper, we introduce our works on these challenges. Endoscopy is a widely used clinical procedure for the early First, Sec. 2 presents our observations of the datasets. Then, detection of cancers, therapeutic procedures, and minimally we describe our methods in Sec. 3. We participate in all tasks invasive surgery in hollow-organs. Computer-assisted meth- of both challenges but focus mainly on the two semantic seg- ods would improve diagnosis and assist in surgical planning. mentation ones using fully convolutional networks such as These endoscopic applications face two main challenges: 1) U-Nets [4] and U-Net++ [5] with EfficientNet [6] backbones Endoscopy video frames are usually heavily corrupted with (Sec. 3.1). We also did some experiments with the detection multiple artifacts that reduce the visualization of the under- tasks using our segmentation results or the YOLOv3 model lying tissue and affect post-analysis. Accurate detection of [7] (Sec. 3.2). Finally, we will present our results on the test- these artifacts allows corrections of frames and is, therefore, ing data in Sec. 4 and conclude in Sec. 5. a core challenge in a wide range of endoscopic applications [1]. 2) Even with uncorrupted video frames, temporally consistently localizing and segmenting of disease ROIs is a 2. DATASETS challenge due to non-planar geometries, variation in imaging modalities, and deformations of organs. For these reasons, 2.1. About the dataset after the success of EAD2019 [2], the endoscopy computer The EndoCV2020 challenge consists of two datasets. The Copyright c 2020 for this paper by its authors. Use permitted under EAD2020 dataset is an extended version of last year’s Creative Commons License Attribution 4.0 International (CC BY 4.0). EAD2019 challenge [8] with annotations corrected, and a new class added. The EAD2020 train data were released or DenseNet [11]. EfficientNet is available with different ver- in four subsets. They consist of a total of 3005 images, sions, starts from B0 at 5.3 million parameters to B7 at 66 among which 2531 are annotated with bounding boxes for million. We extract five feature maps at different scales from eight classes (specularity, saturation, artifact, blur, contrast, EfficientNet as the input of our decoders. An illustration of bubbles, instrument, and blood), and 643 come with segmen- the EfficientNet B1 encoder and where intermediate feature tation ground truth for five classes (instrument, specularity, maps are extracted is in Fig. 2(a). artifact, bubbles, saturation, and blood). The EDD2020 We start our experimentations with the U-Net architecture train data contains 386 images with boxes and segmentation and the EfficientNet B5. Subsequent tests show that a deeper ground-truth for five disease classes (normal dysplastic Bar- encoder is not needed, so we transit to a larger decoder (U- rett’s oesophagus (BE), suspicious area, high-grade dysplasia Net++) with smaller encoder (EfficientNet B2 and latter B1). (HGD), adenocarcinoma (cancer), and polyp). An illustration of our networks could be found in Fig. 2 2.2. Data correction We speed up the training process with pre-trained weights. Although it was trained on ImageNet, which is a database of The ground-truth for these challenges contain some issues. natural images, the pre-trained weights do improve the train- They can be classified into annotator disagreement and sys- ing and local validation scores. tematic error. Some annotator disagreement was spotted in the EAD2020 3.1.2. Data augmentation detection dataset. For example, an artifact region is identified Input images are resized (while keeping their aspect ratio) and in one frame but is not in the next frame, or one connected zeros-padded to fit into 512x512 pixels. We randomly apply region is marked with two bounding boxes in one frame but these augmentation techniques with 50% probability: Rota- only one in the other. There is also misclassified segmentation tion with random angle, RGB value shift, horizontal or ver- mask. We consider this type of error as noises in the dataset tical flipping, random scaling, elastic deformation [12], crop- since we do not have the resource to make adjustments. ping. We corrected all the systematic errors that we noticed. There are some small anomalies in bounding box ground 3.1.3. The loss function truths, likely due to rounding when converting from Pascal- VOC to Yolo format. There is also a line at the bottom The semantic segmentation tasks are evaluated with four met- of some EAD2020 segmentation ground-truths that do not rics: F1 (i.e., Dice score), F2 , precision, and recall. The se- correspond to any artifact. mantic segmentation score is the average of these four metrics [3]. Let us remind that Fβ = (1 + β 2 ) · (β 2precision·recall ·precision)+recall 3. OUR APPROACHES weights recall β times as important as precision. Therefore the final score places more emphasis on recall. As a result, We focus mainly on the two segmentation tasks with models we will train our network using F2 -loss, which is defined sim- in the U-Net family. In the case of disease detection, we will ilarly to Dice-loss: F2 -loss = 1-F2 . We also apply L2 regular- take advantage of our segmentation results. For the artifacts ization with a factor of 0.0001. detection tasks, we train a separate YOLOv3 [7] network due to the difference in the number of classes (eight classes for 3.1.4. Training detect and five classes for segmentation), and due to disagree- ment between EAD2020 segmentation and detection ground- We train on 80% of the train set and use the other 20% for truths (e.g, Fig 1(a) and Fig 1(b)). validation. We start by fixing the pre-trained encoder and train the decoder part for 80 epochs using Adam optimizer with a learning rate (LR) of 10−3 . From the 41st epochs, we train 3.1. Segmentation of Diseases and Artifacts the whole network, starting at LR=10−3 and decrease with a 3.1.1. Models factor of 0.5 if the validation score does not decrease after 40 epochs. We trained for a total of 1000 epochs. The training is The state-of-the-art of semantic segmentation are meth- early stopped if the training score could not be increased after ods based on encoder-decoder architecture such as U-Net, 88 epochs. We select the weights that maximized the score on U-Net++. These encoder-decoder architectures use skip- the validation set for evaluation of the test set. connections to combine low-resolution, semantically-rich at deeper feature maps with shallower, fine-grained ones to 3.1.5. Prediction and Post-processing recover fine detail of region of interest. Instead of using the original U-Net encoder, we use Effi- Test images are resized and zero-padded to 512x512. We keep cientNet [6], which claims to be balanced between network the aspect ratio and do not enlarge small images. Since our depth, width, and resolution. This architecture achieved bet- prediction sometimes contains small holes which rarely ap- ter accuracy on ImageNet [9] with fewer parameters and re- pear in the EAD2020 train set, a small hole-filling operation quires fewer FLOPS than other networks such as ResNet [10] is applied at the end of the pipeline. This hole-filling process did not make a significant improvement in the final score. (a) EfficientNet B1 encoder and extracted intermediate feature maps (Xi ) (b) Building blocks (c) U-Net (d) U-Net++ Fig. 2. Illustration of our version of U-Net and U-Net++. 2(a): EfficientNet B1 encoder. 2(b) EfficientNet building block M BConv and the building block ouf our decoder Dec. 2(c), 2(d): our version of U-Net and U-Net++ 3.1.6. Test-time Augmentation Table 1. Number of images in test data for each task The segmentation results could be further improved with Test- Detection Segmentation Out-of-sample detection time augmentation (TTA). This approach has been demon- EAD2020 317 162 99 EDD2020 43 43 0 strated in the literature (e.g., in semantic segmentation of brain tumor [13]). In short, we will make predictions on We train our standard YOLOv3 on 416x416 inputs. We the test image and several of its transformed versions and start by training the last three detection layers for 20 epochs then combine these results. We use five transformations: at LR=10−2 , then the upscaling part of the network for 40 horizontal, vertical flipping, rotations of 90◦ , 180◦ , and 270◦ . epochs at LR=10−3 . Finally, we train the whole network at LR =10−5 and reduce the LR by a factor of 0.5 if validation 3.2. Detection of Diseases and Artifacts loss does not decrease after five epochs. 3.2.1. Detection of Diseases (EDD2020) 4. EVALUATION For this task, we will take advantage of our segmentation re- sults. The bounding boxes of connected components (CCs) As presented in Sec. 3.1.3, the semantic segmentation are larger than 0.5% of the image area are presented as our detec- evaluated by the mean of F1 , F2 , precision, and recall. The tion results. This approach is not optimal because it cannot detection tasks are evaluated by a combination of mean aver- handle slit-and-merge of detection bounding boxes, that is the age precision (mAP), and intersection over union (IoU). How- cases where a single CC is marked with more than one box, ever, how these metrics are weighted was not disclosed until or a single box marks several CCs. after the challenge ends. The evaluation was done online at https://endocv.grand- 3.2.2. Detection of Artifacts (EAD2020) challenge.org/. It is divide into two phases. The first For reasons stated earlier, we have to train a separate detector test-phase only evaluate 50% of the test data. The size of for this task. We choose YOLOv3 because this model’s ef- EAD2020 and EDD2020 test data is summarized in Tab. 1. fectiveness has been shown in this type of data [1]. It is also The EAD2020 detection task includes an out-of-sample set, relatively faster than two-states detectors such as RetinaNet which contains images provided exclusively from the training [14]. We have tried to address the class imbalance issue with or other test datasets. focal loss [14]. However, it did not improve our local valida- The quantitative results of our models are presented in tion mAP, similar to the remark in [7]. Tab. 2 and Tab. 3. Our segmentation models performed well [2] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- Table 2. Quantitative results of our segmentation models on ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- EAD2020 and EDD2020 test data. (4) TTA using the first qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, four transformations mentioned in Sec. 3.1.6, (5) TTA using Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, all transformations in Sec. 3.1.6, (F) apply holes filling, (U- Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, Net++512 ): model with same architechture as in Fig. 2(d) but Stefano Realdon, Maxim Loshchenov, Julia A. Schn- with double the number of decoder filters, i.e., Dec(x, 2) be- abel, James E. East, Geroges Wagnieres, Victor B. comes Dec(2x, 2), (-): information is not available because Loschenov, Enrico Grisan, Christian Daul, Walter Blon- that method was not submitted to that test phase. del, and Jens Rittscher. An objective comparison of de- tection and segmentation algorithms for artefacts in clin- Endoscopy Diseases and Artifacts Segmentation ical endoscopy. Scientific Reports, 10, 2020. 50% of Test Set Full Test Set Model [3] Sharib Ali, Noha Ghatwary, Barbara Braden, Lamarque F1 F2 Precision Recall Score Final Score EAD 2020 Dominique, Adam Bailey, Stefano Realdon, Cannizzaro 5F Unet++B1 - - - - - 0.6020 Renato, Jens Rittscher, Christian Daul, and James East. Unet++B14F 0.5777 0.5629 0.6584 0.5661 0.5913 0.5989 Endoscopy disease detection challenge 2020. CoRR, 4 Unet++B1 0.5766 0.5624 0.6556 0.5659 0.5901 - Unet++B1 0.5305 0.5246 0.5963 0.5391 0.5476 - abs/2003.03376, February 2020. Unet++B2 0.5210 0.4921 0.6586 0.4872 0.5397 - Unet B5 0.5126 0.5093 0.5767 0.5436 0.5355 - [4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. EDD2020 U-net: Convolutional networks for biomedical im- U-Net++512 B15F - - - - - 0.5981 age segmentation. In Medical Image Computing and Unet++B14 0.4956 0.5676 0.4280 0.6562 0.5369 0.5436 Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015. Table 3. Quantitative results of our detection models on [5] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima EAD2020 and EDD2020 test data. Tajbakhsh, and Jianming Liang. Unet++: A nested u- Endoscopy Diseases and Artifacts Detection net architecture for medical image segmentation. In Method Score Out-of-sample score Deep Learning in Medical Image Analysis and Multi- EAD2020 modal Learning for Clinical Decision Support, pages YOLOv3 0.1702 ± 0.0567 0.1130 ± 0.0752 EDD2020 3–11, 2018. U-Net++512 B15F 0.1565 ± 0.0547 - Unet++B14 0.1487 ± 0.0578 - [6] Mingxing Tan and Quoc V. Le. Efficientnet: Rethink- ing model scaling for convolutional neural networks. In ICML, 2019. and consistently on both test subset. In Tab. 2, we can observe [7] Joseph Redmon and Ali Farhadi. YOLOv3: An incre- that the hole-filling operator adds a small boost while TTA mental improvement. improves the final score significantly. Our best approach is U-Net++ with B1 encoder, ran with TTA and post-processed [8] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, with holes-filling. We could also see that our detection model Adam Bailey, Stefano Realdon, James East, Georges Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- needs improvements. doscopy artifact detection (ead 2019) challenge dataset. arXiv preprint arXiv:1905.03209, 2019. 5. CONCLUSION AND PERSPECTIVE [9] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej In this work, we demonstrate the effectiveness of U-Net and Karpathy, Aditya Khosla, Michael Bernstein, Alexan- U-Net++ with pre-trained EfficientNet backbone for the seg- der C. Berg, and Li Fei-Fei. ImageNet Large Scale Vi- sual Recognition Challenge. International Journal of mentation of disease and artifact in endoscopy images. Our Computer Vision (IJCV), 115(3):211–252, 2015. experiments also show that TTA can provide better segmen- tation results compared to only predicting on original images. [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian For further works, we intend to improve the generaliza- Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern tion of our segmentation and detection models by applying Recognition, CVPR 2016, Las Vegas, NV, USA, June more data augmentation techniques and using synthetic data. 27-30, 2016, pages 770–778. IEEE Computer Society, We are also considering to improve further the decoder of our 2016. models with an attention mechanism. [11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein- berger. Densely connected convolutional networks. In 6. REFERENCES 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017. [1] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher. A deep learn- [12] P.Y. Simard, D. Steinkraus, and J.C. Platt. Best prac- ing framework for quality assessment and restoration tices for convolutional neural networks applied to visual in video endoscopy. arXiv preprint arXiv:1904.07073, document analysis. In ICDAR, 2003., volume 1, pages 2019. 958–963. IEEE Comput. Soc. [13] Guotai Wang, Wenqi Li, Sébastien Ourselin, and Tom Vercauteren. Automatic brain tumor segmentation us- ing convolutional neural networks with test-time aug- mentation. In Brainlesion: Glioma, Multiple Sclero- sis, Stroke and Traumatic Brain Injuries, pages 61–72, Cham, 2019. Springer International Publishing. [14] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr. Fo- cal loss for dense object detection. In 2017 IEEE Inter- national Conference on Computer Vision (ICCV), pages 2999–3007, Oct 2017.