A U-NET++ WITH PRE-TRAINED EFFICIENTNET BACKBONE FOR SEGMENTATION OF
         DISEASES AND ARTIFACTS IN ENDOSCOPY IMAGES AND VIDEOS

                                                   Le Duy Huỳnh, Nicolas Boutry
       EPITA Research and Development Laboratory (LRDE), 14-16 rue Voltaire, F-94270 Le Kremlin-Bicêtre, France


                           ABSTRACT
Endoscopy is a widely used clinical procedure for the early
detection of numerous diseases. However, the images pro-
duced are usually heavily corrupted with multiple artifacts
that reduce the visualization of the underlying tissue. More-                    (a)               (b)              (c)           (d)
over, the localization of actual diseased regions is also a com-
plex problem. For that reason, EndoCV2020 challenges aim
to make progress in the state-of-the-art in the detection and
segmentation of artifacts and diseases in endoscopy images.
In this work, we propose approaches based on U-Net and U-
                                                                                 (e)               (f)              (g)           (h)
Net++ architecture to automate the segmentation task of En-
doCV2020. We use the EfficientNet as our encoder to extract
                                                                          Fig. 1. Some output examples of our U-Net++ with Effi-
powerful features for our decoders. Data augmentation and
                                                                          cientNet B1 and YOLOv3. The top row is the ground truths
pre-trained weights are employed to prevent overfilling and
                                                                          (GT). a),b): EAD2020 detection and segmentation GT. c),d):
improve generalization. Test-time augmentation also helps
                                                                          EDD2020 detection and segmentation GT. The bottom row is
in improving the results of our models. Our methods per-
                                                                          the results for corresponding tasks on the top row.
forms well in this challenge and achieves a score of 60.20%
for the EAD2020 semantic segmentation task and 59.81% for
the EDD2020’s.                                                            vision challenges on segmentation and detection 2020 (En-
   Index Terms— Endoscopy · U-Net++ · EfficientNet ·                      doCV 2020) are held to make progress in the state-of-the-art
Test-time augmentation · Segmentation · Detection                         further. It consists of two sub-challenges: Endoscopy Arte-
                                                                          fact Detection and Segmentation (EAD2020), and Endoscopy
                                                                          Disease Detection and Segmentation (EDD2020) [3]. Each
                                                                          challenge is further divided into the detection and the seman-
                     1. INTRODUCTION                                      tic segmentation tasks.
                                                                               In this paper, we introduce our works on these challenges.
Endoscopy is a widely used clinical procedure for the early               First, Sec. 2 presents our observations of the datasets. Then,
detection of cancers, therapeutic procedures, and minimally               we describe our methods in Sec. 3. We participate in all tasks
invasive surgery in hollow-organs. Computer-assisted meth-                of both challenges but focus mainly on the two semantic seg-
ods would improve diagnosis and assist in surgical planning.              mentation ones using fully convolutional networks such as
These endoscopic applications face two main challenges: 1)                U-Nets [4] and U-Net++ [5] with EfficientNet [6] backbones
Endoscopy video frames are usually heavily corrupted with                 (Sec. 3.1). We also did some experiments with the detection
multiple artifacts that reduce the visualization of the under-            tasks using our segmentation results or the YOLOv3 model
lying tissue and affect post-analysis. Accurate detection of              [7] (Sec. 3.2). Finally, we will present our results on the test-
these artifacts allows corrections of frames and is, therefore,           ing data in Sec. 4 and conclude in Sec. 5.
a core challenge in a wide range of endoscopic applications
[1]. 2) Even with uncorrupted video frames, temporally
consistently localizing and segmenting of disease ROIs is a                                      2. DATASETS
challenge due to non-planar geometries, variation in imaging
modalities, and deformations of organs. For these reasons,                2.1. About the dataset
after the success of EAD2019 [2], the endoscopy computer
                                                                          The EndoCV2020 challenge consists of two datasets. The
    Copyright c 2020 for this paper by its authors. Use permitted under   EAD2020 dataset is an extended version of last year’s
Creative Commons License Attribution 4.0 International (CC BY 4.0).       EAD2019 challenge [8] with annotations corrected, and a
new class added. The EAD2020 train data were released               or DenseNet [11]. EfficientNet is available with different ver-
in four subsets. They consist of a total of 3005 images,            sions, starts from B0 at 5.3 million parameters to B7 at 66
among which 2531 are annotated with bounding boxes for              million. We extract five feature maps at different scales from
eight classes (specularity, saturation, artifact, blur, contrast,   EfficientNet as the input of our decoders. An illustration of
bubbles, instrument, and blood), and 643 come with segmen-          the EfficientNet B1 encoder and where intermediate feature
tation ground truth for five classes (instrument, specularity,      maps are extracted is in Fig. 2(a).
artifact, bubbles, saturation, and blood). The EDD2020                  We start our experimentations with the U-Net architecture
train data contains 386 images with boxes and segmentation          and the EfficientNet B5. Subsequent tests show that a deeper
ground-truth for five disease classes (normal dysplastic Bar-       encoder is not needed, so we transit to a larger decoder (U-
rett’s oesophagus (BE), suspicious area, high-grade dysplasia       Net++) with smaller encoder (EfficientNet B2 and latter B1).
(HGD), adenocarcinoma (cancer), and polyp).                         An illustration of our networks could be found in Fig. 2
2.2. Data correction                                                    We speed up the training process with pre-trained weights.
                                                                    Although it was trained on ImageNet, which is a database of
The ground-truth for these challenges contain some issues.          natural images, the pre-trained weights do improve the train-
They can be classified into annotator disagreement and sys-         ing and local validation scores.
tematic error.
    Some annotator disagreement was spotted in the EAD2020          3.1.2. Data augmentation
detection dataset. For example, an artifact region is identified
                                                                    Input images are resized (while keeping their aspect ratio) and
in one frame but is not in the next frame, or one connected
                                                                    zeros-padded to fit into 512x512 pixels. We randomly apply
region is marked with two bounding boxes in one frame but
                                                                    these augmentation techniques with 50% probability: Rota-
only one in the other. There is also misclassified segmentation
                                                                    tion with random angle, RGB value shift, horizontal or ver-
mask. We consider this type of error as noises in the dataset
                                                                    tical flipping, random scaling, elastic deformation [12], crop-
since we do not have the resource to make adjustments.
                                                                    ping.
    We corrected all the systematic errors that we noticed.
There are some small anomalies in bounding box ground               3.1.3. The loss function
truths, likely due to rounding when converting from Pascal-
VOC to Yolo format. There is also a line at the bottom              The semantic segmentation tasks are evaluated with four met-
of some EAD2020 segmentation ground-truths that do not              rics: F1 (i.e., Dice score), F2 , precision, and recall. The se-
correspond to any artifact.                                         mantic segmentation score is the average of these four metrics
                                                                    [3]. Let us remind that Fβ = (1 + β 2 ) · (β 2precision·recall
                                                                                                                    ·precision)+recall
                  3. OUR APPROACHES                                 weights recall β times as important as precision. Therefore
                                                                    the final score places more emphasis on recall. As a result,
We focus mainly on the two segmentation tasks with models
                                                                    we will train our network using F2 -loss, which is defined sim-
in the U-Net family. In the case of disease detection, we will
                                                                    ilarly to Dice-loss: F2 -loss = 1-F2 . We also apply L2 regular-
take advantage of our segmentation results. For the artifacts
                                                                    ization with a factor of 0.0001.
detection tasks, we train a separate YOLOv3 [7] network due
to the difference in the number of classes (eight classes for       3.1.4. Training
detect and five classes for segmentation), and due to disagree-
ment between EAD2020 segmentation and detection ground-             We train on 80% of the train set and use the other 20% for
truths (e.g, Fig 1(a) and Fig 1(b)).                                validation. We start by fixing the pre-trained encoder and train
                                                                    the decoder part for 80 epochs using Adam optimizer with a
                                                                    learning rate (LR) of 10−3 . From the 41st epochs, we train
3.1. Segmentation of Diseases and Artifacts
                                                                    the whole network, starting at LR=10−3 and decrease with a
3.1.1. Models                                                       factor of 0.5 if the validation score does not decrease after 40
                                                                    epochs. We trained for a total of 1000 epochs. The training is
The state-of-the-art of semantic segmentation are meth-
                                                                    early stopped if the training score could not be increased after
ods based on encoder-decoder architecture such as U-Net,
                                                                    88 epochs. We select the weights that maximized the score on
U-Net++. These encoder-decoder architectures use skip-
                                                                    the validation set for evaluation of the test set.
connections to combine low-resolution, semantically-rich
at deeper feature maps with shallower, fine-grained ones to         3.1.5. Prediction and Post-processing
recover fine detail of region of interest.
    Instead of using the original U-Net encoder, we use Effi-       Test images are resized and zero-padded to 512x512. We keep
cientNet [6], which claims to be balanced between network           the aspect ratio and do not enlarge small images. Since our
depth, width, and resolution. This architecture achieved bet-       prediction sometimes contains small holes which rarely ap-
ter accuracy on ImageNet [9] with fewer parameters and re-          pear in the EAD2020 train set, a small hole-filling operation
quires fewer FLOPS than other networks such as ResNet [10]          is applied at the end of the pipeline. This hole-filling process
                                                                    did not make a significant improvement in the final score.
                                     (a) EfficientNet B1 encoder and extracted intermediate feature maps (Xi )


          (b) Building blocks                    (c) U-Net                                              (d) U-Net++


Fig. 2. Illustration of our version of U-Net and U-Net++. 2(a): EfficientNet B1 encoder. 2(b) EfficientNet building block
M BConv and the building block ouf our decoder Dec. 2(c), 2(d): our version of U-Net and U-Net++


3.1.6. Test-time Augmentation
                                                                                Table 1. Number of images in test data for each task
The segmentation results could be further improved with Test-                               Detection    Segmentation   Out-of-sample detection
time augmentation (TTA). This approach has been demon-                          EAD2020       317            162                  99
                                                                                EDD2020        43             43                   0
strated in the literature (e.g., in semantic segmentation of
brain tumor [13]). In short, we will make predictions on                       We train our standard YOLOv3 on 416x416 inputs. We
the test image and several of its transformed versions and                 start by training the last three detection layers for 20 epochs
then combine these results. We use five transformations:                   at LR=10−2 , then the upscaling part of the network for 40
horizontal, vertical flipping, rotations of 90◦ , 180◦ , and 270◦ .        epochs at LR=10−3 . Finally, we train the whole network at
                                                                           LR =10−5 and reduce the LR by a factor of 0.5 if validation
3.2. Detection of Diseases and Artifacts
                                                                           loss does not decrease after five epochs.
3.2.1. Detection of Diseases (EDD2020)
                                                                                                     4. EVALUATION
For this task, we will take advantage of our segmentation re-
sults. The bounding boxes of connected components (CCs)                    As presented in Sec. 3.1.3, the semantic segmentation are
larger than 0.5% of the image area are presented as our detec-             evaluated by the mean of F1 , F2 , precision, and recall. The
tion results. This approach is not optimal because it cannot               detection tasks are evaluated by a combination of mean aver-
handle slit-and-merge of detection bounding boxes, that is the             age precision (mAP), and intersection over union (IoU). How-
cases where a single CC is marked with more than one box,                  ever, how these metrics are weighted was not disclosed until
or a single box marks several CCs.                                         after the challenge ends.
                                                                               The evaluation was done online at https://endocv.grand-
3.2.2. Detection of Artifacts (EAD2020)
                                                                           challenge.org/. It is divide into two phases. The first
For reasons stated earlier, we have to train a separate detector           test-phase only evaluate 50% of the test data. The size of
for this task. We choose YOLOv3 because this model’s ef-                   EAD2020 and EDD2020 test data is summarized in Tab. 1.
fectiveness has been shown in this type of data [1]. It is also            The EAD2020 detection task includes an out-of-sample set,
relatively faster than two-states detectors such as RetinaNet              which contains images provided exclusively from the training
[14]. We have tried to address the class imbalance issue with              or other test datasets.
focal loss [14]. However, it did not improve our local valida-                 The quantitative results of our models are presented in
tion mAP, similar to the remark in [7].                                    Tab. 2 and Tab. 3. Our segmentation models performed well
                                                                                  [2] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
Table 2. Quantitative results of our segmentation models on                           ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
EAD2020 and EDD2020 test data. (4) TTA using the first                                qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
four transformations mentioned in Sec. 3.1.6, (5) TTA using                           Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
all transformations in Sec. 3.1.6, (F) apply holes filling, (U-                       Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
                                                                                      fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
Net++512 ): model with same architechture as in Fig. 2(d) but                         Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
with double the number of decoder filters, i.e., Dec(x, 2) be-                        abel, James E. East, Geroges Wagnieres, Victor B.
comes Dec(2x, 2), (-): information is not available because                           Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
that method was not submitted to that test phase.                                     del, and Jens Rittscher. An objective comparison of de-
                                                                                      tection and segmentation algorithms for artefacts in clin-
                 Endoscopy Diseases and Artifacts Segmentation                        ical endoscopy. Scientific Reports, 10, 2020.
                                 50% of Test Set                 Full Test Set
 Model                                                                            [3] Sharib Ali, Noha Ghatwary, Barbara Braden, Lamarque
                     F1      F2      Precision Recall Score      Final Score
                                  EAD 2020                                            Dominique, Adam Bailey, Stefano Realdon, Cannizzaro
          5F
 Unet++B1             -       -          -          -       -       0.6020            Renato, Jens Rittscher, Christian Daul, and James East.
 Unet++B14F        0.5777 0.5629      0.6584     0.5661 0.5913      0.5989            Endoscopy disease detection challenge 2020. CoRR,
          4
 Unet++B1          0.5766 0.5624      0.6556     0.5659 0.5901         -
 Unet++B1          0.5305 0.5246      0.5963     0.5391 0.5476         -
                                                                                      abs/2003.03376, February 2020.
 Unet++B2          0.5210 0.4921      0.6586     0.4872 0.5397         -
 Unet B5           0.5126 0.5093      0.5767     0.5436 0.5355         -          [4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
                                  EDD2020                                             U-net: Convolutional networks for biomedical im-
 U-Net++512 B15F      -       -          -          -       -       0.5981            age segmentation. In Medical Image Computing and
 Unet++B14         0.4956 0.5676      0.4280     0.6562 0.5369      0.5436            Computer-Assisted Intervention – MICCAI 2015, pages
                                                                                      234–241. Springer International Publishing, 2015.
Table 3. Quantitative results of our detection models on
                                                                                  [5] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
EAD2020 and EDD2020 test data.                                                        Tajbakhsh, and Jianming Liang. Unet++: A nested u-
                Endoscopy Diseases and Artifacts Detection                            net architecture for medical image segmentation. In
              Method           Score         Out-of-sample score                      Deep Learning in Medical Image Analysis and Multi-
                               EAD2020                                                modal Learning for Clinical Decision Support, pages
          YOLOv3          0.1702 ± 0.0567      0.1130 ± 0.0752
                               EDD2020
                                                                                      3–11, 2018.
          U-Net++512 B15F 0.1565 ± 0.0547              -
          Unet++B14       0.1487 ± 0.0578              -                          [6] Mingxing Tan and Quoc V. Le. Efficientnet: Rethink-
                                                                                      ing model scaling for convolutional neural networks. In
                                                                                      ICML, 2019.
and consistently on both test subset. In Tab. 2, we can observe
                                                                                  [7] Joseph Redmon and Ali Farhadi. YOLOv3: An incre-
that the hole-filling operator adds a small boost while TTA                           mental improvement.
improves the final score significantly. Our best approach is
U-Net++ with B1 encoder, ran with TTA and post-processed                          [8] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
with holes-filling. We could also see that our detection model                        Adam Bailey, Stefano Realdon, James East, Georges
                                                                                      Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
needs improvements.                                                                   doscopy artifact detection (ead 2019) challenge dataset.
                                                                                      arXiv preprint arXiv:1905.03209, 2019.
          5. CONCLUSION AND PERSPECTIVE                                           [9] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
                                                                                      Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
In this work, we demonstrate the effectiveness of U-Net and                           Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
U-Net++ with pre-trained EfficientNet backbone for the seg-                           der C. Berg, and Li Fei-Fei. ImageNet Large Scale Vi-
                                                                                      sual Recognition Challenge. International Journal of
mentation of disease and artifact in endoscopy images. Our                            Computer Vision (IJCV), 115(3):211–252, 2015.
experiments also show that TTA can provide better segmen-
tation results compared to only predicting on original images.                   [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
    For further works, we intend to improve the generaliza-                           Sun. Deep residual learning for image recognition. In
                                                                                      2016 IEEE Conference on Computer Vision and Pattern
tion of our segmentation and detection models by applying                             Recognition, CVPR 2016, Las Vegas, NV, USA, June
more data augmentation techniques and using synthetic data.                           27-30, 2016, pages 770–778. IEEE Computer Society,
We are also considering to improve further the decoder of our                         2016.
models with an attention mechanism.                                              [11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
                                                                                      berger. Densely connected convolutional networks. In
                         6. REFERENCES                                                2017 IEEE Conference on Computer Vision and Pattern
                                                                                      Recognition (CVPR), pages 2261–2269, 2017.
 [1] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
     James East, Xin Lu, and Jens Rittscher. A deep learn-                       [12] P.Y. Simard, D. Steinkraus, and J.C. Platt. Best prac-
     ing framework for quality assessment and restoration                             tices for convolutional neural networks applied to visual
     in video endoscopy. arXiv preprint arXiv:1904.07073,                             document analysis. In ICDAR, 2003., volume 1, pages
     2019.                                                                            958–963. IEEE Comput. Soc.
[13] Guotai Wang, Wenqi Li, Sébastien Ourselin, and Tom
     Vercauteren. Automatic brain tumor segmentation us-
     ing convolutional neural networks with test-time aug-
     mentation. In Brainlesion: Glioma, Multiple Sclero-
     sis, Stroke and Traumatic Brain Injuries, pages 61–72,
     Cham, 2019. Springer International Publishing.
[14] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr. Fo-
     cal loss for dense object detection. In 2017 IEEE Inter-
     national Conference on Computer Vision (ICCV), pages
     2999–3007, Oct 2017.