=Paper=
{{Paper
|id=Vol-2595/endoCV2020_Nguyen_et_al
|storemode=property
|title=Detection and Segmentation of Endoscopic Artefacts and Diseases
Using Deep Architectures
|pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_16.pdf
|volume=Vol-2595
|authors=Nhan T. Nguyen,Dat Q. Tran,Dung B. Nguyen
|dblpUrl=https://dblp.org/rec/conf/isbi/NguyenTN20
}}
==Detection and Segmentation of Endoscopic Artefacts and Diseases
Using Deep Architectures==
<pdf width="1500px">https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_16.pdf</pdf>
<pre>
     DETECTION AND SEGMENTATION OF ENDOSCOPIC ARTEFACTS AND DISEASES
                       USING DEEP ARCHITECTURES

                                       Nhan T. Nguyen∗ , Dat Q. Tran∗ , Dung B. Nguyen

              Medical Imaging Department, Vingroup Big Data Institute (VinBDI), Hanoi, Vietnam
                        {v.nhannt64;v.dattq13;v.dungnb1}@vinbdi.org


                            ABSTRACT
We describe in this paper our deep learning-based approach
for the EndoCV2020 challenge, which aims to detect and
segment either artefacts or diseases in endoscopic images.
For the detection task, we propose to train and optimize
EfficientDet—a state-of-the-art detector—with different Ef-
ficientNet backbones using Focal loss. By ensembling mul-
tiple detectors, we obtain a mean average precision (mAP)
of 0.2524 on EDD2020 and 0.2202 on EAD2020. For the
segmentation task, two different architectures are proposed:
UNet with EfficientNet-B3 encoder and Feature Pyramid
                                                                           Fig. 1. The number of bounding boxes for each disease class
Network (FPN) with dilated ResNet-50 encoder. Each of
                                                                           in training set provided by the EDD2020 dataset.
them is trained with an auxiliary classification branch. Our
model ensemble reports an sscore of 0.5972 on EAD2020 and
0.701 on EDD2020, which were among the top submitters of                                          2. DATASETS
both challenges.
                                                                           EDD2020 [1] is a comprehensive dataset established to
                                                                           benchmark algorithms for disease detection and segmentation
                      1. INTRODUCTION                                      in endoscopy. It is annotated for 5 different disease classes,
                                                                           including BE, Suspicious, HGD, Cancer, and Polyp. The
Disease detection and segmentation in endoscopic imaging                   dataset comes with bounding boxes for disease detection and
play an important role in the early detection of numerous                  with masked image annotations for semantic segmentation.
cancers, such as gastric, colorectal, and bladder cancers [1].             The training set includes total 386 endoscopy frames, each
Meanwhile, the detection and segmentation of endoscopic                    of which is annotated with either single or multiple diseases.
artefacts is necessary for image reconstruction and quality                Regions of the same class are merged into a single mask,
assertion [2]. Many approaches [3, 4, 5] have been proposed                while a bounding box of multiple classes is treated as sepa-
to detect and segment artefacts and diseases in endoscopy.                 rate boxes with the same location. Figure 1 shows the number
This paper describes our solution for the EndoCV2020 chal-                 of bounding boxes for each disease class. EAD2020 [10, 11],
lenge, which consists of two tracks1 : one deals with artefacts            on the other hand, is used for the track of endoscopy artefact
(EAD2020) and the other one is for diseases (EDD2020)                      detection and segmentation. The training set contains 2,531
. Each track is divided into two tasks: detection and seg-                 annotated frames for 8 artefact classes, including specular-
mentation. We tackle both tasks in both tracks by exploiting               ity, bubbles, saturation, contrast, blood, instrument, blur, and
state-of-the-art deep architectures like EfficientDet [6] and              imaging artefacts. Note that only first 5 classes are used for
U-Net [7] with variants of EfficientNet [8] and ResNet [9] as              the segmentation task.
backbones. In the next sections, we provide a short descrip-
tion of the datasets, the details of the proposed approach, and                            3. PROPOSED METHODS
experimental results.
    ∗ Equal contribution.                                                  3.1. Multi-class detection task
     Copyright c 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).        Detection network: For the detection task, we deployed
   1 https://endocv.grand-challenge.org                                    EfficientDet [6], currently a state-of-the-art architecture for
                                                                   of 6 models with different backbones (D0, D1, D2, D3, D4,
                                                                   and D5) using weighted box fusion [15] serves as our final
                                                                   model. Additionally, we search for the non-maximum sup-
                                                                   pression (NMS) threshold and the confidence threshold for
                                                                   different categories so that the resulting score (0.5 × mAP +
                                                                   0.5 × IOU) is maximized.

                                                                   3.2. Multi-class segmentation task
                                                                   Segmentation network: We propose two different archi-
                                                                   tectures for this task: U-Net with EfficientNet encoders and
                                                                   BiFPN with ResNet encoders.
                                                                       U-Net: Our first network design makes use of U-Net with
                                                                   EfficientNetB3/B4 as backbones. We keep the original strides
                                                                   between blocks in EfficientNet and extract the feature maps
                                                                   from the last 5 blocks for the segmentation. A classification
                                                                   branch is used to provide the label predictions. The overall
                                                                   framework is depicted in Figure 2.
                                                                       BiFPN: To generate the segmentation output from the
                                                                   BiFPN features, we combine all levels of the BiFPN pyramid
                                                                   by following the design illustrated in Figure 5. Starting with
Fig. 2. The U-Net with EfficientNetB3/B4 encoder and a clas-       the deepest BiFPN level (stride-32 output), we apply three
sification branch architecture.                                    upsampling stages to obtain the feature map of the stride-4
                                                                   output. An upsampling stage consists of a 3×3 Convolution,
                                                                   BatchNorm, ReLU and a 2×2 bilinear upsampling. This
object detection. It employs EfficientNet [8] as the backbone      strategy is repeated for other BiFPN levels with strides of 16,
network, BiFPN as the feature network, and shared class/box        8, and 4. The result is a set of feature maps at the same scale,
prediction network. Both BiFPN layers and class/box net            which are then channel-wise concatenated. Finally, a 1 × 1
layers are repeated multiple times based on different resource     Convolution, 4×4 bilinear upsampling and Sigmoid activa-
constraints. Figure 3 illustrates the EfficientDet architecture.   tion are used to generate the mask at the image resolution.
Training procedure: Due to the limited training data avail-
able (386 images in EDD2020 and 2531 images in EAD2020),           Training procedure: All models are trained end-to-end
we use various data augmentation techniques, including ran-        with additional supervision from the multi-label classifica-
dom shift, random crop, rotation, scale, horizontal flip, verti-   tion task. The image labels are obtained directly from the
cal flip, blur, Gauss noise, sharpen, emboss, and contrast. In     segmentation masks. For example, if an image has B.E. mask
particular, we found that the use of mixup could significantly     annotation then the B.E. label is 1. Due to class imbalance
reduce the overfitting. Given x1 and x2 as input images, the       in the training dataset, we use Focal loss for the classifica-
mixup image x̃ is constructed as                                   tion task. Our final loss is L = Lseg +λ×Lcls where λ = 0.4.
                    x̃ = λx1 + (1 − λ)x2 ,
                                                                   Inference: Relying solely on segmentation branch to pre-
                            Network                                dict masks will result in high false positives. Hence, we make
                         x̃ −−−−→ ŷ.
                                                                   use of the class predictions to remove masks. We search opti-
During training, our goal is to minimize the MixLoss Lmixup ,      mal classification thresholds to maximize the macro F1 score
which is expressed as                                              on the validation set. For every image, if the class probability
                                                                   is less than the optimal threshold then its predicted mask is
           Lmixup = λL(ŷ, y1 ) + (1 − λ)L(ŷ, y2 ).        (1)    completely removed.
where the symbol L denotes the Focal loss [12] and λ is drawn
from β(0.75, 0.75) distribution; y1 and y2 are the ground-                      4. EXPERIMENTAL RESULTS
truth labels, while ŷ is the predicted label produced by the
network. Fig. 4 visualizes a mixup example with λ being            Table 1 summarizes the detection and segmentation results
fixed to 0.5.                                                      of our submissions for both challenges. We describe the re-
    Our detectors are optimized by the gradient decent using       sults of each sub-task below. Results on the validation set of
Adam update rule [13] with weight decay. In addition, cycli-       EDD2020 for the detection task are detailed in Table 2. Our
cal learning rate [14] with restarts is also used. The ensemble    best single model (i.e. EfficientDet-D5) obtained a detection
Fig. 3. The EfficientDet architecture. The class prediction network was modified for providing the probabilities of 5 disease
classes. The figure was reproduced from Tan et al. [6].


                                                                          Challenge     dscore      dstd          sscore      sstd
                                                                          EAD2020       0.2202     0.1029         0.5972    0.2765
                                                                          EDD2020       0.2524     0.0948         0.7008    0.3211

                                                                Table 1. Detection and segmentation scores on the En-
                                                                doCV2020 test set.


                                                                   Method                              dScore              mAP         IoU
         Fig. 4. Mixup visualization with λ = 0.5.
                                                                   ED0 [6]                                 0.23       0.13± 0.04       0.33
                                                                   ED0, Augs                               0.34       0.26±0.07        0.42
                                                                   ED0, Augs, Mixup, CLR [16]              0.40       0.30±0.05        0.51
                                                                   ED5, Augs, Mixup, CLR [16]              0.41       0.29±0.05        0.54
                                                                   Ensemble (ED0-ED5), WBF [15]            0.44       0.36±0.05        0.52

                                                                 Table 2. Experimental results on EDD2020 validation set.


                                                                   Method                              Dice                      IoU
                                                                   UNet-EfficientNetB4 [8][7]     0.8522 ± 0.0221          0.8279 ± 0.0213
                                                                   BiFPN-ResNet50                 0.8544 ± 0.0232          0.8317 ± 0.0228

                                                                    Table 3. 5-fold cross-validation results on EDD2020.
   Fig. 5. The BiFPN decoder for semantic segmentation.

                                                                      Method                         Dice                    IoU
score (dScore) of 0.41. The best detection performance was
                                                                      UNet-EfficientNetB4       0.7131 ± 0.0379        0.555 ± 0.0451
provided by the ensemble model, which reported a dScore of
                                                                      BiFPN-ResNet50            0.7325 ± 0.0162        0.578 ± 0.0201
0.44, a mean mAP of 0.36±0.05, and an IoU of 0.52. As
shown in Table 1, our ensemble model yielded dScores of             Table 4. 3-fold cross-validation results on EAD2020.
0.2524±0.0948 and 0.2202±0.1029 on the hidden test sets of
EDD2020 and EAD2020, respectively.
     Results on validation sets for the segmentation task are
provided in Table 3 and Table 4. On the EDD2020 validation      ble 1, our ensemble achieved a segmentation score (sscore) of
set, our best single model achieved a Dice score of 0.854 and   0.5972 in the EAD2020 challenge and an sscore of 0.7008 in
an IoU of 0.832. On the EAD2020 validation set, we obtained     the EDD2020 challenge, both of which were among the top
a Dice score of 0.732 and an IoU of 0.578. As shown in Ta-      results for the segmentation task of both tracks.
                    5. CONCLUSION                                [8] Mingxing Tan and Quoc V Le. Efficientnet: Improv-
                                                                     ing accuracy and efficiency through automl and model
We have described our solutions for the detection and seg-           scaling. arXiv preprint arXiv:1905.11946, 2019.
mentation tasks on both tracks of EndoCV2020: EAD for
artefacts and EDD for diseases. By using EfficientDet for de-    [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
tection and U-Net/BiFPN for segmentation, we obtained sig-           Sun. Deep residual learning for image recognition. In
nificant results on both datasets, especially for the segmen-        IEEE CVPR, pages 770–778, 2016.
tation task. These results suggest that some of the deep ar-    [10] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
chitectures that are effective for natural images can also be        James East, Xin Lu, and Jens Rittscher. A deep learn-
useful for medical images like endoscopic ones, even with a          ing framework for quality assessment and restoration
small-size training datasets.                                        in video endoscopy. arXiv preprint arXiv:1904.07073,
                                                                     2019.
                    6. REFERENCES
                                                                [11] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
                                                                     ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
 [1] Sharib Ali, Noha Ghatwary, Barbara Braden, Lamarque
                                                                     qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
     Dominique, Adam Bailey, Stefano Realdon, Cannizzaro
                                                                     Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
     Renato, Jens Rittscher, Christian Daul, and James East.
                                                                     Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
     Endoscopy disease detection challenge 2020. CoRR,
                                                                     fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
     abs/2003.03376, February 2020.
                                                                     Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
 [2] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,         abel, James E. East, Geroges Wagnieres, Victor B.
     Adam Bailey, Stefano Realdon, James East, Georges               Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
     Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-          del, and Jens Rittscher. An objective comparison of de-
     doscopy artifact detection (ead 2019) challenge dataset.        tection and segmentation algorithms for artefacts in clin-
     arXiv preprint arXiv:1905.03209, 2019.                          ical endoscopy. Scientific Reports, 10, 2020.
                                                                [12] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
 [3] PS Hiremath, BV Dhandra, Iranna Humnabad, Ravin-
                                                                     and Piotr Dollár. Focal loss for dense object detection.
     dra Hegadi, and GG Rajput. Detection of esophageal
                                                                     In Proceedings of the IEEE international conference on
     cancer (necrosis) in the endoscopic images using color
                                                                     computer vision, pages 2980–2988, 2017.
     image segmentation. In Proceedings of second Na-
     tional Conference on Document Analysis and Recog-          [13] Diederik P Kingma and Jimmy Ba.        Adam: A
     nition (NCDAR-2003), Mandya, India, pages 417–422,              method for stochastic optimization. arXiv preprint
     2003.                                                           arXiv:1412.6980, 2014.
 [4] Piotr Szczypiński, Artur Klepaczko, Marek Pazurek,        [14] Leslie N Smith. Cyclical learning rates for training neu-
     and Piotr Daniel. Texture and color based image                 ral networks. In 2017 IEEE Winter Conference on Ap-
     segmentation and pathology detection in capsule en-             plications of Computer Vision (WACV), pages 464–472.
     doscopy videos. Computer methods and programs in                IEEE, 2017.
     biomedicine, 113(1):396–411, 2014.
                                                                [15] Roman Solovyev and Weimin Wang. Weighted boxes
 [5] Eva Tuba, Milan Tuba, and Raka Jovanovic. An algo-              fusion: ensembling boxes for object detection models.
     rithm for automated segmentation for bleeding detec-            arXiv preprint arXiv:1910.13302, 2019.
     tion in endoscopic images. In 2017 International Joint
                                                                [16] Ilya Loshchilov and Frank Hutter. Decoupled weight
     Conference on Neural Networks (IJCNN), pages 4579–
                                                                     decay regularization. arXiv preprint arXiv:1711.05101,
     4586. IEEE, 2017.
                                                                     2017.
 [6] Mingxing Tan, Ruoming Pang, and Quoc V Le. Effi-
     cientdet: Scalable and efficient object detection. arXiv
     preprint arXiv:1911.09070, 2019.

 [7] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
     U-net: Convolutional networks for biomedical im-
     age segmentation. In Medical Image Computing and
     Computer-Assisted Intervention – MICCAI 2015, pages
     234–241. Springer International Publishing, 2015.

</pre>