DEEP LAYER AGGREGATION APPROACHES FOR REGION SEGMENTATION OF
                            ENDOSCOPIC IMAGES

                                           Qingtian Ning, Xu Zhao, Jingyi Wang

                             Department of Automation, Shanghai Jiao Tong University


                         ABSTRACT                                   So we simply apply the Cascade R-CNN framework and use
                                                                    L1 loss to optimize network.
This paper contains our approaches in EAD2019 competition.
For multi-class region segmentation (task 2), we utilize deep
                                                                    2.2. Region segmentation
layer aggregation algorithm to achieve the best results com-
pared to U-net. For the completeness of the competition, we         2.2.1. Deep Layer Aggregation
employ the Cascade R-CNN framework to finish multi-class
artefact detection (task 1) and multi-class artefact generaliza-    Visual recognition requires rich representations that span lev-
tion tasks (task 3).                                                els from low to high, scales from small to large, and reso-
                                                                    lutions from fine to coarse [4]. Even with the depth of fea-
                                                                    tures in a convolutional network, a layer in isolation is not
                    1. INTRODUCTION                                 enough: compounding and aggregating these representations
                                                                    improves inference of what and where [4]. Deep layer aggre-
In this paper, we will introduce our methods and results of         gation (DLA) structures iteratively and hierarchically merge
the Endoscopic artefact detection challenge (EAD2019) [1, 2]        the feature hierarchy to make networks with better accuracy
in detail. The competition consists of three tasks, which are       and fewer parameters [4]. For region segmentation, we used
artefact detection (task 1), region segmentation (task 2) and       DLA-60 model provided. In addition, we use post processing,
generalization (task 3). For task 1, it aims to get localization    such as conditional random field [5], to optimize segmenta-
of bounding boxes and class labels for 7 artefact classes for       tion results. In particular, this is the case that one pixel corre-
given frames. For task 2, Algorithm should obtain the pre-          sponds to multiple categories in ground truth label. In order
cise boundary delineation of detected artefacts. And for task       to avoid this, we manipulate a simple process to make each
3, it aims to verify the detection performance independent of       pixel correspond to only one classes. To overcome the class
specific data type and source.                                      imbalance problem, we propose to use a weighted multi-class
                                                                    dice loss as the segmentation loss.
             2. DETAILS ON OUR METHOD                                                              C
                                                                                                   X      wc Ŷnc Ync
                                                                                  LDice = 1 − 2                         ,          (1)
                                                                                                        c   c    c
2.1. Detection and generalisation tasks                                                            c=1 w (Ŷn + Yn )


2.1.1. Cascade R-CNN                                                where Ŷnc denotes the predicted probability belonging to class
                                                                    c (i.e. background, instrument, specularity, artifact, bubbles,
In object detection, we need an intersection over union (IoU)       saturation), Ync denotes the ground truth probability, and wc
threshold to define positives and negatives. An object detec-       denotes a class dependent weighting factor. Empirically, we
tor usually generates noisy detections if it is trained with low    set the weights to be 1 for background, 1.5 for instrument,
IoU threshold, e.g. 0.5. But detection performance degrade          2.5 for specularity, 2 for artifact, 2.5 for bubbles and 2 for
with increasing the IoU thresholds [3]. So the Cascade R-           saturation.
CNN is proposed to address two problems: 1) over-fitting dur-
ing training, due to exponentially vanishing positive samples,
                                                                                         3. EXPERIMENTS
and 2) inference-time mismatch between the IoUs for which
the detector is optimal and those of the input hypotheses [3].
                                                                    3.1. Detection and generalisation tasks
It consists of a sequence of detectors trained with increasing
IoU thresholds, to be sequentially more selective against close     For Detection and generalisation tasks, experiments are built
false positives. The detectors are trained stage by stage, lever-   with Caffe framework on a single NVIDIA TITAN X GPU.
aging the observation that the output of a detector is a good       We use the Adam optimizer with the learning rate 6.25 ∗ 10−4
distribution for training the next higher quality detector [3].     and a weight decay of 0.0001 for 250000 iterations with batch
                                                                                      5. REFERENCES
Table 1. Results on EAD2019 detection and generalisation.
          Method        Scored    IoUd     mAPd                   [1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
      Cascade R-CNN 0.2330 0.1222 0.3068                              Adam Bailey, Stefano Realdon, James East, Georges
                                                                      Wagnires, Victor Loschenov, Enrico Grisan, Walter
                                                                      Blondel, and Jens Rittscher, “Endoscopy artifact de-
         Table 2. Score gap for generalisation tasks.                 tection (EAD 2019) challenge dataset,” CoRR, vol.
                Method           devg     mAPg                        abs/1905.03209, 2019.
            Cascade R-CNN 0.0515 0.3154
                                                                  [2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
                                                                      James East, Xin Lu, and Jens Rittscher, “A deep learning
    Table 3. Results on EAD2019 region segmentation.                  framework for quality assessment and restoration in video
       Methods        Scores Overlap F2-score                         endoscopy,” CoRR, vol. abs/1904.07073, 2019.
      DLA-60(crf) 0.5320      0.5206      0.5661                  [3] Zhaowei Cai and Nuno Vasconcelos, “Cascade r-cnn:
        DLA-60        0.4460  0.4352      0.4784                      Delving into high quality object detection,” in Proceed-
                                                                      ings of the IEEE Conference on Computer Vision and Pat-
                                                                      tern Recognition, 2018, pp. 6154–6162.
           Table 4. Results on our validation set.
              Methods Dice Jaccard                                [4] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor
               DLA-60 0.517          0.480                            Darrell, “Deep layer aggregation,” in Proceedings of the
                 U-net     0.339     0.296                            IEEE Conference on Computer Vision and Pattern Recog-
                                                                      nition, 2018, pp. 2403–2412.
                                                                  [5] Philipp Krähenbühl and Vladlen Koltun, “Efficient in-
size 1. Table 1 shows the evaluation results on EAD2019 de-
                                                                      ference in fully connected crfs with gaussian edge po-
tection and Table 2 shows the score gap for generalisation
                                                                      tentials,” in Advances in neural information processing
tasks. What surprises us is that our detection algorithm has
                                                                      systems, 2011, pp. 109–117.
good generalization performance.
                                                                  [6] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,
                                                                      “U-net: Convolutional networks for biomedical image
3.2. Region segmentation                                              segmentation,” in International Conference on Medi-
                                                                      cal image computing and computer-assisted intervention.
For region segmentation, experiments are built with Pytorch           Springer, 2015, pp. 234–241.
framework on two NVIDIA 1080ti GPUs. We use the SGD
optimizer with a weight decay of 0.0001, and adopt the poly
                     epoch−1
learning rate (1 − totalepoch ) with momentum 0.9 and train
the model for 200 epochs with batch size 64. The starting
learning rate is 0.01 and the crop size is chosen to be 256.
Table 3 shows the evaluation results on EAD2019 region seg-
mentation. Table 4 shows the comparison results of U-net[6]
and DLA on our validation set.


                     4. CONCLUSION

Overall, EAD2019 is a very meaningful competition. We
have gained a lot in the process of completing the competi-
tion. Finally, we ranked 20th, 11th and 3th for detection, seg-
mentation and generalization, respectively. The final result
exceeded our expectations, which is considerably delightful.
Of course, we still have a lot of shortcomings. For exam-
ple, for segmentation tasks, we make each pixel correspond
to only one classes, which will lead to some holes in the re-
sults. In addition, we can also employ data augmentation, etc.
All in all, we still have a lot to improve.