ENDOSCOPIC ARTEFACT DETECTION AND SEGMENTATION WITH DEEP
                       CONVOLUTIONAL NEURAL NETWORK

                                                 Suhui Yang, Guanju Cheng

                            Ping An Technology (Shenzhen) Co. Ltd., Shenzhen, China


                        ABSTRACT                                   tions, and low contrast can be present in the same frame.
                                                                   Further, not all artefact types contaminate the frame equally.
Endoscopic artefact detection challenge (EAD2019[?]) in-
                                                                   So, unless multiple artefacts present in the frame are known
cludes three tasks: (1) Multi-class artefact detection: lo-
                                                                   with their precise spatial location, clinically relevant frame
calization of bounding boxes and class labels for 7 artefact
                                                                   restoration quality cannot be guaranteed. Another advantage
classes for given frames (specularity, saturation, artefact,
                                                                   of such detection is that frame quality assessments can be
blur, contrast, bubbles and instrument); (2) Region segmen-
                                                                   guided to minimise the number of frames that gets discarded
tation: precise boundary delineation of detected artefacts
                                                                   during automated video analysis. 2) Multi-class artefact re-
(instrument, specularity, artefact, bubbles and saturation); (3)
                                                                   gion segmentation: Frame artefacts typically have irregular
Detection generalization: detection performance independent
                                                                   shapes that are non-rectangular and consequently are over-
of specific data type and source. We participated all three
                                                                   estimated by the detected bounding boxes. Development
tasks of EAD2019, and this manuscript summarizes our solu-
                                                                   of accurate semantic segmentation methods to precisely de-
tion based on deep learning for each task. In short, for task 1,
                                                                   lineate the boundaries of each detected frame artefact will
we apply the improved Cascade R-CNN [1] model combined
                                                                   enable optimized restoration of video frames without sacri-
with feature pyramid networks (FPN) [2] to deal with multi-
                                                                   ficing information. 3) Multi-class artefact generalisation: It is
class artefact detection; for task 2, we apply the network
                                                                   important for algorithms to avoid biases induced by specific
architecture like Deeplab v3+ [3] with different backbones
                                                                   training data sets. Also, it is well known that expert annota-
(ResNet101 [4] and MobileNet [5]) to segment multi-class
                                                                   tion generation is time consuming and can be infeasible for
artefact regions; for task 3, we used Cycle-GAN [6] and then
                                                                   many data institutions. In this challenge, we encourage the
perform image translation between training dataset and test-
                                                                   participants to develop machine learning algorithms that can
ing dataset to improve the model generalization of multi-class
                                                                   be used across different endoscopic datasets worldwide based
artefact detection. Besides, we apply unsupervised t-SNE [7]
                                                                   on our large combined dataset from 6 different institutions.
to visualize the date distribution to achieve targeted data
reduction and augmentation before training detection and
segmentation model; and finally, some effective strategies of                  2. MATERIALS AND METHODS
model fusion and post-processing are also used to obtain the
final results.                                                     2.1. Task 1: Multi-class artefact detection
   Index Terms— Endoscopic artefact detection challenge,           EAD2019 [8, 9] provides two batches of training data for
t-SNE, cascade R-CNN, generalization                               multi-class artefact detection, the first batch contains 886 en-
                                                                   doscopic images labeled with 9352 bounding boxes and the
                   1. INTRODUCTION                                 second batch contains labeled 1306 endoscopic images la-
                                                                   beled with 8466 bounding boxes. After checking the train-
Endoscopy is a widely used clinical procedure for the early        ing data, we notice that there may be two difficulties in this
detection of numerous cancers (e.g., nasopharyngeal, oe-           task. One is unbalance sample distribution, another is various
sophageal adenocarcinoma, gastric, colorectal cancers, blad-       size/aspect ratio of image and detection object. As shown in
der cancer etc.), therapeutic procedures and minimally in-         table 1, there are 4074 specularity and only 327 blur in train-
vasive surgery (e.g., laparoscopy). EAD2019 challenge[?]           ing data1, and there are 3487 artefact and only 46 instrument
proposal aims to address the following key problems inher-         in training data2.
ent in all video endoscopy: 1) Multi-class artefact detection:          Based on this observation, we therefore propose an im-
Existing endoscopy workflows detect only one artefact class        proved Cascade R-CNN [1] as our detection model (Figure
which is insufficient to obtain high-quality frame restora-        1). Compared to original Cascade R-CNN, we add the FPN
tion. In general, the same video frame can be corrupted            [2] module during feature extraction. As shown in Figure 1,
with multiple artefacts, e.g. motion blur, specular reflec-        there are two main sub-modules, including multi-scale feature
           num       First batch of      Second batch of
   Classes        training data(886)   training data(1306)
   specularity        4074(44%)             1761(21%)
    saturation         511(44%)              611(7%)
     artefact         1609(17%)             3487(41%)
       blur             327(%)               348(4%)
     contrast          686(7%)                872(%)
     bubbles          1738(19%)             1341(16%)
   instrument          407(4%)                46(%)
       total             9352                  8466

      Table 1. Statistics of two batches of training data


                                                                  Fig. 2. Data visualization with t-SNE for training data1, train-
                                                                  ing data2, validation data for detection (task 1) and general-
                                                                  ization (task 3).


Fig. 1. The flowchart of our detection network, a improved
Cascade R-CNN by adding FPN module.


representation and multi-stage object detection with cascade
structures. The module of multi-scale feature representation
consists of a bottom-up pathway and a top-down pathway [2].
Using ResNet101 [4] as backbone, the input image is pro-
cessed through bottom-up pathway with a series of residual
blocks. We denote the feature activation outputs of last resid-
ual blocks as {C2, C3, C4, C5}, and we do not include the
output of first residual block due to memory space. Then in
top-down pathway, each feature map is constructed by merg-
ing the corresponding bottom-up map and the unsampled map
from a coarser-resolution feature map with a factor of 2. The     Fig. 3. According to the distribution of testing data for detec-
final feature maps are denoted as {P2, P3, P4, P5} with dif-      tion task and generalization task, the training data within the
ferent spatial sizes corresponding to {C2, C3, C4, C5}. With      shaded area were selected to feed the model, which ignored
the FPN, we could produce the multi-scale feature represen-       the noisy outliers have less contribution for model training.
tations, which can improve the detection rate of small objects
(e.g. specularity) by combining low-level features and high-
level semantic information. In general, IoU and mAP is a          sionality reduction method called t-SNE [7] to visualizing all
pair of mutually contradictory index, e.g. an object detector     the dataset including training data1, training data2, validation
with higher IoU value may usually produce noisy detections        data for detection and generalization (shown in Figure 2). We
leading to low mAP. We apply three cascade stages of object       find there are different data distributions among two training
detection networks (R-CNN) to improve the performance [2].        datasets and validation data. Therefore, we delete some out-
This structure can prevent mAP from dropping sharply when         liers and continuous frames in training data2, and then per-
IOU is high between the prediction box and the real box.          form data augmentation for categories with fewer sample (e.g.
Besides the network architecture, we pay more attention in        saturation and blur). Similar operation is also carried out for
data distribution. We apply an unsupervised nonlinear dimen-      task 2 and task 3.
          Methods            mAPd        IoUd      scored                      Methods                  mAPd     IoUd      scored
      Faster RCNN[10]        0.2618     0.3448     0.2950               Model1(only training
                                                                                                        0.2210   0.4504    0.3127
      Cascade RCNN[1]        0.2996     0.3221     0.3086                       data1)
                                                                        Model2(only training
             Table 2. Results of different models                                                       0.2138   0.4323    0.3012
                                                                                data2)
                                                                       Model3(training data1+
                                                                                                        0.2996   0.3221    0.3086
2.2. Task 2: Region segmentation                                           training data2)
                                                                        Model4(selected by t-
We select Deeplab v3+ [3] network for multi-class artefact                                              0.2379   0.4512    0.3235
                                                                                 SNE)
segmentation, with different backbones (ResNet101 [4] and             Model5(selected by t-SNE
MobileNet [5]). After backbone network, we add 5 paral-                + data augmentation for          0.2658   0.4476    0.3385
lel convolution layers as the feature extraction layers, which              training data)
include one 1*1 convolutional layer, three 3*3 dilated con-           Model6(selected by t-SNE
volutional layers with different ratios of 6, 12, 18, and one          + data augmentation for          0.2633   0.4663    0.3445
global pooling layer. Then these feature maps are merged and                 testing data)
unsampled to achieve the region segmentation.                            Model7(model5 and
                                                                                                        0.3235   0.4172    0.3610
                                                                      model 6 fusion by NMS)
2.3. Task 3: Detection generalization
                                                                    Table 3. Results of different datasets by the improved Cas-
For detection generalization, we translate the training data to     cade R-CNN
the style of validation data with Cycle-GAN [6]. we replace
the deconvolution with the linear interpolation with 1*1 con-
volution to improve the performance of style transfer. Then
we retrain the detection model with translated training data
and test its performance..
                                                                           (a)                    (b)                     (c)
           3. EXPERIMENTS AND RESULTS

In our experiments we evaluated the method of each task in          Fig. 4. (a) the original image; (b) the results of model without
detail. We also compare the experimental results of our meth-       post-processing; (c) the final result.
ods and Faster-rcnn [10] model for task 1.
                                                                    sigmoid as loss function since there may be overlap among
3.1. Task 1: Multi-class artefact detection
                                                                    different artefact classes. Table 4 shows the results of multi-
We use SGD method to optimize the improved Cascade R-               class artefact segmentation by different methods. Due to lim-
CNN. The learning rate is 0.005 with a staged decline mode,         ited training datasets, the contours of some instruments can’t
and a total of 30 epoch is performed, and the batch size is         be extracted completely, to remedy the over-segmentation
2, all images are resized to 1333*800. Table 1 shows the re-        problem of instruments, we combined a marker-based water-
sults with different methods. Table 2 shows the results with        shed segmentation, which took partially extracted regions as
different data methods. In validation data, we perform the          markers, to perceptually group together the mulitiple parts
data augmentation with the operations of flip and contrast.         of instruments. In this way, the multiple parts of instruments
Note that the best results for each condition are achieved by       are perceptually merged according to their regional homo-
the technique of non-maximum suppression (NMS[11]). As              geneity. And the result of segmentation is shown in Figure
shown in Table 2, the cascade-rcnn method achieves a good           4 which we can see how the post-processing improves the
trade-off between mAP and IOU, which is 1.36% higher than           segmentation results clearly. Besides, from Table 4, the result
the faster-rcnn [10] model. We also compared the results of         of merging the two backbones (Resnet101[4], Mobilenet[5])
the different methods in detail, As shown in the table 3, the fi-   Increased from 0.6414 to 0.6568, with an increase of nearly
nal result of model7 brought 4.83% improvement when com-            1.5 percentage points, and further improved to 0.6700 by
pared to the original cascade-rcnn results.                         post-processing such as regional growth.

3.2. Task 2: Region segmentation                                    3.3. Task 3: Detection generalization
The Adam optimizer is used, the initial learning rate is 0.007,     We trained the Cycle-GAN with some hyper parameters: the
a total of 30k iterations are trained, and the batch size is 10.,   Adam optimizer is used, the initial learning rate is 0.002, a
all the images are resized to 513*513.We select multi-class         total of 30 epochs are trained, and the batch size is 1, all
        Methods              Overlap      F2-score      scores       tection model of task 1, which could effectively improve the
   Resnet101 backbone        0.6288        0.6795       0.6414       generalization of the detection model of task 1.
       Ensemble
                              0.6592       0.6937       0.6568
     two backbones
       Ensemble
                              0.6612       0.6964       0.6700
                                                                     ACKNOWLEGMENT
    +Post-processing
                                                                     We gratefully thank Dr. Peng Gao, Dr. Bin Lv, Dr. Jun Wang,
   Table 4. Segmentation results of by different methods             Xia Zhou, Ge Li, Chenfeng Zhang and Yue Wang for their
                                                                     valuable discussion and kind support during this challenge.

                                                                                         5. REFERENCES

                                                                      [1] Zhaowei Cai and Nuno Vasconcelos, “Cascade r-cnn:
                                                                          Delving into high quality object detection,” in Proceed-
                                                                          ings of the IEEE Conference on Computer Vision and
                                                                          Pattern Recognition, 2018, pp. 6154–6162.

                                                                      [2] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
                                                                          Bharath Hariharan, and Serge Belongie, “Feature pyra-
              (a)                               (b)                       mid networks for object detection,” in Proceedings of
                                                                          the IEEE Conference on Computer Vision and Pattern
                                                                          Recognition, 2017, pp. 2117–2125.
   Fig. 5. (a) the original image; (b) the translated image.
                                                                      [3] Liang-Chieh Chen, Yukun Zhu, George Papandreou,
      Methods                            mAPg         devg                Florian Schroff, and Hartwig Adam, “Encoder-decoder
      Train with original                                                 with atrous separable convolution for semantic image
                                         0.3187       0.1018
      training data                                                       segmentation,” in Proceedings of the European Confer-
      Train with translated training                                      ence on Computer Vision (ECCV), 2018, pp. 801–818.
                                         0.3747       0.0693
      data by Cycle-GAN[6]
                                                                      [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
         Table 5. Results of detection generalization                     Sun, “Deep residual learning for image recognition,” in
                                                                          Proceedings of the IEEE conference on computer vision
                                                                          and pattern recognition, 2016, pp. 770–778.
the images are resized to 512*512. Then trained the detec-
tion model in task 1 with original and translated training            [5] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
data respectively, and compare their performance of de-                   Kalenichenko, Weijun Wang, Tobias Weyand, Marco
tection generalization. As shown in Table 5, the model                    Andreetto, and Hartwig Adam, “Mobilenets: Efficient
trained with original training data obtains mAP g=0.3187,                 convolutional neural networks for mobile vision appli-
and dev g=0.1018, while the model trained with translated                 cations,” arXiv preprint arXiv:1704.04861, 2017.
data obtains mAP g=0.3747, and dev g=0.0693. Therefore,
the performance of detection generalization improved with             [6] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
style transfer.                                                           Efros, “Unpaired image-to-image translation using
                                                                          cycle-consistent adversarial networks,” in Proceedings
                                                                          of the IEEE international conference on computer vi-
                      4. CONCLUSION                                       sion, 2017, pp. 2223–2232.
In Task 1, the better results are obtained by combining the fpn       [7] Laurens van der Maaten and Geoffrey Hinton, “Visu-
and cascade-rcnn models. The mAP and IOU evaluation in-                   alizing data using t-sne,” Journal of machine learning
dicators are more balanced. At the same time, using t-SNE to              research, vol. 9, no. Nov, pp. 2579–2605, 2008.
automatically select similar samples between the two batches
of training data and the test datasets, which is helpful to accel-    [8] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
erate the model training; In the task 2, the deeplabv3+ model             Adam Bailey, Stefano Realdon, James East, Georges
is supplemented by the multi-class sigmoid loss function to               Wagnires, Victor Loschenov, Enrico Grisan, Walter
improve the segmentation effect of the model; In the task 3,              Blondel, and Jens Rittscher, “Endoscopy artifact de-
We presented using cycle-gan to translate the training set of             tection (EAD 2019) challenge dataset,” CoRR, vol.
task 1 to the testing set in task 3, and the fine-tuning the de-          abs/1905.03209, 2019.
 [9] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
     James East, Xin Lu, and Jens Rittscher, “A deep learn-
     ing framework for quality assessment and restoration in
     video endoscopy,” CoRR, vol. abs/1904.07073, 2019.
[10] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian
     Sun, “Faster R-CNN: towards real-time object de-
     tection with region proposal networks,” CoRR, vol.
     abs/1506.01497, 2015.
[11] Alexander Neubeck and Luc Van Gool, “Efficient non-
     maximum suppression,” in 18th International Confer-
     ence on Pattern Recognition (ICPR’06). IEEE, 2006,
     vol. 3, pp. 850–855.