=Paper= {{Paper |id=Vol-2595/endoCV2020_Chen_et_al |storemode=property |title=Endoscopy Artefact Detection and Segmentation using Deep Convolutional Neural Network |pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_26.pdf |volume=Vol-2595 |authors=Haijian Chen,Chenyu Lian,Liansheng Wang |dblpUrl=https://dblp.org/rec/conf/isbi/ChenLW20 }} ==Endoscopy Artefact Detection and Segmentation using Deep Convolutional Neural Network== https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_26.pdf
            ENDOSCOPY ARTEFACT DETECTION AND SEGMENTATION USING DEEP
                        CONVOLUTIONAL NEURAL NETWORK

                                         Haijian Chen, Chenyu Lian, Liansheng Wang

               Department of Computer Science, School of Informatics, Xiamen University, China


                       ABSTRACT                                               Class      Count     Ratio       Class       Count      Ratio
Endoscopy Artefact Detection and Segmentation (EAD2020)                    specularity   9791      36.2%      contrast     1641        6.1%
includes 3 sub-tasks: Multi-class artefact detection, Seman-                saturation   1277       4.7%      bubbles      4670       17.3%
tic segmentation and Out-of-sample generalisation. This                      artifact    8012      29.6%    instrument      470        1.7%
manuscript summarizes our solution. The challenge can be                       blur       684       2.5%       blood        491        1.8%
considered as two independent problems: object detection
and semantic segmentation. For the detection problem, we                         Table 1. Class distribution of detection dataset
use Cascade R-CNN with FPN and Hyper Task Cascade. For
the segmentation problem, we use DeepLab v3+ model with                                                                  P
bce+dice loss.                                                                Class           fg           bg         fg/ fg        fg/(fg+bg)
                                                                           Instrument     15997225     371567134      36.39%          4.13%
                                                                           Specularity     4700063     382864296      10.69%          1.21%
                     1. INTRODUCTION                                        Artefact       4100248     383464111       9.33%          1.06%
                                                                            Bubbles        8967902     378596457      20.40%          2.31%
Endoscopy is a widely used clinical procedure for the early                Saturation     10190545     377373814      23.18%          2.63%
detection of numerous cancers. However, a major drawback
of these video frames is that they are heavily corrupted with             Table 2. Pixel distribution of segmentation dataset (fg: fore-
multiple artifacts. Thus, accurate detection and even segmen-             ground, bg : background)
tation of artifacts are very helpful to improve the endoscopy
tools. This task aims to localise bounding boxes, predict
class labels and pixel-wise segmentation of 8 different arti-              Size            Count     Ratio      Size       Count       Ratio
fact classes for given frames and clinical endoscopy video                 512 × 512       138       25.36%     Smaller    129         23.71%
clips.                                                                     1349 × 1079     118       21.69%     Bigger     159         29.23%
                                                                           Total           544       100%
                          2. DATASETS
                                                                          Table 3. Image sizes of segmentation dataset (Smaller :
The details of Endoscopy Artifact Detection and Segmenta-                 height <800, width <700, Bigger is the contrary)
tion Dataset are described well in the original papers [1, 2, 3].
The following part gives a brief analysis of EAD2020 data.

2.1. Object detection                                                     2.2. Semantic segmentation

We combine the two phases of the dataset together. As shown
in Table 1, the distribution of different classes is very imbal-          Many ground-truth pixel values are between 0 and 255 in the
anced. The counts of ‘blur’, ‘instrument’, and ‘blood’ are                dataset. After dividing all ground truth pixel values by 255
significantly smaller than others, which could become hard                and using a threshold of 0.5 to classify foreground and back-
examples when training models. The counts of ‘specularity’                ground pixels, the results are shown in Table 2. Foreground
and ‘artifact’ are very big and the objects of them are very              pixels are significantly fewer than background pixels. The
small in size. Based on the condition, we pay attention to the            foreground pixels of different classes in ground-truth images
balance of each class when we divide 20% of data as valida-               are imbalanced as well. As shown in Table 3, the most com-
tion set.                                                                 mon size of images is 512 × 512 and 1349 × 1079. The others
    Copyright c 2020 for this paper by its authors. Use permitted under   contain different sizes. We shuffle the dataset randomly and
Creative Commons License Attribution 4.0 International (CC BY 4.0).       use 20% of the data as the validation set.
                       3. METHODS                                   bce+dice as the loss of our model at last, which simply means

3.1. Object Detection                                                L = Lbce + Ldice = − ygt log ypred − (1 − ygt ) log ypred
                                                                                                   P
                                                                                                (2 ygt · ypred + 
3.1.1. Model Overview                                                                   +1− P           P             )
                                                                                                  ygt + ypred + 
We use Cascade R-CNN [4] with ResNetXt-101 [5] back-
bone and FPN [6] as the neck of the model. We also train a             ( = 10−7 , ygt and ypred are flattened tensors)
Hyper Task Cascade model [7] with the same backbone and
neck.                                                               3.2.3. Augmentation
                                                                    We apply random brightness and contrast changes, random
3.1.2. Loss                                                         horizontal and vertical flip, random shift scale rotation, Gaus-
                                                                    sian blurring, resizing and normalization to images of the
We use Cross Entropy Loss for classification. Smooth L1             training set. All random transformations are applied by a
Loss is utilized to improve the precision of detection.             probability of 0.5 with the default parameters of Albumen-
                                                                    tations library [11]. We apply image normalization in the
3.1.3. Augmentation                                                 validation set.
                                                                        The images are resized to 512 × 512 and 1024 × 1024
In training data, we perform random flip, normalization and         during the training phase, see 4.2.
resizing. The images are resized to 512 × 512.
                                                                    3.2.4. Implementation Details
3.1.4. Implementation Details                                       We load the weights pre-trained on the ImageNet for the back-
We extract candidate bounding box with RPN (region pro-             bone network. The network is trained using SGD with a mo-
posal network) , and use non-maximum suppression (NMS)              mentum of 0.9 and a weight decay of 0.0001. We train the
to filter the useful bounding-box. Observing that some small        model using mini-batches of size 4. The learning rate is in-
objects are ignored, NMS threshold is increased from 0.7            creased linearly over the warm-up period of 5 epochs, to the
to 0.8. It slightly improves the recall rate and mAP. Soft-         maximum value of 0.01, then adjusted by cosine annealing
NMS [8] is applied to avoid mistakenly discard the bounding-        with warm restarts [12] by a period of 40 epochs. The images
box directly.                                                       are resized to 512 × 512 to train 200 epochs and then resized
                                                                    to 1024 × 1024 to train another 100 epochs.
     We use SGD optimizer with a momentum of 0.9 and a
weight decay of 0.0001. In order to get better results when
convergence, we add a warm-up period to make the training                                   4. RESULTS
rate linearly increase to 0.0025 in the first 500 iterations. The
network is trained for 13 epochs totally.                           4.1. Object Detection
                                                                    Table 4 shows mAPs of different classes in the validation set,
3.2. Semantic Segmentation                                          which are evaluated by COCO metrics. And Table 5 shows
                                                                    more details of evaluation results. Metrics of both models
3.2.1. Model Overview                                               are pretty close to each other. In Figure 1, we find the HTC
                                                                    model is good at detecting large objects while doing poorly
We use DeepLab V3+ network [9] with ResNet101 [5] back-
                                                                    in some small objects, though its AP small metric is slightly
bone for semantic segmentation. DeepLab V3+ is an encoder-
                                                                    better than the other.
decoder network with dilation convolution. ASPP modules
and decoder is implemented as the original paper does.                  Class       Cascade R-CNN          HTC      Faster R-CNN
    The output of the network is activated by sigmoid function       instrument        0.64791           0.64965       0.56197
to get the probability map, since there may be overlap among           artifact        0.22540           0.22511       0.21733
different channels of the mask. The segmentation problem is             blood          0.10594           0.12520       0.10998
considered as multiple binary segmentation tasks.                        blur          0.26506           0.26097       0.19428
                                                                       bubbles         0.11302           0.10491       0.10600
3.2.2. Loss                                                            contrast        0.40275           0.39182       0.38044
                                                                      saturation       0.27912           0.24990       0.26373
We evaluated different losses, including Binary Cross En-            specularity       0.09281           0.09485       0.08561
tropy, Dice Loss, Lovsz-Hinge Loss [10], and their combina-
tion. Based on the testing results discussed in 4.2, we choose           Table 4. mAPs of different classes in validation set
   Metric        Cascade R-CNN         HTC     Faster R-CNN
   mAP                0.267            0.263       0.240
   AP50               0.501            0.505       0.498
   AP75               0.246            0.249       0.209
  AP small            0.082            0.091       0.086
 AP medium            0.162            0.166       0.166
  AP large            0.337            0.337       0.299

 Table 5. AP metrics of evaluation results in validation set

 Model               mAPd       IoUd       mAPg         mAPsq
 Cascade R-CNN       0.2238     0.1707     0.2405       0.3038
                                                                  Fig. 1. Predictions of two images (left: Hybrid Task Cascade;
 HTC network         0.2393     0.0674     0.2621       0.3214
                                                                  right: Cascade R-CNN)
   Table 6. Detection scores in the first phase of test data

 Model               Score d     dstd       gmAP        gdev
 Cascade R-CNN       0.2193      0.0871     0.2485      0.0552
 HTC network         0.2021      0.0901     0.2744      0.0556

          Table 7. Detection scores in the final test
                                                                  Fig. 2. Test image Fig. 3. Pred-512 Fig. 4. Pred-1024
 Size             Score d     dstd       gmAP       gdev
 512 × 512        0.2193      0.0871     0.2485     0.0552
                                                                  using bce+dice, showing the effectiveness of this loss.
 1024 × 1024      0.2156      0.0991     0.2659     0.0764
                                                                      After 300 epochs, the DeepLabV3+ model using bce+dice
Table 8. Detection scores in the final test with Cascade mod-     got 0.7927 in F1, 0.8386 in F2, 0.6857 in IoU, 0.7422 in pre-
els trained with different sizes                                  cision and 0.887 in recall. The U-Net models don’t get much
                                                                  better scores as they almost converge after 165 epochs.
                                                                      We also tested Lovsz-Hinge loss. In our test, it is hard to
    The results of Table 6,7,8 are provided by the official       converge if the model is trained from the ground up. Hence,
leaderboard. Table 6 shows the detection scores in the first      we use Lovsz-Hinge loss to fine-tune the Deeplab model
phase of test data. The Hybrid Task Cascade network per-          trained with bce+dice for 300 epochs. Table 11 shows the re-
forms better in mAP, while getting a lower score in IoU.          sults of the first 20 epochs (Epochs means the training epochs
    Table 7 shows the scores in the final test. We get a higher   with Lovsz). This model converges after 30 epochs but these
detection score with the Cascade R-CNN network.                   results are worse than the model before fine-tuning, so we
    As shown in Table 8, resizing the image to 1024 × 1024        give up this method.
instead of 512×512 doesn’t give a better score but contributes
to generalization performance.
                                                                   Metric       F1        F2        IoU       p          r
                                                                   bce          0.585     0.6754    0.4447    0.5014     0.8136
4.2. Semantic Segmentation                                         dice         0.5846    0.5881    0.4874    0.6755     0.601
4.2.1. Experiments of losses in validation set                     bce+dice     0.6728    0.7042    0.5523    0.6699     0.7585

To evaluate the results of different losses, we train a DeepLab   Table 9. The metrics of the validation set with different
V3+ model with ResNet101 backbone and a modified U-               losses in DeepLab V3+ model with ResNet101 backbone
Net [13] model with ResNet-34 backbone for 160 epochs.            (160 epochs)
    The threshold to predict foreground pixels is 0.5. Other
configurations are the same as 3.2.4. In Table 9 and Table         Metric       F1        F2        IoU       p          r
10, ‘bce is Binary Cross Entropy loss, ‘dice is Dice Loss,         bce          0.4666    0.4346    0.3743    0.7201     0.4196
‘bce+dice’ is defined in 3.2.2. All ‘p’ and ‘r’ in the tables      dice         0.561     0.5421    0.469     0.7188     0.5353
stand for precision and recall.                                    bce+dice     0.6138    0.5837    0.5057    0.7415     0.5709
    In Table 9, the experiment shows that bce+dice gets the
best score in Dice, F2, and IoU score. The precision of           Table 10. The metrics of the validation set with different
bce+dice is pretty close to dice, while not losing much recall.   losses in a modified U-Net model with ResNet-34 backbone
In Table 10, we can see a significant improvement of UNet         (160 epochs)
     We choose bce+dice to train the final model.                  512, models trained with 1024 × 1024 get better scores in the
                                                                   validation set. Some predictions look smoother, as shown in
 Epochs      F1       F2         IoU        p         r            Figure 2,3,4.
 5           0.5491   0.5038     0.4486     0.8490    0.4828           We find that adding segmentation data of EAD2019 to the
 20          0.5115   0.4610     0.4173     0.8927    0.4367       training set also helps a little, although there is potential val-
 40          0.5373   0.4846     0.4387     0.8872    0.4594       idation data leakage, making validation metrics unbelievable.
                                                                   However, it does not help in the detection task.
Table 11. Using Lovsz-Hinge loss to fine-tune a model
                                                                       We chose Model 3 to predict the final test data and got
trained with BCE + Dice loss
                                                                   scores as Table 14 shows.


4.2.2. Experiments of backbones in validation set                             5. DISCUSSION & CONCLUSION
Table 12 shows another experiment to compare different net-
                                                                   In task 1, we compare Cascade R-CNN with Hyper Task Cas-
works. We find that Xception-based DeepLabV3+ converges
                                                                   cade to get a better detection model. FPN and Soft-NMS are
significantly slower than ResNet101-based model, and does
                                                                   used to improve the detection precision due to class imbal-
not get better scores than the U-Net model.
                                                                   ance. A proper threshold of NMS is helpful to improve the
                                                                   recall rate of small objects.
 Model       F1        F2         IoU       p         r                In task 2, we select DeepLabV3+ to solve the problem.
                                                                   We select bce+dice as the loss function to balance precision
 D-X         0.4189    0.4248     0.3167    0.486     0.4388
                                                                   and recall. Image sizes of the dataset is a noticeable part at
 D-R101      0.5823    0.5967     0.4717    0.6288    0.6313
                                                                   the training phase. Adjusting the threshold of predicting also
 U-R34       0.5535    0.5209     0.4507    0.7512    0.5078
                                                                   contributes to a more balanced model.
Table 12. The metrics of validation set with different net-
work (85 epochs, D: DeeplabV3+, U:UNet, X:Xception,                                      6. REFERENCES
R:ResNet)
                                                                    [1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
                                                                        Adam Bailey, Stefano Realdon, James East, Georges
4.2.3. Submission results                                               Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
                                                                        doscopy artifact detection (ead 2019) challenge dataset.
       F1       F2          p        r         sscore     sd            arXiv preprint arXiv:1905.03209, 2019.
 1     0.4872   0.5027      0.5250   0.5467    0.5154     0.2327
 2     0.4802   0.5156      0.4836   0.5872    0.5167     0.2403    [2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
 3     0.5012   0.5042      0.5817   0.5390    0.5315     0.2644        James East, Xin Lu, and Jens Rittscher. A deep learn-
                                                                        ing framework for quality assessment and restoration
Table 13. Segmentation scores in the first phase of test data           in video endoscopy. arXiv preprint arXiv:1904.07073,
(50% of final data)                                                     2019.

                                                                    [3] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
             Model                         sscore     sstd              ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
 3 : DeepLabV3+/ResNet101/1024x            0.5459    0.2682             qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
                                                                        Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
        Table 14. Segmentation scores in the final test                 Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
                                                                        fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
    The training parameters are listed in 3.2.4. All the results        Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
above are provided by the official leaderboard. In Table 13,            abel, James E. East, Geroges Wagnieres, Victor B.
Model 1 is trained with 512 × 512 images and a threshold                Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
of 0.5. Model 2 is the same as model 1 except changing the              del, and Jens Rittscher. An objective comparison of de-
threshold to 0.7. Model 3 is trained with 1024 × 1024 images            tection and segmentation algorithms for artefacts in clin-
and a threshold of 0.7.                                                 ical endoscopy. Scientific Reports, 10, 2020.
    We resized the image to 512 × 512 at first. However, as
discussed in Table 3, there are many bigger images. This can        [4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN:
be found in the first phase of test images as well. Compared            delving into high quality object detection. CoRR,
with the models only trained with images resized to 512 ×               abs/1712.00726, 2017.
 [5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     Sun. Deep residual learning for image recognition,
     2015.
 [6] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming
     He, Bharath Hariharan, and Serge J. Belongie. Fea-
     ture pyramid networks for object detection. CoRR,
     abs/1612.03144, 2016.
 [7] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xi-
     aoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jian-
     ping Shi, Wanli Ouyang, Chen Change Loy, and Dahua
     Lin. Hybrid task cascade for instance segmentation.
     CoRR, abs/1901.07518, 2019.
 [8] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and
     Larry S. Davis. Improving object detection with one
     line of code. CoRR, abs/1704.04503, 2017.

 [9] Liang-Chieh Chen, Yukun Zhu, George Papandreou,
     Florian Schroff, and Hartwig Adam. Encoder-decoder
     with atrous separable convolution for semantic image
     segmentation, 2018.
[10] Maxim Berman, Amal Rannen Triki, and Matthew B.
     Blaschko. The lovsz-softmax loss: A tractable surro-
     gate for the optimization of the intersection-over-union
     measure in neural networks, 2017.
[11] E. Khvedchenya V. I. Iglovikov A. Buslaev, A. Parinov
     and A. A. Kalinin. Albumentations: fast and flexible
     image augmentations. ArXiv e-prints, 2018.
[12] Ilya Loshchilov and Frank Hutter. SGDR: stochastic
     gradient descent with restarts. CoRR, abs/1608.03983,
     2016.

[13] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
     U-net: Convolutional networks for biomedical im-
     age segmentation. In Medical Image Computing and
     Computer-Assisted Intervention –MICCAI 2015, pages
     234–241, 2015.