=Paper=
{{Paper
|id=Vol-2595/endoCV2020_Yu_Guo
|storemode=property
|title=Endoscopic Artefact Detection using Cascade R-CNN based Model
|pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_27.pdf
|volume=Vol-2595
|authors=Zhimiao Yu,Yuanfan Guo
|dblpUrl=https://dblp.org/rec/conf/isbi/YuG20
}}
==Endoscopic Artefact Detection using Cascade R-CNN based Model==
<pdf width="1500px">https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_27.pdf</pdf>
<pre>
       ENDOSCOPIC ARTEFACT DETECTION USING CASCADE R-CNN BASED MODEL

                                                    Zhimiao Yu, and Yuanfan Guo

                                       Shanghai Jiao Tong University, Shanghai, China
                                                  {gyfastas,Carboxy}@sjtu.edu.cn


                           ABSTRACT                                       network architecture is it achieves state-of-art detection per-
                                                                          formance.
Accurate detection of artefacts is a core challenge in a wide-
range of endoscopic applications addressing multiple differ-
ent disease areas. Our work aims to localise bounding bboxes                                     2. DATASETS
and predict class labels of 8 different artefact classes for
given frames and clinical endoscopy video clips. To solve the             The 8 artefact classes in the dataset for “Endoscopic Arte-
task, we use Cascade R -CNN[1] as network architecture and                fact Detection” include specularity, specularity saturation, ar-
adopt ImageNet pretrained ResNet101[2] as backbone with                   tifact, blur, contrast, bubbles, instrument and blood. The vi-
Feature Pyramid Network (FPN) [3] structure. To improve                   sualization of ground truth bboxes are shown in Fig 2. The
the network performance, methods like data augmentation                   artefact detection task will be evaluated based on the results of
and multi-scale are also be adopted. In the end, we analyze               the test dataset provided from a subset of the data collected for
the major challenge of the task.                                          training. Specifically, the training dataset for detection con-
                                                                          sists in total 2200 annotated frames over all 8 artifact classes
                                                                          and test dataset 100[8] [?] [9].
                     1. INTRODUCTION
                                                                                                 3. METHODS
Endoscopy is a widely used clinical procedure for the early
detection of numerous cancers (e.g., nasopharyngeal, oe-                  3.1. Architecture
sophageal adenocarcinoma, gastric, colorectal cancers, blad-
                                                                          The model architecture is shown in Fig 1. We use Cascade
der cancer etc.), therapeutic procedures and minimally in-
                                                                          R-CNN[1] as network architecture and adopt ImageNet pre-
vasive surgery (e.g.,laparoscopy). However, video frames
                                                                          trained ResNet101[2] as backbone with Feature Pyramid Net-
captured by an endoscope usually contain multiple artefacts,
                                                                          work (FPN)[3] structure. Taking the areas of artefacts into
which not only present difficulty in visualising the underly-
                                                                          consideration, the anchors base areas are tuned from 162 to
ing tissue during diagnosis but also affect any post analysis
                                                                          5122 on P 2 to P 6 . Specifically, anchor scales, ratios and
methods required for follow-ups. Existing endoscopy work-
                                                                          strides are [8], [0.5, 1.0, 2.0] and [4, 8, 16, 32, 64], respec-
flows are not competent qualified for restoring high-quality
                                                                          tively.
endoscopic frames because they can detect only one artefact
class in most cases. Generally, the same video frame can be
corrupted with multiple artefacts, e.g. motion blur, specular             3.2. Implement Details
reflections, and low contrast can be present in the same frame.           For data augmentation, each image will be horizontally
Besides, corruption varies with video frames in artefact types.           flipped with a 50 percent chance. We replace the nms opera-
Therefore, improving detection accuracy is a core challenge               tion with the sof t-nms[10] operation in the architecture and
in a wide-range of endoscopic applications.                               set the learning rate scheduling strategy as consine decay[11].
    Recently, deep ConvNets have significant improved im-                 The classification and regression loss function are CrossEn-
age classification and object detection accuracy[4]. In deep              tropyLoss and SmoothL1Loss, respectively. The model is
learning era, object detection can be grouped into two genres:            trained for 24 epochs.
“two-stage detection” (e.g. RCNN[5]) and “one-stage detec-                     In the experiment, we find that specularity, artifact and
tion” (e.g. [6])[7]. In this task, we use Cascade R-CNN[1]                bubbles are hard to classify. A probable reason is these three
as network architecture. It is a multi-stage object detection             artefacts have similar appearance (e.g. Some of them all ap-
architecture. The reason we adopt Cascade R-CNN as our                    pears as spots of light). To solve this problem, we modify
    Copyright c 2020 for this paper by its authors. Use permitted under   the loss function. In specific, we up-weight loss when model
Creative Commons License Attribution 4.0 International (CC BY 4.0).       mistakenly classify these three artefacts. The result turns out
                                               Fig. 1. Model architecture based on Cascade R-CNN.


                                                                          4.1. Data augmentation of resizing
                                                                          We report our results obtained from baseline in Table 1. Our
                                                                          baseline achieves 0.26 mAP. To improve the model perfor-
                                                                          mance, we added image resizing operation to the data aug-
                                                                          mentation pipeline. Specifically, each image will be randomly
                                                                          resized among the range from (512, 512) to (1024, 1024) with
     (a) specularity          (b) saturation               (c) artifact
                                                                          the same aspect ratio as the original. Considering the image
                                                                          size varies, we believe this operation will be effective.
                                                                              The results are shown in Table 2. According to the re-
                                                                          sults, we argue that resizing operation can obvious improve
                                                                          the model performance with an increase in mAP of 0.017.
                                                                          We notice that the improvement is mainly on APsmall . The
                                                                          main reason is that in most cases the resizing operation en-
       (d) blur                (e) contrast                (f) bubbles    larges image size and thus makes it possible to detect more
                                                                          small objects.
                                                                              Note that the scales of test images are often larger than
                                                                          train images, i.e. images in testset often with height and width
                                                                          larger than 1000 while images in trainset around 500, so the
                                                                          resizing operation can solve the scale mismatch problem be-
                                                                          tween training images and testing images.
                  (g) instrument               (h) blood
                                                                          4.2. Difficult classification among specularity, artifact
         Fig. 2. Visualization of ground truth boxes.                     and bubbles
                                                                          In the experiment, we found that network has some difficul-
                                                                          ties in distinguishing classes among specularity, artifact and
to be an improvement of AP for these three artefacts but a                bubbles. To demonstrate the problem clearly, we calculated
decline of mAP.                                                           the confusion matrix, which is shown in Table 4. According
                                                                          to the Table 4, the network has two drawbacks. Firstly, the
                                                                          network tends to confuse specularity, artifact and bubbles in
                            4. RESULTS                                    the classification procedure. Secondly, the network has poor
                                                                          performance in detecting blur.
We randomly divide the data provided into 5 subsets and use                   To solve the first problem, we modified the loss func-
one of them for validation while others for training. The fol-            tion. Specifically, we increased the loss weights to the mis-
lowing metrics are based on the validation set.                           classification of specularity, artifact and bubbles. The result
                                        Table 1. Baseline performance on validation set.
            Artefacts          AP       AP IoU =.50       AP IoU =.75         AP small        AP medium         AP large
           specularity      0.123          0.319             0.063              0.064            0.193            0.202
            saturation      0.197          0.670             0.217              0.040            0.210            0.345
             artifact       0.225          0.486             0.170              0.129            0.218            0.421
               blur         0.184          0.275             0.167                0                0              0.191
             contrast       0.414          0.760             0.416              0.033            0.187            0.439
             bubbles        0.124          0.345             0.061              0.094            0.128            0.216
           instrument       0.531          0.801             0.624                /                0              0.551
              blood         0.181          0.454             0.103                /              0.079            0.221
              mean          0.260          0.514             0.228              0.060            0.127            0.323


                            Table 2. Model performance on validation set with resizing operation.
            Artefacts          AP       AP IoU =.50       AP IoU =.75         AP small        AP medium         AP large
           specularity      0.138          0.380             0.062              0.091            0.216            0.199
            saturation      0.295          0.669             0.246              0.050            0.212            0.338
             artifact       0.243          0.516             0.185              0.140            0.239            0.427
               blur         0.181          0.279             0.178                0                0              0.188
             contrast       0.422          0.760             0.424                0              0.224            0.443
             bubbles        0.153          0.384             0.085              0.125            0.151            0.254
           instrument       0.569          0.830             0.649                /              0.044            0.587
              blood         0.212          0.495             0.172                /              0.130            0.244
              mean          0.277          0.539             0.250              0.068            0.152            0.335


                                                                     to 300. Then we used the model to obtain the testset results
            Table 3. Final result on leaderboard.
                                                                     and the performance is shown in Table 3.
               dataset                dscore
                50% testset           0.2603                                    5. DISCUSSION & CONCLUSION
                100% testset          0.2036
                                                                     In our work, we found the major challenge in “Endoscopic
                                                                     Artefact Detection” task is the difficult classification among
turned out to be an improvement of AP for these three arte-          specularity, artifact and bubbles. One intuitive explanation is
facts but a decline for mAP.                                         that some of them all appears as spots of light, sharing a high
                                                                     degree of similarity. In the future, we intend to train 3 separate
4.3. Qualitative Results                                             classifiers for these 3 artefacts and adopt more advanced fea-
                                                                     ture extraction networks, which may solve this challenge to
To find out what kinds of artefact our model can successfully
                                                                     some extent. Boxes ensemble method was performed in our
detect, we show some qualitative results in Fig 3 and Fig 4.
                                                                     experiment. However, it seemed this method caused lower
The qualitative results indicate a). for artefacts with not so
                                                                     mAP.
small size, our model tends to generate accurate detections;
                                                                         To sum up, we constructed a Cascade R-CNN based
b). more artefacts in an image lead to more difficulties in de-
                                                                     model to solve the “Endoscopic Artefact Detection” task.
tecting; c). our model generates a fair number of true negative
                                                                     We adopted several methods to improve the network perfor-
blur. We are not sure the reason for problem c) mentioned
                                                                     mance, including data augmentation, modifying loss function
above is whether the shortcomings of the model itself or the
                                                                     and boxes ensemble. We also identified the major challenge
absence of annotation blur, because the corresponding images
                                                                     in this task.
show blur characters.

4.4. Leaderboard Result                                                                    6. REFERENCES

We added image resizing operation to the data augmentation            [1] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn:
pipeline and fine-tuned the maximum box number per image                  Delving into high quality object detection. In Proceed-
                                          Table 4. Confusion matrix of 8 classes
                                                                      Labels
                               specularity saturation artifact      blur     contrast   bubbles   instrument   blood
                 specularity     70.3%        7.2%      10.9%      0.0%        0.0%     19.9%        0.0%       0.0%
                 saturation       1.4%       79.9%       0.9%      0.7%        0.0%      0.5%        1.7%       0.0%
                 artifact        19.4%        5.1%      75.4% 20.6%            1.9%     13.1%        6.0%       2.3%
                 blur             0.0%        1.3%       2.3%     49.1%        3.8%      0.1%        4.3%       2.3%
    Predicted
                 contrast         0.0%        0.0%       1.0%     11.8%       88.5%      0.0%       11.2%      12.8%
                 bubbles          8.4%        2.6%       8.7%      0.0%        0.0%     66.2%        0.0%       0.8%
                 instrument       0.3%        3.3%       0.7%     12.5%        4.5%      0.1%       75.0%       3.8%
                 blood            0.2%        0.5%       0.1%      5.2%        1.2%      0.2%        1.7%      78.2%


Fig. 3. High quality detection examples (i.e. Model generates accurate detections). The first row shows ground truth where
artefacts are annotated with blue bounding boxes . The second row shows results where detected artefacts are annotated with
yellow bounding boxes.


Fig. 4. Low quality detection examples (i.e. Model generates inaccurate detections). The first row shows ground truth where
artefacts are annotated with blue bounding boxes . The second row shows results where detected artefacts are annotated with
yellow bounding boxes. The last two columns represent false positive blur.


     ings of the IEEE conference on computer vision and pat-    [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     tern recognition, pages 6154–6162, 2018.                       Sun. Deep residual learning for image recognition. In
                                                                    Proceedings of the IEEE conference on computer vision
     and pattern recognition, pages 770–778, 2016.
 [3] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
     Bharath Hariharan, and Serge Belongie. Feature pyra-
     mid networks for object detection. In Proceedings of the
     IEEE conference on computer vision and pattern recog-
     nition, pages 2117–2125, 2017.
 [4] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE
     international conference on computer vision, pages
     1440–1448, 2015.
 [5] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten-
     dra Malik. Region-based convolutional networks for ac-
     curate object detection and segmentation. IEEE trans-
     actions on pattern analysis and machine intelligence,
     38(1):142–158, 2015.
 [6] Joseph Redmon, Santosh Divvala, Ross Girshick, and
     Ali Farhadi. You only look once: Unified, real-time ob-
     ject detection. In Proceedings of the IEEE conference
     on computer vision and pattern recognition, pages 779–
     788, 2016.
 [7] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping
     Ye. Object detection in 20 years: A survey. arXiv
     preprint arXiv:1905.05055, 2019.
 [8] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
     ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
     qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
     Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
     Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
     fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
     Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
     abel, James E. East, Geroges Wagnieres, Victor B.
     Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
     del, and Jens Rittscher. An objective comparison of de-
     tection and segmentation algorithms for artefacts in clin-
     ical endoscopy. Scientific Reports, 10, 2020.
 [9] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
     James East, Xin Lu, and Jens Rittscher. A deep learn-
     ing framework for quality assessment and restoration
     in video endoscopy. arXiv preprint arXiv:1904.07073,
     2019.
[10] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and
     Larry S Davis. Soft-nms–improving object detection
     with one line of code. In Proceedings of the IEEE in-
     ternational conference on computer vision, pages 5561–
     5569, 2017.
[11] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang,
     Junyuan Xie, and Mu Li. Bag of tricks for image clas-
     sification with convolutional neural networks. In Pro-
     ceedings of the IEEE Conference on Computer Vision
     and Pattern Recognition, pages 558–567, 2019.

</pre>