=Paper=
{{Paper
|id=Vol-2595/endoCV2020_Hung_et_al
|storemode=property
|title=Artefact Detection and Segmentation Using Cascade R-CNN and U-Net
|pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_31.pdf
|volume=Vol-2595
|authors=Hoang Manh Hung,Phan Tran Dac Thinh,Hyung-Jeong Yang,Soo-Hyung Kim,Guee-Sang Lee
|dblpUrl=https://dblp.org/rec/conf/isbi/HungTYKL20
}}
==Artefact Detection and Segmentation Using Cascade R-CNN and U-Net==
<pdf width="1500px">https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_31.pdf</pdf>
<pre>
      ARTEFACT DETECTION AND SEGMENTATION USING CASCADE R-CNN & U-NET

 Hoang Manh Hung∗ , Phan Tran Dac Thinh∗ , Hyung-Jeong Yang, Soo-Hyung Kim, and Guee-Sang Lee

     Department of Electronic and Computer Engineering, Chonnam National University, South Korea


                            ABSTRACT                                       tip captures the frames inside your body, observing the un-
                                                                           common specks which could be indications of disease or can-
     Endoscopy is a widely adopted procedure for the early                 cer. Unfortunately, diagnosis is made difficult by corrupted
detection of various types of cancers, therapeutic procedures              frames with multiple artefacts such as pixel saturations, mo-
and minimally invasive surgery. Nevertheless, the efficiency               tion blur, defocus, fluid, debris, etc. However, a technique
of this method is badly influenced by several artefacts, namely            called high-quality endoscopic frame restoration is able to
pixel saturations, motion blur, defocus, bubbles, specular re-             thoroughly solve this issue if undesired objects are accurately
flections, etc. Providing that all of these artefacts are spotted          tracked down. As a result, the task of detecting and identify-
comprehensively, contaminated frame could be restored to an                ing those artefacts is crucial for the procedure of endoscopic
adequate quality in order to better visualize the underlying               diagnosis. An amendment technique called high-quality en-
tissue during diagnosis. In this paper, we present and dis-                doscopic frame restoration is able to thoroughly solve this
cuss our methodology of detection and segmentation on en-                  issue only if undesired objects are accurately tracked down.
doscopy images. For artefact detection, we modified a deep                 Consequently, the task of detecting and identifying those arte-
neural network structure based on the cascade R-CNN with                   facts is crucial for the procedure of endoscopic diagnosis.
ResNeXt101 as backbone, including the deformable convolu-                  Each endoscopic frame is tainted by multiple artefacts and
tions. Moreover, Feature Pyramid Network is added to refine                the influence of them on each image is not even. Unless
the raw feature maps, enhancing the performance of feature                 the restoration technique has the knowledge of those artefacts
extraction. In semantic segmentation task, U-Net is utilized               precise spatial location, the quality of image can be guaran-
with the support of SE-ResNeXt50 as backbone. The clas-                    teed for further diagnosis.
sification model appears to be legitimately dominant when
compared to other models that we tested. At the end of the                     Object detection and segmentation has been gaining a lot
Endoscopy Artefact Detection challenge 2020, we attain the                 of attraction lately, leading to the advent of many powerful
mAP score of 0.2366 with the deviation of 0.0762 on the test               neural network models. In terms of object detection, Zhang et
dataset of detection task and the dice score of 0.5700 on the              al proposed Mask-Aided R-CNN based on Mask R-CNN [1]
test dataset of segmentation.                                              to modify its mask header for assisting training on pixel-level
                                                                           labelled samples. RetinaNet [2] associated with focal loss
                                                                           was introduced by Lin et al in order to predict bounding boxs
                                                                           sizes more precisely and cope with class imbalance issue. To
                      1. INTRODUCTION                                      reduce vanishing positive samples for large thresholds and in-
                                                                           crease hypotheses quality, Cascade R-CNN [3] with feature
In the last decade, medical imaging for disease diagnosis and              pyramid networks (FPN) [4] was presented by Cai et al. For
early treatment has a great step forward thanks to the surge of            object segmentation, U-Net [5] built by Ronneberger et al is
machine learning application in computer vision. The appli-                one of the most popular models. It is favored for multiple
cations of this new technique are indisputably innumerable,                of applications and performs effectively. Moreover, DeepLab
shortening the time of diagnose and accelerating the treat-                [6] proposed by Chen et al is a vigorous model because it is
ment. Although the precision of the software related to med-               capable of enlarging the field of view of filters for larger con-
ical imaging currently still cannot be compared to that of ex-             text learning but not compromise with the amount of compu-
perts, recent advances have been proving its auspicious capa-              tation. With its high capability of global context information
bilities of human replacement in differing tasks. Endoscopy                exploitation, Pyramid Scene Parsing (PSP) [7] network in-
is one of those clinical procedure that can utterly benefit from           troduced by Zhao et al is another model that achieved high
this burgeoning technology. In this procedure, a long, thin,               record in the task of segmentation.
flexible tube along with a light source and a camera at the                    Endoscopy Artefact Detection 2020 (EAD) Challenge [8,
     *Both authors contributed equally to this manuscript
                                                                           9, 10] is one of the competitions that are interested in optimiz-
     Copyright c 2020 for this paper by its authors. Use permitted under   ing the automatic artefact detection capability. The challenge
Creative Commons License Attribution 4.0 International (CC BY 4.0).        contributes two kinds of labelled data: for detection task, they
                                                                 problems. In this task, after assessing several models for
                                                                 segmentation, we decide to take the advantage of segmen-
                                                                 tation from U-Net [5] that have been verified through a lot
                                                                 of papers which include papers about medical segmentation.
                                                                 Many researches favored U-Net as their main network. Usu-
                                                                 ally, U-Net is preferred mostly because of its flexible and
                                                                 interchangeable backbone. The competitive classification
                                                                 performance of the neural network as backbone will deter-
                                                                 mine the success of the whole segmentation model. The
                                                                 higher the accuracy of the backbone can achieve, the better
                                                                 the U-Net model can tell the difference between desired pixels
                                                                 and background. Therefore, through a few trials with some
  Fig. 1. The proposed method based on Cascade R-CNN.            backbones (variants of Resnet, ResNext and SE-ResNeXt),
                                                                 we choose SE-ResNeXt50 model because of its equilibrium
                                                                 among strong performance, reasonable computational cost
provide images with bounding box annotation and for seg-         and acceptable time of training. Some models yielded greater
mentation task, the data are images with pixel-level annota-     results than our chosen model but they took us a huge amount
tion. In this paper, we present two different models for each    of time to train. SE-ResNeXt50 is a modified version of
task of the EAD challenge after several preliminary tests with   ResNeXt model with Squeeze and Excitation blocks [12],
a few models. Cascade R-CNN is utilized for former task          improving the representational power of a network. This
and U-Net with SE-ResNext50 is applied for the latter task.      model can tackle the problem of the imbalanced dataset bet-
Both models are discussed in Section 2 and we delineate the      ter than other models and retrieve more hidden fragments
progress of training them in Section 3.                          inside the picture. Furthermore, binary cross entropy loss and
                                                                 dice loss are combined to deal with the discrete distribution
                                                                 of the foreground. Another important point is using the pre-
                      2. METHODS                                 trained model of the backbone. Not only does it increase the
                                                                 accuracy of prediction but also reduces the total amount of
2.1. Artefact Detection                                          training time. The predicted mask is subsequently applied a
Our approach is based on Cascade R-CNN [3], which is             threshold value which is verified earlier. This threshold value
trained sequentially by using cascaded bounding box regres-      from 0.2 to 0.9 is picked if it satisfies the best result of dice
sion. This network can produce higher quality proposals          score on the total dataset.
during the inference process than other models that we have
tested. In addition, the detector uses the resampling mecha-                  3. EXPERIMENTAL RESULTS
nism which can reduce overfitting with the high intersection
over union (IoU) threshold and also dismiss some outliers.       3.1. Artefact Detection
The backbone ResNeXt-101 (64x4d), following by the Fea-
ture Pyramid Networks (FPN) [4] in a top-down mode as the        In multi-class artefact detection task, the dataset consists of
neck, increases the networks capability of feature extraction    2531 images and each image can contain one or many classes
and improves recall. Furthermore, the deformable convolu-        in the total of 8 classes, namely specularity, saturation, arti-
tion (DCN) [11] is added into the backbone stage 3 to stage 5,   fact, blur, contrast, bubbles, instrument and blood. First, we
which assists the model in perceiving the image content and      preprocess the images by transforming all of them in to the
differentiating the desired objects from the large background.   size of 608x608. Then, they are normalized by our mean and
Our modified model for detection is illustrated in Fig. 1.       standard and augmented with random flip and crop. Stochas-
                                                                 tic gradient descent is the chosen optimization along with the
2.2. Semantic Segmentation                                       initiated learning rate of 0.02. The backbone uses the pre-
                                                                 trained model on COCO dataset (2017) to enhance the perfor-
The mask images for segmentation contain 5 classes, namely       mance of the main model. The 5-fold cross validation is used
instrument, specularity, artefact, bubbles and saturation. The   for stable deviation. In the testing phase, weights from train-
distribution of desired objects in each images is not even       ing 5 folds are used to detect the online testing dataset. Non-
and some artefacts doesnt appear much on the background,         max suppression (NMS) [13] is selected to filter redundant
especially the case of specularity. Moreover, sometimes,         detections in post-processing for each fold. Subsequently, to
the mask images are ambiguous when we try to point out           combine the predictions from 5 weights, we employ Weighted
the foreground and background. Therefore, we need a tool         Boxes Fusion (WBF) [14] method instead of the commonly
that is powerful enough to solve the previously-mentioned        NMS. WBF not only deletes redundant boxes but also takes
                                                                                 Table 2. Segmentation results.
                                                                    Dataset    Model                        sscore     sstd
                                                                    Test       U-Net + SE ResNeXt50         0.5700     0.2703


                                                                 amount of epochs for one training time is 100. The predicted
                                                                 mask of one class is the average score of 5 model weights
                                                                 from 5 folds. After training, we find the threshold for our pre-
                                                                 dicted images. 0.4 is the threshold value which leads us to our
                                                                 own best score as shown in Table 2.
    Fig. 2. Sample of predicted image from test dataset.
                                                                                      4. CONCLUSION

                Table 1. Detection results.                      In this paper, we demonstrate two verified models for the
  Dataset    Model                     Score d      gmAP         EAD2020 Challenge. The difficulty that we met when par-
             Cascade R-CNN + DCN                                 ticipating in this challenge is how to pick up the most suitable
  Test                                 0.2366       0.2153       component for our network that performs excellently on the
             + ResNeXt101
                                                                 given dataset. It proves that there are still a lot of approaches
                                                                 that we havent tested and the development of artefact detec-
advantage of information from these boxes to align the final     tion related to endoscopy is very promising.
boxes more accurately. At the end, we get mAP score of
0.2366 with a deviation of 0.0762 in the testing set as shown                    5. ACKNOWLEDGMENT
in Table 1. The Fig. 2 below is one of our predicted images
from the test dataset.                                           This research was supported by the Bio & Medical Technol-
                                                                 ogy Development Program of the National Research Foun-
3.2. Semantic Segmentation                                       dation (NRF) & funded by the Korean government (MSIT)
                                                                 (NRF-2019M3E5D1A02067961) and a grant (HCRI 19136)
The challenge provides 643 pixel-level annotation images for     Chonnam National University Hwasun Hospital Institute for
semantic segmentation. The size of images is not fixed so it     Biomedical Science and National Research Foundation of
needs to be preprocessed before training. There is no sepa-      Korea(NRF) grant funded by the Korea government(MSIP)
rate validation data so we use 5-fold cross validation to en-    (NRF-2017R1A2B4011409).
sure the correctness of our own evaluation. First, we set the
size of all images into 256x256. The original size of images
is varied and bigger than the normalized size. Empirically,                           6. REFERENCES
if the normalized size is set at higher value, the overall re-
sult of segmentation task is absolutely enhanced. Second, we      [1] Kaiming He, Georgia Gkioxari, Piotr Dollr, and Ross
train the U-Net segmentation model with the backbone net-             Girshick. Mask r-cnn. IEEE Transactions on Pattern
work of SE-ResNeXt50. We also use the pretrained model of             Analysis and Machine Intelligence, 42, 2020.
SE-ResNext50 to boost the performance of segmentation and         [2] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
reduce the duration of training. The backbone is not frozen           and Piotr Dollr. Focal loss for dense object detection.
for fine-tuning at the beginning. We load the weight of the           IEEE Transactions on Pattern Analysis and Machine In-
backbone and train it at the same time with the main model.           telligence, 2017.
Augmentation is not applied in our training method due to
lower results when it is compared to training without augmen-     [3] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn:
tation on a few lighter-weight neural network models. We did          High quality object detection and instance segmenta-
not check if augmentation could boost the performance of the          tion. IEEE Transactions on Pattern Analysis and Ma-
model due to lack of time. The GPU we have is Nvidia RTX              chine Intelligence, 2019.
2080Ti so we set the batch size of 32. Adam optimizer is used
for our network and the learning is kept constant at 3x10−4 .     [4] Tsung-Yi Lin, Piotr Dollr, Ross Girshick, Kaiming He,
Due to overlay in pixel-level labels, which means one pixel           Bharath Hariharan, , and Serge Belongie. Feature pyra-
could contain more than one class, we train each class once           mid networks for object detection. IEEE Conference
at the same time. Thus, we have to train 25 times in total for        on Computer Vision and Pattern Recognition (CVPR),
5 classes, which is evaluated by 5 fold cross validation. The         2017.
 [5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.          [14] Roman Solovyev and Weimin Wang. Weighted boxes
     U-net: Convolutional networks for biomedical image                fusion: ensembling boxes for object detection models.
     segmentation. Lecture Notes in Computer Science:                  CoRR, abs/1910.13302, 2019.
     Medical Image Computing and Computer-Assisted In-
     tervention (MICCAI), 9351, 2015.

 [6] Liang-Chieh Chen, George Papandreou, Iasonas Kokki-
     nos, Kevin Murphy, and Alan L. Yuille. Deeplab:
     Semantic image segmentation with deep convolutional
     nets, atrous convolution, and fully connected crfs. IEEE
     Transactions on Pattern Analysis and Machine Intelli-
     gence, 40, 2018.

 [7] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiao-
     gang Wang, and Jiaya Jia. Pyramid scene parsing net-
     work. IEEE Conference on Computer Vision and Pat-
     tern Recognition (CVPR), 2017.

 [8] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
     Adam Bailey, Stefano Realdon, James East, Georges
     Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
     doscopy artifact detection (ead 2019) challenge dataset.
     arXiv preprint arXiv:1905.03209, 2019.

 [9] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
     James East, Xin Lu, and Jens Rittscher. A deep learn-
     ing framework for quality assessment and restoration
     in video endoscopy. arXiv preprint arXiv:1904.07073,
     2019.

[10] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
     ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
     qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
     Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
     Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
     fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
     Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
     abel, James E. East, Geroges Wagnieres, Victor B.
     Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
     del, and Jens Rittscher. An objective comparison of de-
     tection and segmentation algorithms for artefacts in clin-
     ical endoscopy. Scientific Reports, 10, 2020.

[11] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong
     Zhang, Han Hu, and Yichen Wei. Deformable convo-
     lutional networks. IEEE International Conference on
     Computer Vision (ICCV), 2017.

[12] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua
     Wu. Squeeze-and-excitation networks. IEEE Trans-
     actions on Pattern Analysis and Machine Intelligence,
     2018.

[13] Alexander Neubeck and Luc Van Gool. Efficient non-
     maximum suppression. 18th International Conference
     on Pattern Recognition (ICPR), 3, 2006.

</pre>