DNN MODELS AND POSTPROCESSING THRESHOLDS FOR ENDOSCOPY ARTIFACT
                        DETECTION IN PRACTICE

                                  Seiryo Watanabe1,2 , Shigeto Seno1 , Hideo Matsuda1
                        1
                       Department of Bioinformatic Engineering, Osaka University, Japan
            2
                Autonomous Mobile Systems Laboratory, Meiji University 1-1-1 Higashimita, Japan


                            ABSTRACT
We tackled the problem of multi-class artifact detection and
segmentation in endoscopic video frames using different deep
learning algorithms and strategies. In particular we proposed
to combine the advantages of two state-of-the-art deep ob-
ject detection algorithms, YOLOv3 and Mask R-CNN. With
Mask R-CNN we achieved improved performance by lever-
aging the available image segmentation annotations to aid
bounding box detection. Appropriate thresholding of ob-
jectness score and non-maximum suppression (NMS) were
subsequently applied to achieve high leaderboard test scores.
  Index Terms— Endoscopic artefact detection, Mask R-
CNN, YOLOv3

                    1. INTRODUCTION                                  Fig. 1: Example images (left) and extracted image patches
                                                                     after cropping (right) from the provided segmentation dataset
This paper details the work done taking part in the Endo-            of EAD2019.
scopic artefact detection challenge (EAD2019) [1, 2] on all
three sub-challenge, artefact detection (task #1), segmenta-
tion (task #2) and generalization (task #3). Specifically we
investigated i) the use of Mask R-CNN to use segmentation
to improve the quality of artefact bounding box detection,
ii) the combination of single-stage YOLOv3 and two-stage
Mask R-CNN bounding box predictions and iii) the tuning of           aged patches from the available segmentation masks cropped
post-processing thresholds to reduce false positive detection        by the width and height of each mask region, Fig. 1 were
in practical applications.                                           then extracted. The extracted patches were then used to re-
                                                                     fine the previously trained DNN. This gave us a network that
                                                                     could quickly assess if candidate regions contained artifacts
                        2. METHODS
                                                                     in the training data. We then created an augmented datasets
2.1. Segmentation Aided Artefact Detection                           by adding the 474 images of the segmentation dataset to the
                                                                     provided 2186 images with bounding box annotations in the
The provided segmentation training data was composed of              detection challenge. To further increase the number of train-
around 589 images of which 498 were semantically annotated           ing images overall we carried out data augmentation, rotating
for 5 artefact classes, specularity, image artifact, bubbles, sat-   each image three times at increments of 900 angles as well
uration and instrument as binary masks. After removal of du-         as horizontal flipping. This gave a total of (474 + 2186) 8 =
plicate masks, we get 474 uniquely annotated segmentation            21280 training images. For image patches with bounding box
video frames. Bounding box annotation for artefact detection         annotations without segmentation masks, we generated corre-
consisted was released in two phases, 886 images in training         sponding segmentation masks using the patch trained DNN,
data I and 1306 images in training data II. We first trained a       Fig. 2. We then trained a Mask R-CNN [3] model with a Fea-
deep convolutional neural network (DNN) using the 474 seg-           ture Pyramid Network and ResNet101 backone on the con-
mentation images. A total of 3312 individual cropped im-             structed augmented image dataset.
                                                                  Fig. 3: Example bounding box detections using YOLOv3
                                                                  (top) and Mask R-CNN (bottom). For Mask R-CNN pre-
                                                                  dicted segmentation masks for each bounding box is addition-
                                                                  ally visualised.

Fig. 2: Example of images with only bounding box anno-
tations provided (left) and corresponding generated segmen-       olds. This means one not only has to reduce the number of
tation masks for each extracted patch (right) using a DNN         false positives (FP) and increase the number of true positive
trained on images with provided segmentation masks.               (TP) where a positive match covers at least one quarter of the
                                                                  ground truth area for mAP but additionally must care about
                                                                  how precisely positive areas are detected and how much of
2.2. Combining Mask R-CNN and YOLOv3 for artefact                 the true are is covered. Similarly the segmentation score uses
detection                                                         Jaccard, Dice and F2-score for the binary mask of each class
                                                                  so a strict threshold is needed to reduce FP areas regardless of
While Mask R-CNN uses a Region Proposal Network (RPN)
                                                                  their probabilities. These metrics are good for practical appli-
to get high accuracies, this loses the spatial relation of ob-
                                                                  cation. During an endoscopy examination the ideal algorithm
ject and non-object regions. Specularity, Saturation, Artifact,
                                                                  should not remove crucial areas such that reducing FP is more
Bubbles, and Instruments are kinds of objects and have clear
                                                                  important than reducing FN. We thus investigated different
defined image boundaries but Blur and Contrast artefacts are
                                                                  thresholds for detection and segmentation. In the end we a
image areas that do not have clear boundary. We thus used
                                                                  NMS (non-max suppression) threshold of 0.5 for detection
YOLOv3 [4] for blur and contrast artefact detection, Fig. 3.
                                                                  because the object should not overlap with other objects of
Unlike Mask R-CNN, YOLOv3 splits the image into several
                                                                  the same score and an objectness threshold of 0.01 to balance
spatial grids and predicts bounding boxes and class probabil-
                                                                  the mAP and IoU scores. IoU was increased ∼10% while
ities based on the grid thus retaining the spatial relation of
                                                                  mAP was decreased ∼1% by applying NMS. We used 0.5 as
objects and background. The test dataset was thus processed
                                                                  a score threshold for segmentation task and did not apply non
with both Mask R-CNN and YOLOv3 models with the result
                                                                  maximum suppression (NMS) because the same image region
of Mask R-CNN predicted blur and contrast replaced with the
                                                                  can be labelled for several classes.
corresponding result of YOLOv3. By doing this, IoU was in-
creased 13% and mAP increased 2% compared to the result
without integrating YOLOv3.                                                           3. CONCLUSIONS

2.3. Postprocessing thresholds for detection and segmen-          We show that the provided segmentation train dataset was
tation                                                            good enough to produce segmentation masks for images
                                                                  with only bounding box annotations. The quality of the
Now we have several bounding boxes, masks, class proba-           constructed dataset was sufficient to improve DNN models
bilities and mean average precision (mAP) for each classes.       across every artifact detection metric. We show that Mask
Normally the research process ends at this point when we          R-CNN has good ability to locate endoscopic artefacts with
care only about a good model for artifact detection. How-         good image boundaries while YOLOv3 was better for lo-
ever for this competition the rules of includes Intersection      cating artefacts with unclear boundary area such as blur and
over Union (IoU) as an additional score for object detection,     contrast. We found careful setting of thresholds on objectness
0.6 mAP + 0.4 IoU which requires tuning of detection thresh-      score and NMS steeply increased IOU with small deficit in
mAP.

                    4. REFERENCES

[1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
    Adam Bailey, Stefano Realdon, James East, Georges
    Wagnires, Victor Loschenov, Enrico Grisan, Walter Blon-
    del, and Jens Rittscher, “Endoscopy artifact detection
    (ead 2019) challenge dataset,” 2019.
[2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
    James East, Xin Lu, and Jens Rittscher, “A deep learning
    framework for quality assessment and restoration in video
    endoscopy,” 2019.
[3] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
    Girshick, “Mask r-cnn,” in Proceedings of the IEEE
    international conference on computer vision, 2017, pp.
    2961–2969.
[4] Joseph Redmon and Ali Farhadi, “Yolov3: An incre-
    mental improvement,” arXiv preprint arXiv:1804.02767,
    2018.