DNN MODELS AND POSTPROCESSING THRESHOLDS FOR ENDOSCOPY ARTIFACT DETECTION IN PRACTICE Seiryo Watanabe1,2 , Shigeto Seno1 , Hideo Matsuda1 1 Department of Bioinformatic Engineering, Osaka University, Japan 2 Autonomous Mobile Systems Laboratory, Meiji University 1-1-1 Higashimita, Japan ABSTRACT We tackled the problem of multi-class artifact detection and segmentation in endoscopic video frames using different deep learning algorithms and strategies. In particular we proposed to combine the advantages of two state-of-the-art deep ob- ject detection algorithms, YOLOv3 and Mask R-CNN. With Mask R-CNN we achieved improved performance by lever- aging the available image segmentation annotations to aid bounding box detection. Appropriate thresholding of ob- jectness score and non-maximum suppression (NMS) were subsequently applied to achieve high leaderboard test scores. Index Terms— Endoscopic artefact detection, Mask R- CNN, YOLOv3 1. INTRODUCTION Fig. 1: Example images (left) and extracted image patches after cropping (right) from the provided segmentation dataset This paper details the work done taking part in the Endo- of EAD2019. scopic artefact detection challenge (EAD2019) [1, 2] on all three sub-challenge, artefact detection (task #1), segmenta- tion (task #2) and generalization (task #3). Specifically we investigated i) the use of Mask R-CNN to use segmentation to improve the quality of artefact bounding box detection, ii) the combination of single-stage YOLOv3 and two-stage Mask R-CNN bounding box predictions and iii) the tuning of aged patches from the available segmentation masks cropped post-processing thresholds to reduce false positive detection by the width and height of each mask region, Fig. 1 were in practical applications. then extracted. The extracted patches were then used to re- fine the previously trained DNN. This gave us a network that could quickly assess if candidate regions contained artifacts 2. METHODS in the training data. We then created an augmented datasets 2.1. Segmentation Aided Artefact Detection by adding the 474 images of the segmentation dataset to the provided 2186 images with bounding box annotations in the The provided segmentation training data was composed of detection challenge. To further increase the number of train- around 589 images of which 498 were semantically annotated ing images overall we carried out data augmentation, rotating for 5 artefact classes, specularity, image artifact, bubbles, sat- each image three times at increments of 900 angles as well uration and instrument as binary masks. After removal of du- as horizontal flipping. This gave a total of (474 + 2186) 8 = plicate masks, we get 474 uniquely annotated segmentation 21280 training images. For image patches with bounding box video frames. Bounding box annotation for artefact detection annotations without segmentation masks, we generated corre- consisted was released in two phases, 886 images in training sponding segmentation masks using the patch trained DNN, data I and 1306 images in training data II. We first trained a Fig. 2. We then trained a Mask R-CNN [3] model with a Fea- deep convolutional neural network (DNN) using the 474 seg- ture Pyramid Network and ResNet101 backone on the con- mentation images. A total of 3312 individual cropped im- structed augmented image dataset. Fig. 3: Example bounding box detections using YOLOv3 (top) and Mask R-CNN (bottom). For Mask R-CNN pre- dicted segmentation masks for each bounding box is addition- ally visualised. Fig. 2: Example of images with only bounding box anno- tations provided (left) and corresponding generated segmen- olds. This means one not only has to reduce the number of tation masks for each extracted patch (right) using a DNN false positives (FP) and increase the number of true positive trained on images with provided segmentation masks. (TP) where a positive match covers at least one quarter of the ground truth area for mAP but additionally must care about how precisely positive areas are detected and how much of 2.2. Combining Mask R-CNN and YOLOv3 for artefact the true are is covered. Similarly the segmentation score uses detection Jaccard, Dice and F2-score for the binary mask of each class so a strict threshold is needed to reduce FP areas regardless of While Mask R-CNN uses a Region Proposal Network (RPN) their probabilities. These metrics are good for practical appli- to get high accuracies, this loses the spatial relation of ob- cation. During an endoscopy examination the ideal algorithm ject and non-object regions. Specularity, Saturation, Artifact, should not remove crucial areas such that reducing FP is more Bubbles, and Instruments are kinds of objects and have clear important than reducing FN. We thus investigated different defined image boundaries but Blur and Contrast artefacts are thresholds for detection and segmentation. In the end we a image areas that do not have clear boundary. We thus used NMS (non-max suppression) threshold of 0.5 for detection YOLOv3 [4] for blur and contrast artefact detection, Fig. 3. because the object should not overlap with other objects of Unlike Mask R-CNN, YOLOv3 splits the image into several the same score and an objectness threshold of 0.01 to balance spatial grids and predicts bounding boxes and class probabil- the mAP and IoU scores. IoU was increased ∼10% while ities based on the grid thus retaining the spatial relation of mAP was decreased ∼1% by applying NMS. We used 0.5 as objects and background. The test dataset was thus processed a score threshold for segmentation task and did not apply non with both Mask R-CNN and YOLOv3 models with the result maximum suppression (NMS) because the same image region of Mask R-CNN predicted blur and contrast replaced with the can be labelled for several classes. corresponding result of YOLOv3. By doing this, IoU was in- creased 13% and mAP increased 2% compared to the result without integrating YOLOv3. 3. CONCLUSIONS 2.3. Postprocessing thresholds for detection and segmen- We show that the provided segmentation train dataset was tation good enough to produce segmentation masks for images with only bounding box annotations. The quality of the Now we have several bounding boxes, masks, class proba- constructed dataset was sufficient to improve DNN models bilities and mean average precision (mAP) for each classes. across every artifact detection metric. We show that Mask Normally the research process ends at this point when we R-CNN has good ability to locate endoscopic artefacts with care only about a good model for artifact detection. How- good image boundaries while YOLOv3 was better for lo- ever for this competition the rules of includes Intersection cating artefacts with unclear boundary area such as blur and over Union (IoU) as an additional score for object detection, contrast. We found careful setting of thresholds on objectness 0.6 mAP + 0.4 IoU which requires tuning of detection thresh- score and NMS steeply increased IOU with small deficit in mAP. 4. REFERENCES [1] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnires, Victor Loschenov, Enrico Grisan, Walter Blon- del, and Jens Rittscher, “Endoscopy artifact detection (ead 2019) challenge dataset,” 2019. [2] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher, “A deep learning framework for quality assessment and restoration in video endoscopy,” 2019. [3] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969. [4] Joseph Redmon and Ali Farhadi, “Yolov3: An incre- mental improvement,” arXiv preprint arXiv:1804.02767, 2018.