ENSEMBLE MASK-AIDED R-CNN Pengyi Zhang, Xiaoqiong Li, YunXin Zhong Beijing Institute of Technology, Beijing, China ABSTRACT Recently the strategy of integrating instance mask prediction header into one-stage or two-stage object detector has been immensely popular for instance segmentation (e.g., Reti- naMask or Mask R-CNN). This strategy notably improve the object detector at the meantime of learning to predict instance mask. In this paper, we introduce a Mask-aided R- CNN model with a flexible and multi-stage training protocol to address the problems of EAD2019 Challenge (a multi-class artefact detection in video endoscopy). The proposed training protocol aims to facilitate the implementation of this strategy for the detection task and segmentation task and to improve the detection and segmentation performance using pixel-level labeled samples with incomplete categories. This training protocol consists of three principal steps, of which the core part is augmenting the training set with soft pixel-level labels. The Mask-aided R-CNN is modified from Mask R-CNN by pruning its mask header to support training on pixel-level ] labeled samples with incomplete categories. We propose a simple yet effective ensemble method based on graph clique Fig. 1. Illustration of detection and segmentation results of for object detectors to furtherly improve the detection per- proposed ensemble Mask-aided R-CNN. formance. The ensemble method votes on graph cliques to fuse the detection results from different detectors. It produces robust detection results from different detectors. It produces quality endoscopic frame and facilitate the visual diagnosis, robust detection results, which is quite important for clinical the algorithms of endoscopic frame restoration based on the application. Extensive experiments on EAD2019 challenging priori knowledge of artefacts are generally used in existing dataset have demonstrated the effectiveness of our proposed endoscopy workflows. Therefore, identifying the types and ensemble Mask-aided R-CNN.As a result, we won the 1ST the locations of those artefacts accurately is essential for place in detection task of EAD2019 Challenge. high-quality endoscopic frame restoration and is crucial for realizing reliable computer assisted endoscopy tools for im- Index Terms— Soft label, Ensemble, Graph clique, proved patient care. However, the methods for identifying Mask-aided R-CNN artefacts in existing endoscopy workflows support only one single artefact type in an endoscopic frame, which generally 1. INTRODUCTION contains multiple artefacts as shown in Figure 1. More- over, different types of artefacts unequally contaminate the Recently with the rapid development of medical imaging frame, thus requiring specific restoration algorithms for spe- technology, the medical imaging diagnosis and treatment cific types of artefacts. Therefore, it is an urgent problem to equipment and digital health records have been widely used develop accurate detection algorithms for multi-class artefact in clinic. Among those medical imaging technologies, en- detection task. doscopy is an important clinical procedure for early detection Driven by the growth of computing power (e.g., Graphi- of cancers in hollow organs. However, the endoscopy video cal Processing Units and dedicated deep learning chips) and frames are easily corrupted with multiple artefacts (e.g., mo- the availability of large labelled data sets (e.g., ImageNet tion blur, specular reflections, bubbles etc.), thus increasing [1] and COCO [2]), deep neural networks have been ex- the difficulty of visual diagnosis. In order to retrieve high- tensively studied due to their fast, scalable and end-to-end learning framework. In recent years, Convolution Neural Detection header Network (CNN) [3] models have achieved significant im- RPN & Backbone provements compared with conventional shallow methods ROIAlign in image classification (e.g., ResNet [4] and DenseNet [5]), Mask header object detection (e.g., Faster R-CNN [6] and SSD [7]) and ] I semantic segmentation (e.g., UNet [8] and Mask R-CNN [9]) Fig. 2. Illustration of adding mask header to enable instance etc. The advantages of CNN models, i.e. modular design segmentation (Based on Faster R-CNN [6] and Mask R-CNN and end-to-end learning architecture, enable existing CNN [9]). models to be easily used in complex problems by adding task-specific network branch. Recently the strategy of in- tegrating instance mask prediction header into one-stage or Training set of Augmented training set Augmented training set segmentation task for segmentation task for segmentation task two-stage object detector has been immensely popular for in- stance segmentation (e.g., RetinaMask [10] or Mask R-CNN weights weights Mask R-CNN Mask R-CNN Mask-aided R-CNN [9]). This strategy notably improve the object detector at the meantime of learning to predict instance mask. In this paper, Training set of detection Training set of detection Ensemble method task task we aim at addressing the problems of multi-class endoscopic repeat artefact detection by developing instance segmentation algo- Training samples with Training samples with Segmentation Detection rithm using this strategy in EAD2019 Challenge [11][12]. soft pixel-level labels soft pixel-level labels Task Task The EAD2019 Challenge provides two kinds of labelled sam- Step 1: training a basic Step 2: augmenting training Step 3: training a mask- Mask R-CNN model set with soft labels aided R-CNN model ples, i.e. endoscopic frames with bounding box annotation for detection task and endoscopic frames with pixel-level an- notations for segmentation task. The frames for segmentation Fig. 3. Outline of proposed multi-stage training protocol. task are contained in the frames for detection task, which means only part of endoscopic frames in detection task have in recent years (e.g., RetinaMask [10] and Mask R-CNN [9] pixel-level annotations. illustrated in Figure 2). To take full advantage of this strat- We present ensemble Mask-aided R-CNN for multi-class egy, we introduce a Mask-aided R-CNN model with a flexible endoscopic artefact detection with three highlights. First, and multi-stage training protocol. we propose to integrate the detection task and segmentation The outline of proposed multi-stage training protocol task into an end-to-end framework of instance segmentation, shown in Figure 3 consists of three principal steps, of which i.e. Mask-aided R-CNN, which are able to take full advan- the core part is augmenting the training set with soft pixel- tage of all labelled samples to improve the performance of level labels. multi-class endoscopic artefact detection. Second, we de- sign a flexible and multi-stage training protocol based on soft pixel-level annotations to train proposed Mask-aided R- 2.1.1. Step 1: training a basic Mask R-CNN model for the CNN. The soft pixel-level annotations are firstly generated segmentation task by initially trained Mask R-CNN models and furtherly re- We first train a basic Mask R-CNN model on the training set fined by subsequently retrained models. The effectiveness of of segmentation task to implement instance segmentation. In designed training protocol has been verified in training and order to maintain consistency of semantic segmentation and improving Mask-aided R-CNN. Third, we propose a simple object detection, the instance masks are bounded by bounding yet effective ensemble method based on graph clique for ob- box annotations acquired from the training set of detection ject detectors to furtherly improve the detection performance. task. The process is illustrated in Figure 4. Extensive experiments on EAD2019 challenging dataset have demonstrated the effectiveness of our proposed ensemble Mask-aided R-CNN. As a result, we won the 1ST place in 2.1.2. Step 2: augmenting the training set of segmentation detection task of EAD2019 Challenge. task with soft pixel-level labels The trained Mask R-CNN model is subsequently used to pre- 2. METHOD dict instance masks for the training samples of detection task that have no pixel-level labels. One thing that needs to be 2.1. Training Protocol of Mask-aided R-CNN noted is that during the inference process the results of object detection are toughly replaced with the ground truth bound- Adding a branch of mask header in one-stage or two-stage ob- ing boxes. It means that we enforce the mask prediction only ject detector is a common strategy to enable instance segmen- for the ground truth instances. This trick shown in Figure 5 tation. The effectiveness of this strategy in improving both aims to improve segmentation accuracy and to maintain con- detection and segmentation performance has been witnessed sistency of semantic segmentation and object detection. Ground-truth of I Mask header instance mask 5 categories Backbone Soft pixel-level mask_loss (a) Original image (b) Ground truth (c) Bounded mask labels ⊋ RPN & ROIAlign Fig. 4. Illustration of maintaining consistency of semantic segmentation and object detection. (a) is the original image Detection cls_loss & Ground-truth of 7 categories (“00024.jpg”); (b) shows the ground truth mask of segmenta- header bbox_loss bounding boxes tion task (“Specularity”), where the bounding boxes marked in red are ground truth of detection task. (c) is the bounded Fig. 6. Proposed Mask-aided R-CNN. We compute the mask mask used to train a Mask R-CNN model. loss only for the five segmentation categories. RPN & Detection Backbone ROIAlign header Mask-aided R-CNN models will be used by the ensemble module to furtherly improve the detection performance. I Ground-truth of Mask header 2.2. Mask-aided R-CNN Bounding boxes The Mask-aided R-CNN shown in Figure 6 is modified from Soft pixel- Mask R-CNN by pruning its mask header to support training ] level labels on pixel-level labeled samples with incomplete categories. In EAD2019 Challenge[11], the detection task has seven cate- Fig. 5. Illustration of retrieving soft pixel-level labels. We gories while the segmentation task has five categories, where perform mask prediction only on the ground-truth bounding the five segmentation categories are a subset of the seven de- boxes to maintain the consistency of semantic segmentation tection categories. Therefore, the Mask-aided R-CNN model and object detection. for EAD2019 Challenge is designed by following the two steps below: (1) Design a Mask R-CNN model with seven semantic categories; (2) Prune the neural units and connec- These predicted instance masks, called soft pixel-level tions related with the two extra categories in the mask header labels, are assigned to the corresponding training samples. of this Mask R-CNN to get a mask header with five semantic These training samples with soft pixel-level labels are fur- categories. When training such a Mask-aided R-CNN model, therly added to the training set of segmentation task. We we compute the mask loss only for the five segmentation cat- retrain the Mask R-CNN model on the augmented training egories. The remaining defaults in training process are kept set of segmentation task. Subsequently, the soft pixel-level unchanged. labels are refined with the new instance masks predicted by the retrained Mask R-CNN model. This step might be per- 2.3. Ensemble method formed multiple times for higher segmentation accuracy. The final augmented training set will be used in next step, while Ensemble strategy is commonly used to improve the perfor- the final retrained Mask R-CNN model can be used by the mance in image classification tasks. In the detection task of ensemble module. EAD2019 challenge, we propose a simple yet effective en- semble method based on graph model for object detection 2.1.3. Step 3: training a Mask-aided R-CNN model for de- tasks to furtherly improve the detection performance. Our tection and segmentation task proposed ensemble method is able to fuse the detection re- sults from multiple object detectors by voting on one graph To take full advantage of all available training samples and clique for the same object and mutually reinforcing each other the strategy of boosting object detection by adding mask pre- among graph cliques. diction branch, we generate soft pixel-level labels for train- ing samples with no pixel-level annotations through the first 2.3.1. Construction of Graph model two steps. In this step, we train multiple Mask-aided R-CNN models with different backbone networks on the final aug- Given a single image I, C semantic categories and N ob- mented training set. The Mask-aided R-CNN model support- ject detectors, the detection result set can be formalized as ing to be trained on pixel-level labeled samples with incom- {Detcn |n = 1, 2, . . . , N, c = 1, 2, . . . , C}. For convenience, plete categories is detailed later in next section. These trained we simply extract the detection results of a single category Model A Model A Clique Model B Model B Clique Model C Model C Clique Clique Clique Fig. 7. The construction process of graph model for proposed ensemble method. Each rectangle denotes one detection and Fig. 8. The inference process of graph model for proposed each color in the figure denotes one detector model. ensemble method. Each clique denotes one detection result, which can be calculated by Formula (1) and (2). {Detcn |n = 1, 2, . . . , N } to introduce and formalize our en- semble method (illustrated in Figure 7). Each detection Detcn The last step is voting on partitioned cliques and there- consist of {uuid, score, bbox}, where uuid denotes the fore, each clique output a single detection by calculating its universally unique identifier of the detector, score denotes confidence score and bounding box. Given a clique {Detck = the confidence score of this detection and bbox denotes the {uuidk , scorek , bboxk }|k = 1, 2, . . . , K}, the voting result bounding box of this detection. Detc = {0, score, bbox} is formalized as follows: A weighted undirected graph Gc (V, E) with dense con- nections can be established from the detections {Detcn |n = 1, 2, . . . , N } , where V denotes the set of vertexes and E de- K Y notes the set of edges. Each vertex represents a single detec- score = 1 − (1 − scorek ) (1) tion. The vertexes are densely connected with each other by k PK edges. We assign a weight, i.e. the intersection over union scorek × bboxk k (IOU) score of two detections, to the corresponding edge. bbox = PK (2) k scorek 2.3.2. Inference of Graph model where K denotes the number of vertexes in this clique We formulize the inference of established graph model as the maximum clique problem, which aims to maximize the sum 3. EXPERIMENTS of edge weights in each clique here. Several reasonable con- straints are introduced to simplify the partition process. Post Experiments on EAD2019 Challenge 1 are performed by fol- processing, e.g. non-maximum suppression (NMS), to re- lowing the proposed training protocol of Mask-aided R-CNN. move redundant detections is commonly used in object de- We train our models on servers with two 1080Ti GPUs. tector. Therefore, the vertexes in one clique are required to be different in the uuid attribute, which means removing the edges constructed by the same detector. Moreover, we in- 3.1. Experiments on training a basic Mask R-CNN model troduce a threshold of IOU score to remove the edges with for the segmentation task low weights, which means that the two detections with higher First, we generate the bounded mask for the released samples IOU score are more likely to be the same object. of segmentation task to enable instance segmentation. The After the simplification step, we design a greedy approach released data (498 images with pixel-level labels in total) is to solve this maximum clique problem iteratively. Initially, split into training set (90%, 448 images) and validation set each vertex is adopted as a clique. We iteratively merge the (10%, 50 images). Second, we train a Mask R-CNN model two cliques, which has largest edge weight and different uuid attributes of all the vertexes. 1 https://ead2019.grand-challenge.org dataset into one training set (90%, 800 images in the first re- Table 1. The evaluation results of basic Mask R-CNN model leased dataset and 90%, 1175 images in the second released tested on validation set. dataset), a validation set of “release 1” (10%, 89 image in the Task AP AP50 AP75 APS APM APL first released dataset) and a validation set of “release 2” (10%, Detection 25.8 50.4 22.4 12.0 34.6 23.8 131 images in the second released dataset). We successively Segmentation 23.3 44.6 20.3 7.7 28.4 26.5 train three Faster R-CNN models and three Mask-aided R- CNN models on the training set. The corresponding backbone networks of these models are RseNet50, RseNet50+FPN and with the backbone network of RseNet101 and feature pyra- RseNet101+FPN, respectively. The Faster R-CNN models mid network (FPN) [13] on this training samples. We per- are trained only with bounding box annotations, while the form two augmentation operations, i.e., random scaling and Mask-aided R-CNN are trained with both soft mask anno- random horizontal flipping. The network is trained end-to- tations and bounding box annotations. Here, we perform the end using SGD with the momentum of 0.9 and weight decay data augmentation on the training set with random scaling, of 0.0001. We train the model using mini-batches of size 2. random horizontal flipping, random vertical flipping and ran- We use an initial learning rate of 0.005 that is decayed by a dom cropping on-the-fly. Each model is trained end-to-end factor of 10 at the iteration step of 24000 and 48000. The using SGD with the momentum of 0.9. A weight decay fac- maximum training iteration is set as 72000. tor of 0.0002 is adopted when training the models with a The trained model is tested on the validation set and the ResNet101+FPN backbone, while a weight decay factor of evaluation results are shown in Table 1 . 0.0001 is adopted when training other models. We train each model using mini-batches of size 2. We use an initial learning rate of 0.005 that is decayed by a factor of 10 at the iteration 3.2. Experiments on augmenting the training set of seg- step of 24000 and 48000. The maximum training iteration is mentation task with soft pixel-level labeled samples set as 72000. The trained Mask R-CNN is then used to generate soft pixel- We evaluate the iteration snapshots of each model on the level labels for training samples of detection task. We follow validation set of “release 1” and “release 2”. The average the trick detailed in Chapter 2 to enforce the mask predic- precision curves of each model are shown in Figure 9, 10, tion only for the ground truth instances to maintain consis- 11, 12, 13, and 14. tency of semantic segmentation and object detection and to For quantitatively evaluation, we uniformly select two it- improve segmentation accuracy. We evaluate the generated eration snapshots (iteration of 40000 and iteration of 72000) soft mask on validation set of segmentation task and the re- sults are shown in Table 2. Compared with Table 1, the qual- ead_2019_release1_soft_whole_val: det ead_2019_release1_soft_whole_val: seg ity of predicted masks has been improved significantly, which AP AP 0.5 AP50 AP50 verifies the effectiveness of proposed trick. AP75 APs 0.4 AP75 APs APm APm The second step of proposed training protocol is per- 0.4 APl 0.3 APl performance performance formed only once in this experiment. We generate soft mask 0.3 0.2 for each released sample in detection task. These soft mask 0.2 0.1 annotations, together with the released samples of detec- 0.1 tion task and the corresponding bounding box annotations, 0 10000 20000 30000 40000 iter 50000 60000 70000 0 10000 20000 30000 40000 iter 50000 60000 70000 constitute the whole dataset for instance segmentation task. (a) Evaluation results of detection and segmentation task on the validation set of “release 1” dataset. 3.3. Experiments on training the Mask-aided R-CNN ead_2019_release2_soft_whole_val: det ead_2019_release2_soft_whole_val: seg models for detection task AP AP AP50 0.5 AP50 0.5 AP75 AP75 APs APs The whole dataset consists of two released datasets, of which 0.4 APm APl 0.4 APm APl the first released dataset contains 889 images and the second performance performance 0.3 0.3 released dataset contains 1306 images. We split the whole 0.2 0.2 0.1 0.1 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 iter iter Table 2. The evaluation results of proposed trick to enforce (b) Evaluation results of detection and segmentation task on the validation set the mask prediction only for the ground truth instances. of “release 2” dataset. Task AP AP50 AP75 APS APM APL Fig. 9. Evaluation results of Mask-aided R-CNN with the Detection - - - - - - backbone of ResNet50 on the detection and segmentation Segmentation 31.5 61.4 26.3 19.0 32.1 33.6 task. Table 3. The evaluation results of three Faster R-CNN models, three Mask-aided R-CNN models and three ensemble models on validation set of “release 1” dataset. Model iter AP AP50 AP75 APS APM APL AP1 AP10 AP100 ARS ARM ARL EC 40000 24.9 52.9 20.6 13.6 24.9 26.1 18.5 35.2 38.9 19.8 36.1 45.2 faster+ResNet50 72000 25.4 52.9 21.9 14 24.9 26.4 19.1 35.3 38.7 19.9 36 44.8 29.8 mask-aided+ 40000 24.7 52.2 22.3 13.9 24.9 26.4 19 34.1 38.3 19.9 35.9 48 ResNet50 72000 24.9 51.9 21.5 13.2 25.3 27.4 19.7 35.3 38.8 18.7 35.8 48.8 30 faster+ 40000 25.4 52.8 21.6 14.4 23.3 27 19.2 34.7 39 21.1 34.1 45.9 ResNet50+FPN 72000 25.2 52 20.9 14.8 23.9 26 19.8 34.7 38.8 20.8 35.2 42.9 29.7 mask-aided 40000 25.4 52.3 21.9 13.9 21.5 27.2 18.6 35.5 40.3 21 36 45.9 +ResNet50+FPN 72000 25.9 53.1 21.7 13.9 23.4 28.7 20.1 36.6 41 20.3 36.6 51.7 30.5 faster+ 40000 26 52.1 22.9 13.6 23.4 27.2 20.2 35.8 39.9 20.6 36.1 48 ResNet101+FPN 72000 26.2 51.8 24.1 13.8 24.3 27 20.3 35.5 39.7 20.5 36.3 43.7 30.4 mask-aided+ 40000 26.7 54.3 23.9 17.2 23.6 28.9 20.7 37 41.5 26.2 36.2 49.2 ResNet101+FPN 72000 26.5 52.7 24.9 14.1 24 28.4 21 36.9 40.6 20 36.2 46 31.5 faster ensemble 28.5 53.4 26.6 15.5 27.5 30 20.8 38.4 43.3 21.3 39.2 47.5 32.7 mask-aided en- 28.4 54.6 26.8 15 27.1 31 21.7 38 42.6 21.3 37.9 52.6 33.1 semble all ensemble 29.6 55.5 28 16.2 28.5 31.7 21.9 39.9 45.8 23.3 41 54.3 34.6 Table 4. The evaluation results of three Faster R-CNN models, three Mask-aided R-CNN models and three ensemble models on validation set of “release 2” dataset. Model iter AP AP50 AP75 APS APM APL AP1 AP10 AP100 ARS ARM ARL EC 40000 27.4 56.7 25.3 16.9 24.7 37.6 28.3 40 42.8 25.1 33.3 50.1 faster+ResNet50 72000 27.6 55.1 24.5 16.7 24.4 38.6 28.5 40 42.6 24.3 32.7 51.1 33.9 mask-aided+ 40000 27.1 53.6 21.8 17 15.6 43.8 25.5 37.1 39.8 26.2 25.4 57.7 ResNet50 72000 27.3 52.4 21.2 14.9 16 44.2 25.7 37.9 40.7 22.3 25.7 58.7 32.4 faster+ 40000 26.2 55.4 20.8 15 18 37 26.9 38.5 41.2 24.4 33.4 52.3 ResNet50+FPN 72000 24.2 52.7 19.9 14.9 14.7 36.8 23.1 34.6 37.1 22.8 24.8 49.2 31.0 mask-aided+ 40000 24.7 54.7 21.2 15.8 17.8 35.2 23.4 36 39.1 24.8 28.3 52.3 ResNet50+FPN 72000 25 52.8 20.5 14.3 16.8 35.2 24.3 36.6 39.4 22.6 27.1 55.6 31.0 faster+ 40000 27.8 57.6 22 13.6 21.6 40.9 27.2 39.7 42.3 21.9 32.9 55.7 ResNet101+FPN 72000 27 55.7 22.1 14 19.9 41.3 26.8 38.2 40.6 21.4 31.6 57.3 33.3 mask-aided+ 40000 30.7 60.4 22.8 14.4 26.1 42.5 30 42.2 45.6 24.1 35.3 60.9 ResNet101+FPN 72000 28.4 60.5 20.9 13.3 21.8 41.6 27 39.7 42.5 22.4 31.7 57.6 35.1 faster ensemble 27.7 56.5 23.9 17.4 21.6 40.6 27.1 39.8 42.5 27 32.6 56.6 34.4 mask-aided en- 29.5 57.6 23.4 15.2 25.8 43.4 26.2 41.9 44.7 23 34.9 58.3 35.3 semble all ensemble 30.1 58 24.1 17.5 26.2 44.2 26 43.6 46.8 28.2 36.5 59.5 36.7 from each trained model and evaluate the models on the val- deeper convolutional network, thus improving detection per- idation sets of “release 1” and “release 2”. The evaluation formance. The EC scores in Table 4 also reveals a consistent results of average precision (AP) and average recall (AR) are implication. shown in Table 3 and Table 4. The average of AP and AR is adopted as the Evaluation Criterion (EC) score in the ex- 3.4. Experiments on ensemble method periments. In Table 3, the EC scores of Mask-aided R-CNN models are consistently higher than the EC scores of Faster R- The two selected iteration snapshots of each model in Section CNN. Specifically, the significance of the EC score difference 3.3 are enrolled in the proposed ensemble method. In this between Mask-aided R-CNN and Faster R-CNN increases as experiment, we implement three ensemble models, involving the complexity of backbone network increases. It implicates ensemble of Faster R-CNN models, ensemble of Mask-aided that the generated soft pixel-level labels facilitate to train a R-CNN models and ensemble of all the Faster R-CNN models and Mask-aided R-CNN models. The threshold of IOU score ead_2019_release1_soft_whole_val: det ead_2019_release2_soft_whole_val: det ead_2019_release1_soft_whole_val: det ead_2019_release1_soft_whole_val: seg 0.6 0.5 AP AP AP AP 0.5 AP50 AP50 AP50 AP50 AP75 AP75 0.5 AP75 AP75 0.5 APs APs APs 0.4 APs APm APm APm APm 0.4 APl APl 0.4 APl APl 0.4 performance performance performance performance 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 iter iter iter iter (a) Evaluation results of detection and segmentation task on the validation set Fig. 10. Evaluation results of faster R-CNN with the back- of “release 1” dataset. bone of ResNet50 on the detection task. ead_2019_release2_soft_whole_val: det ead_2019_release2_soft_whole_val: seg 0.6 AP 0.6 AP50 AP75 ead_2019_release1_soft_whole_val: det ead_2019_release1_soft_whole_val: seg 0.5 APs 0.5 0.5 AP AP APm AP50 AP50 APl 0.5 AP AP75 AP75 AP50 0.4 performance performance APs 0.4 APs 0.4 AP75 APm APm APs 0.4 APl APl 0.3 APm 0.3 APl performance performance 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 iter iter 0.1 0 10000 20000 30000 40000 iter 50000 60000 70000 0 10000 20000 30000 40000 iter 50000 60000 70000 (b) Evaluation results of detection and segmentation task on the validation set of “release 2” dataset. (a) Evaluation results of detection and segmentation task on the validation set of “release 1” dataset. Fig. 13. Evaluation results of Mask-aided R-CNN with the ead_2019_release2_soft_whole_val: det ead_2019_release2_soft_whole_val: seg backbone of ResNet101 and FPN on the detection and seg- 0.6 AP AP 0.5 AP50 AP75 AP50 AP75 mentation task. APs 0.5 APs APm APm 0.4 APl APl 0.4 performance performance ead_2019_release1_soft_whole_val: det ead_2019_release2_soft_whole_val: det 0.3 AP 0.6 AP 0.3 0.5 AP50 AP50 AP75 AP75 0.2 APs 0.5 APs 0.2 APm APm 0.4 APl APl 0.1 0.4 performance performance 0.1 0.3 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 0.3 iter iter 0.2 0.2 (b) Evaluation results of detection and segmentation task on the validation set 0.1 0.1 of “release 2” dataset. 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 iter iter Fig. 11. Evaluation results of Mask-aided R-CNN with the backbone of ResNet50 and FPN on the detection and segmen- Fig. 14. Evaluation results of faster R-CNN with the back- tation task. bone of ResNet101 and FPN on the detection task. ead_2019_release1_soft_whole_val: det ead_2019_release2_soft_whole_val: det 0.6 0.5 AP AP50 AP AP50 Furtherly, the ensemble of all the Faster R-CNN models and AP75 0.5 AP75 0.4 APs APm APs APm Mask-aided R-CNN models significantly improves the EC APl APl 0.4 scores, which is adopted as the final model for the EAD2019 performance performance 0.3 0.3 challenge. Such robust and significant improvements verify 0.2 0.2 the effectiveness of proposed ensemble method. 0.1 0.1 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000 iter iter 4. CONCLUSION Fig. 12. Evaluation results of faster R-CNN with the back- bone of ResNet50 and FPN on the detection task. In this paper, we introduce ensemble Mask-aided R-CNN with a flexible and multi-stage training protocol for the de- tection task and segmentation task of EAD2019 Challenge. in ensemble method is consistently set as 0.4. We evaluate the Numerous experiments have demonstrated the effectiveness three ensemble models on the validation sets. The evaluation of our work. More specifically, Mask-aided strategy using results are shown in Table 3 and Table 4. soft pixel-level labels of incomplete categories facilitates to The EC scores of ensemble Faster R-CNN models and train a deeper convolutional network and to improve detec- Mask-aided R-CNN in Table 3 and Table 4 are significantly tion performance. The proposed ensemble method is able to higher than the EC scores of corresponding single models. fuse detection results from different detectors and furtherly improve detection performance with no training cost. Certain [11] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, parts of proposed method remain to be furtherly explored, Adam Bailey, Stefano Realdon, James East, Georges such as how to furtherly improve the segmentation perfor- Wagnires, Victor Loschenov, Enrico Grisan, Walter mance with soft pixel-level labels. Blondel, and Jens Rittscher, “Endoscopy artifact de- tection (EAD 2019) challenge dataset,” CoRR, vol. abs/1905.03209, 2019. 5. REFERENCES [12] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, [1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, James East, Xin Lu, and Jens Rittscher, “A deep learn- and Li Fei-Fei, “Imagenet: A large-scale hierarchical ing framework for quality assessment and restoration in image database,” in 2009 IEEE conference on computer video endoscopy,” arXiv preprint arXiv:1904.07073, vision and pattern recognition. Ieee, 2009, pp. 248–255. 2019. [2] Tsung Yi Lin, Michael Maire, Serge Belongie, James [13] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and Bharath Hariharan, and Serge Belongie, “Feature pyra- C. Lawrence Zitnick, “Microsoft coco: Common ob- mid networks for object detection,” in Proceedings of jects in context,” 2014. the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. [3] Y Lecun, Y Bengio, and G Hinton, “Deep learning.,” Nature, vol. 521, no. 7553, pp. 436, 2015. [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [5] Li Da, Li Lin, and Li Xiang, “Classification of re- mote sensing images based on densely connected con- volutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017. [6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99. [7] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng Yang Fu, and Alexander C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, 2016. [8] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-assisted Intervention, 2015. [9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969. [10] Cheng-Yang Fu, Mykhailo Shvets, and Alexander C Berg, “Retinamask: Learning to predict masks improves state-of-the-art single-shot detection for free,” arXiv preprint arXiv:1901.03353, 2019.