ENSEMBLE MASK-AIDED R-CNN

                                       Pengyi Zhang, Xiaoqiong Li, YunXin Zhong

                                     Beijing Institute of Technology, Beijing, China


                         ABSTRACT
Recently the strategy of integrating instance mask prediction
header into one-stage or two-stage object detector has been
immensely popular for instance segmentation (e.g., Reti-
naMask or Mask R-CNN). This strategy notably improve
the object detector at the meantime of learning to predict
instance mask. In this paper, we introduce a Mask-aided R-
CNN model with a flexible and multi-stage training protocol
to address the problems of EAD2019 Challenge (a multi-class
artefact detection in video endoscopy). The proposed training
protocol aims to facilitate the implementation of this strategy
for the detection task and segmentation task and to improve
the detection and segmentation performance using pixel-level
labeled samples with incomplete categories. This training
protocol consists of three principal steps, of which the core
part is augmenting the training set with soft pixel-level labels.
The Mask-aided R-CNN is modified from Mask R-CNN by
pruning its mask header to support training on pixel-level          ]
labeled samples with incomplete categories. We propose a
simple yet effective ensemble method based on graph clique          Fig. 1. Illustration of detection and segmentation results of
for object detectors to furtherly improve the detection per-        proposed ensemble Mask-aided R-CNN.
formance. The ensemble method votes on graph cliques to
fuse the detection results from different detectors. It produces
robust detection results from different detectors. It produces      quality endoscopic frame and facilitate the visual diagnosis,
robust detection results, which is quite important for clinical     the algorithms of endoscopic frame restoration based on the
application. Extensive experiments on EAD2019 challenging           priori knowledge of artefacts are generally used in existing
dataset have demonstrated the effectiveness of our proposed         endoscopy workflows. Therefore, identifying the types and
ensemble Mask-aided R-CNN.As a result, we won the 1ST               the locations of those artefacts accurately is essential for
place in detection task of EAD2019 Challenge.                       high-quality endoscopic frame restoration and is crucial for
                                                                    realizing reliable computer assisted endoscopy tools for im-
  Index Terms— Soft label, Ensemble, Graph clique,                  proved patient care. However, the methods for identifying
Mask-aided R-CNN                                                    artefacts in existing endoscopy workflows support only one
                                                                    single artefact type in an endoscopic frame, which generally
                    1. INTRODUCTION                                 contains multiple artefacts as shown in Figure 1. More-
                                                                    over, different types of artefacts unequally contaminate the
Recently with the rapid development of medical imaging              frame, thus requiring specific restoration algorithms for spe-
technology, the medical imaging diagnosis and treatment             cific types of artefacts. Therefore, it is an urgent problem to
equipment and digital health records have been widely used          develop accurate detection algorithms for multi-class artefact
in clinic. Among those medical imaging technologies, en-            detection task.
doscopy is an important clinical procedure for early detection           Driven by the growth of computing power (e.g., Graphi-
of cancers in hollow organs. However, the endoscopy video           cal Processing Units and dedicated deep learning chips) and
frames are easily corrupted with multiple artefacts (e.g., mo-      the availability of large labelled data sets (e.g., ImageNet
tion blur, specular reflections, bubbles etc.), thus increasing     [1] and COCO [2]), deep neural networks have been ex-
the difficulty of visual diagnosis. In order to retrieve high-      tensively studied due to their fast, scalable and end-to-end
learning framework. In recent years, Convolution Neural                                                                                                            Detection
                                                                                                                                                                    header
Network (CNN) [3] models have achieved significant im-                                                                            RPN &
                                                                                                    Backbone
provements compared with conventional shallow methods                                                                            ROIAlign
in image classification (e.g., ResNet [4] and DenseNet [5]),                                                                                                      Mask header

object detection (e.g., Faster R-CNN [6] and SSD [7]) and        ]          I

semantic segmentation (e.g., UNet [8] and Mask R-CNN [9])
                                                                 Fig. 2. Illustration of adding mask header to enable instance
etc. The advantages of CNN models, i.e. modular design
                                                                 segmentation (Based on Faster R-CNN [6] and Mask R-CNN
and end-to-end learning architecture, enable existing CNN
                                                                 [9]).
models to be easily used in complex problems by adding
task-specific network branch. Recently the strategy of in-
tegrating instance mask prediction header into one-stage or                Training set of                   Augmented training set                  Augmented training set
                                                                         segmentation task                    for segmentation task                   for segmentation task
two-stage object detector has been immensely popular for in-
stance segmentation (e.g., RetinaMask [10] or Mask R-CNN                                          weights                                weights
                                                                           Mask R-CNN                             Mask R-CNN                           Mask-aided R-CNN
[9]). This strategy notably improve the object detector at the
meantime of learning to predict instance mask. In this paper,         Training set of detection              Training set of detection
                                                                                                                                                        Ensemble method
                                                                                task                                   task
we aim at addressing the problems of multi-class endoscopic                                         repeat

artefact detection by developing instance segmentation algo-           Training samples with                  Training samples with                Segmentation    Detection
rithm using this strategy in EAD2019 Challenge [11][12].               soft pixel-level labels                soft pixel-level labels                 Task          Task

The EAD2019 Challenge provides two kinds of labelled sam-            Step 1: training a basic           Step 2: augmenting training                Step 3: training a mask-
                                                                      Mask R-CNN model                      set with soft labels                    aided R-CNN model
ples, i.e. endoscopic frames with bounding box annotation
for detection task and endoscopic frames with pixel-level an-
notations for segmentation task. The frames for segmentation         Fig. 3. Outline of proposed multi-stage training protocol.
task are contained in the frames for detection task, which
means only part of endoscopic frames in detection task have      in recent years (e.g., RetinaMask [10] and Mask R-CNN [9]
pixel-level annotations.                                         illustrated in Figure 2). To take full advantage of this strat-
    We present ensemble Mask-aided R-CNN for multi-class         egy, we introduce a Mask-aided R-CNN model with a flexible
endoscopic artefact detection with three highlights. First,      and multi-stage training protocol.
we propose to integrate the detection task and segmentation          The outline of proposed multi-stage training protocol
task into an end-to-end framework of instance segmentation,      shown in Figure 3 consists of three principal steps, of which
i.e. Mask-aided R-CNN, which are able to take full advan-        the core part is augmenting the training set with soft pixel-
tage of all labelled samples to improve the performance of       level labels.
multi-class endoscopic artefact detection. Second, we de-
sign a flexible and multi-stage training protocol based on
soft pixel-level annotations to train proposed Mask-aided R-     2.1.1. Step 1: training a basic Mask R-CNN model for the
CNN. The soft pixel-level annotations are firstly generated      segmentation task
by initially trained Mask R-CNN models and furtherly re-         We first train a basic Mask R-CNN model on the training set
fined by subsequently retrained models. The effectiveness of     of segmentation task to implement instance segmentation. In
designed training protocol has been verified in training and     order to maintain consistency of semantic segmentation and
improving Mask-aided R-CNN. Third, we propose a simple           object detection, the instance masks are bounded by bounding
yet effective ensemble method based on graph clique for ob-      box annotations acquired from the training set of detection
ject detectors to furtherly improve the detection performance.   task. The process is illustrated in Figure 4.
Extensive experiments on EAD2019 challenging dataset have
demonstrated the effectiveness of our proposed ensemble
Mask-aided R-CNN. As a result, we won the 1ST place in           2.1.2. Step 2: augmenting the training set of segmentation
detection task of EAD2019 Challenge.                             task with soft pixel-level labels
                                                                 The trained Mask R-CNN model is subsequently used to pre-
                       2. METHOD                                 dict instance masks for the training samples of detection task
                                                                 that have no pixel-level labels. One thing that needs to be
2.1. Training Protocol of Mask-aided R-CNN                       noted is that during the inference process the results of object
                                                                 detection are toughly replaced with the ground truth bound-
Adding a branch of mask header in one-stage or two-stage ob-     ing boxes. It means that we enforce the mask prediction only
ject detector is a common strategy to enable instance segmen-    for the ground truth instances. This trick shown in Figure 5
tation. The effectiveness of this strategy in improving both     aims to improve segmentation accuracy and to maintain con-
detection and segmentation performance has been witnessed        sistency of semantic segmentation and object detection.
                                                                                                                     Ground-truth of
                                                                                 I                   Mask header
                                                                                                                     instance mask

                                                                                                                                         5 categories
                                                                                     Backbone
                                                                                                                      Soft pixel-level
                                                                                                      mask_loss
    (a) Original image         (b) Ground truth          (c) Bounded mask                                                  labels


                                                                                                                                               ⊋
                                                                                      RPN &
                                                                                     ROIAlign
Fig. 4. Illustration of maintaining consistency of semantic
segmentation and object detection. (a) is the original image                         Detection        cls_loss &     Ground-truth of
                                                                                                                                         7 categories
(“00024.jpg”); (b) shows the ground truth mask of segmenta-                           header          bbox_loss      bounding boxes
tion task (“Specularity”), where the bounding boxes marked
in red are ground truth of detection task. (c) is the bounded                   Fig. 6. Proposed Mask-aided R-CNN. We compute the mask
mask used to train a Mask R-CNN model.                                          loss only for the five segmentation categories.

                                           RPN &                 Detection
                         Backbone
                                          ROIAlign                header        Mask-aided R-CNN models will be used by the ensemble
                                                                                module to furtherly improve the detection performance.
        I

                                       Ground-truth of
                                                                Mask header
                                                                                2.2. Mask-aided R-CNN
                                       Bounding boxes
                                                                                The Mask-aided R-CNN shown in Figure 6 is modified from
                                                                 Soft pixel-    Mask R-CNN by pruning its mask header to support training
]                                                                level labels   on pixel-level labeled samples with incomplete categories. In
                                                                                EAD2019 Challenge[11], the detection task has seven cate-
Fig. 5. Illustration of retrieving soft pixel-level labels. We                  gories while the segmentation task has five categories, where
perform mask prediction only on the ground-truth bounding                       the five segmentation categories are a subset of the seven de-
boxes to maintain the consistency of semantic segmentation                      tection categories. Therefore, the Mask-aided R-CNN model
and object detection.                                                           for EAD2019 Challenge is designed by following the two
                                                                                steps below: (1) Design a Mask R-CNN model with seven
                                                                                semantic categories; (2) Prune the neural units and connec-
    These predicted instance masks, called soft pixel-level                     tions related with the two extra categories in the mask header
labels, are assigned to the corresponding training samples.                     of this Mask R-CNN to get a mask header with five semantic
These training samples with soft pixel-level labels are fur-                    categories. When training such a Mask-aided R-CNN model,
therly added to the training set of segmentation task. We                       we compute the mask loss only for the five segmentation cat-
retrain the Mask R-CNN model on the augmented training                          egories. The remaining defaults in training process are kept
set of segmentation task. Subsequently, the soft pixel-level                    unchanged.
labels are refined with the new instance masks predicted by
the retrained Mask R-CNN model. This step might be per-
                                                                                2.3. Ensemble method
formed multiple times for higher segmentation accuracy. The
final augmented training set will be used in next step, while                   Ensemble strategy is commonly used to improve the perfor-
the final retrained Mask R-CNN model can be used by the                         mance in image classification tasks. In the detection task of
ensemble module.                                                                EAD2019 challenge, we propose a simple yet effective en-
                                                                                semble method based on graph model for object detection
2.1.3. Step 3: training a Mask-aided R-CNN model for de-                        tasks to furtherly improve the detection performance. Our
tection and segmentation task                                                   proposed ensemble method is able to fuse the detection re-
                                                                                sults from multiple object detectors by voting on one graph
To take full advantage of all available training samples and                    clique for the same object and mutually reinforcing each other
the strategy of boosting object detection by adding mask pre-                   among graph cliques.
diction branch, we generate soft pixel-level labels for train-
ing samples with no pixel-level annotations through the first
                                                                                2.3.1. Construction of Graph model
two steps. In this step, we train multiple Mask-aided R-CNN
models with different backbone networks on the final aug-                       Given a single image I, C semantic categories and N ob-
mented training set. The Mask-aided R-CNN model support-                        ject detectors, the detection result set can be formalized as
ing to be trained on pixel-level labeled samples with incom-                    {Detcn |n = 1, 2, . . . , N, c = 1, 2, . . . , C}. For convenience,
plete categories is detailed later in next section. These trained               we simply extract the detection results of a single category
 Model A
                                                                     Model A                                             Clique
 Model B
                                                                     Model B              Clique
 Model C
                                                                     Model C


                                                                            Clique


                                                                                                                         Clique

                                                                                          Clique


Fig. 7. The construction process of graph model for proposed
ensemble method. Each rectangle denotes one detection and           Fig. 8. The inference process of graph model for proposed
each color in the figure denotes one detector model.                ensemble method. Each clique denotes one detection result,
                                                                    which can be calculated by Formula (1) and (2).
{Detcn |n = 1, 2, . . . , N } to introduce and formalize our en-
semble method (illustrated in Figure 7). Each detection Detcn           The last step is voting on partitioned cliques and there-
consist of {uuid, score, bbox}, where uuid denotes the              fore, each clique output a single detection by calculating its
universally unique identifier of the detector, score denotes        confidence score and bounding box. Given a clique {Detck =
the confidence score of this detection and bbox denotes the         {uuidk , scorek , bboxk }|k = 1, 2, . . . , K}, the voting result
bounding box of this detection.                                     Detc = {0, score, bbox} is formalized as follows:
    A weighted undirected graph Gc (V, E) with dense con-
nections can be established from the detections {Detcn |n =
1, 2, . . . , N } , where V denotes the set of vertexes and E de-                                      K
                                                                                                       Y
notes the set of edges. Each vertex represents a single detec-                        score = 1 −        (1 − scorek )            (1)
tion. The vertexes are densely connected with each other by                                             k
                                                                                                PK
edges. We assign a weight, i.e. the intersection over union                                            scorek × bboxk
                                                                                                   k
(IOU) score of two detections, to the corresponding edge.                             bbox =           PK                         (2)
                                                                                                         k scorek

2.3.2. Inference of Graph model                                        where K denotes the number of vertexes in this clique
We formulize the inference of established graph model as the
maximum clique problem, which aims to maximize the sum                                     3. EXPERIMENTS
of edge weights in each clique here. Several reasonable con-
straints are introduced to simplify the partition process. Post     Experiments on EAD2019 Challenge 1 are performed by fol-
processing, e.g. non-maximum suppression (NMS), to re-              lowing the proposed training protocol of Mask-aided R-CNN.
move redundant detections is commonly used in object de-            We train our models on servers with two 1080Ti GPUs.
tector. Therefore, the vertexes in one clique are required to
be different in the uuid attribute, which means removing the
edges constructed by the same detector. Moreover, we in-            3.1. Experiments on training a basic Mask R-CNN model
troduce a threshold of IOU score to remove the edges with           for the segmentation task
low weights, which means that the two detections with higher        First, we generate the bounded mask for the released samples
IOU score are more likely to be the same object.                    of segmentation task to enable instance segmentation. The
     After the simplification step, we design a greedy approach     released data (498 images with pixel-level labels in total) is
to solve this maximum clique problem iteratively. Initially,        split into training set (90%, 448 images) and validation set
each vertex is adopted as a clique. We iteratively merge the        (10%, 50 images). Second, we train a Mask R-CNN model
two cliques, which has largest edge weight and different uuid
attributes of all the vertexes.                                       1 https://ead2019.grand-challenge.org
                                                                 dataset into one training set (90%, 800 images in the first re-
Table 1. The evaluation results of basic Mask R-CNN model
                                                                 leased dataset and 90%, 1175 images in the second released
tested on validation set.
                                                                 dataset), a validation set of “release 1” (10%, 89 image in the
    Task            AP AP50 AP75 APS APM APL                     first released dataset) and a validation set of “release 2” (10%,
   Detection    25.8      50.4   22.4   12.0   34.6   23.8       131 images in the second released dataset). We successively
   Segmentation 23.3      44.6   20.3   7.7    28.4   26.5       train three Faster R-CNN models and three Mask-aided R-
                                                                 CNN models on the training set. The corresponding backbone
                                                                 networks of these models are RseNet50, RseNet50+FPN and
with the backbone network of RseNet101 and feature pyra-         RseNet101+FPN, respectively. The Faster R-CNN models
mid network (FPN) [13] on this training samples. We per-         are trained only with bounding box annotations, while the
form two augmentation operations, i.e., random scaling and       Mask-aided R-CNN are trained with both soft mask anno-
random horizontal flipping. The network is trained end-to-       tations and bounding box annotations. Here, we perform the
end using SGD with the momentum of 0.9 and weight decay          data augmentation on the training set with random scaling,
of 0.0001. We train the model using mini-batches of size 2.      random horizontal flipping, random vertical flipping and ran-
We use an initial learning rate of 0.005 that is decayed by a    dom cropping on-the-fly. Each model is trained end-to-end
factor of 10 at the iteration step of 24000 and 48000. The       using SGD with the momentum of 0.9. A weight decay fac-
maximum training iteration is set as 72000.                      tor of 0.0002 is adopted when training the models with a
    The trained model is tested on the validation set and the    ResNet101+FPN backbone, while a weight decay factor of
evaluation results are shown in Table 1 .                        0.0001 is adopted when training other models. We train each
                                                                 model using mini-batches of size 2. We use an initial learning
                                                                 rate of 0.005 that is decayed by a factor of 10 at the iteration
3.2. Experiments on augmenting the training set of seg-
                                                                 step of 24000 and 48000. The maximum training iteration is
mentation task with soft pixel-level labeled samples
                                                                 set as 72000.
The trained Mask R-CNN is then used to generate soft pixel-           We evaluate the iteration snapshots of each model on the
level labels for training samples of detection task. We follow   validation set of “release 1” and “release 2”. The average
the trick detailed in Chapter 2 to enforce the mask predic-      precision curves of each model are shown in Figure 9, 10,
tion only for the ground truth instances to maintain consis-     11, 12, 13, and 14.
tency of semantic segmentation and object detection and to            For quantitatively evaluation, we uniformly select two it-
improve segmentation accuracy. We evaluate the generated         eration snapshots (iteration of 40000 and iteration of 72000)
soft mask on validation set of segmentation task and the re-
sults are shown in Table 2. Compared with Table 1, the qual-
                                                                                              ead_2019_release1_soft_whole_val: det                                              ead_2019_release1_soft_whole_val: seg
ity of predicted masks has been improved significantly, which                              AP                                                                                 AP
                                                                                 0.5       AP50                                                                               AP50
verifies the effectiveness of proposed trick.                                              AP75
                                                                                           APs
                                                                                                                                                                    0.4       AP75
                                                                                                                                                                              APs
                                                                                           APm                                                                                APm
    The second step of proposed training protocol is per-                        0.4       APl
                                                                                                                                                                    0.3
                                                                                                                                                                              APl
                                                                   performance


                                                                                                                                                      performance


formed only once in this experiment. We generate soft mask                       0.3
                                                                                                                                                                    0.2
for each released sample in detection task. These soft mask                      0.2

                                                                                                                                                                    0.1
annotations, together with the released samples of detec-                        0.1


tion task and the corresponding bounding box annotations,                              0   10000   20000   30000      40000
                                                                                                                   iter
                                                                                                                              50000   60000   70000                       0   10000   20000   30000      40000
                                                                                                                                                                                                      iter
                                                                                                                                                                                                                 50000   60000   70000


constitute the whole dataset for instance segmentation task.
                                                                 (a) Evaluation results of detection and segmentation task on the validation set
                                                                 of “release 1” dataset.
3.3. Experiments on training the Mask-aided R-CNN
                                                                                              ead_2019_release2_soft_whole_val: det                                              ead_2019_release2_soft_whole_val: seg
models for detection task                                                                  AP                                                                                 AP
                                                                                           AP50                                                                     0.5       AP50
                                                                                 0.5       AP75                                                                               AP75
                                                                                           APs                                                                                APs
The whole dataset consists of two released datasets, of which                    0.4
                                                                                           APm
                                                                                           APl                                                                      0.4
                                                                                                                                                                              APm
                                                                                                                                                                              APl

the first released dataset contains 889 images and the second
                                                                   performance


                                                                                                                                                      performance


                                                                                 0.3                                                                                0.3

released dataset contains 1306 images. We split the whole                        0.2
                                                                                                                                                                    0.2

                                                                                 0.1
                                                                                                                                                                    0.1

                                                                                       0   10000   20000   30000      40000   50000   60000   70000                       0   10000   20000   30000      40000   50000   60000   70000
                                                                                                                   iter                                                                               iter

Table 2. The evaluation results of proposed trick to enforce
                                                                 (b) Evaluation results of detection and segmentation task on the validation set
the mask prediction only for the ground truth instances.         of “release 2” dataset.
   Task            AP     AP50 AP75 APS APM APL
                                                                 Fig. 9. Evaluation results of Mask-aided R-CNN with the
   Detection    -         -      -      -      -      -
                                                                 backbone of ResNet50 on the detection and segmentation
   Segmentation 31.5      61.4   26.3   19.0   32.1   33.6
                                                                 task.
Table 3. The evaluation results of three Faster R-CNN models, three Mask-aided R-CNN models and three ensemble models
on validation set of “release 1” dataset.
    Model                iter    AP AP50 AP75 APS APM APL AP1 AP10 AP100 ARS ARM ARL EC
                         40000 24.9 52.9 20.6 13.6 24.9 26.1 18.5 35.2 38.9              19.8 36.1 45.2
    faster+ResNet50
                         72000 25.4 52.9 21.9 14         24.9 26.4 19.1 35.3 38.7        19.9 36        44.8 29.8
    mask-aided+          40000 24.7 52.2 22.3 13.9 24.9 26.4 19              34.1 38.3   19.9 35.9 48
    ResNet50             72000 24.9 51.9 21.5 13.2 25.3 27.4 19.7 35.3 38.8              18.7 35.8 48.8 30
    faster+              40000 25.4 52.8 21.6 14.4 23.3 27             19.2 34.7 39      21.1 34.1 45.9
    ResNet50+FPN 72000 25.2 52                20.9 14.8 23.9 26        19.8 34.7 38.8    20.8 35.2 42.9 29.7
    mask-aided           40000 25.4 52.3 21.9 13.9 21.5 27.2 18.6 35.5 40.3              21      36     45.9
    +ResNet50+FPN 72000 25.9 53.1 21.7 13.9 23.4 28.7 20.1 36.6 41                       20.3 36.6 51.7 30.5
    faster+              40000 26       52.1 22.9 13.6 23.4 27.2 20.2 35.8 39.9          20.6 36.1 48
    ResNet101+FPN 72000 26.2 51.8 24.1 13.8 24.3 27                    20.3 35.5 39.7    20.5 36.3 43.7 30.4
    mask-aided+          40000 26.7 54.3 23.9 17.2 23.6 28.9 20.7 37               41.5  26.2 36.2 49.2
    ResNet101+FPN 72000 26.5 52.7 24.9 14.1 24                   28.4 21     36.9 40.6   20      36.2 46      31.5
    faster ensemble              28.5 53.4 26.6 15.5 27.5 30           20.8 38.4 43.3    21.3 39.2 47.5 32.7
    mask-aided en-
                                 28.4 54.6 26.8 15       27.1 31       21.7 38     42.6  21.3 37.9 52.6 33.1
    semble
    all ensemble                 29.6 55.5 28       16.2 28.5 31.7 21.9 39.9 45.8        23.3 41        54.3 34.6


Table 4. The evaluation results of three Faster R-CNN models, three Mask-aided R-CNN models and three ensemble models
on validation set of “release 2” dataset.
    Model                iter    AP AP50 AP75 APS APM APL AP1 AP10 AP100 ARS ARM ARL EC
                         40000 27.4 56.7 25.3 16.9 24.7 37.6 28.3 40               42.8  25.1 33.3 50.1
    faster+ResNet50
                         72000 27.6 55.1 24.5 16.7 24.4 38.6 28.5 40               42.6  24.3 32.7 51.1 33.9
    mask-aided+          40000 27.1 53.6 21.8 17         15.6 43.8 25.5 37.1 39.8        26.2 25.4 57.7
    ResNet50             72000 27.3 52.4 21.2 14.9 16            44.2 25.7 37.9 40.7     22.3 25.7 58.7 32.4
    faster+              40000 26.2 55.4 20.8 15         18      37    26.9 38.5 41.2    24.4 33.4 52.3
    ResNet50+FPN 72000 24.2 52.7 19.9 14.9 14.7 36.8 23.1 34.6 37.1                      22.8 24.8 49.2 31.0
    mask-aided+          40000 24.7 54.7 21.2 15.8 17.8 35.2 23.4 36               39.1  24.8 28.3 52.3
    ResNet50+FPN 72000 25               52.8 20.5 14.3 16.8 35.2 24.3 36.6 39.4          22.6 27.1 55.6 31.0
    faster+              40000 27.8 57.6 22         13.6 21.6 40.9 27.2 39.7 42.3        21.9 32.9 55.7
    ResNet101+FPN 72000 27              55.7 22.1 14     19.9 41.3 26.8 38.2 40.6        21.4 31.6 57.3 33.3
    mask-aided+          40000 30.7 60.4 22.8 14.4 26.1 42.5 30              42.2 45.6   24.1 35.3 60.9
    ResNet101+FPN 72000 28.4 60.5 20.9 13.3 21.8 41.6 27                     39.7 42.5   22.4 31.7 57.6 35.1
    faster ensemble              27.7 56.5 23.9 17.4 21.6 40.6 27.1 39.8 42.5            27      32.6 56.6 34.4
    mask-aided en-
                                 29.5 57.6 23.4 15.2 25.8 43.4 26.2 41.9 44.7            23      34.9 58.3 35.3
    semble
    all ensemble                 30.1 58      24.1 17.5 26.2 44.2 26         43.6 46.8   28.2 36.5 59.5 36.7


from each trained model and evaluate the models on the val-        deeper convolutional network, thus improving detection per-
idation sets of “release 1” and “release 2”. The evaluation        formance. The EC scores in Table 4 also reveals a consistent
results of average precision (AP) and average recall (AR) are      implication.
shown in Table 3 and Table 4. The average of AP and AR
is adopted as the Evaluation Criterion (EC) score in the ex-       3.4. Experiments on ensemble method
periments. In Table 3, the EC scores of Mask-aided R-CNN
models are consistently higher than the EC scores of Faster R-     The two selected iteration snapshots of each model in Section
CNN. Specifically, the significance of the EC score difference     3.3 are enrolled in the proposed ensemble method. In this
between Mask-aided R-CNN and Faster R-CNN increases as             experiment, we implement three ensemble models, involving
the complexity of backbone network increases. It implicates        ensemble of Faster R-CNN models, ensemble of Mask-aided
that the generated soft pixel-level labels facilitate to train a   R-CNN models and ensemble of all the Faster R-CNN models
                                                                   and Mask-aided R-CNN models. The threshold of IOU score
                                 ead_2019_release1_soft_whole_val: det                                                          ead_2019_release2_soft_whole_val: det                                                                ead_2019_release1_soft_whole_val: det                                                          ead_2019_release1_soft_whole_val: seg
                                                                                                              0.6                                                                                                                                                                                                 0.5
                              AP                                                                                             AP                                                                                               AP                                                                                             AP
               0.5            AP50                                                                                           AP50                                                                                             AP50                                                                                           AP50
                              AP75                                                                                           AP75                                                                                  0.5        AP75                                                                                           AP75
                                                                                                              0.5
                              APs                                                                                            APs                                                                                              APs                                                                                 0.4        APs
                              APm                                                                                            APm                                                                                              APm                                                                                            APm
               0.4            APl                                                                                            APl                                                                                   0.4        APl                                                                                            APl
                                                                                                              0.4
 performance


                                                                                                performance


                                                                                                                                                                                                     performance


                                                                                                                                                                                                                                                                                                    performance
                                                                                                                                                                                                                                                                                                                  0.3
               0.3                                                                                                                                                                                                 0.3
                                                                                                              0.3

                                                                                                                                                                                                                                                                                                                  0.2
               0.2                                                                                            0.2                                                                                                  0.2


               0.1                                                                                            0.1                                                                                                                                                                                                 0.1
                                                                                                                                                                                                                   0.1


                      0       10000     20000    30000      40000    50000    60000    70000                         0       10000     20000    30000      40000    50000    60000    70000                              0    10000     20000    30000      40000    50000    60000    70000                            0    10000     20000    30000      40000    50000    60000    70000
                                                         iter                                                                                           iter                                                                                             iter                                                                                           iter


                                                                                                                                                                                               (a) Evaluation results of detection and segmentation task on the validation set
Fig. 10. Evaluation results of faster R-CNN with the back-                                                                                                                                     of “release 1” dataset.
bone of ResNet50 on the detection task.
                                                                                                                                                                                                                                     ead_2019_release2_soft_whole_val: det                                                          ead_2019_release2_soft_whole_val: seg
                                                                                                                                                                                                                                                                                                                  0.6        AP
                                                                                                                                                                                                                   0.6
                                                                                                                                                                                                                                                                                                                             AP50
                                                                                                                                                                                                                                                                                                                             AP75
                                      ead_2019_release1_soft_whole_val: det                                                          ead_2019_release1_soft_whole_val: seg                                                                                                                                        0.5        APs
                                                                                                                   0.5                                                                                             0.5
                               AP                                                                                             AP                                                                                                                                                                                             APm
                               AP50                                                                                           AP50                                                                                                                                                                                           APl
                    0.5                                                                                                                                                                                                       AP
                               AP75                                                                                           AP75                                                                                            AP50                                                                                0.4


                                                                                                                                                                                                     performance


                                                                                                                                                                                                                                                                                                    performance
                               APs                                                                                 0.4        APs                                                                                  0.4
                                                                                                                                                                                                                              AP75
                               APm                                                                                            APm                                                                                             APs
                    0.4        APl                                                                                            APl                                                                                                                                                                                 0.3
                                                                                                                                                                                                                              APm
                                                                                                                                                                                                                   0.3        APl
      performance


                                                                                                     performance


                                                                                                                   0.3
                    0.3                                                                                                                                                                                                                                                                                           0.2
                                                                                                                                                                                                                   0.2
                                                                                                                   0.2
                    0.2                                                                                                                                                                                                                                                                                           0.1
                                                                                                                                                                                                                   0.1
                                                                                                                   0.1                                                                                                   0    10000     20000    30000      40000    50000    60000    70000                            0    10000     20000    30000      40000    50000    60000    70000
                                                                                                                                                                                                                                                         iter                                                                                           iter
                    0.1

                          0    10000     20000    30000      40000
                                                          iter
                                                                      50000    60000    70000                            0    10000     20000    30000      40000
                                                                                                                                                         iter
                                                                                                                                                                     50000    60000    70000
                                                                                                                                                                                               (b) Evaluation results of detection and segmentation task on the validation set
                                                                                                                                                                                               of “release 2” dataset.
(a) Evaluation results of detection and segmentation task on the validation set
of “release 1” dataset.
                                                                                                                                                                                               Fig. 13. Evaluation results of Mask-aided R-CNN with the
                                      ead_2019_release2_soft_whole_val: det                                                          ead_2019_release2_soft_whole_val: seg                     backbone of ResNet101 and FPN on the detection and seg-
                                                                                                                   0.6
                                                                                        AP                                    AP

                    0.5
                                                                                        AP50
                                                                                        AP75
                                                                                                                              AP50
                                                                                                                              AP75
                                                                                                                                                                                               mentation task.
                                                                                        APs                        0.5        APs
                                                                                        APm                                   APm
                    0.4                                                                 APl                                   APl
                                                                                                                   0.4
      performance


                                                                                                     performance


                                                                                                                                                                                                                                ead_2019_release1_soft_whole_val: det                                                          ead_2019_release2_soft_whole_val: det
                    0.3                                                                                                                                                                                                      AP                                                                              0.6            AP
                                                                                                                   0.3
                                                                                                                                                                                                              0.5            AP50                                                                                           AP50
                                                                                                                                                                                                                             AP75                                                                                           AP75
                    0.2                                                                                                                                                                                                      APs                                                                             0.5            APs
                                                                                                                   0.2                                                                                                       APm                                                                                            APm
                                                                                                                                                                                                              0.4            APl                                                                                            APl
                    0.1                                                                                                                                                                                                                                                                                      0.4
                                                                                                                                                                                                performance


                                                                                                                                                                                                                                                                                               performance
                                                                                                                   0.1
                                                                                                                                                                                                              0.3
                          0    10000     20000    30000      40000    50000    60000    70000                            0    10000     20000    30000      40000    50000    60000    70000                                                                                                                 0.3
                                                          iter                                                                                           iter

                                                                                                                                                                                                              0.2                                                                                            0.2
(b) Evaluation results of detection and segmentation task on the validation set
                                                                                                                                                                                                              0.1                                                                                            0.1
of “release 2” dataset.
                                                                                                                                                                                                                     0       10000     20000    30000      40000    50000    60000    70000                         0       10000     20000    30000      40000    50000    60000    70000
                                                                                                                                                                                                                                                        iter                                                                                           iter


Fig. 11. Evaluation results of Mask-aided R-CNN with the
backbone of ResNet50 and FPN on the detection and segmen-                                                                                                                                      Fig. 14. Evaluation results of faster R-CNN with the back-
tation task.                                                                                                                                                                                   bone of ResNet101 and FPN on the detection task.

                                 ead_2019_release1_soft_whole_val: det                                                          ead_2019_release2_soft_whole_val: det
                                                                                                              0.6

               0.5
                              AP
                              AP50
                                                                                                                                                                                      AP
                                                                                                                                                                                      AP50
                                                                                                                                                                                               Furtherly, the ensemble of all the Faster R-CNN models and
                              AP75                                                                            0.5                                                                     AP75


               0.4
                              APs
                              APm
                                                                                                                                                                                      APs
                                                                                                                                                                                      APm      Mask-aided R-CNN models significantly improves the EC
                              APl                                                                                                                                                     APl
                                                                                                              0.4
                                                                                                                                                                                               scores, which is adopted as the final model for the EAD2019
 performance


                                                                                                performance


               0.3
                                                                                                              0.3
                                                                                                                                                                                               challenge. Such robust and significant improvements verify
               0.2
                                                                                                              0.2
                                                                                                                                                                                               the effectiveness of proposed ensemble method.
               0.1                                                                                            0.1


                      0       10000     20000    30000      40000    50000    60000    70000                         0       10000     20000    30000      40000    50000    60000    70000
                                                         iter                                                                                           iter

                                                                                                                                                                                                                                                                       4. CONCLUSION
Fig. 12. Evaluation results of faster R-CNN with the back-
bone of ResNet50 and FPN on the detection task.                                                                                                                                                In this paper, we introduce ensemble Mask-aided R-CNN
                                                                                                                                                                                               with a flexible and multi-stage training protocol for the de-
                                                                                                                                                                                               tection task and segmentation task of EAD2019 Challenge.
in ensemble method is consistently set as 0.4. We evaluate the                                                                                                                                 Numerous experiments have demonstrated the effectiveness
three ensemble models on the validation sets. The evaluation                                                                                                                                   of our work. More specifically, Mask-aided strategy using
results are shown in Table 3 and Table 4.                                                                                                                                                      soft pixel-level labels of incomplete categories facilitates to
    The EC scores of ensemble Faster R-CNN models and                                                                                                                                          train a deeper convolutional network and to improve detec-
Mask-aided R-CNN in Table 3 and Table 4 are significantly                                                                                                                                      tion performance. The proposed ensemble method is able to
higher than the EC scores of corresponding single models.                                                                                                                                      fuse detection results from different detectors and furtherly
improve detection performance with no training cost. Certain    [11] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
parts of proposed method remain to be furtherly explored,            Adam Bailey, Stefano Realdon, James East, Georges
such as how to furtherly improve the segmentation perfor-            Wagnires, Victor Loschenov, Enrico Grisan, Walter
mance with soft pixel-level labels.                                  Blondel, and Jens Rittscher, “Endoscopy artifact de-
                                                                     tection (EAD 2019) challenge dataset,” CoRR, vol.
                                                                     abs/1905.03209, 2019.
                    5. REFERENCES
                                                                [12] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,
 [1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,          James East, Xin Lu, and Jens Rittscher, “A deep learn-
     and Li Fei-Fei, “Imagenet: A large-scale hierarchical           ing framework for quality assessment and restoration in
     image database,” in 2009 IEEE conference on computer            video endoscopy,” arXiv preprint arXiv:1904.07073,
     vision and pattern recognition. Ieee, 2009, pp. 248–255.        2019.

 [2] Tsung Yi Lin, Michael Maire, Serge Belongie, James         [13] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
     Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and             Bharath Hariharan, and Serge Belongie, “Feature pyra-
     C. Lawrence Zitnick, “Microsoft coco: Common ob-                mid networks for object detection,” in Proceedings of
     jects in context,” 2014.                                        the IEEE Conference on Computer Vision and Pattern
                                                                     Recognition, 2017, pp. 2117–2125.
 [3] Y Lecun, Y Bengio, and G Hinton, “Deep learning.,”
     Nature, vol. 521, no. 7553, pp. 436, 2015.

 [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     Sun, “Deep residual learning for image recognition,” in
     Proceedings of the IEEE conference on computer vision
     and pattern recognition, 2016, pp. 770–778.

 [5] Li Da, Li Lin, and Li Xiang, “Classification of re-
     mote sensing images based on densely connected con-
     volutional networks,” in IEEE Conference on Computer
     Vision and Pattern Recognition, 2017.

 [6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
     Sun, “Faster r-cnn: Towards real-time object detection
     with region proposal networks,” in Advances in neural
     information processing systems, 2015, pp. 91–99.

 [7] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
     Szegedy, Scott Reed, Cheng Yang Fu, and Alexander C.
     Berg, “Ssd: Single shot multibox detector,” in European
     Conference on Computer Vision, 2016.

 [8] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,
     “U-net: Convolutional networks for biomedical image
     segmentation,” in International Conference on Medical
     Image Computing and Computer-assisted Intervention,
     2015.

 [9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
     Girshick, “Mask r-cnn,” in Proceedings of the IEEE
     international conference on computer vision, 2017, pp.
     2961–2969.

[10] Cheng-Yang Fu, Mykhailo Shvets, and Alexander C
     Berg, “Retinamask: Learning to predict masks improves
     state-of-the-art single-shot detection for free,” arXiv
     preprint arXiv:1901.03353, 2019.