<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pengyi Zhang, Xiaoqiong Li, YunXin Zhong</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Beijing Institute of Technology</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Detection header</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Soft label</institution>
          ,
          <addr-line>Ensemble, Graph clique</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently the strategy of integrating instance mask prediction header into one-stage or two-stage object detector has been immensely popular for instance segmentation (e.g., RetinaMask or Mask R-CNN). This strategy notably improve the object detector at the meantime of learning to predict instance mask. In this paper, we introduce a Mask-aided RCNN model with a flexible and multi-stage training protocol to address the problems of EAD2019 Challenge (a multi-class artefact detection in video endoscopy). The proposed training protocol aims to facilitate the implementation of this strategy for the detection task and segmentation task and to improve the detection and segmentation performance using pixel-level labeled samples with incomplete categories. This training protocol consists of three principal steps, of which the core part is augmenting the training set with soft pixel-level labels. The Mask-aided R-CNN is modified from Mask R-CNN by pruning its mask header to support training on pixel-level labeled samples with incomplete categories. We propose a simple yet effective ensemble method based on graph clique for object detectors to furtherly improve the detection performance. The ensemble method votes on graph cliques to fuse the detection results from different detectors. It produces robust detection results from different detectors. It produces robust detection results, which is quite important for clinical application. Extensive experiments on EAD2019 challenging dataset have demonstrated the effectiveness of our proposed ensemble Mask-aided R-CNN.As a result, we won the 1ST place in detection task of EAD2019 Challenge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Index Terms—
Mask-aided R-CNN</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Recently with the rapid development of medical imaging
technology, the medical imaging diagnosis and treatment
equipment and digital health records have been widely used
in clinic. Among those medical imaging technologies,
endoscopy is an important clinical procedure for early detection
of cancers in hollow organs. However, the endoscopy video
frames are easily corrupted with multiple artefacts (e.g.,
motion blur, specular reflections, bubbles etc.), thus increasing
the difficulty of visual diagnosis. In order to retrieve
highquality endoscopic frame and facilitate the visual diagnosis,
the algorithms of endoscopic frame restoration based on the
priori knowledge of artefacts are generally used in existing
endoscopy workflows. Therefore, identifying the types and
the locations of those artefacts accurately is essential for
high-quality endoscopic frame restoration and is crucial for
realizing reliable computer assisted endoscopy tools for
improved patient care. However, the methods for identifying
artefacts in existing endoscopy workflows support only one
single artefact type in an endoscopic frame, which generally
contains multiple artefacts as shown in Figure 1.
Moreover, different types of artefacts unequally contaminate the
frame, thus requiring specific restoration algorithms for
specific types of artefacts. Therefore, it is an urgent problem to
develop accurate detection algorithms for multi-class artefact
detection task.</p>
      <p>
        Driven by the growth of computing power (e.g.,
Graphical Processing Units and dedicated deep learning chips) and
the availability of large labelled data sets (e.g., ImageNet
[
        <xref ref-type="bibr" rid="ref2">1</xref>
        ] and COCO [
        <xref ref-type="bibr" rid="ref3">2</xref>
        ]), deep neural networks have been
extensively studied due to their fast, scalable and end-to-end
learning framework. In recent years, Convolution Neural
Network (CNN) [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ] models have achieved significant
improvements compared with conventional shallow methods
in image classification (e.g., ResNet [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ] and DenseNet [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ]),
object detection (e.g., Faster R-CNN [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ] and SSD [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ]) and
semantic segmentation (e.g., UNet [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ] and Mask R-CNN [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ])
etc. The advantages of CNN models, i.e. modular design
and end-to-end learning architecture, enable existing CNN
models to be easily used in complex problems by adding
task-specific network branch. Recently the strategy of
integrating instance mask prediction header into one-stage or
two-stage object detector has been immensely popular for
instance segmentation (e.g., RetinaMask [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ] or Mask R-CNN
[
        <xref ref-type="bibr" rid="ref10">9</xref>
        ]). This strategy notably improve the object detector at the
meantime of learning to predict instance mask. In this paper,
we aim at addressing the problems of multi-class endoscopic
artefact detection by developing instance segmentation
algorithm using this strategy in EAD2019 Challenge [
        <xref ref-type="bibr" rid="ref12">11</xref>
        ][
        <xref ref-type="bibr" rid="ref13">12</xref>
        ].
The EAD2019 Challenge provides two kinds of labelled
samples, i.e. endoscopic frames with bounding box annotation
for detection task and endoscopic frames with pixel-level
annotations for segmentation task. The frames for segmentation
task are contained in the frames for detection task, which
means only part of endoscopic frames in detection task have
pixel-level annotations.
      </p>
      <p>We present ensemble Mask-aided R-CNN for multi-class
endoscopic artefact detection with three highlights. First,
we propose to integrate the detection task and segmentation
task into an end-to-end framework of instance segmentation,
i.e. Mask-aided R-CNN, which are able to take full
advantage of all labelled samples to improve the performance of
multi-class endoscopic artefact detection. Second, we
design a flexible and multi-stage training protocol based on
soft pixel-level annotations to train proposed Mask-aided
RCNN. The soft pixel-level annotations are firstly generated
by initially trained Mask R-CNN models and furtherly
refined by subsequently retrained models. The effectiveness of
designed training protocol has been verified in training and
improving Mask-aided R-CNN. Third, we propose a simple
yet effective ensemble method based on graph clique for
object detectors to furtherly improve the detection performance.
Extensive experiments on EAD2019 challenging dataset have
demonstrated the effectiveness of our proposed ensemble
Mask-aided R-CNN. As a result, we won the 1ST place in
detection task of EAD2019 Challenge.</p>
    </sec>
    <sec id="sec-3">
      <title>2. METHOD</title>
    </sec>
    <sec id="sec-4">
      <title>2.1. Training Protocol of Mask-aided R-CNN</title>
      <p>
        Adding a branch of mask header in one-stage or two-stage
object detector is a common strategy to enable instance
segmentation. The effectiveness of this strategy in improving both
detection and segmentation performance has been witnessed
]
Fig. 2. Illustration of adding mask header to enable instance
segmentation (Based on Faster R-CNN [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ] and Mask R-CNN
[
        <xref ref-type="bibr" rid="ref10">9</xref>
        ]).
Mask header
Training set of
segmentation task
      </p>
      <p>Mask R-CNN
Training set of detection</p>
      <p>task
Training samples with
soft pixel-level labels
Step 1: training a basic
Mask R-CNN model
weights
repeat</p>
      <p>Augmented training set
for segmentation task</p>
      <p>Augmented training set
for segmentation task
weights
Mask R-CNN</p>
      <p>Mask-aided R-CNN
Training set of detection</p>
      <p>task
Training samples with
soft pixel-level labels
Step 2: augmenting training
set with soft labels</p>
      <p>Ensemble method
Segmentation Detection</p>
      <p>
        Task Task
Step 3: training a
maskaided R-CNN model
in recent years (e.g., RetinaMask [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ] and Mask R-CNN [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ]
illustrated in Figure 2). To take full advantage of this
strategy, we introduce a Mask-aided R-CNN model with a flexible
and multi-stage training protocol.
      </p>
      <p>The outline of proposed multi-stage training protocol
shown in Figure 3 consists of three principal steps, of which
the core part is augmenting the training set with soft
pixellevel labels.
2.1.1. Step 1: training a basic Mask R-CNN model for the
segmentation task
We first train a basic Mask R-CNN model on the training set
of segmentation task to implement instance segmentation. In
order to maintain consistency of semantic segmentation and
object detection, the instance masks are bounded by bounding
box annotations acquired from the training set of detection
task. The process is illustrated in Figure 4.
2.1.2. Step 2: augmenting the training set of segmentation
task with soft pixel-level labels
The trained Mask R-CNN model is subsequently used to
predict instance masks for the training samples of detection task
that have no pixel-level labels. One thing that needs to be
noted is that during the inference process the results of object
detection are toughly replaced with the ground truth
bounding boxes. It means that we enforce the mask prediction only
for the ground truth instances. This trick shown in Figure 5
aims to improve segmentation accuracy and to maintain
consistency of semantic segmentation and object detection.
(a) Original image
(b) Ground truth
(c) Bounded mask
Fig. 4. Illustration of maintaining consistency of semantic
segmentation and object detection. (a) is the original image
(“00024.jpg”); (b) shows the ground truth mask of
segmentation task (“Specularity”), where the bounding boxes marked
in red are ground truth of detection task. (c) is the bounded
mask used to train a Mask R-CNN model.</p>
      <p>These predicted instance masks, called soft pixel-level
labels, are assigned to the corresponding training samples.
These training samples with soft pixel-level labels are
furtherly added to the training set of segmentation task. We
retrain the Mask R-CNN model on the augmented training
set of segmentation task. Subsequently, the soft pixel-level
labels are refined with the new instance masks predicted by
the retrained Mask R-CNN model. This step might be
performed multiple times for higher segmentation accuracy. The
final augmented training set will be used in next step, while
the final retrained Mask R-CNN model can be used by the
ensemble module.
2.1.3. Step 3: training a Mask-aided R-CNN model for
detection and segmentation task
To take full advantage of all available training samples and
the strategy of boosting object detection by adding mask
prediction branch, we generate soft pixel-level labels for
training samples with no pixel-level annotations through the first
two steps. In this step, we train multiple Mask-aided R-CNN
models with different backbone networks on the final
augmented training set. The Mask-aided R-CNN model
supporting to be trained on pixel-level labeled samples with
incomplete categories is detailed later in next section. These trained
Backbone</p>
      <p>Mask-aided R-CNN models will be used by the ensemble
module to furtherly improve the detection performance.</p>
    </sec>
    <sec id="sec-5">
      <title>2.2. Mask-aided R-CNN</title>
      <p>
        The Mask-aided R-CNN shown in Figure 6 is modified from
Mask R-CNN by pruning its mask header to support training
on pixel-level labeled samples with incomplete categories. In
EAD2019 Challenge[
        <xref ref-type="bibr" rid="ref12">11</xref>
        ], the detection task has seven
categories while the segmentation task has five categories, where
the five segmentation categories are a subset of the seven
detection categories. Therefore, the Mask-aided R-CNN model
for EAD2019 Challenge is designed by following the two
steps below: (1) Design a Mask R-CNN model with seven
semantic categories; (2) Prune the neural units and
connections related with the two extra categories in the mask header
of this Mask R-CNN to get a mask header with five semantic
categories. When training such a Mask-aided R-CNN model,
we compute the mask loss only for the five segmentation
categories. The remaining defaults in training process are kept
unchanged.
      </p>
    </sec>
    <sec id="sec-6">
      <title>2.3. Ensemble method</title>
      <p>Ensemble strategy is commonly used to improve the
performance in image classification tasks. In the detection task of
EAD2019 challenge, we propose a simple yet effective
ensemble method based on graph model for object detection
tasks to furtherly improve the detection performance. Our
proposed ensemble method is able to fuse the detection
results from multiple object detectors by voting on one graph
clique for the same object and mutually reinforcing each other
among graph cliques.</p>
      <sec id="sec-6-1">
        <title>2.3.1. Construction of Graph model</title>
        <p>Given a single image I, C semantic categories and N
object detectors, the detection result set can be formalized as
fDetcnjn = 1; 2; : : : ; N; c = 1; 2; : : : ; Cg. For convenience,
we simply extract the detection results of a single category
Model A
Model B
Model C</p>
        <sec id="sec-6-1-1">
          <title>Clique</title>
        </sec>
        <sec id="sec-6-1-2">
          <title>Clique</title>
        </sec>
        <sec id="sec-6-1-3">
          <title>Clique</title>
        </sec>
        <sec id="sec-6-1-4">
          <title>Clique</title>
        </sec>
        <sec id="sec-6-1-5">
          <title>Clique</title>
          <p>fDetcnjn = 1; 2; : : : ; N g to introduce and formalize our
ensemble method (illustrated in Figure 7). Each detection Detcn
consist of fuuid; score; bboxg, where uuid denotes the
universally unique identifier of the detector, score denotes
the confidence score of this detection and bbox denotes the
bounding box of this detection.</p>
          <p>A weighted undirected graph Gc(V; E) with dense
connections can be established from the detections fDetcnjn =
1; 2; : : : ; N g , where V denotes the set of vertexes and E
denotes the set of edges. Each vertex represents a single
detection. The vertexes are densely connected with each other by
edges. We assign a weight, i.e. the intersection over union
(IOU) score of two detections, to the corresponding edge.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>2.3.2. Inference of Graph model</title>
        <p>We formulize the inference of established graph model as the
maximum clique problem, which aims to maximize the sum
of edge weights in each clique here. Several reasonable
constraints are introduced to simplify the partition process. Post
processing, e.g. non-maximum suppression (NMS), to
remove redundant detections is commonly used in object
detector. Therefore, the vertexes in one clique are required to
be different in the uuid attribute, which means removing the
edges constructed by the same detector. Moreover, we
introduce a threshold of IOU score to remove the edges with
low weights, which means that the two detections with higher
IOU score are more likely to be the same object.</p>
        <p>After the simplification step, we design a greedy approach
to solve this maximum clique problem iteratively. Initially,
each vertex is adopted as a clique. We iteratively merge the
two cliques, which has largest edge weight and different uuid
attributes of all the vertexes.</p>
        <p>Fig. 8. The inference process of graph model for proposed
ensemble method. Each clique denotes one detection result,
which can be calculated by Formula (1) and (2).</p>
        <p>The last step is voting on partitioned cliques and
therefore, each clique output a single detection by calculating its
confidence score and bounding box. Given a clique fDetck =
fuuidk; scorek; bboxkgjk = 1; 2; : : : ; Kg, the voting result
Detc = f0; score; bboxg is formalized as follows:
score = 1
bbox =</p>
        <p>K
Y(1
k
PkK scorek
scorek)</p>
        <p>bboxk
PkK scorek
(1)
(2)
where K denotes the number of vertexes in this clique</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>3. EXPERIMENTS</title>
      <p>Experiments on EAD2019 Challenge 1 are performed by
following the proposed training protocol of Mask-aided R-CNN.
We train our models on servers with two 1080Ti GPUs.</p>
    </sec>
    <sec id="sec-8">
      <title>3.1. Experiments on training a basic Mask R-CNN model for the segmentation task</title>
      <p>
        First, we generate the bounded mask for the released samples
of segmentation task to enable instance segmentation. The
released data (498 images with pixel-level labels in total) is
split into training set (90%, 448 images) and validation set
(10%, 50 images). Second, we train a Mask R-CNN model
1https://ead2019.grand-challenge.org
with the backbone network of RseNet101 and feature
pyramid network (FPN) [
        <xref ref-type="bibr" rid="ref14">13</xref>
        ] on this training samples. We
perform two augmentation operations, i.e., random scaling and
random horizontal flipping. The network is trained
end-toend using SGD with the momentum of 0.9 and weight decay
of 0.0001. We train the model using mini-batches of size 2.
We use an initial learning rate of 0.005 that is decayed by a
factor of 10 at the iteration step of 24000 and 48000. The
maximum training iteration is set as 72000.
      </p>
      <p>The trained model is tested on the validation set and the
evaluation results are shown in Table 1 .</p>
    </sec>
    <sec id="sec-9">
      <title>3.2. Experiments on augmenting the training set of segmentation task with soft pixel-level labeled samples</title>
      <p>The trained Mask R-CNN is then used to generate soft
pixellevel labels for training samples of detection task. We follow
the trick detailed in Chapter 2 to enforce the mask
prediction only for the ground truth instances to maintain
consistency of semantic segmentation and object detection and to
improve segmentation accuracy. We evaluate the generated
soft mask on validation set of segmentation task and the
results are shown in Table 2. Compared with Table 1, the
quality of predicted masks has been improved significantly, which
verifies the effectiveness of proposed trick.</p>
      <p>The second step of proposed training protocol is
performed only once in this experiment. We generate soft mask
for each released sample in detection task. These soft mask
annotations, together with the released samples of
detection task and the corresponding bounding box annotations,
constitute the whole dataset for instance segmentation task.</p>
    </sec>
    <sec id="sec-10">
      <title>3.3. Experiments on training the Mask-aided R-CNN models for detection task</title>
      <p>The whole dataset consists of two released datasets, of which
the first released dataset contains 889 images and the second
released dataset contains 1306 images. We split the whole
dataset into one training set (90%, 800 images in the first
released dataset and 90%, 1175 images in the second released
dataset), a validation set of “release 1” (10%, 89 image in the
first released dataset) and a validation set of “release 2” (10%,
131 images in the second released dataset). We successively
train three Faster R-CNN models and three Mask-aided
RCNN models on the training set. The corresponding backbone
networks of these models are RseNet50, RseNet50+FPN and
RseNet101+FPN, respectively. The Faster R-CNN models
are trained only with bounding box annotations, while the
Mask-aided R-CNN are trained with both soft mask
annotations and bounding box annotations. Here, we perform the
data augmentation on the training set with random scaling,
random horizontal flipping, random vertical flipping and
random cropping on-the-fly. Each model is trained end-to-end
using SGD with the momentum of 0.9. A weight decay
factor of 0.0002 is adopted when training the models with a
ResNet101+FPN backbone, while a weight decay factor of
0.0001 is adopted when training other models. We train each
model using mini-batches of size 2. We use an initial learning
rate of 0.005 that is decayed by a factor of 10 at the iteration
step of 24000 and 48000. The maximum training iteration is
set as 72000.</p>
      <p>We evaluate the iteration snapshots of each model on the
validation set of “release 1” and “release 2”. The average
precision curves of each model are shown in Figure 9, 10,
11, 12, 13, and 14.</p>
      <p>For quantitatively evaluation, we uniformly select two
iteration snapshots (iteration of 40000 and iteration of 72000)
from each trained model and evaluate the models on the
validation sets of “release 1” and “release 2”. The evaluation
results of average precision (AP) and average recall (AR) are
shown in Table 3 and Table 4. The average of AP and AR
is adopted as the Evaluation Criterion (EC) score in the
experiments. In Table 3, the EC scores of Mask-aided R-CNN
models are consistently higher than the EC scores of Faster
RCNN. Specifically, the significance of the EC score difference
between Mask-aided R-CNN and Faster R-CNN increases as
the complexity of backbone network increases. It implicates
that the generated soft pixel-level labels facilitate to train a
deeper convolutional network, thus improving detection
performance. The EC scores in Table 4 also reveals a consistent
implication.</p>
    </sec>
    <sec id="sec-11">
      <title>3.4. Experiments on ensemble method</title>
      <p>The two selected iteration snapshots of each model in Section
3.3 are enrolled in the proposed ensemble method. In this
experiment, we implement three ensemble models, involving
ensemble of Faster R-CNN models, ensemble of Mask-aided
R-CNN models and ensemble of all the Faster R-CNN models
and Mask-aided R-CNN models. The threshold of IOU score
0 10000 20000 30000 ite4r0000 50000 60000 70000 0 10000 20000 30000 ite4r0000 50000 60000 70000
(b) Evaluation results of detection and segmentation task on the validation set
of “release 2” dataset.</p>
      <p>Fig. 11. Evaluation results of Mask-aided R-CNN with the
backbone of ResNet50 and FPN on the detection and
segmentation task.
in ensemble method is consistently set as 0.4. We evaluate the
three ensemble models on the validation sets. The evaluation
results are shown in Table 3 and Table 4.</p>
      <p>The EC scores of ensemble Faster R-CNN models and
Mask-aided R-CNN in Table 3 and Table 4 are significantly
higher than the EC scores of corresponding single models.
cen0.3
frroaep0.2
m
0.1
cen0.3
frroaep0.2
m
0.1
0 10000 20000 30000 ite4r0000 50000 60000 70000 0 10000 20000 30000 ite4r0000 50000 60000 70000
(a) Evaluation results of detection and segmentation task on the validation set
of “release 1” dataset.</p>
      <p>ead_2019_release2_soft_whole_val: det
Fig. 13. Evaluation results of Mask-aided R-CNN with the
backbone of ResNet101 and FPN on the detection and
segmentation task.
Fig. 14. Evaluation results of faster R-CNN with the
backbone of ResNet101 and FPN on the detection task.
Furtherly, the ensemble of all the Faster R-CNN models and
Mask-aided R-CNN models significantly improves the EC
scores, which is adopted as the final model for the EAD2019
challenge. Such robust and significant improvements verify
the effectiveness of proposed ensemble method.</p>
    </sec>
    <sec id="sec-12">
      <title>4. CONCLUSION</title>
      <p>In this paper, we introduce ensemble Mask-aided R-CNN
with a flexible and multi-stage training protocol for the
detection task and segmentation task of EAD2019 Challenge.
Numerous experiments have demonstrated the effectiveness
of our work. More specifically, Mask-aided strategy using
soft pixel-level labels of incomplete categories facilitates to
train a deeper convolutional network and to improve
detection performance. The proposed ensemble method is able to
fuse detection results from different detectors and furtherly
improve detection performance with no training cost. Certain
parts of proposed method remain to be furtherly explored,
such as how to furtherly improve the segmentation
performance with soft pixel-level labels.</p>
    </sec>
    <sec id="sec-13">
      <title>5. REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>0 10000 20000 30000 ite4r0000 50000 60000 70000 0 10000 20000 30000 ite4r0000 50000 60000 70000</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jia</given-names>
            <surname>Deng</surname>
          </string-name>
          , Wei Dong, Richard Socher,
          <string-name>
            <surname>Li-Jia</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei, “
          <article-title>Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition</article-title>
          .
          <source>Ieee</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Tsung</given-names>
            <surname>Yi Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Maire</surname>
          </string-name>
          , Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick, “
          <article-title>Microsoft coco: Common objects in context</article-title>
          ,”
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y</given-names>
            <surname>Lecun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and G Hinton,
          <article-title>“Deep learning</article-title>
          .,
          <source>” Nature</source>
          , vol.
          <volume>521</volume>
          , no.
          <issue>7553</issue>
          , pp.
          <fpage>436</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “
          <article-title>Deep residual learning for image recognition,”</article-title>
          <source>in Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Li</given-names>
            <surname>Da</surname>
          </string-name>
          ,
          <string-name>
            <surname>Li Lin</surname>
            ,
            <given-names>and Li</given-names>
          </string-name>
          <string-name>
            <surname>Xiang</surname>
          </string-name>
          , “
          <article-title>Classification of remote sensing images based on densely connected convolutional networks,”</article-title>
          <source>in IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Shaoqing</given-names>
            <surname>Ren</surname>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
          </string-name>
          , and Jian Sun, “
          <article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>
          ,
          <source>” in Advances in neural information processing systems</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Wei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Dragomir Anguelov, Dumitru Erhan,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Scott Reed, Cheng Yang Fu, and
          <string-name>
            <surname>Alexander</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berg</surname>
          </string-name>
          , “Ssd:
          <article-title>Single shot multibox detector</article-title>
          ,
          <source>” in European Conference on Computer Vision</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and Thomas Brox, “
          <article-title>U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing</article-title>
          and
          <string-name>
            <surname>Computer-assisted Intervention</surname>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Georgia Gkioxari, Piotr Dolla´r, and Ross Girshick, “
          <string-name>
            <surname>Mask</surname>
          </string-name>
          r-cnn,”
          <source>in Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2961</fpage>
          -
          <lpage>2969</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Cheng-Yang</surname>
            <given-names>Fu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mykhailo Shvets</surname>
          </string-name>
          , and Alexander C Berg, “Retinamask:
          <article-title>Learning to predict masks improves state-of-the-art single-shot detection for free</article-title>
          ,” arXiv preprint arXiv:
          <year>1901</year>
          .03353,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sharib</surname>
            <given-names>Ali</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Felix Zhou</surname>
          </string-name>
          , Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnires, Victor Loschenov, Enrico Grisan, Walter Blondel, and Jens Rittscher, “
          <article-title>Endoscopy artifact detection (EAD 2019) challenge dataset</article-title>
          ,
          <source>” CoRR</source>
          , vol. abs/
          <year>1905</year>
          .03209,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Sharib</surname>
            <given-names>Ali</given-names>
          </string-name>
          , Felix Zhou, Adam Bailey, Barbara Braden, James East, Xin Lu, and Jens Rittscher, “
          <article-title>A deep learning framework for quality assessment and restoration in video endoscopy</article-title>
          ,” arXiv preprint arXiv:
          <year>1904</year>
          .07073,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Tsung-Yi</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Piotr Dolla´r, Ross Girshick, Kaiming He,
          <string-name>
            <surname>Bharath Hariharan</surname>
          </string-name>
          , and Serge Belongie, “
          <article-title>Feature pyramid networks for object detection</article-title>
          ,”
          <source>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2117</fpage>
          -
          <lpage>2125</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>