<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ENDOSCOPIC DETECTION AND SEGMENTATION OF GASTROENTEROLOGICAL DISEASES WITH DEEP CONVOLUTIONAL NEURAL NETWORKS Adrian Krenzer, Amar Hekalo, Frank Puppe</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Artificial Intelligence and Knowledge Systems, University of W u ̈rzburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Previous endoscopic computer vision research focused mostly on the detection of a singular disease like, e.g. polyps. The endoscopic disease detection challenge (EDD2020) extends this classification task by providing data for different diseases in various organs. The EDD2020 includes two sub-tasks1: (1) Multi-class disease detection: localization of bounding boxes and class labels for the five disease classes: Polyp, Barret's Esophagus (BE), suspicious, High Grade Dysplasia (HGD) and cancer; (2) Region segmentation: boundary delineation of detected diseases. In this paper, we describe our approach by leveraging deep convolutional neural networks (CNNs). We highlight the comparison of two general state-of-the-art object detection approaches. The first one is Single Shot Detection (SSD), and the second one are twostep region proposal based CNNs. We, therefore, compare two different models: YOLOv3 (SSD) and Faster R-CNN with ResNet-101 backbone. For the second task, we leverage the state-of-the-art Cascade Mask R-CNN with various backbones and compare the results. In order to minimize generalization error, we apply data augmentation; finally, we use knowledge from the endoscopic domain to further refine our models during post-processing and compare the resulting performances.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Endoscopic vision is a procedure which covers many
different areas and organs of the human body, such as the bladder,
the stomach or the colon, allowing gastroenterologists to
potentially discover a wide array of diseases and abscesses, like
polyps, cancer and Barrett’s esophagus. Naturally, in order
to assure detection of all diseases and to improve the
workflow, application of real-time detection using Deep Learning
is becoming more prevalent. There have been previous
publications with good results on real-time detection of endoscopic
polyps using Single Shot Detector [1] based CNNs [2] as well
as an anchor free approach called AFP-Net [3]. Existing work
1https://edd2020.grand-challenge.org</p>
      <p>
        Copyright c 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
usually focuses on one disease class, like polyp or cancer
detection, mostly due to lack of annotated data. The
Endoscopic Disease Detection Challenge 2020 [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ] partially solves
this issue by providing endoscopic images of three different
organs, namely colon, esophagus and stomach, with five
disease classes. Additionally they provide corresponding
bounding boxes for object detection as well as polygonal masks for
image segmentation. In this paper we apply and train
stateof-the-art Deep Learning models for both tasks using various
architectures and comparing their performance.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. DATASETS AND DATA ANALYSIS</title>
      <p>
        In order to choose and prepare the right deep CNN for the
task, we start by analyzing the given training data in detail.
The EDD2020 challenge [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ] provides a training data set for
multi-class disease detection, which contains 386 endoscopic
images labeled with 684 bounding boxes and 502
segmentation masks. While analyzing the data, we recognize class
imbalance. Therefore we counted the occurrences for each
class throughout the dataset based on the bounding boxes.
The dataset has more than 200 images with polyps and BE
but less than 100 samples for the three remaining classes
respectively. So, it might be challenging to learn the correct
assessment of the classes HGD, suspicious and cancer. This
unbalanced sample distribution is one difficulty of the dataset
and is therefore considered while choosing our model and it’s
hyperparameters. The second difficulty we recognize is the
variation in box sizes. We therefore calculated the area of
all the boxes. Most of the boxes have nearly the same mean
area while the variation of the areas differs enormously,
especially for the polyp class, where the standard deviation is
significantly larger than within other classes.
      </p>
      <p>Finally, for the segmentation task, for every image there
are given masks specifying which regions are of interest
which is done separately for each class. While most of the
images belong to a unique class, some of them have several
masks with overlapping regions, which is especially apparent
for the “suspicious” class. The latter is often only part of a
region of an already existing class. Hence this is a
multiclass multi-label segmentation task with independent classes.
We randomly split the dataset into 90% training and 10%
validation set, where the best model is chosen by minimum
Input</p>
      <p>YOLOv3
Faster R-CNN
Cascade R-CNN
(a)
Detection
Post-processing with
domain knowledge</p>
      <p>(b)
Post-processing with
domain knowledge</p>
      <p>Segmentation
validation loss during training.</p>
      <p>
        Additional data: In order to improve generalization, we
extend the training dataset by including images from openly
accessible databases. We include two datasets from a
previous endoscopic vision challenge [
        <xref ref-type="bibr" rid="ref2">5</xref>
        ], namely the ETIS-Larib
Polyp database [
        <xref ref-type="bibr" rid="ref3">6</xref>
        ], which consists of 196 polyp images, and
the CVC-ClinicDB [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ], which consists of 612 polyp images,
as well as the dataset from the Gastrointestinal Image
Analysis (GIANA) challenge [
        <xref ref-type="bibr" rid="ref6">8</xref>
        ], with 412 polyp images. All three
datasets have corresponding segmentation masks. We add
corresponding bounding boxes using the segmented masks
ourselves. In addition we include the Kvasir-SEG dataset
[
        <xref ref-type="bibr" rid="ref7">9</xref>
        ], which consists of 1000 polyp images with both
segmentation masks and bounding boxes. Finally, we extract
images annotated with esophagitis from the Kvasir2 dataset [
        <xref ref-type="bibr" rid="ref8">10</xref>
        ].
Esophagitis and Barret’s esophagus occur at the same
position in the esophagus, and some symptoms of
esophagitis are very similar to Barret’s esophagus symptoms.
Therefore we add images with esophagitis symptoms which looked
close to Barret’s esophagus and test if those improve our
results. We receive a light improvement in BE results and
therefore include 103 additional images for a total of 2323
additional training images. Nevertheless, Barret’s esophagus and
esophagitis are different diseases and have to be distinguished
in further research if more classes are included in the
classification task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. METHODS</title>
      <p>
        In this section, we illustrate our approaches for the two
subtasks. All our models are trained on a Tesla P100 Nvidia
GPU. After exploring the data, we decided to choose CNNs
for the challenge as they have proven to be very stable in
classic multi-class detection tasks like the COCO challenge [
        <xref ref-type="bibr" rid="ref9">11</xref>
        ].
In the domain of object detection, we consider two main
concepts that have proven successful in multi-class object
detection. First, a two-step method of region proposals and
subsequent classification of the proposed regions like Faster
RCNN. Second single-shot detection (SSD), which is mostly
applicable in real-time. We compare the results of the SSD
model and Faster R-CNN. To improve our results further, we
combine those two algorithms in our final architecture. For
the second task, since both bounding boxes and segmentation
masks are available, we choose the Cascade Mask R-CNN.
Incorporating both types of annotations achieves the best
results. For both of these tasks we add a post-processing with
gastroenterological knowledge. Figure 1 depicts our final
architecture for the detection and segmentation task. For
training the Faster R-CNN we leverage the open source
Detectron2 framework [
        <xref ref-type="bibr" rid="ref10">12</xref>
        ].
      </p>
      <p>
        By including additional 2220 polyp images, we
significantly increase the class imbalance of the training data. Class
balance is crucial for training and inference of neural
networks. To tackle this problem, we use class weights in the
algorithms. Therefore the loss of an underrepresented class
multiplies by a weight that balances the outcome of the total
loss function. By adding those weights, we observe an
enhancement in polyp detection while not losing the detection
score in the other classes [
        <xref ref-type="bibr" rid="ref11">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>3.1. Task 1 multi-class bounding box detection:</title>
      <p>
        As mentioned above, we want to compare two common object
detection approaches, namely SSD and what we call a classic
region proposal approach. Compared to classical approaches,
SSD enables real-time detection. In practice, real-time
detection is critical. Often, the gastroenterological diseases
receive treatment directly (e.g., ablation of a polyp). Therefore
a low inference time has to be considered to apply the
models in real practice. On the contrary, larger architectures may
perform better in tasks suited for procedures like detecting
the stadium of the disease, which mostly has no real-time
restrictions. Nevertheless, a larger architecture may perform
well on our challenge task, too. Therefore, we leverage one
model from each of these sub positions. The model for SSD
we utilize is called the YOLOv3 algorithm [
        <xref ref-type="bibr" rid="ref12">14</xref>
        ], which is the
third version of the well-known YOLO architecture [
        <xref ref-type="bibr" rid="ref13">15</xref>
        ] and
has added residual blocks that allow training deeper networks
while preventing the vanishing gradient problem. We use the
YOLOv3 algorithm with initial weights pre-trained on the
COCO dataset [
        <xref ref-type="bibr" rid="ref9">11</xref>
        ]. In the next step, we unfreeze the last
two layers of the network and train them utilizing the adam
optimizer [
        <xref ref-type="bibr" rid="ref4">16</xref>
        ]. We train for 50 epochs. In addition, we
unfreeze the whole network and train until it stops through early
stopping, resulting in an additional 33 epochs.
      </p>
      <p>
        As a classic larger architecture, we use a Faster R-CNN
[
        <xref ref-type="bibr" rid="ref14">17</xref>
        ] with a 104 depth Retinanet backbone. We use a batch
size of 2 because of the computational expense of this large
network. We initialize the network with weights pre-trained
on the COCO dataset. We choose a learning rate of 0.00025
for the training.
      </p>
      <p>Post-processing: The YOLOv3 architecture is more
successful in classifying polyps and HGD whereas classic
architecture is better in detecting BE, suspicious and cancer. We
therefore assemble both networks to improve our detection
results. Hence, the YOLOv3 predicts HGD and polyps while
the Faster R-CNN algorithm predicts BE, suspicious and
cancer. Both algorithms can predict all labels, but we only use
the predictions of the specified classes from each algorithm
respectively. To further improve our results we use
gastroenterological knowledge and knowledge of the data set
structure. As the probability is low that BE and polyp are predicted
in the same image we implement a simple rule: If both polyps
and BE are detected, we only produce boxes for the class with
higher probability, i.e., if the probability for polyps is higher
than for BE, no bounding boxes are predicted for BE.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2. Task 2 region segmentation:</title>
      <p>
        For the image segmentation task, we train two similar
architectures with various backbones, namely Mask R-CNN [
        <xref ref-type="bibr" rid="ref15">18</xref>
        ]
and its successor, Cascade Mask R-CNN [
        <xref ref-type="bibr" rid="ref16">19</xref>
        ]. Both
architectures are primarily two-stage object detection models based
on Faster R-CNN, i.e. a region proposal network first
proposes candidate bounding boxes (Regions of Interest, RoI)
before the final prediction. Here, they add another branch
used to predict segmentation masks, where the proposed RoIs
are used to enhance the segmentation mask predictions in
contrast to using fully convolutional networks only. Cascade
Mask R-CNN is an extended framework using a cascade-like
structure and is essentially an ensemble of several Mask
RCNNs with weight sharing on the backbones.
      </p>
      <p>Fig. 2: In order to train Mask and Cascade Mask R-CNN
for semantic segmentation, some bounding boxes had to be
adjusted. We transform the boxes from including several
instances (left) to be only one instance (right).</p>
      <p>We choose these types of models for two reasons: First,
since we have both bounding boxes and segmentation masks
available as training data, we can utilize the Mask R-CNN
approach, where RoI influences the segmentation, to the fullest.
Second, since these networks are set to perform instance
segmentation, each class is predicted independently from each
other, which is a prefect fit for our multi-class multi-label
problem. As this is a semantic task, we treat this as an
instance segmentation with only one instance per occurrence
per class. As such, we had to adjust some of the ground truth
bounding boxes in our data, as shown in Fig. 2.</p>
      <p>
        For Mask R-CNN we use the ResNeXt-101-32x8d [
        <xref ref-type="bibr" rid="ref17">20</xref>
        ]
and for Cascade Mask R-CNN the ResNeXt-151-32x8d [
        <xref ref-type="bibr" rid="ref17">20</xref>
        ]
models as backbones, both of which are CNN classifyers
pretrained on the ImageNet-1k dataset [
        <xref ref-type="bibr" rid="ref18">21</xref>
        ]. Additionally, both
full architectures are pre-trained on the COCO dataset [
        <xref ref-type="bibr" rid="ref9">11</xref>
        ],
hence we utilize transfer learning due to the small size of our
training dataset.
      </p>
      <p>
        The networks are trained using the Detectron2 framework
[
        <xref ref-type="bibr" rid="ref10">12</xref>
        ] which provides a wide range of pre-trained object
detection and segmentation models. As a pre-processing step,
we convert our data to the COCO dataset format. Image
preprocessing, i.e. padding, resizing, rescaling the pixel values
etc., is then performed automatically within the framework.
The total loss is the sum of classification, box-regression and
mask loss L = Lcls + Lbox + Lmask [
        <xref ref-type="bibr" rid="ref15">18</xref>
        ], where Lmask is
the binary cross-entropy for independent segmentation of all
masks. The models are trained using stochastic gradient
descent with a learning rate of 0.00025 and a batch size of 2.
They are trained for up to 10000 iterations with checkpoints
every 500 iterations. We then choose the checkpoint with the
lowest validation loss as our final model. We also apply data
augmentation in the form of random horizontal and vertical
flipping as well as random resizing with retained aspect ratio
in order to minimize the generalization error.
      </p>
      <p>Post-processing: To further improve our results we use
knowledge from gastroenterology and knowledge from the
data set structure. As mentioned above, the probability that
BE and polyps are present in the same image is very low. We
apply the following procedure on the polyp/BE predictions:
We utilize the predictions from object detection and
only predict masks, where there are bounding boxes
present from Yolov3 and Faster R-CNN.</p>
      <p>As an additional criterion, pixels within bounding
boxes of probability &lt; 0:2 are labeled with 0, i.e.
no disease present.</p>
      <p>If both polyps and BE are detected, we only produce
masks for the class with higher probability, as with the
detection model.</p>
    </sec>
    <sec id="sec-6">
      <title>4. RESULTS</title>
      <p>
        In this section, we describe our results of the two subtasks. In
both settings, we highlight the performance of the algorithms
for every single disease. Therefore, we create a validation
set. The validation set consists of 40 images randomly chosen
from the provided data (no additional data is included). We
test the detection as well as the segmentation on the created
validation set.
4.1. Task 1
Table 1 shows our results on our created validation set for
the detection task where YOLOv3 is the described SSD
algorithm, Faster R-CNN is the FASTER R-CNN algorithm
with ResNet-101 backbone and ensemble with pp
(postprocessing) is the ensemble of those two added with the
hardcoded rule. We display the mean average precision with
a minimum IoU of 0.5 (mAP) [
        <xref ref-type="bibr" rid="ref9">11</xref>
        ]. We highlight the
performance of the algorithms split on the five diseases. All of
the algorithms have an excellent performance in detecting
polyps; this is mostly due to our additional polyp training
data (see chapter 2). BE is better detected by the Faster
RCNN algorithm, which is why we used this algorithm for
detecting BE in the ensembled version. Notably, suspicious
is one of the harder classes to correctly classify as YOLOv3
is only showing a detection performance of 10 % mAP. As
depicted in Table 1, cancer is detected quite well by all of
the algorithms. All things considered, the ensemble with
post-processing is the best algorithm in this task. The
postprocessing and combination of YOLOv3 and Faster R-CNN
(Ensemble with pp) enhances the performance compared to
the single YOLOv3 method by 7.95%. Figure 3 shows a
detection result of the YOLOv3 algorithm and a
segmentation result of the Cascade Mask R-CNN. Our detection score
on the EDD2020 challenge [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ] test set using the ensemble
architecture produces a score of 0:3360 0:0852.
4.2. Task 2
As in task 1, we evaluated our models on our validation set as
a subset of the provided data on both Dice coefficient as well
as intersection over union (IoU). Table 2 summarizes these
Fig. 3: Exemplary results for both detection with YOLOv3
(upper) and segmentation with Cascade Mask R-CNN (lower)
results. While Mask R-CNN outperforms Cascade Mask
RCNN in both polyp and BE classes, Cascade Mask-RCNN
provides better results overall, especially on the other three
classes, which are comparatively underrepresented in our
training data. Applying the post processing steps described
in section 3 further improves the results of Cascade Mask
RCNN, but interestingly worsens the micro ( ) averaged score,
which we discuss below. Our segmentation score on the
EDD2020 challenge [
        <xref ref-type="bibr" rid="ref1">4</xref>
        ] test set using Cascade Mask R-CNN
is then 0:6526 0:3418.
      </p>
    </sec>
    <sec id="sec-7">
      <title>5. DISCUSSION &amp; CONCLUSION</title>
      <p>All of our models in both tasks perform best on the polyp class
and worst on the suspicious category. Since data on polyps
is abundant in our training set, it is clear why the networks
show good results in this area. The suspicious class, however,
shows a similar amount of samples as HGD and cancer, yet,
with the exception of Cascade Mask R-CNN, all models
perform significantly worse on this class. This is most likely due
to the unclear nature of this class as it often denotes regions
belonging to different types of diseases, i.e. in some images
it denotes possible cancer, whereas in others it signifies
possible BE. Additionally, performing gastroenterologists often
have differing opinions on what areas can be considered as
suspicious, which adds further noise to our data. The
performance of Cascade Mask R-CNN on suspicious and the other
less represented classes can be attributed to its ensemble-like
structure. The discrepancy of the micro-averaged scores can
be explained as such: Our post processing severely reduces
the amount of false positives, but also adds some false
negatives. This improves the class-based score, since classes on
one image with empty masks receive perfect scores this way.
With micro-averaging, however, since precision and recall are
the same, we essentially look at the per pixel accuracy of the
entire mask, ultimately worsening this score.</p>
      <p>Our model outperforms the best network from [2], namely
SSD with a InceptionV3 backbone, which was partially
trained using the same polyp databases and showed a
precision of 73:6% on the MICCAI 2015 evaluation dataset,
compared to our 84:19% with YOLOv3. AFP-net performs
better than our model [3] with a precision of 88:89% on
the ETIS-Larib dataset and 99:36% on the CVC-Clinic-train
dataset. However, for both cases, direct comparison is
difficult since both different training and different evaluation data
are used. Additionally, we perform multi-class prediction,
which can be a more difficult task to perform than binary
prediction.</p>
      <p>We applied state-of-the-art Deep Learning architectures
for the detection and semantic segmentation of five
different gastroenterological diseases. For detection, we evaluated
three architectures, the YOLOv3 and the Faster R-CNN, and
our combination of those algorithms. Furthermore, our
ensemble includes domain knowledge-based post-processing,
which further enhances our results in the challenge. For
segmentation, we evaluate three models: Cascade Mask
RCNN, its predecessor Mask R-CNN, and the Cascade Mask
R-CNN combined with post-processing. In the region
segmentation task, the Cascade Mask R-CNN with additional
post-processing reliably performs as good or better than the
other networks. For future work we intend to improve our
results by adding more training data, applying additional forms
of data augmentation and further hyperparameter tuning. All
in all, we present state-of-the-art results in the EDD challenge
with our detection and segmentation applications.</p>
    </sec>
    <sec id="sec-8">
      <title>6. REFERENCES</title>
      <p>[1] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
Szegedy, Scott E. Reed, Cheng-Yang Fu, and
Alexander C. Berg. SSD: single shot multibox detector. CoRR,
abs/1512.02325, 2015.
[2] J. Jiang M. Liu and Z. Wang. Colonic polyp detection in
endoscopic videos with single shot detection based deep
convolutional neural network. IEEE Access, 7:75058–
75066, 2019.
[3] Dechun Wang, Ning Zhang, Xinzi Sun, Pengfei Zhang,
Chenxi Zhang, Yu Cao, and Benyuan Liu. Afp-net:
Realtime anchor-free polyp detection in colonoscopy,
2019.</p>
      <p>Adam: A
arXiv preprint</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sharib</given-names>
            <surname>Ali</surname>
          </string-name>
          , Noha Ghatwary, Barbara Braden, Dominique Lamarque, Adam Bailey, Stefano Realdon, Renato Cannizzaro, Jens Rittscher,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Daul</surname>
          </string-name>
          , and
          <string-name>
            <given-names>James</given-names>
            <surname>East</surname>
          </string-name>
          .
          <source>Endoscopy disease detection challenge</source>
          <year>2020</year>
          . arXiv preprint arXiv:
          <year>2003</year>
          .03376,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bernal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tajkbaksh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Snchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Matuszewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Angermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Romain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rustad</surname>
          </string-name>
          , I. Balasingham,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Debard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Maier-Hein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Speidel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brandao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Crdova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Snchez-Montes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Gurudu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fernndez-Esparrach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Histace</surname>
          </string-name>
          .
          <article-title>Comparative validation of polyp detection methods in video colonoscopy: Results from the miccai 2015 endoscopic vision challenge</article-title>
          .
          <source>IEEE Transactions on Medical Imaging</source>
          ,
          <volume>36</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1231</fpage>
          -
          <lpage>1249</lpage>
          ,
          <year>June 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Histace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Romain</surname>
          </string-name>
          , et al.
          <article-title>Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer</article-title>
          .
          <source>Int J CARS</source>
          ,
          <volume>9</volume>
          :
          <fpage>283</fpage>
          -
          <lpage>293</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P Kingma</given-names>
          </string-name>
          and
          <article-title>Jimmy Ba. method for stochastic optimization</article-title>
          .
          <source>arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Bernal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Javier</surname>
          </string-name>
          <string-name>
            <given-names>Sa</given-names>
            ´nchez, Gloria Ferna´ndezEsparrach,
            <surname>Debora</surname>
          </string-name>
          <string-name>
            <surname>Gil</surname>
          </string-name>
          ,
          <article-title>Cristina Rodr´ıguez, and Fernando Vilarin˜o. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians</article-title>
          .
          <source>Computerized Medical Imaging and Graphics</source>
          ,
          <volume>43</volume>
          :
          <fpage>99</fpage>
          -
          <lpage>111</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y. B.</given-names>
            <surname>Guo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bogdan J.</given-names>
            <surname>Matuszewski</surname>
          </string-name>
          .
          <article-title>Giana polyp segmentation with fully convolutional dilation neural networks</article-title>
          .
          <source>In VISIGRAPP</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pia H. Smedsrud</surname>
            , Michael Riegler, Pa˚l Halvorsen, Dag Johansen, Thomas de Lange, and Ha˚vard
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Johansen</surname>
          </string-name>
          .
          <article-title>Kvasir-seg: A segmented polyp dataset</article-title>
          .
          <source>In Proceedings of the International Conference on Multimedia Modeling (MMM)</source>
          . Springer,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Konstantin</surname>
            <given-names>Pogorelov</given-names>
          </string-name>
          , Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato,
          <string-name>
            <surname>Duc-Tien</surname>
            <given-names>DangNguyen</given-names>
          </string-name>
          , Mathias Lux, Peter Thelin Schmidt,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Riegler</surname>
          </string-name>
          , and Pa˚l Halvorsen.
          <article-title>Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection</article-title>
          .
          <source>In Proceedings of the 8th ACM on Multimedia Systems Conference, MMSys'17</source>
          , pages
          <fpage>164</fpage>
          -
          <lpage>169</lpage>
          , New York, NY, USA,
          <year>2017</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Tsung-Yi Lin</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            , Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and
            <given-names>C</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence</surname>
          </string-name>
          <article-title>Zitnick</article-title>
          .
          <article-title>Microsoft coco: Common objects in context</article-title>
          .
          <source>In European conference on computer vision</source>
          , pages
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Yuxin</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Alexander Kirillov, Francisco Massa,
          <string-name>
            <surname>Wan-Yen Lo</surname>
          </string-name>
          , and Ross Girshick. Detectron2. https://github.com/facebookresearch/ detectron2,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Chen</surname>
            <given-names>Huang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Yining</given-names>
            <surname>Li</surname>
          </string-name>
          , Chen Change Loy, and
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>Learning deep representation for imbalanced classification</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <fpage>5375</fpage>
          -
          <lpage>5384</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Redmon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ali</given-names>
            <surname>Farhadi</surname>
          </string-name>
          .
          <article-title>Yolov3: An incremental improvement</article-title>
          .
          <source>arXiv preprint arXiv:1804.02767</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Joseph</surname>
            <given-names>Redmon</given-names>
          </string-name>
          , Santosh Divvala,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Ali</given-names>
          </string-name>
          <string-name>
            <surname>Farhadi</surname>
          </string-name>
          .
          <article-title>You only look once: Unified, real-time object detection</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Shaoqing</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Georgia Gkioxari, Piotr Dolla´r, and
          <string-name>
            <surname>Ross</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick. Mask</surname>
          </string-name>
          R-CNN. CoRR, abs/1703.06870,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Zhaowei</given-names>
            <surname>Cai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nuno</given-names>
            <surname>Vasconcelos</surname>
          </string-name>
          .
          <article-title>Cascade R-CNN: high quality object detection and instance segmentation</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1906</year>
          .09756,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Saining</surname>
            <given-names>Xie</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross B. Girshick</surname>
            , Piotr Dolla´r, Zhuowen Tu, and
            <given-names>Kaiming</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          .
          <article-title>Aggregated residual transformations for deep neural networks</article-title>
          .
          <source>CoRR, abs/1611.05431</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Olga</surname>
            <given-names>Russakovsky</given-names>
          </string-name>
          , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein,
          <string-name>
            <surname>Alexander C. Berg</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fei-Fei Li</surname>
          </string-name>
          .
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>CoRR, abs/1409.0575</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>