ENDOSCOPIC DETECTION AND SEGMENTATION OF GASTROENTEROLOGICAL
            DISEASES WITH DEEP CONVOLUTIONAL NEURAL NETWORKS

                                           Adrian Krenzer, Amar Hekalo, Frank Puppe

     Department of Artificial Intelligence and Knowledge Systems, University of Würzburg, Germany


                             ABSTRACT                                     usually focuses on one disease class, like polyp or cancer
                                                                          detection, mostly due to lack of annotated data. The Endo-
Previous endoscopic computer vision research focused mostly
                                                                          scopic Disease Detection Challenge 2020 [4] partially solves
on the detection of a singular disease like, e.g. polyps. The
                                                                          this issue by providing endoscopic images of three different
endoscopic disease detection challenge (EDD2020) extends
                                                                          organs, namely colon, esophagus and stomach, with five dis-
this classification task by providing data for different diseases
                                                                          ease classes. Additionally they provide corresponding bound-
in various organs. The EDD2020 includes two sub-tasks1 :
                                                                          ing boxes for object detection as well as polygonal masks for
(1) Multi-class disease detection: localization of bounding
                                                                          image segmentation. In this paper we apply and train state-
boxes and class labels for the five disease classes: Polyp,
                                                                          of-the-art Deep Learning models for both tasks using various
Barret’s Esophagus (BE), suspicious, High Grade Dyspla-
                                                                          architectures and comparing their performance.
sia (HGD) and cancer; (2) Region segmentation: boundary
delineation of detected diseases. In this paper, we describe
our approach by leveraging deep convolutional neural net-                          2. DATASETS AND DATA ANALYSIS
works (CNNs). We highlight the comparison of two general
state-of-the-art object detection approaches. The first one is            In order to choose and prepare the right deep CNN for the
Single Shot Detection (SSD), and the second one are two-                  task, we start by analyzing the given training data in detail.
step region proposal based CNNs. We, therefore, compare                   The EDD2020 challenge [4] provides a training data set for
two different models: YOLOv3 (SSD) and Faster R-CNN                       multi-class disease detection, which contains 386 endoscopic
with ResNet-101 backbone. For the second task, we lever-                  images labeled with 684 bounding boxes and 502 segmen-
age the state-of-the-art Cascade Mask R-CNN with various                  tation masks. While analyzing the data, we recognize class
backbones and compare the results. In order to minimize                   imbalance. Therefore we counted the occurrences for each
generalization error, we apply data augmentation; finally, we             class throughout the dataset based on the bounding boxes.
use knowledge from the endoscopic domain to further refine                The dataset has more than 200 images with polyps and BE
our models during post-processing and compare the resulting               but less than 100 samples for the three remaining classes re-
performances.                                                             spectively. So, it might be challenging to learn the correct
                                                                          assessment of the classes HGD, suspicious and cancer. This
                                                                          unbalanced sample distribution is one difficulty of the dataset
                                                                          and is therefore considered while choosing our model and it’s
                       1. INTRODUCTION                                    hyperparameters. The second difficulty we recognize is the
                                                                          variation in box sizes. We therefore calculated the area of
Endoscopic vision is a procedure which covers many differ-                all the boxes. Most of the boxes have nearly the same mean
ent areas and organs of the human body, such as the bladder,              area while the variation of the areas differs enormously, es-
the stomach or the colon, allowing gastroenterologists to po-             pecially for the polyp class, where the standard deviation is
tentially discover a wide array of diseases and abscesses, like           significantly larger than within other classes.
polyps, cancer and Barrett’s esophagus. Naturally, in order                    Finally, for the segmentation task, for every image there
to assure detection of all diseases and to improve the work-              are given masks specifying which regions are of interest
flow, application of real-time detection using Deep Learning              which is done separately for each class. While most of the
is becoming more prevalent. There have been previous publi-               images belong to a unique class, some of them have several
cations with good results on real-time detection of endoscopic            masks with overlapping regions, which is especially apparent
polyps using Single Shot Detector [1] based CNNs [2] as well              for the “suspicious” class. The latter is often only part of a
as an anchor free approach called AFP-Net [3]. Existing work              region of an already existing class. Hence this is a multi-
   1 https://edd2020.grand-challenge.org                                  class multi-label segmentation task with independent classes.
    Copyright c 2020 for this paper by its authors. Use permitted under   We randomly split the dataset into 90% training and 10%
Creative Commons License Attribution 4.0 International (CC BY 4.0).       validation set, where the best model is chosen by minimum
                                                                                                             Output
                                                                                              Detection
                                     YOLOv3
                                                                              Post-processing with
Input                                                                          domain knowledge
                                  Faster R-CNN
                                                     (a)                            (b)


                                                                                          Segmentation


                                                                              Post-processing with
                                 Cascade R-CNN
                                                                               domain knowledge


Fig. 1: This figure illustrates our final pipeline for the detection and segmentation task. At step (a) the predictions for polyps
and HGD of the YOLOv3 algorithm and the predictions of BE, suspicious, and cancer of the Faster R-CNN are applied for the
final result. At step (b) the box output of the detection architecture is utilized to filter the segmentation masks.


validation loss during training.                                    In the domain of object detection, we consider two main con-
     Additional data: In order to improve generalization, we        cepts that have proven successful in multi-class object detec-
extend the training dataset by including images from openly         tion. First, a two-step method of region proposals and sub-
accessible databases. We include two datasets from a previ-         sequent classification of the proposed regions like Faster R-
ous endoscopic vision challenge [5], namely the ETIS-Larib          CNN. Second single-shot detection (SSD), which is mostly
Polyp database [6], which consists of 196 polyp images, and         applicable in real-time. We compare the results of the SSD
the CVC-ClinicDB [7], which consists of 612 polyp images,           model and Faster R-CNN. To improve our results further, we
as well as the dataset from the Gastrointestinal Image Analy-       combine those two algorithms in our final architecture. For
sis (GIANA) challenge [8], with 412 polyp images. All three         the second task, since both bounding boxes and segmentation
datasets have corresponding segmentation masks. We add              masks are available, we choose the Cascade Mask R-CNN.
corresponding bounding boxes using the segmented masks              Incorporating both types of annotations achieves the best re-
ourselves. In addition we include the Kvasir-SEG dataset            sults. For both of these tasks we add a post-processing with
[9], which consists of 1000 polyp images with both segmen-          gastroenterological knowledge. Figure 1 depicts our final ar-
tation masks and bounding boxes. Finally, we extract im-            chitecture for the detection and segmentation task. For train-
ages annotated with esophagitis from the Kvasir2 dataset [10].      ing the Faster R-CNN we leverage the open source Detec-
Esophagitis and Barret’s esophagus occur at the same po-            tron2 framework [12].
sition in the esophagus, and some symptoms of esophagi-                 By including additional 2220 polyp images, we signifi-
tis are very similar to Barret’s esophagus symptoms. There-         cantly increase the class imbalance of the training data. Class
fore we add images with esophagitis symptoms which looked           balance is crucial for training and inference of neural net-
close to Barret’s esophagus and test if those improve our re-       works. To tackle this problem, we use class weights in the
sults. We receive a light improvement in BE results and there-      algorithms. Therefore the loss of an underrepresented class
fore include 103 additional images for a total of 2323 addi-        multiplies by a weight that balances the outcome of the total
tional training images. Nevertheless, Barret’s esophagus and        loss function. By adding those weights, we observe an en-
esophagitis are different diseases and have to be distinguished     hancement in polyp detection while not losing the detection
in further research if more classes are included in the classifi-   score in the other classes [13].
cation task.
                                                                    3.1. Task 1 multi-class bounding box detection:
                       3. METHODS
                                                                    As mentioned above, we want to compare two common object
In this section, we illustrate our approaches for the two sub-      detection approaches, namely SSD and what we call a classic
tasks. All our models are trained on a Tesla P100 Nvidia            region proposal approach. Compared to classical approaches,
GPU. After exploring the data, we decided to choose CNNs            SSD enables real-time detection. In practice, real-time de-
for the challenge as they have proven to be very stable in clas-    tection is critical. Often, the gastroenterological diseases re-
sic multi-class detection tasks like the COCO challenge [11].       ceive treatment directly (e.g., ablation of a polyp). Therefore
a low inference time has to be considered to apply the mod-
els in real practice. On the contrary, larger architectures may
perform better in tasks suited for procedures like detecting
the stadium of the disease, which mostly has no real-time re-
strictions. Nevertheless, a larger architecture may perform
well on our challenge task, too. Therefore, we leverage one
model from each of these sub positions. The model for SSD
we utilize is called the YOLOv3 algorithm [14], which is the
third version of the well-known YOLO architecture [15] and          Fig. 2: In order to train Mask and Cascade Mask R-CNN
has added residual blocks that allow training deeper networks       for semantic segmentation, some bounding boxes had to be
while preventing the vanishing gradient problem. We use the         adjusted. We transform the boxes from including several in-
YOLOv3 algorithm with initial weights pre-trained on the            stances (left) to be only one instance (right).
COCO dataset [11]. In the next step, we unfreeze the last
two layers of the network and train them utilizing the adam
optimizer [16]. We train for 50 epochs. In addition, we un-             We choose these types of models for two reasons: First,
freeze the whole network and train until it stops through early     since we have both bounding boxes and segmentation masks
stopping, resulting in an additional 33 epochs.                     available as training data, we can utilize the Mask R-CNN ap-
     As a classic larger architecture, we use a Faster R-CNN        proach, where RoI influences the segmentation, to the fullest.
[17] with a 104 depth Retinanet backbone. We use a batch            Second, since these networks are set to perform instance seg-
size of 2 because of the computational expense of this large        mentation, each class is predicted independently from each
network. We initialize the network with weights pre-trained         other, which is a prefect fit for our multi-class multi-label
on the COCO dataset. We choose a learning rate of 0.00025           problem. As this is a semantic task, we treat this as an in-
for the training.                                                   stance segmentation with only one instance per occurrence
     Post-processing: The YOLOv3 architecture is more suc-          per class. As such, we had to adjust some of the ground truth
cessful in classifying polyps and HGD whereas classic archi-        bounding boxes in our data, as shown in Fig. 2.
tecture is better in detecting BE, suspicious and cancer. We            For Mask R-CNN we use the ResNeXt-101-32x8d [20]
therefore assemble both networks to improve our detection           and for Cascade Mask R-CNN the ResNeXt-151-32x8d [20]
results. Hence, the YOLOv3 predicts HGD and polyps while            models as backbones, both of which are CNN classifyers pre-
the Faster R-CNN algorithm predicts BE, suspicious and can-         trained on the ImageNet-1k dataset [21]. Additionally, both
cer. Both algorithms can predict all labels, but we only use        full architectures are pre-trained on the COCO dataset [11],
the predictions of the specified classes from each algorithm        hence we utilize transfer learning due to the small size of our
respectively. To further improve our results we use gastroen-       training dataset.
terological knowledge and knowledge of the data set struc-              The networks are trained using the Detectron2 framework
ture. As the probability is low that BE and polyp are predicted     [12] which provides a wide range of pre-trained object de-
in the same image we implement a simple rule: If both polyps        tection and segmentation models. As a pre-processing step,
and BE are detected, we only produce boxes for the class with       we convert our data to the COCO dataset format. Image pre-
higher probability, i.e., if the probability for polyps is higher   processing, i.e. padding, resizing, rescaling the pixel values
than for BE, no bounding boxes are predicted for BE.                etc., is then performed automatically within the framework.
                                                                    The total loss is the sum of classification, box-regression and
                                                                    mask loss L = Lcls + Lbox + Lmask [18], where Lmask is
3.2. Task 2 region segmentation:
                                                                    the binary cross-entropy for independent segmentation of all
For the image segmentation task, we train two similar archi-        masks. The models are trained using stochastic gradient de-
tectures with various backbones, namely Mask R-CNN [18]             scent with a learning rate of 0.00025 and a batch size of 2.
and its successor, Cascade Mask R-CNN [19]. Both architec-          They are trained for up to 10000 iterations with checkpoints
tures are primarily two-stage object detection models based         every 500 iterations. We then choose the checkpoint with the
on Faster R-CNN, i.e. a region proposal network first pro-          lowest validation loss as our final model. We also apply data
poses candidate bounding boxes (Regions of Interest, RoI)           augmentation in the form of random horizontal and vertical
before the final prediction. Here, they add another branch          flipping as well as random resizing with retained aspect ratio
used to predict segmentation masks, where the proposed RoIs         in order to minimize the generalization error.
are used to enhance the segmentation mask predictions in                Post-processing: To further improve our results we use
contrast to using fully convolutional networks only. Cascade        knowledge from gastroenterology and knowledge from the
Mask R-CNN is an extended framework using a cascade-like            data set structure. As mentioned above, the probability that
structure and is essentially an ensemble of several Mask R-         BE and polyps are present in the same image is very low. We
CNNs with weight sharing on the backbones.                          apply the following procedure on the polyp/BE predictions:
   • We utilize the predictions from object detection and
     only predict masks, where there are bounding boxes
     present from Yolov3 and Faster R-CNN.

   • As an additional criterion, pixels within bounding
     boxes of probability < 0.2 are labeled with 0, i.e.
     no disease present.

   • If both polyps and BE are detected, we only produce
     masks for the class with higher probability, as with the
     detection model.

                        4. RESULTS

In this section, we describe our results of the two subtasks. In
both settings, we highlight the performance of the algorithms
for every single disease. Therefore, we create a validation
set. The validation set consists of 40 images randomly chosen
from the provided data (no additional data is included). We
test the detection as well as the segmentation on the created
validation set.

4.1. Task 1
Table 1 shows our results on our created validation set for
the detection task where YOLOv3 is the described SSD al-
gorithm, Faster R-CNN is the FASTER R-CNN algorithm
with ResNet-101 backbone and ensemble with pp (post-
processing) is the ensemble of those two added with the
hardcoded rule. We display the mean average precision with
a minimum IoU of 0.5 (mAP) [11]. We highlight the per-             Fig. 3: Exemplary results for both detection with YOLOv3
formance of the algorithms split on the five diseases. All of      (upper) and segmentation with Cascade Mask R-CNN (lower)
the algorithms have an excellent performance in detecting
polyps; this is mostly due to our additional polyp training        Table 1: Detection results on the validation data (mAP).
data (see chapter 2). BE is better detected by the Faster R-       MAP is the mean average precision over the five classes.
CNN algorithm, which is why we used this algorithm for             Ensemblepp denotes the ensemble of YOLOv3 and Faster R-
detecting BE in the ensembled version. Notably, suspicious         CNN with additional post-processing. All values are in %.
is one of the harder classes to correctly classify as YOLOv3
is only showing a detection performance of 10 % mAP. As                          YOLOv3       Faster R-CNN      Ensemblepp
depicted in Table 1, cancer is detected quite well by all of            Polyp     84.19           73.50           84.46
the algorithms. All things considered, the ensemble with                 BE       38.25           50.40           50.88
post-processing is the best algorithm in this task. The post-          Suspic.    10.00           33.70           33.70
processing and combination of YOLOv3 and Faster R-CNN                   HGD       39.98           28.31           39.98
(Ensemble with pp) enhances the performance compared to                Cancer     49.99           53.20           53.20
the single YOLOv3 method by 7.95%. Figure 3 shows a
                                                                        mAP       44.49           37.29           52.44
detection result of the YOLOv3 algorithm and a segmenta-
tion result of the Cascade Mask R-CNN. Our detection score
on the EDD2020 challenge [4] test set using the ensemble
architecture produces a score of 0.3360 ± 0.0852.                  results. While Mask R-CNN outperforms Cascade Mask R-
                                                                   CNN in both polyp and BE classes, Cascade Mask-RCNN
4.2. Task 2                                                        provides better results overall, especially on the other three
                                                                   classes, which are comparatively underrepresented in our
As in task 1, we evaluated our models on our validation set as     training data. Applying the post processing steps described
a subset of the provided data on both Dice coefficient as well     in section 3 further improves the results of Cascade Mask R-
as intersection over union (IoU). Table 2 summarizes these         CNN, but interestingly worsens the micro (µ) averaged score,
Table 2: Segmentation results on the validation data. R-          dataset. However, for both cases, direct comparison is diffi-
CNNM , R-CNNCM and R-CNNCM pp denote Mask R-CNN,                  cult since both different training and different evaluation data
Cascade Mask R-CNN and Cascade Mask R-CNN with post               are used. Additionally, we perform multi-class prediction,
processing respectively. We also computed the micro aver-         which can be a more difficult task to perform than binary
aged scores, denoted by µ mean, in contrast to mean, which        prediction.
is averaged over class scores. All values are in %.                   We applied state-of-the-art Deep Learning architectures
                                                                  for the detection and semantic segmentation of five differ-
              R-CNNM           R-CNNCM          R-CNNCM pp        ent gastroenterological diseases. For detection, we evaluated
            Dice   IoU        Dice   IoU        Dice   IoU        three architectures, the YOLOv3 and the Faster R-CNN, and
 Polyp      69.41 67.03       61.57 60.08       69.07 67.58       our combination of those algorithms. Furthermore, our en-
 BE         46.41 43.84       44.48 41.06       46.56 43.08       semble includes domain knowledge-based post-processing,
 Suspic.    27.64 25.94       40.03 38.83       52.53 51.33       which further enhances our results in the challenge. For
 HGD        41.83 38.28       63.59 60.25       68.25 65.75       segmentation, we evaluate three models: Cascade Mask R-
 Cancer     53.77 52.14       55.86 54.96       57.24 57.00       CNN, its predecessor Mask R-CNN, and the Cascade Mask
 mean       47.81 45.45       53.11 51.04       58.73 56.95       R-CNN combined with post-processing. In the region seg-
 µ mean     36.57 27.05       47.66 38.44       45.36 37.17       mentation task, the Cascade Mask R-CNN with additional
                                                                  post-processing reliably performs as good or better than the
                                                                  other networks. For future work we intend to improve our re-
which we discuss below. Our segmentation score on the             sults by adding more training data, applying additional forms
EDD2020 challenge [4] test set using Cascade Mask R-CNN           of data augmentation and further hyperparameter tuning. All
is then 0.6526 ± 0.3418.                                          in all, we present state-of-the-art results in the EDD challenge
                                                                  with our detection and segmentation applications.
           5. DISCUSSION & CONCLUSION
                                                                                       6. REFERENCES
All of our models in both tasks perform best on the polyp class
and worst on the suspicious category. Since data on polyps         [1] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
is abundant in our training set, it is clear why the networks          Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexan-
show good results in this area. The suspicious class, however,         der C. Berg. SSD: single shot multibox detector. CoRR,
shows a similar amount of samples as HGD and cancer, yet,              abs/1512.02325, 2015.
with the exception of Cascade Mask R-CNN, all models per-          [2] J. Jiang M. Liu and Z. Wang. Colonic polyp detection in
form significantly worse on this class. This is most likely due        endoscopic videos with single shot detection based deep
to the unclear nature of this class as it often denotes regions        convolutional neural network. IEEE Access, 7:75058–
belonging to different types of diseases, i.e. in some images          75066, 2019.
it denotes possible cancer, whereas in others it signifies pos-
sible BE. Additionally, performing gastroenterologists often       [3] Dechun Wang, Ning Zhang, Xinzi Sun, Pengfei Zhang,
have differing opinions on what areas can be considered as             Chenxi Zhang, Yu Cao, and Benyuan Liu. Afp-net:
suspicious, which adds further noise to our data. The perfor-          Realtime anchor-free polyp detection in colonoscopy,
mance of Cascade Mask R-CNN on suspicious and the other                2019.
less represented classes can be attributed to its ensemble-like
                                                                   [4] Sharib Ali, Noha Ghatwary, Barbara Braden, Do-
structure. The discrepancy of the micro-averaged scores can
                                                                       minique Lamarque, Adam Bailey, Stefano Realdon, Re-
be explained as such: Our post processing severely reduces
                                                                       nato Cannizzaro, Jens Rittscher, Christian Daul, and
the amount of false positives, but also adds some false neg-
                                                                       James East. Endoscopy disease detection challenge
atives. This improves the class-based score, since classes on
                                                                       2020. arXiv preprint arXiv:2003.03376, 2020.
one image with empty masks receive perfect scores this way.
With micro-averaging, however, since precision and recall are      [5] J. Bernal, N. Tajkbaksh, F. J. Snchez, B. J. Matuszewski,
the same, we essentially look at the per pixel accuracy of the         H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rus-
entire mask, ultimately worsening this score.                          tad, I. Balasingham, K. Pogorelov, S. Choi, Q. De-
    Our model outperforms the best network from [2], namely            bard, L. Maier-Hein, S. Speidel, D. Stoyanov, P. Bran-
SSD with a InceptionV3 backbone, which was partially                   dao, H. Crdova, C. Snchez-Montes, S. R. Gurudu,
trained using the same polyp databases and showed a pre-               G. Fernndez-Esparrach, X. Dray, J. Liang, and A. His-
cision of 73.6% on the MICCAI 2015 evaluation dataset,                 tace. Comparative validation of polyp detection meth-
compared to our 84.19% with YOLOv3. AFP-net performs                   ods in video colonoscopy: Results from the miccai 2015
better than our model [3] with a precision of 88.89% on                endoscopic vision challenge. IEEE Transactions on
the ETIS-Larib dataset and 99.36% on the CVC-Clinic-train              Medical Imaging, 36(6):1231–1249, June 2017.
 [6] J. Silva, A. Histace, O. Romain, et al. Toward embedded    [16] Diederik P Kingma and Jimmy Ba.        Adam: A
     detection of polyps in wce images for early diagnosis of        method for stochastic optimization. arXiv preprint
     colorectal cancer. Int J CARS, 9:283 – 293, 2014.               arXiv:1412.6980, 2014.

 [7] Jorge Bernal, F. Javier Sánchez, Gloria Fernández-       [17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
     Esparrach, Debora Gil, Cristina Rodrı́guez, and Fer-            Sun. Faster r-cnn: Towards real-time object detection
     nando Vilariño. Wm-dova maps for accurate polyp high-          with region proposal networks. In Advances in neural
     lighting in colonoscopy: Validation vs. saliency maps           information processing systems, pages 91–99, 2015.
     from physicians. Computerized Medical Imaging and
     Graphics, 43:99 – 111, 2015.                               [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and
                                                                     Ross B. Girshick.     Mask R-CNN.         CoRR,
 [8] Y. B. Guo and Bogdan J. Matuszewski. Giana polyp seg-           abs/1703.06870, 2017.
     mentation with fully convolutional dilation neural net-
                                                                [19] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN:
     works. In VISIGRAPP, 2019.
                                                                     high quality object detection and instance segmentation.
 [9] Debesh Jha, Pia H. Smedsrud, Michael Riegler, Pål              CoRR, abs/1906.09756, 2019.
     Halvorsen, Dag Johansen, Thomas de Lange, and
                                                                [20] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen
     Håvard D. Johansen. Kvasir-seg: A segmented polyp
                                                                     Tu, and Kaiming He. Aggregated residual transforma-
     dataset. In Proceedings of the International Conference
                                                                     tions for deep neural networks. CoRR, abs/1611.05431,
     on Multimedia Modeling (MMM). Springer, 2020.
                                                                     2016.
[10] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten
                                                                [21] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
     Griwodz, Sigrun Losada Eskeland, Thomas de Lange,
                                                                     Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
     Dag Johansen, Concetto Spampinato, Duc-Tien Dang-
                                                                     Karpathy, Aditya Khosla, Michael S. Bernstein, Alexan-
     Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael
                                                                     der C. Berg, and Fei-Fei Li. Imagenet large scale visual
     Riegler, and Pål Halvorsen. Kvasir: A multi-class im-
                                                                     recognition challenge. CoRR, abs/1409.0575, 2014.
     age dataset for computer aided gastrointestinal disease
     detection. In Proceedings of the 8th ACM on Multimedia
     Systems Conference, MMSys’17, pages 164–169, New
     York, NY, USA, 2017. ACM.

[11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
     Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
     C Lawrence Zitnick. Microsoft coco: Common objects
     in context. In European conference on computer vision,
     pages 740–755. Springer, 2014.

[12] Yuxin Wu, Alexander Kirillov, Francisco Massa,
     Wan-Yen Lo, and Ross Girshick.       Detectron2.
     https://github.com/facebookresearch/
     detectron2, 2019.

[13] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou
     Tang. Learning deep representation for imbalanced clas-
     sification. In Proceedings of the IEEE conference on
     computer vision and pattern recognition, pages 5375–
     5384, 2016.

[14] Joseph Redmon and Ali Farhadi. Yolov3: An incre-
     mental improvement. arXiv preprint arXiv:1804.02767,
     2018.

[15] Joseph Redmon, Santosh Divvala, Ross Girshick, and
     Ali Farhadi. You only look once: Unified, real-time ob-
     ject detection. In Proceedings of the IEEE conference
     on computer vision and pattern recognition, pages 779–
     788, 2016.