=Paper= {{Paper |id=Vol-2595/endoCV2020_Krenzer_et_al |storemode=property |title=Endoscopic Detection And Segmentation Of Gastroenterological Diseases With Deep Convolutional Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_6.pdf |volume=Vol-2595 |authors=Adrian Krenzer,Amar Hekalo,Frank Puppe |dblpUrl=https://dblp.org/rec/conf/isbi/KrenzerHP20 }} ==Endoscopic Detection And Segmentation Of Gastroenterological Diseases With Deep Convolutional Neural Networks == https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_6.pdf

ENDOSCOPIC DETECTION AND SEGMENTATION OF GASTROENTEROLOGICAL
DISEASES WITH DEEP CONVOLUTIONAL NEURAL NETWORKS

Adrian Krenzer, Amar Hekalo, Frank Puppe

Department of Artificial Intelligence and Knowledge Systems, University of Würzburg, Germany

ABSTRACT usually focuses on one disease class, like polyp or cancer
detection, mostly due to lack of annotated data. The Endo-
Previous endoscopic computer vision research focused mostly
scopic Disease Detection Challenge 2020 [4] partially solves
on the detection of a singular disease like, e.g. polyps. The
this issue by providing endoscopic images of three different
endoscopic disease detection challenge (EDD2020) extends
organs, namely colon, esophagus and stomach, with five dis-
this classification task by providing data for different diseases
ease classes. Additionally they provide corresponding bound-
in various organs. The EDD2020 includes two sub-tasks1 :
ing boxes for object detection as well as polygonal masks for
(1) Multi-class disease detection: localization of bounding
image segmentation. In this paper we apply and train state-
boxes and class labels for the five disease classes: Polyp,
of-the-art Deep Learning models for both tasks using various
Barret’s Esophagus (BE), suspicious, High Grade Dyspla-
architectures and comparing their performance.
sia (HGD) and cancer; (2) Region segmentation: boundary
delineation of detected diseases. In this paper, we describe
our approach by leveraging deep convolutional neural net- 2. DATASETS AND DATA ANALYSIS
works (CNNs). We highlight the comparison of two general
state-of-the-art object detection approaches. The first one is In order to choose and prepare the right deep CNN for the
Single Shot Detection (SSD), and the second one are two- task, we start by analyzing the given training data in detail.
step region proposal based CNNs. We, therefore, compare The EDD2020 challenge [4] provides a training data set for
two different models: YOLOv3 (SSD) and Faster R-CNN multi-class disease detection, which contains 386 endoscopic
with ResNet-101 backbone. For the second task, we lever- images labeled with 684 bounding boxes and 502 segmen-
age the state-of-the-art Cascade Mask R-CNN with various tation masks. While analyzing the data, we recognize class
backbones and compare the results. In order to minimize imbalance. Therefore we counted the occurrences for each
generalization error, we apply data augmentation; finally, we class throughout the dataset based on the bounding boxes.
use knowledge from the endoscopic domain to further refine The dataset has more than 200 images with polyps and BE
our models during post-processing and compare the resulting but less than 100 samples for the three remaining classes re-
performances. spectively. So, it might be challenging to learn the correct
assessment of the classes HGD, suspicious and cancer. This
unbalanced sample distribution is one difficulty of the dataset
and is therefore considered while choosing our model and it’s
1. INTRODUCTION hyperparameters. The second difficulty we recognize is the
variation in box sizes. We therefore calculated the area of
Endoscopic vision is a procedure which covers many differ- all the boxes. Most of the boxes have nearly the same mean
ent areas and organs of the human body, such as the bladder, area while the variation of the areas differs enormously, es-
the stomach or the colon, allowing gastroenterologists to po- pecially for the polyp class, where the standard deviation is
tentially discover a wide array of diseases and abscesses, like significantly larger than within other classes.
polyps, cancer and Barrett’s esophagus. Naturally, in order Finally, for the segmentation task, for every image there
to assure detection of all diseases and to improve the work- are given masks specifying which regions are of interest
flow, application of real-time detection using Deep Learning which is done separately for each class. While most of the
is becoming more prevalent. There have been previous publi- images belong to a unique class, some of them have several
cations with good results on real-time detection of endoscopic masks with overlapping regions, which is especially apparent
polyps using Single Shot Detector [1] based CNNs [2] as well for the “suspicious” class. The latter is often only part of a
as an anchor free approach called AFP-Net [3]. Existing work region of an already existing class. Hence this is a multi-
1 https://edd2020.grand-challenge.org class multi-label segmentation task with independent classes.
Copyright c 2020 for this paper by its authors. Use permitted under We randomly split the dataset into 90% training and 10%
Creative Commons License Attribution 4.0 International (CC BY 4.0). validation set, where the best model is chosen by minimum
Output
Detection
YOLOv3
Post-processing with
Input domain knowledge
Faster R-CNN
(a) (b)

Segmentation

Post-processing with
Cascade R-CNN
domain knowledge

Fig. 1: This figure illustrates our final pipeline for the detection and segmentation task. At step (a) the predictions for polyps
and HGD of the YOLOv3 algorithm and the predictions of BE, suspicious, and cancer of the Faster R-CNN are applied for the
final result. At step (b) the box output of the detection architecture is utilized to filter the segmentation masks.

validation loss during training. In the domain of object detection, we consider two main con-
Additional data: In order to improve generalization, we cepts that have proven successful in multi-class object detec-
extend the training dataset by including images from openly tion. First, a two-step method of region proposals and sub-
accessible databases. We include two datasets from a previ- sequent classification of the proposed regions like Faster R-
ous endoscopic vision challenge [5], namely the ETIS-Larib CNN. Second single-shot detection (SSD), which is mostly
Polyp database [6], which consists of 196 polyp images, and applicable in real-time. We compare the results of the SSD
the CVC-ClinicDB [7], which consists of 612 polyp images, model and Faster R-CNN. To improve our results further, we
as well as the dataset from the Gastrointestinal Image Analy- combine those two algorithms in our final architecture. For
sis (GIANA) challenge [8], with 412 polyp images. All three the second task, since both bounding boxes and segmentation
datasets have corresponding segmentation masks. We add masks are available, we choose the Cascade Mask R-CNN.
corresponding bounding boxes using the segmented masks Incorporating both types of annotations achieves the best re-
ourselves. In addition we include the Kvasir-SEG dataset sults. For both of these tasks we add a post-processing with
[9], which consists of 1000 polyp images with both segmen- gastroenterological knowledge. Figure 1 depicts our final ar-
tation masks and bounding boxes. Finally, we extract im- chitecture for the detection and segmentation task. For train-
ages annotated with esophagitis from the Kvasir2 dataset [10]. ing the Faster R-CNN we leverage the open source Detec-
Esophagitis and Barret’s esophagus occur at the same po- tron2 framework [12].
sition in the esophagus, and some symptoms of esophagi- By including additional 2220 polyp images, we signifi-
tis are very similar to Barret’s esophagus symptoms. There- cantly increase the class imbalance of the training data. Class
fore we add images with esophagitis symptoms which looked balance is crucial for training and inference of neural net-
close to Barret’s esophagus and test if those improve our re- works. To tackle this problem, we use class weights in the
sults. We receive a light improvement in BE results and there- algorithms. Therefore the loss of an underrepresented class
fore include 103 additional images for a total of 2323 addi- multiplies by a weight that balances the outcome of the total
tional training images. Nevertheless, Barret’s esophagus and loss function. By adding those weights, we observe an en-
esophagitis are different diseases and have to be distinguished hancement in polyp detection while not losing the detection
in further research if more classes are included in the classifi- score in the other classes [13].
cation task.
3.1. Task 1 multi-class bounding box detection:
3. METHODS
As mentioned above, we want to compare two common object
In this section, we illustrate our approaches for the two sub- detection approaches, namely SSD and what we call a classic
tasks. All our models are trained on a Tesla P100 Nvidia region proposal approach. Compared to classical approaches,
GPU. After exploring the data, we decided to choose CNNs SSD enables real-time detection. In practice, real-time de-
for the challenge as they have proven to be very stable in clas- tection is critical. Often, the gastroenterological diseases re-
sic multi-class detection tasks like the COCO challenge [11]. ceive treatment directly (e.g., ablation of a polyp). Therefore
a low inference time has to be considered to apply the mod-
els in real practice. On the contrary, larger architectures may
perform better in tasks suited for procedures like detecting
the stadium of the disease, which mostly has no real-time re-
strictions. Nevertheless, a larger architecture may perform
well on our challenge task, too. Therefore, we leverage one
model from each of these sub positions. The model for SSD
we utilize is called the YOLOv3 algorithm [14], which is the
third version of the well-known YOLO architecture [15] and Fig. 2: In order to train Mask and Cascade Mask R-CNN
has added residual blocks that allow training deeper networks for semantic segmentation, some bounding boxes had to be
while preventing the vanishing gradient problem. We use the adjusted. We transform the boxes from including several in-
YOLOv3 algorithm with initial weights pre-trained on the stances (left) to be only one instance (right).
COCO dataset [11]. In the next step, we unfreeze the last
two layers of the network and train them utilizing the adam
optimizer [16]. We train for 50 epochs. In addition, we un- We choose these types of models for two reasons: First,
freeze the whole network and train until it stops through early since we have both bounding boxes and segmentation masks
stopping, resulting in an additional 33 epochs. available as training data, we can utilize the Mask R-CNN ap-
As a classic larger architecture, we use a Faster R-CNN proach, where RoI influences the segmentation, to the fullest.
[17] with a 104 depth Retinanet backbone. We use a batch Second, since these networks are set to perform instance seg-
size of 2 because of the computational expense of this large mentation, each class is predicted independently from each
network. We initialize the network with weights pre-trained other, which is a prefect fit for our multi-class multi-label
on the COCO dataset. We choose a learning rate of 0.00025 problem. As this is a semantic task, we treat this as an in-
for the training. stance segmentation with only one instance per occurrence
Post-processing: The YOLOv3 architecture is more suc- per class. As such, we had to adjust some of the ground truth
cessful in classifying polyps and HGD whereas classic archi- bounding boxes in our data, as shown in Fig. 2.
tecture is better in detecting BE, suspicious and cancer. We For Mask R-CNN we use the ResNeXt-101-32x8d [20]
therefore assemble both networks to improve our detection and for Cascade Mask R-CNN the ResNeXt-151-32x8d [20]
results. Hence, the YOLOv3 predicts HGD and polyps while models as backbones, both of which are CNN classifyers pre-
the Faster R-CNN algorithm predicts BE, suspicious and can- trained on the ImageNet-1k dataset [21]. Additionally, both
cer. Both algorithms can predict all labels, but we only use full architectures are pre-trained on the COCO dataset [11],
the predictions of the specified classes from each algorithm hence we utilize transfer learning due to the small size of our
respectively. To further improve our results we use gastroen- training dataset.
terological knowledge and knowledge of the data set struc- The networks are trained using the Detectron2 framework
ture. As the probability is low that BE and polyp are predicted [12] which provides a wide range of pre-trained object de-
in the same image we implement a simple rule: If both polyps tection and segmentation models. As a pre-processing step,
and BE are detected, we only produce boxes for the class with we convert our data to the COCO dataset format. Image pre-
higher probability, i.e., if the probability for polyps is higher processing, i.e. padding, resizing, rescaling the pixel values
than for BE, no bounding boxes are predicted for BE. etc., is then performed automatically within the framework.
The total loss is the sum of classification, box-regression and
mask loss L = Lcls + Lbox + Lmask [18], where Lmask is
3.2. Task 2 region segmentation:
the binary cross-entropy for independent segmentation of all
For the image segmentation task, we train two similar archi- masks. The models are trained using stochastic gradient de-
tectures with various backbones, namely Mask R-CNN [18] scent with a learning rate of 0.00025 and a batch size of 2.
and its successor, Cascade Mask R-CNN [19]. Both architec- They are trained for up to 10000 iterations with checkpoints
tures are primarily two-stage object detection models based every 500 iterations. We then choose the checkpoint with the
on Faster R-CNN, i.e. a region proposal network first pro- lowest validation loss as our final model. We also apply data
poses candidate bounding boxes (Regions of Interest, RoI) augmentation in the form of random horizontal and vertical
before the final prediction. Here, they add another branch flipping as well as random resizing with retained aspect ratio
used to predict segmentation masks, where the proposed RoIs in order to minimize the generalization error.
are used to enhance the segmentation mask predictions in Post-processing: To further improve our results we use
contrast to using fully convolutional networks only. Cascade knowledge from gastroenterology and knowledge from the
Mask R-CNN is an extended framework using a cascade-like data set structure. As mentioned above, the probability that
structure and is essentially an ensemble of several Mask R- BE and polyps are present in the same image is very low. We
CNNs with weight sharing on the backbones. apply the following procedure on the polyp/BE predictions:
• We utilize the predictions from object detection and
only predict masks, where there are bounding boxes
present from Yolov3 and Faster R-CNN.

• As an additional criterion, pixels within bounding
boxes of probability < 0.2 are labeled with 0, i.e.
no disease present.

• If both polyps and BE are detected, we only produce
masks for the class with higher probability, as with the
detection model.

4. RESULTS

In this section, we describe our results of the two subtasks. In
both settings, we highlight the performance of the algorithms
for every single disease. Therefore, we create a validation
set. The validation set consists of 40 images randomly chosen
from the provided data (no additional data is included). We
test the detection as well as the segmentation on the created
validation set.

4.1. Task 1
Table 1 shows our results on our created validation set for
the detection task where YOLOv3 is the described SSD al-
gorithm, Faster R-CNN is the FASTER R-CNN algorithm
with ResNet-101 backbone and ensemble with pp (post-
processing) is the ensemble of those two added with the
hardcoded rule. We display the mean average precision with
a minimum IoU of 0.5 (mAP) [11]. We highlight the per- Fig. 3: Exemplary results for both detection with YOLOv3
formance of the algorithms split on the five diseases. All of (upper) and segmentation with Cascade Mask R-CNN (lower)
the algorithms have an excellent performance in detecting
polyps; this is mostly due to our additional polyp training Table 1: Detection results on the validation data (mAP).
data (see chapter 2). BE is better detected by the Faster R- MAP is the mean average precision over the five classes.
CNN algorithm, which is why we used this algorithm for Ensemblepp denotes the ensemble of YOLOv3 and Faster R-
detecting BE in the ensembled version. Notably, suspicious CNN with additional post-processing. All values are in %.
is one of the harder classes to correctly classify as YOLOv3
is only showing a detection performance of 10 % mAP. As YOLOv3 Faster R-CNN Ensemblepp
depicted in Table 1, cancer is detected quite well by all of Polyp 84.19 73.50 84.46
the algorithms. All things considered, the ensemble with BE 38.25 50.40 50.88
post-processing is the best algorithm in this task. The post- Suspic. 10.00 33.70 33.70
processing and combination of YOLOv3 and Faster R-CNN HGD 39.98 28.31 39.98
(Ensemble with pp) enhances the performance compared to Cancer 49.99 53.20 53.20
the single YOLOv3 method by 7.95%. Figure 3 shows a
mAP 44.49 37.29 52.44
detection result of the YOLOv3 algorithm and a segmenta-
tion result of the Cascade Mask R-CNN. Our detection score
on the EDD2020 challenge [4] test set using the ensemble
architecture produces a score of 0.3360 ± 0.0852. results. While Mask R-CNN outperforms Cascade Mask R-
CNN in both polyp and BE classes, Cascade Mask-RCNN
4.2. Task 2 provides better results overall, especially on the other three
classes, which are comparatively underrepresented in our
As in task 1, we evaluated our models on our validation set as training data. Applying the post processing steps described
a subset of the provided data on both Dice coefficient as well in section 3 further improves the results of Cascade Mask R-
as intersection over union (IoU). Table 2 summarizes these CNN, but interestingly worsens the micro (µ) averaged score,
Table 2: Segmentation results on the validation data. R- dataset. However, for both cases, direct comparison is diffi-
CNNM , R-CNNCM and R-CNNCM pp denote Mask R-CNN, cult since both different training and different evaluation data
Cascade Mask R-CNN and Cascade Mask R-CNN with post are used. Additionally, we perform multi-class prediction,
processing respectively. We also computed the micro aver- which can be a more difficult task to perform than binary
aged scores, denoted by µ mean, in contrast to mean, which prediction.
is averaged over class scores. All values are in %. We applied state-of-the-art Deep Learning architectures
for the detection and semantic segmentation of five differ-
R-CNNM R-CNNCM R-CNNCM pp ent gastroenterological diseases. For detection, we evaluated
Dice IoU Dice IoU Dice IoU three architectures, the YOLOv3 and the Faster R-CNN, and
Polyp 69.41 67.03 61.57 60.08 69.07 67.58 our combination of those algorithms. Furthermore, our en-
BE 46.41 43.84 44.48 41.06 46.56 43.08 semble includes domain knowledge-based post-processing,
Suspic. 27.64 25.94 40.03 38.83 52.53 51.33 which further enhances our results in the challenge. For
HGD 41.83 38.28 63.59 60.25 68.25 65.75 segmentation, we evaluate three models: Cascade Mask R-
Cancer 53.77 52.14 55.86 54.96 57.24 57.00 CNN, its predecessor Mask R-CNN, and the Cascade Mask
mean 47.81 45.45 53.11 51.04 58.73 56.95 R-CNN combined with post-processing. In the region seg-
µ mean 36.57 27.05 47.66 38.44 45.36 37.17 mentation task, the Cascade Mask R-CNN with additional
post-processing reliably performs as good or better than the
other networks. For future work we intend to improve our re-
which we discuss below. Our segmentation score on the sults by adding more training data, applying additional forms
EDD2020 challenge [4] test set using Cascade Mask R-CNN of data augmentation and further hyperparameter tuning. All
is then 0.6526 ± 0.3418. in all, we present state-of-the-art results in the EDD challenge
with our detection and segmentation applications.
5. DISCUSSION & CONCLUSION
6. REFERENCES
All of our models in both tasks perform best on the polyp class
and worst on the suspicious category. Since data on polyps [1] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
is abundant in our training set, it is clear why the networks Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexan-
show good results in this area. The suspicious class, however, der C. Berg. SSD: single shot multibox detector. CoRR,
shows a similar amount of samples as HGD and cancer, yet, abs/1512.02325, 2015.
with the exception of Cascade Mask R-CNN, all models per- [2] J. Jiang M. Liu and Z. Wang. Colonic polyp detection in
form significantly worse on this class. This is most likely due endoscopic videos with single shot detection based deep
to the unclear nature of this class as it often denotes regions convolutional neural network. IEEE Access, 7:75058–
belonging to different types of diseases, i.e. in some images 75066, 2019.
it denotes possible cancer, whereas in others it signifies pos-
sible BE. Additionally, performing gastroenterologists often [3] Dechun Wang, Ning Zhang, Xinzi Sun, Pengfei Zhang,
have differing opinions on what areas can be considered as Chenxi Zhang, Yu Cao, and Benyuan Liu. Afp-net:
suspicious, which adds further noise to our data. The perfor- Realtime anchor-free polyp detection in colonoscopy,
mance of Cascade Mask R-CNN on suspicious and the other 2019.
less represented classes can be attributed to its ensemble-like
[4] Sharib Ali, Noha Ghatwary, Barbara Braden, Do-
structure. The discrepancy of the micro-averaged scores can
minique Lamarque, Adam Bailey, Stefano Realdon, Re-
be explained as such: Our post processing severely reduces
nato Cannizzaro, Jens Rittscher, Christian Daul, and
the amount of false positives, but also adds some false neg-
James East. Endoscopy disease detection challenge
atives. This improves the class-based score, since classes on
2020. arXiv preprint arXiv:2003.03376, 2020.
one image with empty masks receive perfect scores this way.
With micro-averaging, however, since precision and recall are [5] J. Bernal, N. Tajkbaksh, F. J. Snchez, B. J. Matuszewski,
the same, we essentially look at the per pixel accuracy of the H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rus-
entire mask, ultimately worsening this score. tad, I. Balasingham, K. Pogorelov, S. Choi, Q. De-
Our model outperforms the best network from [2], namely bard, L. Maier-Hein, S. Speidel, D. Stoyanov, P. Bran-
SSD with a InceptionV3 backbone, which was partially dao, H. Crdova, C. Snchez-Montes, S. R. Gurudu,
trained using the same polyp databases and showed a pre- G. Fernndez-Esparrach, X. Dray, J. Liang, and A. His-
cision of 73.6% on the MICCAI 2015 evaluation dataset, tace. Comparative validation of polyp detection meth-
compared to our 84.19% with YOLOv3. AFP-net performs ods in video colonoscopy: Results from the miccai 2015
better than our model [3] with a precision of 88.89% on endoscopic vision challenge. IEEE Transactions on
the ETIS-Larib dataset and 99.36% on the CVC-Clinic-train Medical Imaging, 36(6):1231–1249, June 2017.
[6] J. Silva, A. Histace, O. Romain, et al. Toward embedded [16] Diederik P Kingma and Jimmy Ba. Adam: A
detection of polyps in wce images for early diagnosis of method for stochastic optimization. arXiv preprint
colorectal cancer. Int J CARS, 9:283 – 293, 2014. arXiv:1412.6980, 2014.

[7] Jorge Bernal, F. Javier Sánchez, Gloria Fernández- [17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Esparrach, Debora Gil, Cristina Rodrı́guez, and Fer- Sun. Faster r-cnn: Towards real-time object detection
nando Vilariño. Wm-dova maps for accurate polyp high- with region proposal networks. In Advances in neural
lighting in colonoscopy: Validation vs. saliency maps information processing systems, pages 91–99, 2015.
from physicians. Computerized Medical Imaging and
Graphics, 43:99 – 111, 2015. [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and
Ross B. Girshick. Mask R-CNN. CoRR,
[8] Y. B. Guo and Bogdan J. Matuszewski. Giana polyp seg- abs/1703.06870, 2017.
mentation with fully convolutional dilation neural net-
[19] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN:
works. In VISIGRAPP, 2019.
high quality object detection and instance segmentation.
[9] Debesh Jha, Pia H. Smedsrud, Michael Riegler, Pål CoRR, abs/1906.09756, 2019.
Halvorsen, Dag Johansen, Thomas de Lange, and
[20] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen
Håvard D. Johansen. Kvasir-seg: A segmented polyp
Tu, and Kaiming He. Aggregated residual transforma-
dataset. In Proceedings of the International Conference
tions for deep neural networks. CoRR, abs/1611.05431,
on Multimedia Modeling (MMM). Springer, 2020.
2016.
[10] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten
[21] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
Griwodz, Sigrun Losada Eskeland, Thomas de Lange,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
Dag Johansen, Concetto Spampinato, Duc-Tien Dang-
Karpathy, Aditya Khosla, Michael S. Bernstein, Alexan-
Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael
der C. Berg, and Fei-Fei Li. Imagenet large scale visual
Riegler, and Pål Halvorsen. Kvasir: A multi-class im-
recognition challenge. CoRR, abs/1409.0575, 2014.
age dataset for computer aided gastrointestinal disease
detection. In Proceedings of the 8th ACM on Multimedia
Systems Conference, MMSys’17, pages 164–169, New
York, NY, USA, 2017. ACM.

[11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C Lawrence Zitnick. Microsoft coco: Common objects
in context. In European conference on computer vision,
pages 740–755. Springer, 2014.

[12] Yuxin Wu, Alexander Kirillov, Francisco Massa,
Wan-Yen Lo, and Ross Girshick. Detectron2.
https://github.com/facebookresearch/
detectron2, 2019.

[13] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou
Tang. Learning deep representation for imbalanced clas-
sification. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 5375–
5384, 2016.

[14] Joseph Redmon and Ali Farhadi. Yolov3: An incre-
mental improvement. arXiv preprint arXiv:1804.02767,
2018.

[15] Joseph Redmon, Santosh Divvala, Ross Girshick, and
Ali Farhadi. You only look once: Unified, real-time ob-
ject detection. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 779–
788, 2016.