ENDOSCOPIC DETECTION AND SEGMENTATION OF GASTROENTEROLOGICAL DISEASES WITH DEEP CONVOLUTIONAL NEURAL NETWORKS Adrian Krenzer, Amar Hekalo, Frank Puppe Department of Artificial Intelligence and Knowledge Systems, University of Würzburg, Germany ABSTRACT usually focuses on one disease class, like polyp or cancer detection, mostly due to lack of annotated data. The Endo- Previous endoscopic computer vision research focused mostly scopic Disease Detection Challenge 2020 [4] partially solves on the detection of a singular disease like, e.g. polyps. The this issue by providing endoscopic images of three different endoscopic disease detection challenge (EDD2020) extends organs, namely colon, esophagus and stomach, with five dis- this classification task by providing data for different diseases ease classes. Additionally they provide corresponding bound- in various organs. The EDD2020 includes two sub-tasks1 : ing boxes for object detection as well as polygonal masks for (1) Multi-class disease detection: localization of bounding image segmentation. In this paper we apply and train state- boxes and class labels for the five disease classes: Polyp, of-the-art Deep Learning models for both tasks using various Barret’s Esophagus (BE), suspicious, High Grade Dyspla- architectures and comparing their performance. sia (HGD) and cancer; (2) Region segmentation: boundary delineation of detected diseases. In this paper, we describe our approach by leveraging deep convolutional neural net- 2. DATASETS AND DATA ANALYSIS works (CNNs). We highlight the comparison of two general state-of-the-art object detection approaches. The first one is In order to choose and prepare the right deep CNN for the Single Shot Detection (SSD), and the second one are two- task, we start by analyzing the given training data in detail. step region proposal based CNNs. We, therefore, compare The EDD2020 challenge [4] provides a training data set for two different models: YOLOv3 (SSD) and Faster R-CNN multi-class disease detection, which contains 386 endoscopic with ResNet-101 backbone. For the second task, we lever- images labeled with 684 bounding boxes and 502 segmen- age the state-of-the-art Cascade Mask R-CNN with various tation masks. While analyzing the data, we recognize class backbones and compare the results. In order to minimize imbalance. Therefore we counted the occurrences for each generalization error, we apply data augmentation; finally, we class throughout the dataset based on the bounding boxes. use knowledge from the endoscopic domain to further refine The dataset has more than 200 images with polyps and BE our models during post-processing and compare the resulting but less than 100 samples for the three remaining classes re- performances. spectively. So, it might be challenging to learn the correct assessment of the classes HGD, suspicious and cancer. This unbalanced sample distribution is one difficulty of the dataset and is therefore considered while choosing our model and it’s 1. INTRODUCTION hyperparameters. The second difficulty we recognize is the variation in box sizes. We therefore calculated the area of Endoscopic vision is a procedure which covers many differ- all the boxes. Most of the boxes have nearly the same mean ent areas and organs of the human body, such as the bladder, area while the variation of the areas differs enormously, es- the stomach or the colon, allowing gastroenterologists to po- pecially for the polyp class, where the standard deviation is tentially discover a wide array of diseases and abscesses, like significantly larger than within other classes. polyps, cancer and Barrett’s esophagus. Naturally, in order Finally, for the segmentation task, for every image there to assure detection of all diseases and to improve the work- are given masks specifying which regions are of interest flow, application of real-time detection using Deep Learning which is done separately for each class. While most of the is becoming more prevalent. There have been previous publi- images belong to a unique class, some of them have several cations with good results on real-time detection of endoscopic masks with overlapping regions, which is especially apparent polyps using Single Shot Detector [1] based CNNs [2] as well for the “suspicious” class. The latter is often only part of a as an anchor free approach called AFP-Net [3]. Existing work region of an already existing class. Hence this is a multi- 1 https://edd2020.grand-challenge.org class multi-label segmentation task with independent classes. Copyright c 2020 for this paper by its authors. Use permitted under We randomly split the dataset into 90% training and 10% Creative Commons License Attribution 4.0 International (CC BY 4.0). validation set, where the best model is chosen by minimum Output Detection YOLOv3 Post-processing with Input domain knowledge Faster R-CNN (a) (b) Segmentation Post-processing with Cascade R-CNN domain knowledge Fig. 1: This figure illustrates our final pipeline for the detection and segmentation task. At step (a) the predictions for polyps and HGD of the YOLOv3 algorithm and the predictions of BE, suspicious, and cancer of the Faster R-CNN are applied for the final result. At step (b) the box output of the detection architecture is utilized to filter the segmentation masks. validation loss during training. In the domain of object detection, we consider two main con- Additional data: In order to improve generalization, we cepts that have proven successful in multi-class object detec- extend the training dataset by including images from openly tion. First, a two-step method of region proposals and sub- accessible databases. We include two datasets from a previ- sequent classification of the proposed regions like Faster R- ous endoscopic vision challenge [5], namely the ETIS-Larib CNN. Second single-shot detection (SSD), which is mostly Polyp database [6], which consists of 196 polyp images, and applicable in real-time. We compare the results of the SSD the CVC-ClinicDB [7], which consists of 612 polyp images, model and Faster R-CNN. To improve our results further, we as well as the dataset from the Gastrointestinal Image Analy- combine those two algorithms in our final architecture. For sis (GIANA) challenge [8], with 412 polyp images. All three the second task, since both bounding boxes and segmentation datasets have corresponding segmentation masks. We add masks are available, we choose the Cascade Mask R-CNN. corresponding bounding boxes using the segmented masks Incorporating both types of annotations achieves the best re- ourselves. In addition we include the Kvasir-SEG dataset sults. For both of these tasks we add a post-processing with [9], which consists of 1000 polyp images with both segmen- gastroenterological knowledge. Figure 1 depicts our final ar- tation masks and bounding boxes. Finally, we extract im- chitecture for the detection and segmentation task. For train- ages annotated with esophagitis from the Kvasir2 dataset [10]. ing the Faster R-CNN we leverage the open source Detec- Esophagitis and Barret’s esophagus occur at the same po- tron2 framework [12]. sition in the esophagus, and some symptoms of esophagi- By including additional 2220 polyp images, we signifi- tis are very similar to Barret’s esophagus symptoms. There- cantly increase the class imbalance of the training data. Class fore we add images with esophagitis symptoms which looked balance is crucial for training and inference of neural net- close to Barret’s esophagus and test if those improve our re- works. To tackle this problem, we use class weights in the sults. We receive a light improvement in BE results and there- algorithms. Therefore the loss of an underrepresented class fore include 103 additional images for a total of 2323 addi- multiplies by a weight that balances the outcome of the total tional training images. Nevertheless, Barret’s esophagus and loss function. By adding those weights, we observe an en- esophagitis are different diseases and have to be distinguished hancement in polyp detection while not losing the detection in further research if more classes are included in the classifi- score in the other classes [13]. cation task. 3.1. Task 1 multi-class bounding box detection: 3. METHODS As mentioned above, we want to compare two common object In this section, we illustrate our approaches for the two sub- detection approaches, namely SSD and what we call a classic tasks. All our models are trained on a Tesla P100 Nvidia region proposal approach. Compared to classical approaches, GPU. After exploring the data, we decided to choose CNNs SSD enables real-time detection. In practice, real-time de- for the challenge as they have proven to be very stable in clas- tection is critical. Often, the gastroenterological diseases re- sic multi-class detection tasks like the COCO challenge [11]. ceive treatment directly (e.g., ablation of a polyp). Therefore a low inference time has to be considered to apply the mod- els in real practice. On the contrary, larger architectures may perform better in tasks suited for procedures like detecting the stadium of the disease, which mostly has no real-time re- strictions. Nevertheless, a larger architecture may perform well on our challenge task, too. Therefore, we leverage one model from each of these sub positions. The model for SSD we utilize is called the YOLOv3 algorithm [14], which is the third version of the well-known YOLO architecture [15] and Fig. 2: In order to train Mask and Cascade Mask R-CNN has added residual blocks that allow training deeper networks for semantic segmentation, some bounding boxes had to be while preventing the vanishing gradient problem. We use the adjusted. We transform the boxes from including several in- YOLOv3 algorithm with initial weights pre-trained on the stances (left) to be only one instance (right). COCO dataset [11]. In the next step, we unfreeze the last two layers of the network and train them utilizing the adam optimizer [16]. We train for 50 epochs. In addition, we un- We choose these types of models for two reasons: First, freeze the whole network and train until it stops through early since we have both bounding boxes and segmentation masks stopping, resulting in an additional 33 epochs. available as training data, we can utilize the Mask R-CNN ap- As a classic larger architecture, we use a Faster R-CNN proach, where RoI influences the segmentation, to the fullest. [17] with a 104 depth Retinanet backbone. We use a batch Second, since these networks are set to perform instance seg- size of 2 because of the computational expense of this large mentation, each class is predicted independently from each network. We initialize the network with weights pre-trained other, which is a prefect fit for our multi-class multi-label on the COCO dataset. We choose a learning rate of 0.00025 problem. As this is a semantic task, we treat this as an in- for the training. stance segmentation with only one instance per occurrence Post-processing: The YOLOv3 architecture is more suc- per class. As such, we had to adjust some of the ground truth cessful in classifying polyps and HGD whereas classic archi- bounding boxes in our data, as shown in Fig. 2. tecture is better in detecting BE, suspicious and cancer. We For Mask R-CNN we use the ResNeXt-101-32x8d [20] therefore assemble both networks to improve our detection and for Cascade Mask R-CNN the ResNeXt-151-32x8d [20] results. Hence, the YOLOv3 predicts HGD and polyps while models as backbones, both of which are CNN classifyers pre- the Faster R-CNN algorithm predicts BE, suspicious and can- trained on the ImageNet-1k dataset [21]. Additionally, both cer. Both algorithms can predict all labels, but we only use full architectures are pre-trained on the COCO dataset [11], the predictions of the specified classes from each algorithm hence we utilize transfer learning due to the small size of our respectively. To further improve our results we use gastroen- training dataset. terological knowledge and knowledge of the data set struc- The networks are trained using the Detectron2 framework ture. As the probability is low that BE and polyp are predicted [12] which provides a wide range of pre-trained object de- in the same image we implement a simple rule: If both polyps tection and segmentation models. As a pre-processing step, and BE are detected, we only produce boxes for the class with we convert our data to the COCO dataset format. Image pre- higher probability, i.e., if the probability for polyps is higher processing, i.e. padding, resizing, rescaling the pixel values than for BE, no bounding boxes are predicted for BE. etc., is then performed automatically within the framework. The total loss is the sum of classification, box-regression and mask loss L = Lcls + Lbox + Lmask [18], where Lmask is 3.2. Task 2 region segmentation: the binary cross-entropy for independent segmentation of all For the image segmentation task, we train two similar archi- masks. The models are trained using stochastic gradient de- tectures with various backbones, namely Mask R-CNN [18] scent with a learning rate of 0.00025 and a batch size of 2. and its successor, Cascade Mask R-CNN [19]. Both architec- They are trained for up to 10000 iterations with checkpoints tures are primarily two-stage object detection models based every 500 iterations. We then choose the checkpoint with the on Faster R-CNN, i.e. a region proposal network first pro- lowest validation loss as our final model. We also apply data poses candidate bounding boxes (Regions of Interest, RoI) augmentation in the form of random horizontal and vertical before the final prediction. Here, they add another branch flipping as well as random resizing with retained aspect ratio used to predict segmentation masks, where the proposed RoIs in order to minimize the generalization error. are used to enhance the segmentation mask predictions in Post-processing: To further improve our results we use contrast to using fully convolutional networks only. Cascade knowledge from gastroenterology and knowledge from the Mask R-CNN is an extended framework using a cascade-like data set structure. As mentioned above, the probability that structure and is essentially an ensemble of several Mask R- BE and polyps are present in the same image is very low. We CNNs with weight sharing on the backbones. apply the following procedure on the polyp/BE predictions: • We utilize the predictions from object detection and only predict masks, where there are bounding boxes present from Yolov3 and Faster R-CNN. • As an additional criterion, pixels within bounding boxes of probability < 0.2 are labeled with 0, i.e. no disease present. • If both polyps and BE are detected, we only produce masks for the class with higher probability, as with the detection model. 4. RESULTS In this section, we describe our results of the two subtasks. In both settings, we highlight the performance of the algorithms for every single disease. Therefore, we create a validation set. The validation set consists of 40 images randomly chosen from the provided data (no additional data is included). We test the detection as well as the segmentation on the created validation set. 4.1. Task 1 Table 1 shows our results on our created validation set for the detection task where YOLOv3 is the described SSD al- gorithm, Faster R-CNN is the FASTER R-CNN algorithm with ResNet-101 backbone and ensemble with pp (post- processing) is the ensemble of those two added with the hardcoded rule. We display the mean average precision with a minimum IoU of 0.5 (mAP) [11]. We highlight the per- Fig. 3: Exemplary results for both detection with YOLOv3 formance of the algorithms split on the five diseases. All of (upper) and segmentation with Cascade Mask R-CNN (lower) the algorithms have an excellent performance in detecting polyps; this is mostly due to our additional polyp training Table 1: Detection results on the validation data (mAP). data (see chapter 2). BE is better detected by the Faster R- MAP is the mean average precision over the five classes. CNN algorithm, which is why we used this algorithm for Ensemblepp denotes the ensemble of YOLOv3 and Faster R- detecting BE in the ensembled version. Notably, suspicious CNN with additional post-processing. All values are in %. is one of the harder classes to correctly classify as YOLOv3 is only showing a detection performance of 10 % mAP. As YOLOv3 Faster R-CNN Ensemblepp depicted in Table 1, cancer is detected quite well by all of Polyp 84.19 73.50 84.46 the algorithms. All things considered, the ensemble with BE 38.25 50.40 50.88 post-processing is the best algorithm in this task. The post- Suspic. 10.00 33.70 33.70 processing and combination of YOLOv3 and Faster R-CNN HGD 39.98 28.31 39.98 (Ensemble with pp) enhances the performance compared to Cancer 49.99 53.20 53.20 the single YOLOv3 method by 7.95%. Figure 3 shows a mAP 44.49 37.29 52.44 detection result of the YOLOv3 algorithm and a segmenta- tion result of the Cascade Mask R-CNN. Our detection score on the EDD2020 challenge [4] test set using the ensemble architecture produces a score of 0.3360 ± 0.0852. results. While Mask R-CNN outperforms Cascade Mask R- CNN in both polyp and BE classes, Cascade Mask-RCNN 4.2. Task 2 provides better results overall, especially on the other three classes, which are comparatively underrepresented in our As in task 1, we evaluated our models on our validation set as training data. Applying the post processing steps described a subset of the provided data on both Dice coefficient as well in section 3 further improves the results of Cascade Mask R- as intersection over union (IoU). Table 2 summarizes these CNN, but interestingly worsens the micro (µ) averaged score, Table 2: Segmentation results on the validation data. R- dataset. However, for both cases, direct comparison is diffi- CNNM , R-CNNCM and R-CNNCM pp denote Mask R-CNN, cult since both different training and different evaluation data Cascade Mask R-CNN and Cascade Mask R-CNN with post are used. Additionally, we perform multi-class prediction, processing respectively. We also computed the micro aver- which can be a more difficult task to perform than binary aged scores, denoted by µ mean, in contrast to mean, which prediction. is averaged over class scores. All values are in %. We applied state-of-the-art Deep Learning architectures for the detection and semantic segmentation of five differ- R-CNNM R-CNNCM R-CNNCM pp ent gastroenterological diseases. For detection, we evaluated Dice IoU Dice IoU Dice IoU three architectures, the YOLOv3 and the Faster R-CNN, and Polyp 69.41 67.03 61.57 60.08 69.07 67.58 our combination of those algorithms. Furthermore, our en- BE 46.41 43.84 44.48 41.06 46.56 43.08 semble includes domain knowledge-based post-processing, Suspic. 27.64 25.94 40.03 38.83 52.53 51.33 which further enhances our results in the challenge. For HGD 41.83 38.28 63.59 60.25 68.25 65.75 segmentation, we evaluate three models: Cascade Mask R- Cancer 53.77 52.14 55.86 54.96 57.24 57.00 CNN, its predecessor Mask R-CNN, and the Cascade Mask mean 47.81 45.45 53.11 51.04 58.73 56.95 R-CNN combined with post-processing. In the region seg- µ mean 36.57 27.05 47.66 38.44 45.36 37.17 mentation task, the Cascade Mask R-CNN with additional post-processing reliably performs as good or better than the other networks. For future work we intend to improve our re- which we discuss below. Our segmentation score on the sults by adding more training data, applying additional forms EDD2020 challenge [4] test set using Cascade Mask R-CNN of data augmentation and further hyperparameter tuning. All is then 0.6526 ± 0.3418. in all, we present state-of-the-art results in the EDD challenge with our detection and segmentation applications. 5. DISCUSSION & CONCLUSION 6. REFERENCES All of our models in both tasks perform best on the polyp class and worst on the suspicious category. Since data on polyps [1] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian is abundant in our training set, it is clear why the networks Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexan- show good results in this area. The suspicious class, however, der C. Berg. SSD: single shot multibox detector. CoRR, shows a similar amount of samples as HGD and cancer, yet, abs/1512.02325, 2015. with the exception of Cascade Mask R-CNN, all models per- [2] J. Jiang M. Liu and Z. Wang. Colonic polyp detection in form significantly worse on this class. This is most likely due endoscopic videos with single shot detection based deep to the unclear nature of this class as it often denotes regions convolutional neural network. IEEE Access, 7:75058– belonging to different types of diseases, i.e. in some images 75066, 2019. it denotes possible cancer, whereas in others it signifies pos- sible BE. Additionally, performing gastroenterologists often [3] Dechun Wang, Ning Zhang, Xinzi Sun, Pengfei Zhang, have differing opinions on what areas can be considered as Chenxi Zhang, Yu Cao, and Benyuan Liu. Afp-net: suspicious, which adds further noise to our data. The perfor- Realtime anchor-free polyp detection in colonoscopy, mance of Cascade Mask R-CNN on suspicious and the other 2019. less represented classes can be attributed to its ensemble-like [4] Sharib Ali, Noha Ghatwary, Barbara Braden, Do- structure. The discrepancy of the micro-averaged scores can minique Lamarque, Adam Bailey, Stefano Realdon, Re- be explained as such: Our post processing severely reduces nato Cannizzaro, Jens Rittscher, Christian Daul, and the amount of false positives, but also adds some false neg- James East. Endoscopy disease detection challenge atives. This improves the class-based score, since classes on 2020. arXiv preprint arXiv:2003.03376, 2020. one image with empty masks receive perfect scores this way. With micro-averaging, however, since precision and recall are [5] J. Bernal, N. Tajkbaksh, F. J. Snchez, B. J. Matuszewski, the same, we essentially look at the per pixel accuracy of the H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rus- entire mask, ultimately worsening this score. tad, I. Balasingham, K. Pogorelov, S. Choi, Q. De- Our model outperforms the best network from [2], namely bard, L. Maier-Hein, S. Speidel, D. Stoyanov, P. Bran- SSD with a InceptionV3 backbone, which was partially dao, H. Crdova, C. Snchez-Montes, S. R. Gurudu, trained using the same polyp databases and showed a pre- G. Fernndez-Esparrach, X. Dray, J. Liang, and A. His- cision of 73.6% on the MICCAI 2015 evaluation dataset, tace. Comparative validation of polyp detection meth- compared to our 84.19% with YOLOv3. AFP-net performs ods in video colonoscopy: Results from the miccai 2015 better than our model [3] with a precision of 88.89% on endoscopic vision challenge. IEEE Transactions on the ETIS-Larib dataset and 99.36% on the CVC-Clinic-train Medical Imaging, 36(6):1231–1249, June 2017. [6] J. Silva, A. Histace, O. Romain, et al. Toward embedded [16] Diederik P Kingma and Jimmy Ba. Adam: A detection of polyps in wce images for early diagnosis of method for stochastic optimization. arXiv preprint colorectal cancer. Int J CARS, 9:283 – 293, 2014. arXiv:1412.6980, 2014. [7] Jorge Bernal, F. Javier Sánchez, Gloria Fernández- [17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Esparrach, Debora Gil, Cristina Rodrı́guez, and Fer- Sun. Faster r-cnn: Towards real-time object detection nando Vilariño. Wm-dova maps for accurate polyp high- with region proposal networks. In Advances in neural lighting in colonoscopy: Validation vs. saliency maps information processing systems, pages 91–99, 2015. from physicians. Computerized Medical Imaging and Graphics, 43:99 – 111, 2015. [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. CoRR, [8] Y. B. Guo and Bogdan J. Matuszewski. Giana polyp seg- abs/1703.06870, 2017. mentation with fully convolutional dilation neural net- [19] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: works. In VISIGRAPP, 2019. high quality object detection and instance segmentation. [9] Debesh Jha, Pia H. Smedsrud, Michael Riegler, Pål CoRR, abs/1906.09756, 2019. Halvorsen, Dag Johansen, Thomas de Lange, and [20] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Håvard D. Johansen. Kvasir-seg: A segmented polyp Tu, and Kaiming He. Aggregated residual transforma- dataset. In Proceedings of the International Conference tions for deep neural networks. CoRR, abs/1611.05431, on Multimedia Modeling (MMM). Springer, 2020. 2016. [10] Konstantin Pogorelov, Kristin Ranheim Randel, Carsten [21] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Dag Johansen, Concetto Spampinato, Duc-Tien Dang- Karpathy, Aditya Khosla, Michael S. Bernstein, Alexan- Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael der C. Berg, and Fei-Fei Li. Imagenet large scale visual Riegler, and Pål Halvorsen. Kvasir: A multi-class im- recognition challenge. CoRR, abs/1409.0575, 2014. age dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference, MMSys’17, pages 164–169, New York, NY, USA, 2017. ACM. [11] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [12] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/ detectron2, 2019. [13] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced clas- sification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5375– 5384, 2016. [14] Joseph Redmon and Ali Farhadi. Yolov3: An incre- mental improvement. arXiv preprint arXiv:1804.02767, 2018. [15] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time ob- ject detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779– 788, 2016.