TRANSFER LEARNING FOR ENDOSCOPY DISEASE DETECTION AND SEGMENTATION WITH MASK-RCNN BENCHMARK ARCHITECTURE Shahadate Rezvy1,4 , Tahmina Zebin2 , Barbara Braden3 , Wei Pang4 , Stephen Taylor5 , Xiaohong W Gao1 1 School of Science and Technology, Middlesex University London, UK 2 School of Computing Sciences, University of East Anglia, UK 3 Translational Gastroenterology Unit, John Radcliffe Hospital, University of Oxford, UK 4 School of Mathematical & Computer Sciences, Heriot-Watt University, UK 5 MRC Weatherall Institute of Molecular Medicine, University of Oxford, UK ABSTRACT cinoma in the esophagus by using high-definition endoscopic We proposed and implemented a disease detection and se- images (50 cancer, 50 Barrett). They adapted and fed the data mantic segmentation pipeline using a modified mask-RCNN set to a deep Convolutional Neural Network (CNN) using a infrastructure model on the EDD2020 dataset1 . On the im- transfer learning approach. The model was evaluated to leave ages provided for the phase-I test dataset, for ’BE’, we one patient out cross-validation. With sensitivity and speci- achieved an average precision of 51.14%, for ’HGD’ and ficity of 0.94 and 0.88, respectively. Horie et al. [4] reported ’polyp’ it is 50%. However, the detection score for ’suspi- AI diagnoses of esophageal cancer including squamous cell cious’ and ’cancer’ were low. For phase-I, we achieved a dice carcinoma (ESCC) and adenocarcinoma (EAC) using CNNs. coefficient of 0.4562 and an F2 score of 0.4508. We noticed The CNN correctly detected esophageal cancer cases with a the missed and mis-classification was due to the imbalance sensitivity of 98%. CNN could detect all small cancer lesions between classes. Hence, we applied a selective and balanced less than 10 mm in size. It has reportedly distinguished super- augmentation stage in our architecture to provide more accu- ficial esophageal cancer from advanced cancer with an accu- rate detection and segmentation. We observed an increase in racy of 98%. Very recently, Gao et al. [5] investigated the fea- detection score to 0.29 on phase -II images after balancing the sibility of mask-RCNN (Region-based convolutional neural dataset from our phase-I detection score of 0.24. We achieved network) and YOLOv3 architectures to detect various stages an improved semantic segmentation score of 0.62 from our of squamous cell carcinoma (SCC) cancer in real-time to de- phase-I score of 0.52. tect subtle appearance changes. For the detection of SCC, the reported average accuracy for classification and detection was 85% and 74% respectively. 1. INTRODUCTION For colonoscopy, deep neural networks based solutions were implemented to detect and classify colorectal polyps Endoscopy is an extensively used clinical procedure for the in research presented by the authors in reference [6, 7, 8]. early detection of cancers in various organs such as esopha- For gastric cancer, Wu et al. [9] identified EGC from non- gus, stomach, colon, and bladder [1]. In recent years, deep malignancy with an accuracy of 92.5%, a sensitivity of learning methods were used in various endoscopic imag- 94.0%, a specificity of 91.0%, a positive predictive value ing tasks including esophago-gastro-duodenoscopy (EGD), of 91.3%, and a negative predictive value of 93.8%, outper- colonoscopy, and capsule endoscopy (CE) [2]. Most of these forming all levels of endoscopists. In real-time unprocessed were inspired by artificial neural network-based solutions EGD videos, the DCNN achieved automated performance for for accurate and consistent localization and segmentation of detecting EGC and monitoring blind spots. Mori et al. [10] diseased region-of-interests enable precise quantification and and Min et al. [2] provided a comprehensive review of some mapping of lesions from clinical endoscopy videos. This en- recent literature in this field. ables critical and useful detection techniques for monitoring For Endoscopy Disease Detection and Segmentation and surgical planning. Grand Challenge, we proposed and implemented a disease For oesophageal cancer detection, Mendel et al. [3] pro- detection and semantic segmentation pipeline using a mod- posed an automatic approach for early detection of adenocar- ified mask-RCNN architecture. The rest of the paper is 1 https://edd2020.grand-challenge.org organized as follows. Section 2 introduces the dataset for Copyright c 2020 for this paper by its authors. Use permitted under the task. Section 3 presents our proposed architecture with Creative Commons License Attribution 4.0 International (CC BY 4.0). various settings and procedural stages, with results presented 3. METHODS Table 1. Class-wise object distribution [1] Disease Category (Class name) Objects We implemented the Endoscopic disease detection and se- mantic segmentation pipeline for the EDD2020 challenge us- Non-dysplastic Barrett’s oesophagus (BE) 160 ing a modified mask-RCNN [12] architecture trained in the Subtle pre-cancerous lesion (Suspicious) 88 feature-representation transfer learning mode. Mask-RCNN Suspected Dysplasia (HGD) 74 was proposed as an extension of Faster R-CNN and the ar- Adenocarcinoma (Cancer) 53 chitecture has reportedly outperformed all the previous state- Polyp 127 of-the-art models used for the instance segmentation task on and discussed in Section 4. Finally, conclusions are drawn in various image datasets. We used PyTorch, torchvision, im- Section 5. gaug, pycoco-creator, maskrcnn-benchmark [13], apex, and OpenCV libraries in python for generating various functions of the pipeline. 3.1. Pre-trained model backbone and network head re- 2. DATASET DESCRIPTION AND IMAGE moval AUGMENTATION We removed the network head or the final layers of the pre- trained model with a Resnet-101 backbone [12] that was ini- The annotated dataset provided for the competition contained tially trained on the COCO dataset. This stage is crucial as 388 frames from 5 different international centers and 3 organs the pre-trained model was trained for a different classification (colon, esophagus, and stomach) targeting multiple popula- task. The removal of network head removed weights and bias tions and varied endoscopy video modalities associated with associated to class score, bounding box predictor and mask pre-malignant and diseased regions. The dataset is labeled by predictor layers. It is then replaced with new untrained layers medical experts and experienced post-doctoral researchers. It with desired number of classes for the new data. We adjusted came with object-wise binary masks and bounding box an- a six-class network head for the EDD2020 dataset (five as- notation. The class-wise object distribution in the dataset is signed classes+ Background). We fed the augmented dataset shown in Table 1. A detailed description of the dataset can be and and the associated masks into the mask-RCNN model ar- found at [1]. chitecture as illustrated in figure 2. We separated a small subset from the original training set with various class labels as our external validation set. This 3.2. Transfer learning stages subset had 25 images, and was programmatically chosen to At the initial stage, we froze the weights of the earlier layers have similar size and resolution as the images in phase-I test of the pre-trained ResNet-101 backbone to help us extract the dataset of 24 images. This set with ground truth labels served generic low-level descriptors or patterns from the endoscopy as a checkpoint for us to the trained model’s performance. image data. Later layers of the CNN become progressively We applied image augmentation techniques [11] on the more specific to the details of the output classes of the new rest of the images with their associated masks. Our obser- data-set. Then a newly added network head is trained for vation of the dataset revealed a co-location of ’BE’ regions adapting the weights according to the patterns and distribu- with ’suspicious, cancer and HGD’ area. We also noticed an tion of the new dataset. The network head is updated and fine imbalance between classes and images coming from various tuned during model training. The training of the model has organs. Hence, we opted for an instance cropping stage in our been done offline on an Ubuntu machine with Intel(R) Core pipeline that produced multiple images from these co-located i9-9900X CPU @ 3.50GHz, 62GB memory and a GeForce images, each with one target object and other objects are re- RTX 2060 GPU. The final model was fine- tuned with an moved by a selective cropping mechanism (example shown Adam optimizer with a learning rate of 0.0001 and a categor- on Figure 1). We kept 10% padding around the ground truth ical cross-entropy for 50000 epochs. To be noted, the dataset bounding box provided for the instance. This isolated the in- after augmentation is still quite small, so we employed a five- stances of ’cancer’, ’suspicious’ and ’HGD’ regions from co- fold cross-validation during training to avoid the over-fitting localized ’BE’ regions. We applied transformations such as of the model. rotation, flip and crop on the individual classes and instances to increase our training data. We then used the ’WeightedRan- 4. RESULTS AND EVALUATION SCORE domSampler’ from the PyTorch data loader to form the fi- nal balanced training set of almost equal class representation. Equations (1) to (3) in this section summarises the detection This set included 1670 instances in total. Figure 1 illustrates and segmentation matrices we are using to evaluate the per- some of the augmentation methods we applied in our pipeline. formance of a model trained on this dataset [1]. The metric, Fig. 1. Augmentation methods applied on the images including transformation such as rotation, flip and instance cropping. Table 2. Validation set bounding-box detection and segmen- tation score before and after fine-tuning Fine- Task mAP AP (50), AP (75) AP (m), AP (l) tuning No bbox 0.291 0.361; 0.319 0.450; 0.328 No segment 0.254 0.347; 0.252 0.250; 0.292 Yes bbox 0.479 0.689; 0.600 0.675; 0.493 Yes segment 0.513 0.683; 0.549 0.563; 0.566 Table 3. Out-of-sample detection and segmentation score Training Dataset (Test data) scored scores Original+ Flip, rotate, (Phase-I) 0.2460 0.5243 crop Original+Instance- (Phase-II) 0.2906 0.6264 crop+class-balance 4.1. Results on validation dataset Fig. 2. Illustration of the mask-RCNN architecture adapted for transfer learning on the EDD dataset mean average precision (mAP) measures the ability of an ob- ject detector to accurately retrieve all instances of the ground Table 2 summarises average precision performances on the truth bounding boxes. The higher the mAP the better the per- isolated validation dataset (25 images with ground-truth formance. In Equation (1), N = 5 and APi indicates Average masks) to get an estimate of the test set performance. Class- precision of individual disease class i for this dataset. wise precision values were presented for two IoU thresholds. 1 X For AP (50), only candidates over 50% region comparing mAP = APi (1) ground truth were counted and we achieved about 36.1% N i average precision for bounding box detection and 34.7% av- scored = 0.6 × mAPd + 0.4 × IoUd (2) erage precision for pixel-to-pixel segmentation. For AP (75), only the candidates over 75% IoU value are counted. Av- X scores = 0.25 ∗ precision + recall + F1 + F2 (3) i erage precision values were counted for large (AP (l)) and medium-sized (AP (m)) objects in the images and the accu- For the detection task, the competition uses a a final mean racy ranged from 32.27% to 45% respectively. To be noted, score (scored ), which is a weighted score of mAP and IoU we omitted AP (s) for small object (area < 32pixel2 ) due to and formula is presented in Equation (2). Here, IoU - inter- the absence of such small objects in the test dataset. However, section over union measures the overlap between the ground such low values are indicative of the model being overfit and truth and predicted bounding boxes. For scoring of the se- we applied parameter-tuning to the fully connected network mantic segmentation task, an average measure (scores ) is cal- layers along with realistic and balanced augmentation. This culated as per Equation (3), which is the average score of F1 - significantly improved the mAP for for both bounding box score (Dice Coefficient), F2 -score, precision and recall. A and segmentation mask to 47.9% and 51.3% respectively detail description of these matrices can be found in [1]. (shown in row 3 and 4 on Table 2). Fig. 3. Semantic segmentation results on some of the images from the test dataset 4.2. Results on the test dataset: Phase-I and Phase-II region of cancer or polyp or BE is not always the case in practice. Very often, multifocal patches of cancer, low-grade For phase-I, we received 24 images and Figure 3 shows detec- and high-grade dysplasia are scattered across the surface of tion and segmentation output from some of the images from the lesion. Further improvements are required to deal with this test set. From the scores available on the leaderboard, bubble, saturation, instrument and other visible artefacts in for ’BE’, we achieved average precision value of 51.14%, for the dataset [14, 15]. This will improve the model’s perfor- ’HGD’ and ’polyp’ it is 50%. However, the score for ’suspi- mance by avoiding false detection in these regions and will cious’ and ’cancer’ areas were very low. We attained a dice provide more accurate and realistic solution for endoscopic coefficient of 0.4562 and an F2 score of 0.4508. We noticed disease detection cases. the missed and mis-classification was due to the imbalance between classes. Hence, before phase-II submission, we 6. REFERENCES retrained the model after applying a ’WeightedRandomSam- pler’ for selective and balanced sampling of the augmented [1] Sharib Ali, Noha Ghatwary, Barbara Braden, Do- dataset. During phase-II, we received 43 images and we minique Lamarque, Adam Bailey, Stefano Realdon, Re- retrained the model with a balanced augmentation dataset. nato Cannizzaro, Jens Rittscher, Christian Daul, and From the leader-board scores available at this stage, the fi- James East. Endoscopy disease detection challenge nal detection score scored and semantic segmentation score 2020. arXiv preprint arXiv:2003.03376, 2020. scores is listed in Table 3. In the table, we observed an in- crease in detection score to 0.29 when a class balancing and [2] Jun Ki Min, Min Seob Kwak, and Jae Myung Cha. instance cropping is applied on the training dataset. We had Overview of deep learning in gastrointestinal en- a score of 0.24 on phase-I which we obtained with generic doscopy. Gut and liver, 13(4):388, 2019. augmentation techniques applied on the data. We achieved an [3] Robert Mendel, Alanna Ebigbo, Andreas Probst, et al. improved semantic segmentation score of 0.62 as well from Barretts esophagus analysis using convolutional neu- our phase-I score 0f 0.52. The final model had an standard ral networks. In Image Processing for Medicine 2017, deviation of 0.082 in the mAPd value and deviation was 0.33 pages 80–85. Springer, 2017. in the semantic score. [4] Yoshimasa Horie, Toshiyuki Yoshio, Kazuharu Aoyama, Yoshimizu, et al. Diagnostic outcomes of 5. DISCUSSION & CONCLUSION esophageal cancer by artificial intelligence using convo- lutional neural networks. Gastrointestinal endoscopy, As balanced augmentation has improved both detection and 89(1):25–32, 2019. segmentation score in this task, application of generative ad- versarial network-based augmentation techniques in future [5] Xiaohong W Gao, Barbara Braden, Stephen Taylor, and can contribute to a more generalised and robust model. Ad- Wei Pang. Towards real-time detection of squamous ditionally, we assumed that the detected object was spread pre-cancers from oesophageal endoscopic videos. In uniformly across a detected region as the patch was classi- 2019 18th IEEE International Conference On Machine fied as a specific disease type (cancer, polyp) depending on Learning And Applications (ICMLA), pages 1606–1612, the patch-specific feature. However, the idea of one uniform Dec 2019. [6] Yoriaki Komeda, Hisashi Handa, et al. Computer-aided Loschenov, Enrico Grisan, Christian Daul, Walter Blon- diagnosis based on convolutional neural network system del, and Jens Rittscher. An objective comparison of de- for colorectal polyp classification: preliminary experi- tection and segmentation algorithms for artefacts in clin- ence. Oncology, 93:30–34, 2017. ical endoscopy. Scientific Reports, 10, 2020. [7] Teng Zhou, Guoqiang Han, Bing Nan Li, et al. Quan- titative analysis of patients with celiac disease by video capsule endoscopy: A deep learning method. Comput- ers in biology and medicine, 85:1–6, 2017. [8] Lequan Yu, Hao Chen, Qi Dou, Jing Qin, and Pheng Ann Heng. Integrating online and offline three- dimensional deep learning for automated polyp detec- tion in colonoscopy videos. IEEE journal of biomedical and health informatics, 21(1):65–75, 2016. [9] Lianlian Wu, Wei Zhou, Xinyue Wan, Jun Zhang, et al. A deep neural network improves endoscopic detection of early gastric cancer without blind spots. Endoscopy, 51(06):522–531, 2019. [10] Yuichi Mori, Tyler M Berzin, and Shin-ei Kudo. Ar- tificial intelligence for early gastric cancer: early promise and the path ahead. Gastrointestinal endoscopy, 89(4):816–817, 2019. [11] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):60, 2019. [12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE in- ternational conference on computer vision, pages 2961– 2969, 2017. [13] Francisco Massa and Ross Girshick. maskrcnn- benchmark: Fast, modular reference im- plementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn- benchmark, 2018. [14] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Adam Bailey, Stefano Realdon, James East, Georges Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- doscopy artifact detection (ead 2019) challenge dataset. arXiv preprint arXiv:1905.03209, 2019. [15] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, Stefano Realdon, Maxim Loshchenov, Julia A. Schn- abel, James E. East, Geroges Wagnieres, Victor B.