TRANSFER LEARNING FOR ENDOSCOPY DISEASE DETECTION AND SEGMENTATION
              WITH MASK-RCNN BENCHMARK ARCHITECTURE

 Shahadate Rezvy1,4 , Tahmina Zebin2 , Barbara Braden3 , Wei Pang4 , Stephen Taylor5 , Xiaohong W Gao1
                         1
                       School of Science and Technology, Middlesex University London, UK
                         2
                           School of Computing Sciences, University of East Anglia, UK
            3
              Translational Gastroenterology Unit, John Radcliffe Hospital, University of Oxford, UK
                   4
                     School of Mathematical & Computer Sciences, Heriot-Watt University, UK
                  5
                    MRC Weatherall Institute of Molecular Medicine, University of Oxford, UK


                             ABSTRACT                                     cinoma in the esophagus by using high-definition endoscopic
We proposed and implemented a disease detection and se-                   images (50 cancer, 50 Barrett). They adapted and fed the data
mantic segmentation pipeline using a modified mask-RCNN                   set to a deep Convolutional Neural Network (CNN) using a
infrastructure model on the EDD2020 dataset1 . On the im-                 transfer learning approach. The model was evaluated to leave
ages provided for the phase-I test dataset, for ’BE’, we                  one patient out cross-validation. With sensitivity and speci-
achieved an average precision of 51.14%, for ’HGD’ and                    ficity of 0.94 and 0.88, respectively. Horie et al. [4] reported
’polyp’ it is 50%. However, the detection score for ’suspi-               AI diagnoses of esophageal cancer including squamous cell
cious’ and ’cancer’ were low. For phase-I, we achieved a dice             carcinoma (ESCC) and adenocarcinoma (EAC) using CNNs.
coefficient of 0.4562 and an F2 score of 0.4508. We noticed               The CNN correctly detected esophageal cancer cases with a
the missed and mis-classification was due to the imbalance                sensitivity of 98%. CNN could detect all small cancer lesions
between classes. Hence, we applied a selective and balanced               less than 10 mm in size. It has reportedly distinguished super-
augmentation stage in our architecture to provide more accu-              ficial esophageal cancer from advanced cancer with an accu-
rate detection and segmentation. We observed an increase in               racy of 98%. Very recently, Gao et al. [5] investigated the fea-
detection score to 0.29 on phase -II images after balancing the           sibility of mask-RCNN (Region-based convolutional neural
dataset from our phase-I detection score of 0.24. We achieved             network) and YOLOv3 architectures to detect various stages
an improved semantic segmentation score of 0.62 from our                  of squamous cell carcinoma (SCC) cancer in real-time to de-
phase-I score of 0.52.                                                    tect subtle appearance changes. For the detection of SCC, the
                                                                          reported average accuracy for classification and detection was
                                                                          85% and 74% respectively.

                       1. INTRODUCTION                                        For colonoscopy, deep neural networks based solutions
                                                                          were implemented to detect and classify colorectal polyps
Endoscopy is an extensively used clinical procedure for the               in research presented by the authors in reference [6, 7, 8].
early detection of cancers in various organs such as esopha-              For gastric cancer, Wu et al. [9] identified EGC from non-
gus, stomach, colon, and bladder [1]. In recent years, deep               malignancy with an accuracy of 92.5%, a sensitivity of
learning methods were used in various endoscopic imag-                    94.0%, a specificity of 91.0%, a positive predictive value
ing tasks including esophago-gastro-duodenoscopy (EGD),                   of 91.3%, and a negative predictive value of 93.8%, outper-
colonoscopy, and capsule endoscopy (CE) [2]. Most of these                forming all levels of endoscopists. In real-time unprocessed
were inspired by artificial neural network-based solutions                EGD videos, the DCNN achieved automated performance for
for accurate and consistent localization and segmentation of              detecting EGC and monitoring blind spots. Mori et al. [10]
diseased region-of-interests enable precise quantification and            and Min et al. [2] provided a comprehensive review of some
mapping of lesions from clinical endoscopy videos. This en-               recent literature in this field.
ables critical and useful detection techniques for monitoring                 For Endoscopy Disease Detection and Segmentation
and surgical planning.                                                    Grand Challenge, we proposed and implemented a disease
    For oesophageal cancer detection, Mendel et al. [3] pro-              detection and semantic segmentation pipeline using a mod-
posed an automatic approach for early detection of adenocar-              ified mask-RCNN architecture. The rest of the paper is
   1 https://edd2020.grand-challenge.org                                  organized as follows. Section 2 introduces the dataset for
    Copyright c 2020 for this paper by its authors. Use permitted under   the task. Section 3 presents our proposed architecture with
Creative Commons License Attribution 4.0 International (CC BY 4.0).       various settings and procedural stages, with results presented
                                                                                         3. METHODS
         Table 1. Class-wise object distribution [1]
 Disease Category (Class name)                    Objects         We implemented the Endoscopic disease detection and se-
                                                                  mantic segmentation pipeline for the EDD2020 challenge us-
 Non-dysplastic Barrett’s oesophagus (BE)           160
                                                                  ing a modified mask-RCNN [12] architecture trained in the
 Subtle pre-cancerous lesion (Suspicious)            88
                                                                  feature-representation transfer learning mode. Mask-RCNN
 Suspected Dysplasia (HGD)                           74
                                                                  was proposed as an extension of Faster R-CNN and the ar-
 Adenocarcinoma (Cancer)                             53
                                                                  chitecture has reportedly outperformed all the previous state-
 Polyp                                              127
                                                                  of-the-art models used for the instance segmentation task on
and discussed in Section 4. Finally, conclusions are drawn in     various image datasets. We used PyTorch, torchvision, im-
Section 5.                                                        gaug, pycoco-creator, maskrcnn-benchmark [13], apex, and
                                                                  OpenCV libraries in python for generating various functions
                                                                  of the pipeline.

                                                                  3.1. Pre-trained model backbone and network head re-
      2. DATASET DESCRIPTION AND IMAGE                            moval
               AUGMENTATION
                                                                  We removed the network head or the final layers of the pre-
                                                                  trained model with a Resnet-101 backbone [12] that was ini-
The annotated dataset provided for the competition contained      tially trained on the COCO dataset. This stage is crucial as
388 frames from 5 different international centers and 3 organs    the pre-trained model was trained for a different classification
(colon, esophagus, and stomach) targeting multiple popula-        task. The removal of network head removed weights and bias
tions and varied endoscopy video modalities associated with       associated to class score, bounding box predictor and mask
pre-malignant and diseased regions. The dataset is labeled by     predictor layers. It is then replaced with new untrained layers
medical experts and experienced post-doctoral researchers. It     with desired number of classes for the new data. We adjusted
came with object-wise binary masks and bounding box an-           a six-class network head for the EDD2020 dataset (five as-
notation. The class-wise object distribution in the dataset is    signed classes+ Background). We fed the augmented dataset
shown in Table 1. A detailed description of the dataset can be    and and the associated masks into the mask-RCNN model ar-
found at [1].                                                     chitecture as illustrated in figure 2.
    We separated a small subset from the original training set
with various class labels as our external validation set. This    3.2. Transfer learning stages
subset had 25 images, and was programmatically chosen to          At the initial stage, we froze the weights of the earlier layers
have similar size and resolution as the images in phase-I test    of the pre-trained ResNet-101 backbone to help us extract the
dataset of 24 images. This set with ground truth labels served    generic low-level descriptors or patterns from the endoscopy
as a checkpoint for us to the trained model’s performance.        image data. Later layers of the CNN become progressively
    We applied image augmentation techniques [11] on the          more specific to the details of the output classes of the new
rest of the images with their associated masks. Our obser-        data-set. Then a newly added network head is trained for
vation of the dataset revealed a co-location of ’BE’ regions      adapting the weights according to the patterns and distribu-
with ’suspicious, cancer and HGD’ area. We also noticed an        tion of the new dataset. The network head is updated and fine
imbalance between classes and images coming from various          tuned during model training. The training of the model has
organs. Hence, we opted for an instance cropping stage in our     been done offline on an Ubuntu machine with Intel(R) Core
pipeline that produced multiple images from these co-located      i9-9900X CPU @ 3.50GHz, 62GB memory and a GeForce
images, each with one target object and other objects are re-     RTX 2060 GPU. The final model was fine- tuned with an
moved by a selective cropping mechanism (example shown            Adam optimizer with a learning rate of 0.0001 and a categor-
on Figure 1). We kept 10% padding around the ground truth         ical cross-entropy for 50000 epochs. To be noted, the dataset
bounding box provided for the instance. This isolated the in-     after augmentation is still quite small, so we employed a five-
stances of ’cancer’, ’suspicious’ and ’HGD’ regions from co-      fold cross-validation during training to avoid the over-fitting
localized ’BE’ regions. We applied transformations such as        of the model.
rotation, flip and crop on the individual classes and instances
to increase our training data. We then used the ’WeightedRan-            4. RESULTS AND EVALUATION SCORE
domSampler’ from the PyTorch data loader to form the fi-
nal balanced training set of almost equal class representation.   Equations (1) to (3) in this section summarises the detection
This set included 1670 instances in total. Figure 1 illustrates   and segmentation matrices we are using to evaluate the per-
some of the augmentation methods we applied in our pipeline.      formance of a model trained on this dataset [1]. The metric,
  Fig. 1. Augmentation methods applied on the images including transformation such as rotation, flip and instance cropping.

                                                                   Table 2. Validation set bounding-box detection and segmen-
                                                                   tation score before and after fine-tuning
                                                                    Fine- Task         mAP     AP (50), AP (75)     AP (m), AP (l)
                                                                    tuning
                                                                    No   bbox          0.291     0.361; 0.319        0.450; 0.328
                                                                    No segment         0.254     0.347; 0.252        0.250; 0.292
                                                                    Yes bbox           0.479     0.689; 0.600        0.675; 0.493
                                                                    Yes segment        0.513     0.683; 0.549        0.563; 0.566

                                                                    Table 3. Out-of-sample detection and segmentation score
                                                                    Training Dataset             (Test data)    scored    scores
                                                                    Original+ Flip, rotate,       (Phase-I)     0.2460    0.5243
                                                                    crop
                                                                    Original+Instance-            (Phase-II)    0.2906    0.6264
                                                                    crop+class-balance
                                                                   4.1. Results on validation dataset
Fig. 2. Illustration of the mask-RCNN architecture adapted
for transfer learning on the EDD dataset
mean average precision (mAP) measures the ability of an ob-
ject detector to accurately retrieve all instances of the ground
                                                                   Table 2 summarises average precision performances on the
truth bounding boxes. The higher the mAP the better the per-
                                                                   isolated validation dataset (25 images with ground-truth
formance. In Equation (1), N = 5 and APi indicates Average
                                                                   masks) to get an estimate of the test set performance. Class-
precision of individual disease class i for this dataset.
                                                                   wise precision values were presented for two IoU thresholds.
                                             1 X                   For AP (50), only candidates over 50% region comparing
                                   mAP =         APi        (1)    ground truth were counted and we achieved about 36.1%
                                             N i
                                                                   average precision for bounding box detection and 34.7% av-
                scored = 0.6 × mAPd + 0.4 × IoUd            (2)    erage precision for pixel-to-pixel segmentation. For AP (75),
                                                                   only the candidates over 75% IoU value are counted. Av-
                   X
   scores = 0.25 ∗    precision + recall + F1 + F2          (3)
                      i
                                                                   erage precision values were counted for large (AP (l)) and
                                                                   medium-sized (AP (m)) objects in the images and the accu-
For the detection task, the competition uses a a final mean        racy ranged from 32.27% to 45% respectively. To be noted,
score (scored ), which is a weighted score of mAP and IoU          we omitted AP (s) for small object (area < 32pixel2 ) due to
and formula is presented in Equation (2). Here, IoU - inter-       the absence of such small objects in the test dataset. However,
section over union measures the overlap between the ground         such low values are indicative of the model being overfit and
truth and predicted bounding boxes. For scoring of the se-         we applied parameter-tuning to the fully connected network
mantic segmentation task, an average measure (scores ) is cal-     layers along with realistic and balanced augmentation. This
culated as per Equation (3), which is the average score of F1 -    significantly improved the mAP for for both bounding box
score (Dice Coefficient), F2 -score, precision and recall. A       and segmentation mask to 47.9% and 51.3% respectively
detail description of these matrices can be found in [1].          (shown in row 3 and 4 on Table 2).
                       Fig. 3. Semantic segmentation results on some of the images from the test dataset
4.2. Results on the test dataset: Phase-I and Phase-II           region of cancer or polyp or BE is not always the case in
                                                                 practice. Very often, multifocal patches of cancer, low-grade
For phase-I, we received 24 images and Figure 3 shows detec-     and high-grade dysplasia are scattered across the surface of
tion and segmentation output from some of the images from        the lesion. Further improvements are required to deal with
this test set. From the scores available on the leaderboard,     bubble, saturation, instrument and other visible artefacts in
for ’BE’, we achieved average precision value of 51.14%, for     the dataset [14, 15]. This will improve the model’s perfor-
’HGD’ and ’polyp’ it is 50%. However, the score for ’suspi-      mance by avoiding false detection in these regions and will
cious’ and ’cancer’ areas were very low. We attained a dice      provide more accurate and realistic solution for endoscopic
coefficient of 0.4562 and an F2 score of 0.4508. We noticed      disease detection cases.
the missed and mis-classification was due to the imbalance
between classes. Hence, before phase-II submission, we
                                                                                      6. REFERENCES
retrained the model after applying a ’WeightedRandomSam-
pler’ for selective and balanced sampling of the augmented
                                                                  [1] Sharib Ali, Noha Ghatwary, Barbara Braden, Do-
dataset. During phase-II, we received 43 images and we
                                                                       minique Lamarque, Adam Bailey, Stefano Realdon, Re-
retrained the model with a balanced augmentation dataset.
                                                                       nato Cannizzaro, Jens Rittscher, Christian Daul, and
From the leader-board scores available at this stage, the fi-
                                                                       James East. Endoscopy disease detection challenge
nal detection score scored and semantic segmentation score
                                                                       2020. arXiv preprint arXiv:2003.03376, 2020.
scores is listed in Table 3. In the table, we observed an in-
crease in detection score to 0.29 when a class balancing and      [2] Jun Ki Min, Min Seob Kwak, and Jae Myung Cha.
instance cropping is applied on the training dataset. We had           Overview of deep learning in gastrointestinal en-
a score of 0.24 on phase-I which we obtained with generic              doscopy. Gut and liver, 13(4):388, 2019.
augmentation techniques applied on the data. We achieved an
                                                                  [3] Robert Mendel, Alanna Ebigbo, Andreas Probst, et al.
improved semantic segmentation score of 0.62 as well from
                                                                       Barretts esophagus analysis using convolutional neu-
our phase-I score 0f 0.52. The final model had an standard
                                                                       ral networks. In Image Processing for Medicine 2017,
deviation of 0.082 in the mAPd value and deviation was 0.33
                                                                       pages 80–85. Springer, 2017.
in the semantic score.
                                                                  [4] Yoshimasa Horie, Toshiyuki Yoshio, Kazuharu
                                                                       Aoyama, Yoshimizu, et al. Diagnostic outcomes of
            5. DISCUSSION & CONCLUSION
                                                                       esophageal cancer by artificial intelligence using convo-
                                                                       lutional neural networks. Gastrointestinal endoscopy,
As balanced augmentation has improved both detection and
                                                                       89(1):25–32, 2019.
segmentation score in this task, application of generative ad-
versarial network-based augmentation techniques in future         [5] Xiaohong W Gao, Barbara Braden, Stephen Taylor, and
can contribute to a more generalised and robust model. Ad-             Wei Pang. Towards real-time detection of squamous
ditionally, we assumed that the detected object was spread             pre-cancers from oesophageal endoscopic videos. In
uniformly across a detected region as the patch was classi-            2019 18th IEEE International Conference On Machine
fied as a specific disease type (cancer, polyp) depending on           Learning And Applications (ICMLA), pages 1606–1612,
the patch-specific feature. However, the idea of one uniform           Dec 2019.
 [6] Yoriaki Komeda, Hisashi Handa, et al. Computer-aided         Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
     diagnosis based on convolutional neural network system       del, and Jens Rittscher. An objective comparison of de-
     for colorectal polyp classification: preliminary experi-     tection and segmentation algorithms for artefacts in clin-
     ence. Oncology, 93:30–34, 2017.                              ical endoscopy. Scientific Reports, 10, 2020.

 [7] Teng Zhou, Guoqiang Han, Bing Nan Li, et al. Quan-
     titative analysis of patients with celiac disease by video
     capsule endoscopy: A deep learning method. Comput-
     ers in biology and medicine, 85:1–6, 2017.

 [8] Lequan Yu, Hao Chen, Qi Dou, Jing Qin, and
     Pheng Ann Heng. Integrating online and offline three-
     dimensional deep learning for automated polyp detec-
     tion in colonoscopy videos. IEEE journal of biomedical
     and health informatics, 21(1):65–75, 2016.

 [9] Lianlian Wu, Wei Zhou, Xinyue Wan, Jun Zhang, et al.
     A deep neural network improves endoscopic detection
     of early gastric cancer without blind spots. Endoscopy,
     51(06):522–531, 2019.

[10] Yuichi Mori, Tyler M Berzin, and Shin-ei Kudo. Ar-
     tificial intelligence for early gastric cancer: early
     promise and the path ahead. Gastrointestinal endoscopy,
     89(4):816–817, 2019.

[11] Connor Shorten and Taghi M Khoshgoftaar. A survey
     on image data augmentation for deep learning. Journal
     of Big Data, 6(1):60, 2019.

[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
     Girshick. Mask r-cnn. In Proceedings of the IEEE in-
     ternational conference on computer vision, pages 2961–
     2969, 2017.

[13] Francisco Massa and Ross Girshick.         maskrcnn-
     benchmark:         Fast,   modular reference im-
     plementation of Instance Segmentation and
     Object      Detection    algorithms   in    PyTorch.
     https://github.com/facebookresearch/maskrcnn-
     benchmark, 2018.

[14] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
     Adam Bailey, Stefano Realdon, James East, Georges
     Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
     doscopy artifact detection (ead 2019) challenge dataset.
     arXiv preprint arXiv:1905.03209, 2019.

[15] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
     ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
     qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
     Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
     Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
     fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
     Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
     abel, James E. East, Geroges Wagnieres, Victor B.