=Paper= {{Paper |id=Vol-2595/endoCV2020_Rezvy_et_al |storemode=property |title=Transfer Learning For Endoscopy Disease Detection & Segmentation With Mask-RCNN Benchmark Architecture |pdfUrl=https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_17.pdf |volume=Vol-2595 |authors=Shahadate Rezvy,Tahmina Zebin,Barbara Braden,Wei Pang,Stephen Taylor,Xiaohong W Gao |dblpUrl=https://dblp.org/rec/conf/isbi/RezvyZBPTG20 }} ==Transfer Learning For Endoscopy Disease Detection & Segmentation With Mask-RCNN Benchmark Architecture== https://ceur-ws.org/Vol-2595/endoCV2020_paper_id_17.pdf

TRANSFER LEARNING FOR ENDOSCOPY DISEASE DETECTION AND SEGMENTATION
WITH MASK-RCNN BENCHMARK ARCHITECTURE

Shahadate Rezvy1,4 , Tahmina Zebin2 , Barbara Braden3 , Wei Pang4 , Stephen Taylor5 , Xiaohong W Gao1
1
School of Science and Technology, Middlesex University London, UK
2
School of Computing Sciences, University of East Anglia, UK
3
Translational Gastroenterology Unit, John Radcliffe Hospital, University of Oxford, UK
4
School of Mathematical & Computer Sciences, Heriot-Watt University, UK
5
MRC Weatherall Institute of Molecular Medicine, University of Oxford, UK

ABSTRACT cinoma in the esophagus by using high-definition endoscopic
We proposed and implemented a disease detection and se- images (50 cancer, 50 Barrett). They adapted and fed the data
mantic segmentation pipeline using a modified mask-RCNN set to a deep Convolutional Neural Network (CNN) using a
infrastructure model on the EDD2020 dataset1 . On the im- transfer learning approach. The model was evaluated to leave
ages provided for the phase-I test dataset, for ’BE’, we one patient out cross-validation. With sensitivity and speci-
achieved an average precision of 51.14%, for ’HGD’ and ficity of 0.94 and 0.88, respectively. Horie et al. [4] reported
’polyp’ it is 50%. However, the detection score for ’suspi- AI diagnoses of esophageal cancer including squamous cell
cious’ and ’cancer’ were low. For phase-I, we achieved a dice carcinoma (ESCC) and adenocarcinoma (EAC) using CNNs.
coefficient of 0.4562 and an F2 score of 0.4508. We noticed The CNN correctly detected esophageal cancer cases with a
the missed and mis-classification was due to the imbalance sensitivity of 98%. CNN could detect all small cancer lesions
between classes. Hence, we applied a selective and balanced less than 10 mm in size. It has reportedly distinguished super-
augmentation stage in our architecture to provide more accu- ficial esophageal cancer from advanced cancer with an accu-
rate detection and segmentation. We observed an increase in racy of 98%. Very recently, Gao et al. [5] investigated the fea-
detection score to 0.29 on phase -II images after balancing the sibility of mask-RCNN (Region-based convolutional neural
dataset from our phase-I detection score of 0.24. We achieved network) and YOLOv3 architectures to detect various stages
an improved semantic segmentation score of 0.62 from our of squamous cell carcinoma (SCC) cancer in real-time to de-
phase-I score of 0.52. tect subtle appearance changes. For the detection of SCC, the
reported average accuracy for classification and detection was
85% and 74% respectively.

1. INTRODUCTION For colonoscopy, deep neural networks based solutions
were implemented to detect and classify colorectal polyps
Endoscopy is an extensively used clinical procedure for the in research presented by the authors in reference [6, 7, 8].
early detection of cancers in various organs such as esopha- For gastric cancer, Wu et al. [9] identified EGC from non-
gus, stomach, colon, and bladder [1]. In recent years, deep malignancy with an accuracy of 92.5%, a sensitivity of
learning methods were used in various endoscopic imag- 94.0%, a specificity of 91.0%, a positive predictive value
ing tasks including esophago-gastro-duodenoscopy (EGD), of 91.3%, and a negative predictive value of 93.8%, outper-
colonoscopy, and capsule endoscopy (CE) [2]. Most of these forming all levels of endoscopists. In real-time unprocessed
were inspired by artificial neural network-based solutions EGD videos, the DCNN achieved automated performance for
for accurate and consistent localization and segmentation of detecting EGC and monitoring blind spots. Mori et al. [10]
diseased region-of-interests enable precise quantification and and Min et al. [2] provided a comprehensive review of some
mapping of lesions from clinical endoscopy videos. This en- recent literature in this field.
ables critical and useful detection techniques for monitoring For Endoscopy Disease Detection and Segmentation
and surgical planning. Grand Challenge, we proposed and implemented a disease
For oesophageal cancer detection, Mendel et al. [3] pro- detection and semantic segmentation pipeline using a mod-
posed an automatic approach for early detection of adenocar- ified mask-RCNN architecture. The rest of the paper is
1 https://edd2020.grand-challenge.org organized as follows. Section 2 introduces the dataset for
Copyright c 2020 for this paper by its authors. Use permitted under the task. Section 3 presents our proposed architecture with
Creative Commons License Attribution 4.0 International (CC BY 4.0). various settings and procedural stages, with results presented
3. METHODS
Table 1. Class-wise object distribution [1]
Disease Category (Class name) Objects We implemented the Endoscopic disease detection and se-
mantic segmentation pipeline for the EDD2020 challenge us-
Non-dysplastic Barrett’s oesophagus (BE) 160
ing a modified mask-RCNN [12] architecture trained in the
Subtle pre-cancerous lesion (Suspicious) 88
feature-representation transfer learning mode. Mask-RCNN
Suspected Dysplasia (HGD) 74
was proposed as an extension of Faster R-CNN and the ar-
Adenocarcinoma (Cancer) 53
chitecture has reportedly outperformed all the previous state-
Polyp 127
of-the-art models used for the instance segmentation task on
and discussed in Section 4. Finally, conclusions are drawn in various image datasets. We used PyTorch, torchvision, im-
Section 5. gaug, pycoco-creator, maskrcnn-benchmark [13], apex, and
OpenCV libraries in python for generating various functions
of the pipeline.

3.1. Pre-trained model backbone and network head re-
2. DATASET DESCRIPTION AND IMAGE moval
AUGMENTATION
We removed the network head or the final layers of the pre-
trained model with a Resnet-101 backbone [12] that was ini-
The annotated dataset provided for the competition contained tially trained on the COCO dataset. This stage is crucial as
388 frames from 5 different international centers and 3 organs the pre-trained model was trained for a different classification
(colon, esophagus, and stomach) targeting multiple popula- task. The removal of network head removed weights and bias
tions and varied endoscopy video modalities associated with associated to class score, bounding box predictor and mask
pre-malignant and diseased regions. The dataset is labeled by predictor layers. It is then replaced with new untrained layers
medical experts and experienced post-doctoral researchers. It with desired number of classes for the new data. We adjusted
came with object-wise binary masks and bounding box an- a six-class network head for the EDD2020 dataset (five as-
notation. The class-wise object distribution in the dataset is signed classes+ Background). We fed the augmented dataset
shown in Table 1. A detailed description of the dataset can be and and the associated masks into the mask-RCNN model ar-
found at [1]. chitecture as illustrated in figure 2.
We separated a small subset from the original training set
with various class labels as our external validation set. This 3.2. Transfer learning stages
subset had 25 images, and was programmatically chosen to At the initial stage, we froze the weights of the earlier layers
have similar size and resolution as the images in phase-I test of the pre-trained ResNet-101 backbone to help us extract the
dataset of 24 images. This set with ground truth labels served generic low-level descriptors or patterns from the endoscopy
as a checkpoint for us to the trained model’s performance. image data. Later layers of the CNN become progressively
We applied image augmentation techniques [11] on the more specific to the details of the output classes of the new
rest of the images with their associated masks. Our obser- data-set. Then a newly added network head is trained for
vation of the dataset revealed a co-location of ’BE’ regions adapting the weights according to the patterns and distribu-
with ’suspicious, cancer and HGD’ area. We also noticed an tion of the new dataset. The network head is updated and fine
imbalance between classes and images coming from various tuned during model training. The training of the model has
organs. Hence, we opted for an instance cropping stage in our been done offline on an Ubuntu machine with Intel(R) Core
pipeline that produced multiple images from these co-located i9-9900X CPU @ 3.50GHz, 62GB memory and a GeForce
images, each with one target object and other objects are re- RTX 2060 GPU. The final model was fine- tuned with an
moved by a selective cropping mechanism (example shown Adam optimizer with a learning rate of 0.0001 and a categor-
on Figure 1). We kept 10% padding around the ground truth ical cross-entropy for 50000 epochs. To be noted, the dataset
bounding box provided for the instance. This isolated the in- after augmentation is still quite small, so we employed a five-
stances of ’cancer’, ’suspicious’ and ’HGD’ regions from co- fold cross-validation during training to avoid the over-fitting
localized ’BE’ regions. We applied transformations such as of the model.
rotation, flip and crop on the individual classes and instances
to increase our training data. We then used the ’WeightedRan- 4. RESULTS AND EVALUATION SCORE
domSampler’ from the PyTorch data loader to form the fi-
nal balanced training set of almost equal class representation. Equations (1) to (3) in this section summarises the detection
This set included 1670 instances in total. Figure 1 illustrates and segmentation matrices we are using to evaluate the per-
some of the augmentation methods we applied in our pipeline. formance of a model trained on this dataset [1]. The metric,
Fig. 1. Augmentation methods applied on the images including transformation such as rotation, flip and instance cropping.

Table 2. Validation set bounding-box detection and segmen-
tation score before and after fine-tuning
Fine- Task mAP AP (50), AP (75) AP (m), AP (l)
tuning
No bbox 0.291 0.361; 0.319 0.450; 0.328
No segment 0.254 0.347; 0.252 0.250; 0.292
Yes bbox 0.479 0.689; 0.600 0.675; 0.493
Yes segment 0.513 0.683; 0.549 0.563; 0.566

Table 3. Out-of-sample detection and segmentation score
Training Dataset (Test data) scored scores
Original+ Flip, rotate, (Phase-I) 0.2460 0.5243
crop
Original+Instance- (Phase-II) 0.2906 0.6264
crop+class-balance
4.1. Results on validation dataset
Fig. 2. Illustration of the mask-RCNN architecture adapted
for transfer learning on the EDD dataset
mean average precision (mAP) measures the ability of an ob-
ject detector to accurately retrieve all instances of the ground
Table 2 summarises average precision performances on the
truth bounding boxes. The higher the mAP the better the per-
isolated validation dataset (25 images with ground-truth
formance. In Equation (1), N = 5 and APi indicates Average
masks) to get an estimate of the test set performance. Class-
precision of individual disease class i for this dataset.
wise precision values were presented for two IoU thresholds.
1 X For AP (50), only candidates over 50% region comparing
mAP = APi (1) ground truth were counted and we achieved about 36.1%
N i
average precision for bounding box detection and 34.7% av-
scored = 0.6 × mAPd + 0.4 × IoUd (2) erage precision for pixel-to-pixel segmentation. For AP (75),
only the candidates over 75% IoU value are counted. Av-
X
scores = 0.25 ∗ precision + recall + F1 + F2 (3)
i
erage precision values were counted for large (AP (l)) and
medium-sized (AP (m)) objects in the images and the accu-
For the detection task, the competition uses a a final mean racy ranged from 32.27% to 45% respectively. To be noted,
score (scored ), which is a weighted score of mAP and IoU we omitted AP (s) for small object (area < 32pixel2 ) due to
and formula is presented in Equation (2). Here, IoU - inter- the absence of such small objects in the test dataset. However,
section over union measures the overlap between the ground such low values are indicative of the model being overfit and
truth and predicted bounding boxes. For scoring of the se- we applied parameter-tuning to the fully connected network
mantic segmentation task, an average measure (scores ) is cal- layers along with realistic and balanced augmentation. This
culated as per Equation (3), which is the average score of F1 - significantly improved the mAP for for both bounding box
score (Dice Coefficient), F2 -score, precision and recall. A and segmentation mask to 47.9% and 51.3% respectively
detail description of these matrices can be found in [1]. (shown in row 3 and 4 on Table 2).
Fig. 3. Semantic segmentation results on some of the images from the test dataset
4.2. Results on the test dataset: Phase-I and Phase-II region of cancer or polyp or BE is not always the case in
practice. Very often, multifocal patches of cancer, low-grade
For phase-I, we received 24 images and Figure 3 shows detec- and high-grade dysplasia are scattered across the surface of
tion and segmentation output from some of the images from the lesion. Further improvements are required to deal with
this test set. From the scores available on the leaderboard, bubble, saturation, instrument and other visible artefacts in
for ’BE’, we achieved average precision value of 51.14%, for the dataset [14, 15]. This will improve the model’s perfor-
’HGD’ and ’polyp’ it is 50%. However, the score for ’suspi- mance by avoiding false detection in these regions and will
cious’ and ’cancer’ areas were very low. We attained a dice provide more accurate and realistic solution for endoscopic
coefficient of 0.4562 and an F2 score of 0.4508. We noticed disease detection cases.
the missed and mis-classification was due to the imbalance
between classes. Hence, before phase-II submission, we
6. REFERENCES
retrained the model after applying a ’WeightedRandomSam-
pler’ for selective and balanced sampling of the augmented
[1] Sharib Ali, Noha Ghatwary, Barbara Braden, Do-
dataset. During phase-II, we received 43 images and we
minique Lamarque, Adam Bailey, Stefano Realdon, Re-
retrained the model with a balanced augmentation dataset.
nato Cannizzaro, Jens Rittscher, Christian Daul, and
From the leader-board scores available at this stage, the fi-
James East. Endoscopy disease detection challenge
nal detection score scored and semantic segmentation score
2020. arXiv preprint arXiv:2003.03376, 2020.
scores is listed in Table 3. In the table, we observed an in-
crease in detection score to 0.29 when a class balancing and [2] Jun Ki Min, Min Seob Kwak, and Jae Myung Cha.
instance cropping is applied on the training dataset. We had Overview of deep learning in gastrointestinal en-
a score of 0.24 on phase-I which we obtained with generic doscopy. Gut and liver, 13(4):388, 2019.
augmentation techniques applied on the data. We achieved an
[3] Robert Mendel, Alanna Ebigbo, Andreas Probst, et al.
improved semantic segmentation score of 0.62 as well from
Barretts esophagus analysis using convolutional neu-
our phase-I score 0f 0.52. The final model had an standard
ral networks. In Image Processing for Medicine 2017,
deviation of 0.082 in the mAPd value and deviation was 0.33
pages 80–85. Springer, 2017.
in the semantic score.
[4] Yoshimasa Horie, Toshiyuki Yoshio, Kazuharu
Aoyama, Yoshimizu, et al. Diagnostic outcomes of
5. DISCUSSION & CONCLUSION
esophageal cancer by artificial intelligence using convo-
lutional neural networks. Gastrointestinal endoscopy,
As balanced augmentation has improved both detection and
89(1):25–32, 2019.
segmentation score in this task, application of generative ad-
versarial network-based augmentation techniques in future [5] Xiaohong W Gao, Barbara Braden, Stephen Taylor, and
can contribute to a more generalised and robust model. Ad- Wei Pang. Towards real-time detection of squamous
ditionally, we assumed that the detected object was spread pre-cancers from oesophageal endoscopic videos. In
uniformly across a detected region as the patch was classi- 2019 18th IEEE International Conference On Machine
fied as a specific disease type (cancer, polyp) depending on Learning And Applications (ICMLA), pages 1606–1612,
the patch-specific feature. However, the idea of one uniform Dec 2019.
[6] Yoriaki Komeda, Hisashi Handa, et al. Computer-aided Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
diagnosis based on convolutional neural network system del, and Jens Rittscher. An objective comparison of de-
for colorectal polyp classification: preliminary experi- tection and segmentation algorithms for artefacts in clin-
ence. Oncology, 93:30–34, 2017. ical endoscopy. Scientific Reports, 10, 2020.

[7] Teng Zhou, Guoqiang Han, Bing Nan Li, et al. Quan-
titative analysis of patients with celiac disease by video
capsule endoscopy: A deep learning method. Comput-
ers in biology and medicine, 85:1–6, 2017.

[8] Lequan Yu, Hao Chen, Qi Dou, Jing Qin, and
Pheng Ann Heng. Integrating online and offline three-
dimensional deep learning for automated polyp detec-
tion in colonoscopy videos. IEEE journal of biomedical
and health informatics, 21(1):65–75, 2016.

[9] Lianlian Wu, Wei Zhou, Xinyue Wan, Jun Zhang, et al.
A deep neural network improves endoscopic detection
of early gastric cancer without blind spots. Endoscopy,
51(06):522–531, 2019.

[10] Yuichi Mori, Tyler M Berzin, and Shin-ei Kudo. Ar-
tificial intelligence for early gastric cancer: early
promise and the path ahead. Gastrointestinal endoscopy,
89(4):816–817, 2019.

[11] Connor Shorten and Taghi M Khoshgoftaar. A survey
on image data augmentation for deep learning. Journal
of Big Data, 6(1):60, 2019.

[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
Girshick. Mask r-cnn. In Proceedings of the IEEE in-
ternational conference on computer vision, pages 2961–
2969, 2017.

[13] Francisco Massa and Ross Girshick. maskrcnn-
benchmark: Fast, modular reference im-
plementation of Instance Segmentation and
Object Detection algorithms in PyTorch.
https://github.com/facebookresearch/maskrcnn-
benchmark, 2018.

[14] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
Adam Bailey, Stefano Realdon, James East, Georges
Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
doscopy artifact detection (ead 2019) challenge dataset.
arXiv preprint arXiv:1905.03209, 2019.

[15] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
abel, James E. East, Geroges Wagnieres, Victor B.