DEEP ENCODER-DECODER NETWORKS FOR ARTEFACTS SEGMENTATION IN ENDOSCOPY IMAGES Yun Bo Guo, Qingshuo Zheng, Bogdan J. Matuszewski Computer Vision and Machine Learning (CVML) Group School of Engineering University of Central Lancashire {YBGuo1,QZheng5,BMatuszewski1}@uclan.ac.uk ABSTRACT Segmentation is one of the key enabling technologies in medical image analysis with a great variety of methods pro- Automated analysis of endoscopic images is becoming in- posed [2, 3, 4]. More recently, methods based on deep learn- creasingly significant for an early detection of numerous can- ing showed significant improvement in the quality of the seg- cers and minimally invasive surgical procedures. The pa- mentation also in analysis of colonoscopy images [5, 6]. per briefly describes the methodology adopted for the 2020 Endoscopy Artefact Detection and Segmentation (EAD2020) The key architectures used as the baseline for the segmen- challenge1 . A number of novel variants of the DeepLab V3+ tation methods, developed for the EAD challenge, are Dilated encoder-decoder architecture have been investigated, imple- ResFCN [5], previously proposed by the authors, and the mented and tested for the segmentation sub-challenge. Mod- well-known DeepLab V3+ [7]. The summary of the changes ifications were introduced to improve: selection of image fu- made to these baseline architectures is briefly explained in tures, segmentation of small objects, and use of the encoder section 3. For completeness the detection sub-task has been output information. The proposed methods achieved compet- also investigated with the YOLO V3 [8], Faster R-CNN [9], itive segmentation score results on both release-I and release- and Cascade R-CNN [10] methods used as the baseline, with II test datasets. For the detection sub-challenge three off-the- their design parameters optimised. shelf deep detection networks have been optimised and eval- uated on the EAD data. 2. DATASETS Only the data, which have been made available as part of the EAD2020 challenge [11, 12] have been directly used for 1. INTRODUCTION the reported methods’ development. Some of the networks Automated analysis of endoscopic images has obvious practi- and/or sub-networks used in the designed architectures, have cal clinical importance. For example, colorectal cancer is one been acquired from the GitHub repository2 . These are nor- of the leading causes of death worldwide, e.g. in the United mally pre-trained on open generic image datasets, such as Im- States, it is the third largest cause of cancer deaths; whereas ageNet or COCO. Apart from such cases, no data other than in Europe, it is the second largest with 243,000 deaths in EAD2020, have been used for training, validation or testing 2018 [1]. Colonoscopy is the gold standard for colon screen- of the developed architectures. ing, with colon cancer survival rate strongly depending on the The original EAD2020 training images are augmented by early detection, i.e. a colonoscopy procedure. rotation, colour jitter and elastic deformations. For the seg- Automation of the analysis of endoscopic images poses mentation task, all the images have been scaled to 513×513 significant technical difficulties. As evident from the EAD pixels in size, with two training data subsets created. The challenge, the segmentation task is a very demanding prob- smaller training subset consists of 11,376 images, augmented lem, with multiple difficult to define semantic categories, pos- from the phase-I training dataset. The networks trained on sibly represented within the same/similar image locations and this smaller subset have been validated on the phase-II train- structures of significantly different sizes. Additionally, some ing dataset and online on the test datasets. This small train- of these categories (e.g. “bubbles”) are difficult to discrimi- ing subset was predominantly used to quickly verify specific nate with respect to appearance and spatial distribution. design choices made during the methods’ development. The larger training subset consists of 38,195 images augmented 1 It refers to the results submitted by the CVML team. Copyright c 2020 for this paper by its authors. Use permitted under 2 //github.com{/ultrlytics/yolov3,/open-mmlab/mmdetection,/hujie- Creative Commons License Attribution 4.0 International (CC BY 4.0). frank/SENet}. Fig. 1. Flowchart of the Network 3 encoder architecture. from the phase-I and phase-II training datasets. That big- ger training set was used to train architectures which have been thought to provide competitive results when trained on Fig. 2. The number of valid weights in the dilation kernels the smaller dataset. The networks trained on the larger train- shown in Fig.1. ing subset were only evaluated online on the EAD2020 test datasets. For the detection sub-problem, the images have been • Network 3: Shown in Fig.1, is based on Network 2, scaled to 667×400 pixels in size. As for the segmentation, with the squeeze and excitation module added behind two augmented training subsets were created. The smaller the ASPP module. This is to introduce attention gat- subset with images augmented from phase-I training dataset ing at the output of the original encoder to better utilise consists of 8800 images, whereas the large subset has 30,372 information available in the computed feature maps. images augmented from the phase-I and phase-II training datasets. Following on the methodology proposed in [14], Fig.2 shows the number of active kernel weights of the dilated sub- networks. It can be seen that with a too high dilation rate the 3. METHODS 3×3 kernel is effectively reduced to a 1×1 kernel. However, a DeepLab V3+ [7] is an end-to-end trained semantic segmen- too small dilation rate results in a small receptive field, having tation network, where lower down-sampling rate and dilated a negative effect on the network performance. The selected convolutions are used to maintain the size of feature maps, dilation rates of 2, 4, and 6 provide an effective compromise and an atrous spatial pyramid pooling (ASPP) module gener- with kernels having between 4 and 9 valid weights. ates the final features based on multiple receptive fields. Fi- Since the proposed networks don’t have built-in rotation nally, these features are up-sampled, and the classifier assigns invariance, to improve the segmentation accuracy the image the unique class label to each pixel. rotation augmentation during test time has been investigated. A number of novel network architectures (here collec- For this purpose, rotated versions of the test image are pre- tively named as DeepEAD), based on the DeepLab V3+, have sented to the network and the corresponding outputs are aver- been proposed and validated for the EAD2020 segmentation aged to better utilise generalisation properties of the network. challenge. In order to segment the overlapping objects, the The adopted test time augmentation process is explained in original multi-class classifier is replaced with 5 binary classi- Fig.3. The corresponding results, shown in section 4, demon- fiers. Further changes lead to three network architectures: strate that the test time augmentation does indeed have a sig- nificant impact on the segmentation performance. • Network 1: The original DeepLab V3+ main sub- network is replaced by the SE-ResNeXt-50 [13]. It is expected to provide better image features, as it outper- 4. RESULTS forms both Xception and ResNet architectures (origi- nally used by different implementations of the DeepLab This section reports on a sample of results obtained for the V3+) on the image classification task. segmentation and detection methods described above. Ta- ble 1 shows a representative sample of the results obtained • Network 2: Based on Network 1, with the global pool- for the segmentation task on both validation and release-I test ing removed from the ASPP and replaced with 3×3 datasets. The results obtained on the validation data (phase-II convolutions. The corresponding receptive fields are training data) are reported in the second column, with all expected to improve segmentation of the small objects. the networks trained only on the augmented images from Furthermore, the number of the convolution kernels at the phase-I training dataset. The results on the release-I test each resolution is selected to emphasise small objects. dataset are reported in the third column. The symbol “*” Fig. 3. Test time augmentation, with images on the left showing network outputs for the original image and its rotated, in 30 degree intervals, versions. Image on the right shows the result after augmentation with the individual results superimposed in the original image reference frame. Method sscore (validation) sscore (test data) DeepLab v3+ 0.45 0.40 Network 1 0.50 0.48 Network 2 0.52 0.50 / 0.59* Network 3 0.54 0.52 Table 1. The segmentation score results for various segmen- tation networks, obtained on the validation (second column) and release-I test (third column) data. Method sscore(release-II test) Fig. 4. The result from Network 3 with (in red) and without Network 2 0.5406 (in blue) test time augmentation. Network 3 0.5488 Network 3 0.5922 (+ test time augmentation) indicates that the result has been obtained for the network trained on the large training dataset (i.e. images augmented Table 2. Segmentation scores on the release-II test data. from the phase-I and phase-II training sets), otherwise results have been obtained for the network trained on the small train- ing dataset (i.e. images augmented from the phase-I training shown in Fig.4 demonstrating impact of the augmentation on dataset only - see section 2 for more details). It could be segmentation of the ”instrument” class. concluded that the gradual improvement of the results on the Various post-processing operations have been also tested, validation data is replicated on the test data. As expected including hole filling and removal of objects from the image the use of the large training set also improves performance. black boundary. These, though, had a relatively small, and This can be seen from the results reported for Network 2, difficult to predict, effect on the segmentation score. The seg- with the segmentation score of 0.50 for the network trained mentation score result for the final submission was reported on a smaller training set, and score of 0.59 for the exactly the as 0.5916, which was slightly lower than the best result of same network but trained on the large dataset. It seems that 0.5922 (see Table 2). the segmentation score of 0.5934 (for the Network 2 trained Table 3 shows results obtained for different detection net- on the larger training set) was a competitive result on the works tested on the release-II test data. It could be observed release-I test dataset. that R-CNN networks outperform the Yolo network, with the The best results obtained on the release-II test dataset, best detection score achieved by the Faster R-CNN. This is with all the networks trained on the larger training dataset, different from the results obtained on the release-I test set (not are reported in Table 2. As evident from the table, Network reported here) where the Yolo network achieved better result. 3 provides the best segmentation results with the test time This though could be possibly explained by optimisation of augmentation improving the segmentation score by 0.0434, the networks design parameters during the second phase of i.e. about 8%. The effects of the test time augmentation are testing. Method Detection scores (release-II test data) Volume 4: VISAPP, Prague, Czech Republic, February Yolo V3 0.1992 25-27, 2019, pages 632–641. SciTePress, 2019. Faster R-CNN 0.2335 Cascade R-CNN 0.2162 [6] Yun Bo Guo and Bogdan J. Matuszewski. Polyp seg- mentation with fully convolutional deep dilation neural Table 3. Detection scores on the release-II test data. network. In Medical Image Understanding and Analy- sis - 23rd Conference, MIUA 2019, Liverpool, UK, July 24-26, 2019, Proceedings, volume 1065 of Communica- 5. DISCUSSION & CONCLUSION tions in Computer and Information Science, pages 377– 388. Springer, 2019. The paper describes novel segmentation networks, high- lighting the key characteristics of the proposed deep architec- [7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, tures. The proposed methods achieved segmentation scores of Florian Schroff, and Hartwig Adam. Encoder-decoder 0.5934 on the release-I test data and 0.5922 on the release-II with atrous separable convolution for semantic image test data, which seem to be competitive. The overall detection segmentation. In Proceedings of the European confer- performance also seems comparatively reasonable with best ence on computer vision (ECCV), pages 801–818, 2018. detection score of 0.2335 on the release-II test data. However, the statistical significance of these results would need to be [8] Joseph Redmon and Ali Farhadi. Yolov3: An incremen- investigated. Further improvements could be possible, e.g. tal improvement. CoRR, abs/1804.02767, 2018. with the image aspect ratio augmentation to reflect the input [9] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian format of the adopted networks, or use of the segmentation Sun. Faster R-CNN: towards real-time object detection network as a pre-selection tool for detection of small objects with region proposal networks. CoRR, abs/1506.01497, (e.g. specularity artefacts). 2015. 6. REFERENCES [10] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: high quality object detection and instance segmentation. [1] J. Ferlay, M. Colombet, I. Soerjomataram, T. Dyba, CoRR, abs/1906.09756, 2019. G. Randi, M. Bettio, A. Gavin, O. Visser, and F. Bray. [11] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- Cancer incidence and mortality patterns in europe: Es- ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- timates for 40 countries and 25 major cancers in 2018. qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul, European Journal of Cancer, 103:356 – 387, 2018. Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, [2] Aymeric Histace, Bogdan J. Matuszewski, and Yan Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- Zhang. Segmentation of myocardial boundaries in fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, tagged cardiac MRI using active contours: A gradient- Stefano Realdon, Maxim Loshchenov, Julia A. Schn- based approach integrating texture analysis. Int. J. abel, James E. East, Geroges Wagnieres, Victor B. Biomedical Imaging, 2009:983794:1–983794:8, 2009. Loschenov, Enrico Grisan, Christian Daul, Walter Blon- del, and Jens Rittscher. An objective comparison of de- [3] Yan Zhang, Bogdan J. Matuszewski, Aymeric Histace, tection and segmentation algorithms for artefacts in clin- Frédéric Precioso, Judith Kilgallon, and Christopher J. ical endoscopy. Scientific Reports, 10, 2020. Moore. Boundary delineation in prostate imaging using active contour segmentation method with interactively [12] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, defined object regions. volume 6367 of Lecture Notes in Adam Bailey, Stefano Realdon, James East, Georges Computer Science, pages 131–142. Springer, 2010. Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- doscopy artifact detection (ead 2019) challenge dataset. [4] Yan Zhang, Bogdan J. Matuszewski, Aymeric Histace, arXiv preprint arXiv:1905.03209, 2019. and Frédéric Precioso. Statistical model of shape mo- ments with active contour evolution for shape detection [13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation and segmentation. Journal of Mathematical Imaging networks. CoRR, abs/1709.01507, 2017. and Vision, 47(1-2):35–47, 2013. [14] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous con- [5] Yun Bo Guo and Bogdan J. Matuszewski. GIANA polyp volution for semantic image segmentation. CoRR, segmentation with fully convolutional dilation neural abs/1706.05587, 2017. networks. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2019,