OXENDONET: A DILATED CONVOLUTIONAL NEURAL NETWORKS FOR ENDOSCOPIC ARTEFACT SEGMENTATION Mourad Gridach1,2 , Irina Voiculescu2 1 Department of Computer Science, High Institute of Technology, Agadir 2 Department of Computer Science, University of Oxford, UK ABSTRACT (contrast, blur, noise, artifacts, and distortion). Automating even part of the segmentation task is a good way of reducing Medical image segmentation plays a key role in many generic time spent on routine activities, as well as improving the han- applications such as population analysis and, more accessi- dling of larger volumes of data which are increasingly avail- bly, can be made into a crucial tool in diagnosis and treatment able from a large variety of modern scanners. Any such auto- planning. Its output can vary from extracting practical clinical mated process should, of course, still allow for manual over- information such as pathologies (detection of cancer), to mea- ride by a human expert. suring anatomical structures (kidney volume, cartilage thick- Recently, deep neural networks (DNNs), have been suc- ness, bone angles). Many prior approaches to this problem cessfully used in semantic and biomedical image segmenta- are based on one of two main architectures: a fully convolu- tion. Long et al. [3] proposed a fully convolutional network tional network or a U-Net-based architecture. These methods (FCN) to perform end-to-end semantic image segmentation, rely on multiple pooling and striding layers to increase the re- which surpasses all the existing approaches. Ronneberger ceptive field size of neurons. Since we are tackling a segmen- et al. [4] developed a U-shaped deep convolutional network tation task, the way pooling layers are used reduce the feature called U-Net consisting of contracting path to capture context map size and lead to the loss of important spatial information. and a symmetric expanding path that enables precise local- In this paper, we propose a novel neural network, which we ization. Using this (now widely cited) architecture, U-Net call OxEndoNet. Our network uses the pyramid dilated mod- outperforms all the previous models by a significant margin. ule (PDM) consisting of multiple dilated convolutions stacked Based on U-Net, Chen et al. [5] developed a model called in parallel. The PDM module eliminates the need of striding DCAN, which won the 2015 MICCAI Gland Segmentation layers and has a very large receptive field which maintains Challenge. spatial resolution. We combine several pyramid dilated mod- Such approaches suffer from two main limitations: firstly, ules to form our final OxEndoNet network. The proposed with complex and large variations in the size of objects in network is able to capture small and complex variations in medical images, the FCN with single receptive field size fails the challenging problem of Endoscopy Artefact Detection and to deal with such variations. Secondly, like in the case of Segmentation where objects vary largely in scale and size.∗ object detection and classical semantic segmentation, in med- ical image segmentation global context is also very important. Classical networks such as U-Net and FCN miss some parts 1. INTRODUCTION of the images because they fail to see the entire image and incorporate global context in producing the correct segmenta- Medical image segmentation [1, 2] is an important step in tion mask. For example, U-Net only has receptive fields that many medical applications such as population analysis, di- spann 68 × 68 pixels [4]. agnosis disease, planning treatments and medical interven- Our goal has therefore been to design a network that is tion, where the goal is to extract useful information such as able to integrate global context in order to detect and assess pathologies, biological organs and structures. In most clinics, the interdependence of organs in medical images. segmentation currently relies on the time consuming task of To address the issues described above, we propose the drawing contours manually, by medical experts for instance novel OxEndoNet, a neural network architecture based on radiologists, pathologists, ophthalmologists, etc. This can dilated convolutions. This architecture tackles the challeng- be challenging because features of interest (soft tissue, blood ing variations in the size of anatomical features in medical vessels, cancer cells) can have large and complex variations images. OxEndoNet is used to address the problems of the ∗ Work carried out during a collaborative visit at the University of Oxford EAD2020 Challenge (a multi-class artefact segmentation in Copyright c 2020 for this paper by its authors. Use permitted under video endoscopy). Creative Commons License Attribution 4.0 International (CC BY 4.0). Our network has a large receptive field that uses a novel architecture module called the Pyramid Dilated Module (PDM) to capture highly appropriate, robust and dense lo- cal and global features, which directly influence the final prediction and make it more accurate. The PDM module consists of multiple dilated convolutions stacked in paral- lel. Combining several PDM layers leads to our OxEndoNet network, which is detailed in Section 3.3. Unlike many methods used in other similar challenges, an ensemble of models was not used in this case, which makes our OxEndoNet a promising framework for the future. 2. DATASETS In this challenge, we use the Endoscopy Artefact Detection and Segmentation dataset1 . Its goal is to capture the wide vi- sual diversity in endoscopic videos acquired in everyday clin- ical settings. For more details about the dataset, we refer the reader to [6, 7, 8]. The training employed the released data split into two sets: 80% of it was used for training per se, whereas the remainder 20% was kept as validation data. The final architecture is based on the results from the validation Fig. 1. Pyramid Dilated Module architecture. We stacked data. The metrics around which the learning was based are four dilated convolutions with dilation rates of 1, 2, 3 and 4 Accuracy and F1 -score, hence our network scoring well in in parallel. The results of convolutions are concatenated. the F1 measure in this challenge. large dilation rate means a large receptive field. Its main 3. METHODS advantage is the ability to enlarge the receptive field size to incorporate context without introducing extra parameters or Some background information about other networks is neces- computation cost. Dilated convolution has been successfully sary in order to describe our proposed architecture. applied in many computer vision applications such as audio generation [10], object detection [11], and semantic segmen- 3.1. Dilated Convolution tation [12]. Dilated convolution (or Atrous convolution) was originally 3.2. Pyramid Dilated Module developed in algorithme à trous for wavelet decomposi- tion [9]. The main idea of dilated convolutions is to insert In a deep neural network, the size of receptive field plays an holes (trous in French) as zeros between pixels in convolu- important role in indicating the extent to which context infor- tional filters. As a result, we increase the image resolution, mation is used. Previous work uses pooling layers and strided which allows dense feature extraction in convolutional neural convolution to enlarge the receptive field. These techniques networks. More formally, given 1-d input signal f and y significantly improve the performance in applications like im- as the output signal at location i of a dilated convolution, age classification and object detection because they require a we represent dilated convolution in one dimension as the single prediction per input image. However, in tasks requir- following: ing dense per-pixel prediction such as image segmentation, strided layers often fail to get better results because some de- S X tails about the spatial information is lost, which influences the y[i] = f [i + d · s] · w[s] (1) pixel-wise prediction. An alternative solution to strided con- s=1 volution is to increase the size of the filters. where w[s] denotes the sth parameter of the filter, d is the A common limitation of this method is a severe increase dilation rate, and S is the filter size. When d = 1, dilated in the number of parameters to optimize and training time. convolutions correspond to standard convolutions. In other words, dilated convolution is equivalent to convolving the in- 3.3. OxEndoNet Network put f with up-sampled filters produced by inserting d − 1 Motivated by the recent success of dilated convolution, we zeros between two consecutive filter values. Therefore, a propose a new pyramid dilated module (PDM), which empir- 1 https://ead2020.grand-challenge.org ically proves to be a powerful feature extractor in endoscopy Fig. 2. OxEndoNet architecture. O, r×c×d refer to the output of each PDM layer and dimensions respectively. artefact detection and segmentation task. As shown in Fig- Model Overlap F2-score score-s ure 1, we stacked convolutions with different dilation rates OxEndoNet 0.4901 0.5107 0.5194 in parallel. In this case, PDM has four parallel convolutions with 3 × 3 filter size and dilation rates of 1, 2, 3 and 4. The activation function we used is the Rectified Linear Unit Table 1. Results of OxEndoNet on phase 1 test data. (ReLU) [13]. The result of each convolution with dilation rate produces the same number of output dimension. To form the 4. EXPERIMENTS AND RESULTS final PDM module, we concatenate the outputs of each dilated convolution. By combining the dilated convolutions with dif- We implemented OxEndoNet using the public framework Py- ferent dilation rates, the PDM module is able to extract use- Torch [15]. The number of PDM layers, learning rate and the ful features for objects of various sizes. All the previous ad- number of parallel dilated convolutions are the main hyper- vantages play a remarkable role in medical image segmenta- parameters that influenced our models performance. During tion, because medical images often feature organs of different training, we used the Adam optimizer [16] with the default sizes. initial learning rate of 3.10−3 and weight decay of 10−4 . Fur- Given this PDM, we propose the OxEndoNet network il- thermore, we used the poly learning rate policy [17] by multi- lustrated in Figure 2. For each input image, we use ResNet-50 plying the initial rate with (1 − epoch/maxEpoches)0.9 and pretrained on ImageNet [14] as the base network to extract the trained the models for 300 epochs. For the number of PDM feature map followed by multiple PDM layers to form an end- layers, we conduct experiments with 3, 4 and 5 layers. Con- to-end trainable network. By using several layers, we increase cerning the number of parallel dilated convolutions, we ran the receptive field size which allows our model to use context experiments with 3, 4 and 5 parallel convolutions. It should information. In the final architecture, we use four PDM lay- also be noted that all the hyperparameters were selected based ers; each layer uses four parallel dilated convolutions with on performance on validation data. filter size of 3 × 3 and dilation rates of 1, 2, 3, and 4. We note We tested the performance of our model on the released that the number of PDM layers and the number of parallel test data named as Test Data Phase 1, which consisted of 50% dilated convolutions are hyperparameters. The PDM layers of the overall test data. In Phase 1, the test data contained 80 have 64, 128, 256, and 128 output channels where we use 16, images, the results of which we submitted to the challenge. 32, 64 and 32 filters respectively. We feed the final PDM layer Table 1 shows the results of our model on this test data. The to a convolution layer followed by a bilinear interpolation to overall results will specified after the workshop. up-scale the feature map to the original size of an image. The architecture design followed two key observations. 5. DISCUSSION & CONCLUSION Firstly, recognizing organs in medical images requires a high spatial precision that is lost when applying pooling with strid- We have described OxEndoNet, a neural network designed to ing layers. This is the main issue in FCN- and U-Net-based tackle the challenging problem of Endoscopy Artefact Detec- models. Secondly, complex and large variations in the size of tion and Segmentation where objects vary largely in scale and objects in medical images lead to inaccurate prediction due to size. Its use of pyramid dilated module consists of parallel the small or medium sized receptive field which fails to deal dilated convolutions concatenated to provide additional con- with such variations. Therefore, an accurate model should textual information. The need of pooling and striding layers, have a large receptive field to handle these complex varia- considered a major drawback of other segmentation methods, tions of organs in images. Our OxEndoNet network produces is fully eliminated. Both PDM and OxEndoNet will be useful a large receptive field to incorporate larger context without frameworks to explore by the community for other computer increasing the number of parameters or the amount of com- vision tasks. In the future, we plan to test our model on a putation while preserving full spatial resolution. wide variety of medical image volumes, as well as on generic semantic image segmentation tasks. Shadi Albarqouni, Xiaokang Wang, Chunqing Wang, Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu- 6. ACKNOWLEDGMENT fan Yang, Mohammad Azam Khan, Xiaohong W. Gao, Stefano Realdon, Maxim Loshchenov, Julia A. Schn- We would like to thank AfOx for the visiting fellowship abel, James E. East, Geroges Wagnieres, Victor B. which supported Mourad Gridach’s visit to the Oxford De- Loschenov, Enrico Grisan, Christian Daul, Walter Blon- partment of Computer Science, where this model was de- del, and Jens Rittscher. An objective comparison of de- signed. tection and segmentation algorithms for artefacts in clin- ical endoscopy. Scientific Reports, 10, 2020. 7. REFERENCES [9] Matthias Holschneider, Richard Kronland-Martinet, Jean Morlet, and Ph Tchamitchian. A real-time algo- [1] Scott Fernquest, Daniel Park, Marija Marcan, Antony rithm for signal analysis with the help of the wavelet Palmer, Irina Voiculescu, and Sion Glyn-Jones. Seg- transform. In Wavelets, pages 286–297. Springer, 1990. mentation of hip cartilage in compositional magnetic resonance imaging: A fast, accurate, reproducible, and [10] Aaron van den Oord, Sander Dieleman, Heiga Zen, clinically viable semi-automated methodology. Journal Karen Simonyan, Oriol Vinyals, Alex Graves, Nal of Orthopaedic Research R , 36(8):2280–2287, 2018. Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv [2] Varduhi Yeghiazaryan and Irina D Voiculescu. Family preprint arXiv:1609.03499, 2016. of boundary overlap metrics for the evaluation of med- ical image segmentation. Journal of Medical Imaging, [11] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Ob- 5(1):015006, 2018. ject detection via region-based fully convolutional net- works. In Advances in neural information processing [3] Jonathan Long, Evan Shelhamer, and Trevor Darrell. systems, pages 379–387, 2016. Fully convolutional networks for semantic segmenta- tion. In Proceedings of the IEEE conference on com- [12] Zhengyang Wang and Shuiwang Ji. Smoothed dilated puter vision and pattern recognition, pages 3431–3440, convolutions for improved dense prediction. In Pro- 2015. ceedings of the 24th ACM SIGKDD International Con- ference on Knowledge Discovery & Data Mining, pages [4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2486–2495, 2018. U-net: Convolutional networks for biomedical image [13] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. segmentation. In International Conference on Medi- Deep sparse rectifier neural networks. In Proceedings cal image computing and computer-assisted interven- of the fourteenth international conference on artificial tion, pages 234–241. Springer, 2015. intelligence and statistics, pages 315–323, 2011. [5] Hao Chen, Xiaojuan Qi, Lequan Yu, and Pheng-Ann [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Heng. Dcan: deep contour-aware networks for accurate Sun. Deep residual learning for image recognition. In gland segmentation. In Proceedings of the IEEE con- Proceedings of the IEEE conference on computer vision ference on Computer Vision and Pattern Recognition, and pattern recognition, pages 770–778, 2016. pages 2487–2496, 2016. [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory [6] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden, Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Adam Bailey, Stefano Realdon, James East, Georges Alban Desmaison, Luca Antiga, and Adam Lerer. Au- Wagnieres, Victor Loschenov, Enrico Grisan, et al. En- tomatic differentiation in pytorch. 2017. doscopy artifact detection (ead 2019) challenge dataset. arXiv preprint arXiv:1905.03209, 2019. [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint [7] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden, arXiv:1412.6980, 2014. James East, Xin Lu, and Jens Rittscher. A deep learn- ing framework for quality assessment and restoration [17] Liang-Chieh Chen, George Papandreou, Iasonas Kokki- in video endoscopy. arXiv preprint arXiv:1904.07073, nos, Kevin Murphy, and Alan L Yuille. Deeplab: 2019. Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE [8] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai- transactions on pattern analysis and machine intelli- ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao- gence, 40(4):834–848, 2017. qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,