OXENDONET: A DILATED CONVOLUTIONAL NEURAL NETWORKS FOR ENDOSCOPIC
                        ARTEFACT SEGMENTATION

                                                    Mourad Gridach1,2 , Irina Voiculescu2
                        1
                            Department of Computer Science, High Institute of Technology, Agadir
                               2
                                 Department of Computer Science, University of Oxford, UK


                              ABSTRACT                                            (contrast, blur, noise, artifacts, and distortion). Automating
                                                                                  even part of the segmentation task is a good way of reducing
Medical image segmentation plays a key role in many generic
                                                                                  time spent on routine activities, as well as improving the han-
applications such as population analysis and, more accessi-
                                                                                  dling of larger volumes of data which are increasingly avail-
bly, can be made into a crucial tool in diagnosis and treatment
                                                                                  able from a large variety of modern scanners. Any such auto-
planning. Its output can vary from extracting practical clinical
                                                                                  mated process should, of course, still allow for manual over-
information such as pathologies (detection of cancer), to mea-
                                                                                  ride by a human expert.
suring anatomical structures (kidney volume, cartilage thick-
                                                                                      Recently, deep neural networks (DNNs), have been suc-
ness, bone angles). Many prior approaches to this problem
                                                                                  cessfully used in semantic and biomedical image segmenta-
are based on one of two main architectures: a fully convolu-
                                                                                  tion. Long et al. [3] proposed a fully convolutional network
tional network or a U-Net-based architecture. These methods
                                                                                  (FCN) to perform end-to-end semantic image segmentation,
rely on multiple pooling and striding layers to increase the re-
                                                                                  which surpasses all the existing approaches. Ronneberger
ceptive field size of neurons. Since we are tackling a segmen-
                                                                                  et al. [4] developed a U-shaped deep convolutional network
tation task, the way pooling layers are used reduce the feature
                                                                                  called U-Net consisting of contracting path to capture context
map size and lead to the loss of important spatial information.
                                                                                  and a symmetric expanding path that enables precise local-
In this paper, we propose a novel neural network, which we
                                                                                  ization. Using this (now widely cited) architecture, U-Net
call OxEndoNet. Our network uses the pyramid dilated mod-
                                                                                  outperforms all the previous models by a significant margin.
ule (PDM) consisting of multiple dilated convolutions stacked
                                                                                  Based on U-Net, Chen et al. [5] developed a model called
in parallel. The PDM module eliminates the need of striding
                                                                                  DCAN, which won the 2015 MICCAI Gland Segmentation
layers and has a very large receptive field which maintains
                                                                                  Challenge.
spatial resolution. We combine several pyramid dilated mod-
                                                                                      Such approaches suffer from two main limitations: firstly,
ules to form our final OxEndoNet network. The proposed
                                                                                  with complex and large variations in the size of objects in
network is able to capture small and complex variations in
                                                                                  medical images, the FCN with single receptive field size fails
the challenging problem of Endoscopy Artefact Detection and
                                                                                  to deal with such variations. Secondly, like in the case of
Segmentation where objects vary largely in scale and size.∗
                                                                                  object detection and classical semantic segmentation, in med-
                                                                                  ical image segmentation global context is also very important.
                                                                                  Classical networks such as U-Net and FCN miss some parts
                        1. INTRODUCTION                                           of the images because they fail to see the entire image and
                                                                                  incorporate global context in producing the correct segmenta-
Medical image segmentation [1, 2] is an important step in                         tion mask. For example, U-Net only has receptive fields that
many medical applications such as population analysis, di-                        spann 68 × 68 pixels [4].
agnosis disease, planning treatments and medical interven-                            Our goal has therefore been to design a network that is
tion, where the goal is to extract useful information such as                     able to integrate global context in order to detect and assess
pathologies, biological organs and structures. In most clinics,                   the interdependence of organs in medical images.
segmentation currently relies on the time consuming task of                           To address the issues described above, we propose the
drawing contours manually, by medical experts for instance                        novel OxEndoNet, a neural network architecture based on
radiologists, pathologists, ophthalmologists, etc. This can                       dilated convolutions. This architecture tackles the challeng-
be challenging because features of interest (soft tissue, blood                   ing variations in the size of anatomical features in medical
vessels, cancer cells) can have large and complex variations                      images. OxEndoNet is used to address the problems of the
    ∗ Work carried out during a collaborative visit at the University of Oxford   EAD2020 Challenge (a multi-class artefact segmentation in
     Copyright c 2020 for this paper by its authors. Use permitted under          video endoscopy).
Creative Commons License Attribution 4.0 International (CC BY 4.0).                   Our network has a large receptive field that uses a novel
architecture module called the Pyramid Dilated Module
(PDM) to capture highly appropriate, robust and dense lo-
cal and global features, which directly influence the final
prediction and make it more accurate. The PDM module
consists of multiple dilated convolutions stacked in paral-
lel. Combining several PDM layers leads to our OxEndoNet
network, which is detailed in Section 3.3.
     Unlike many methods used in other similar challenges, an
ensemble of models was not used in this case, which makes
our OxEndoNet a promising framework for the future.

                          2. DATASETS

In this challenge, we use the Endoscopy Artefact Detection
and Segmentation dataset1 . Its goal is to capture the wide vi-
sual diversity in endoscopic videos acquired in everyday clin-
ical settings. For more details about the dataset, we refer the
reader to [6, 7, 8]. The training employed the released data
split into two sets: 80% of it was used for training per se,
whereas the remainder 20% was kept as validation data. The
final architecture is based on the results from the validation    Fig. 1. Pyramid Dilated Module architecture. We stacked
data. The metrics around which the learning was based are         four dilated convolutions with dilation rates of 1, 2, 3 and 4
Accuracy and F1 -score, hence our network scoring well in         in parallel. The results of convolutions are concatenated.
the F1 measure in this challenge.
                                                                  large dilation rate means a large receptive field. Its main
                          3. METHODS                              advantage is the ability to enlarge the receptive field size to
                                                                  incorporate context without introducing extra parameters or
Some background information about other networks is neces-        computation cost. Dilated convolution has been successfully
sary in order to describe our proposed architecture.              applied in many computer vision applications such as audio
                                                                  generation [10], object detection [11], and semantic segmen-
3.1. Dilated Convolution                                          tation [12].

Dilated convolution (or Atrous convolution) was originally
                                                                  3.2. Pyramid Dilated Module
developed in algorithme à trous for wavelet decomposi-
tion [9]. The main idea of dilated convolutions is to insert      In a deep neural network, the size of receptive field plays an
holes (trous in French) as zeros between pixels in convolu-       important role in indicating the extent to which context infor-
tional filters. As a result, we increase the image resolution,    mation is used. Previous work uses pooling layers and strided
which allows dense feature extraction in convolutional neural     convolution to enlarge the receptive field. These techniques
networks. More formally, given 1-d input signal f and y           significantly improve the performance in applications like im-
as the output signal at location i of a dilated convolution,      age classification and object detection because they require a
we represent dilated convolution in one dimension as the          single prediction per input image. However, in tasks requir-
following:                                                        ing dense per-pixel prediction such as image segmentation,
                                                                  strided layers often fail to get better results because some de-
                           S
                           X                                      tails about the spatial information is lost, which influences the
                  y[i] =         f [i + d · s] · w[s]      (1)    pixel-wise prediction. An alternative solution to strided con-
                           s=1
                                                                  volution is to increase the size of the filters.
where w[s] denotes the sth parameter of the filter, d is the           A common limitation of this method is a severe increase
dilation rate, and S is the filter size. When d = 1, dilated      in the number of parameters to optimize and training time.
convolutions correspond to standard convolutions. In other
words, dilated convolution is equivalent to convolving the in-    3.3. OxEndoNet Network
put f with up-sampled filters produced by inserting d − 1
                                                                  Motivated by the recent success of dilated convolution, we
zeros between two consecutive filter values. Therefore, a
                                                                  propose a new pyramid dilated module (PDM), which empir-
   1 https://ead2020.grand-challenge.org                          ically proves to be a powerful feature extractor in endoscopy
        Fig. 2. OxEndoNet architecture. O, r×c×d refer to the output of each PDM layer and dimensions respectively.


artefact detection and segmentation task. As shown in Fig-                 Model             Overlap     F2-score     score-s
ure 1, we stacked convolutions with different dilation rates               OxEndoNet         0.4901       0.5107      0.5194
in parallel. In this case, PDM has four parallel convolutions
with 3 × 3 filter size and dilation rates of 1, 2, 3 and 4.
The activation function we used is the Rectified Linear Unit             Table 1. Results of OxEndoNet on phase 1 test data.
(ReLU) [13]. The result of each convolution with dilation rate
produces the same number of output dimension. To form the
                                                                                4. EXPERIMENTS AND RESULTS
final PDM module, we concatenate the outputs of each dilated
convolution. By combining the dilated convolutions with dif-
                                                                     We implemented OxEndoNet using the public framework Py-
ferent dilation rates, the PDM module is able to extract use-
                                                                     Torch [15]. The number of PDM layers, learning rate and the
ful features for objects of various sizes. All the previous ad-
                                                                     number of parallel dilated convolutions are the main hyper-
vantages play a remarkable role in medical image segmenta-
                                                                     parameters that influenced our models performance. During
tion, because medical images often feature organs of different
                                                                     training, we used the Adam optimizer [16] with the default
sizes.
                                                                     initial learning rate of 3.10−3 and weight decay of 10−4 . Fur-
     Given this PDM, we propose the OxEndoNet network il-            thermore, we used the poly learning rate policy [17] by multi-
lustrated in Figure 2. For each input image, we use ResNet-50        plying the initial rate with (1 − epoch/maxEpoches)0.9 and
pretrained on ImageNet [14] as the base network to extract the       trained the models for 300 epochs. For the number of PDM
feature map followed by multiple PDM layers to form an end-          layers, we conduct experiments with 3, 4 and 5 layers. Con-
to-end trainable network. By using several layers, we increase       cerning the number of parallel dilated convolutions, we ran
the receptive field size which allows our model to use context       experiments with 3, 4 and 5 parallel convolutions. It should
information. In the final architecture, we use four PDM lay-         also be noted that all the hyperparameters were selected based
ers; each layer uses four parallel dilated convolutions with         on performance on validation data.
filter size of 3 × 3 and dilation rates of 1, 2, 3, and 4. We note        We tested the performance of our model on the released
that the number of PDM layers and the number of parallel             test data named as Test Data Phase 1, which consisted of 50%
dilated convolutions are hyperparameters. The PDM layers             of the overall test data. In Phase 1, the test data contained 80
have 64, 128, 256, and 128 output channels where we use 16,          images, the results of which we submitted to the challenge.
32, 64 and 32 filters respectively. We feed the final PDM layer      Table 1 shows the results of our model on this test data. The
to a convolution layer followed by a bilinear interpolation to       overall results will specified after the workshop.
up-scale the feature map to the original size of an image.
    The architecture design followed two key observations.                      5. DISCUSSION & CONCLUSION
Firstly, recognizing organs in medical images requires a high
spatial precision that is lost when applying pooling with strid-     We have described OxEndoNet, a neural network designed to
ing layers. This is the main issue in FCN- and U-Net-based           tackle the challenging problem of Endoscopy Artefact Detec-
models. Secondly, complex and large variations in the size of        tion and Segmentation where objects vary largely in scale and
objects in medical images lead to inaccurate prediction due to       size. Its use of pyramid dilated module consists of parallel
the small or medium sized receptive field which fails to deal        dilated convolutions concatenated to provide additional con-
with such variations. Therefore, an accurate model should            textual information. The need of pooling and striding layers,
have a large receptive field to handle these complex varia-          considered a major drawback of other segmentation methods,
tions of organs in images. Our OxEndoNet network produces            is fully eliminated. Both PDM and OxEndoNet will be useful
a large receptive field to incorporate larger context without        frameworks to explore by the community for other computer
increasing the number of parameters or the amount of com-            vision tasks. In the future, we plan to test our model on a
putation while preserving full spatial resolution.                   wide variety of medical image volumes, as well as on generic
semantic image segmentation tasks.                                   Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
                                                                     Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
               6. ACKNOWLEDGMENT                                     fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
                                                                     Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
We would like to thank AfOx for the visiting fellowship              abel, James E. East, Geroges Wagnieres, Victor B.
which supported Mourad Gridach’s visit to the Oxford De-             Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
partment of Computer Science, where this model was de-               del, and Jens Rittscher. An objective comparison of de-
signed.                                                              tection and segmentation algorithms for artefacts in clin-
                                                                     ical endoscopy. Scientific Reports, 10, 2020.
                    7. REFERENCES                                [9] Matthias Holschneider, Richard Kronland-Martinet,
                                                                     Jean Morlet, and Ph Tchamitchian. A real-time algo-
 [1] Scott Fernquest, Daniel Park, Marija Marcan, Antony             rithm for signal analysis with the help of the wavelet
     Palmer, Irina Voiculescu, and Sion Glyn-Jones. Seg-             transform. In Wavelets, pages 286–297. Springer, 1990.
     mentation of hip cartilage in compositional magnetic
     resonance imaging: A fast, accurate, reproducible, and     [10] Aaron van den Oord, Sander Dieleman, Heiga Zen,
     clinically viable semi-automated methodology. Journal           Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
     of Orthopaedic Research R , 36(8):2280–2287, 2018.              Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.
                                                                     Wavenet: A generative model for raw audio. arXiv
 [2] Varduhi Yeghiazaryan and Irina D Voiculescu. Family             preprint arXiv:1609.03499, 2016.
     of boundary overlap metrics for the evaluation of med-
     ical image segmentation. Journal of Medical Imaging,       [11] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Ob-
     5(1):015006, 2018.                                              ject detection via region-based fully convolutional net-
                                                                     works. In Advances in neural information processing
 [3] Jonathan Long, Evan Shelhamer, and Trevor Darrell.              systems, pages 379–387, 2016.
     Fully convolutional networks for semantic segmenta-
     tion. In Proceedings of the IEEE conference on com-        [12] Zhengyang Wang and Shuiwang Ji. Smoothed dilated
     puter vision and pattern recognition, pages 3431–3440,          convolutions for improved dense prediction. In Pro-
     2015.                                                           ceedings of the 24th ACM SIGKDD International Con-
                                                                     ference on Knowledge Discovery & Data Mining, pages
 [4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.             2486–2495, 2018.
     U-net: Convolutional networks for biomedical image
                                                                [13] Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
     segmentation. In International Conference on Medi-
                                                                     Deep sparse rectifier neural networks. In Proceedings
     cal image computing and computer-assisted interven-
                                                                     of the fourteenth international conference on artificial
     tion, pages 234–241. Springer, 2015.
                                                                     intelligence and statistics, pages 315–323, 2011.
 [5] Hao Chen, Xiaojuan Qi, Lequan Yu, and Pheng-Ann
                                                                [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
     Heng. Dcan: deep contour-aware networks for accurate
                                                                     Sun. Deep residual learning for image recognition. In
     gland segmentation. In Proceedings of the IEEE con-
                                                                     Proceedings of the IEEE conference on computer vision
     ference on Computer Vision and Pattern Recognition,
                                                                     and pattern recognition, pages 770–778, 2016.
     pages 2487–2496, 2016.
                                                                [15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
 [6] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,         Chanan, Edward Yang, Zachary DeVito, Zeming Lin,
     Adam Bailey, Stefano Realdon, James East, Georges               Alban Desmaison, Luca Antiga, and Adam Lerer. Au-
     Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-          tomatic differentiation in pytorch. 2017.
     doscopy artifact detection (ead 2019) challenge dataset.
     arXiv preprint arXiv:1905.03209, 2019.                     [16] Diederik P Kingma and Jimmy Ba.        Adam: A
                                                                     method for stochastic optimization. arXiv preprint
 [7] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,            arXiv:1412.6980, 2014.
     James East, Xin Lu, and Jens Rittscher. A deep learn-
     ing framework for quality assessment and restoration       [17] Liang-Chieh Chen, George Papandreou, Iasonas Kokki-
     in video endoscopy. arXiv preprint arXiv:1904.07073,            nos, Kevin Murphy, and Alan L Yuille. Deeplab:
     2019.                                                           Semantic image segmentation with deep convolutional
                                                                     nets, atrous convolution, and fully connected crfs. IEEE
 [8] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-               transactions on pattern analysis and machine intelli-
     ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-              gence, 40(4):834–848, 2017.
     qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,