DEEP ENCODER-DECODER NETWORKS FOR ARTEFACTS SEGMENTATION IN
                             ENDOSCOPY IMAGES

                                      Yun Bo Guo, Qingshuo Zheng, Bogdan J. Matuszewski

                                    Computer Vision and Machine Learning (CVML) Group
                                                   School of Engineering
                                              University of Central Lancashire
                                           {YBGuo1,QZheng5,BMatuszewski1}@uclan.ac.uk


                             ABSTRACT                                         Segmentation is one of the key enabling technologies in
                                                                          medical image analysis with a great variety of methods pro-
Automated analysis of endoscopic images is becoming in-
                                                                          posed [2, 3, 4]. More recently, methods based on deep learn-
creasingly significant for an early detection of numerous can-
                                                                          ing showed significant improvement in the quality of the seg-
cers and minimally invasive surgical procedures. The pa-
                                                                          mentation also in analysis of colonoscopy images [5, 6].
per briefly describes the methodology adopted for the 2020
Endoscopy Artefact Detection and Segmentation (EAD2020)                       The key architectures used as the baseline for the segmen-
challenge1 . A number of novel variants of the DeepLab V3+                tation methods, developed for the EAD challenge, are Dilated
encoder-decoder architecture have been investigated, imple-               ResFCN [5], previously proposed by the authors, and the
mented and tested for the segmentation sub-challenge. Mod-                well-known DeepLab V3+ [7]. The summary of the changes
ifications were introduced to improve: selection of image fu-             made to these baseline architectures is briefly explained in
tures, segmentation of small objects, and use of the encoder              section 3. For completeness the detection sub-task has been
output information. The proposed methods achieved compet-                 also investigated with the YOLO V3 [8], Faster R-CNN [9],
itive segmentation score results on both release-I and release-           and Cascade R-CNN [10] methods used as the baseline, with
II test datasets. For the detection sub-challenge three off-the-          their design parameters optimised.
shelf deep detection networks have been optimised and eval-
uated on the EAD data.                                                                               2. DATASETS

                                                                          Only the data, which have been made available as part of
                                                                          the EAD2020 challenge [11, 12] have been directly used for
                       1. INTRODUCTION
                                                                          the reported methods’ development. Some of the networks
Automated analysis of endoscopic images has obvious practi-               and/or sub-networks used in the designed architectures, have
cal clinical importance. For example, colorectal cancer is one            been acquired from the GitHub repository2 . These are nor-
of the leading causes of death worldwide, e.g. in the United              mally pre-trained on open generic image datasets, such as Im-
States, it is the third largest cause of cancer deaths; whereas           ageNet or COCO. Apart from such cases, no data other than
in Europe, it is the second largest with 243,000 deaths in                EAD2020, have been used for training, validation or testing
2018 [1]. Colonoscopy is the gold standard for colon screen-              of the developed architectures.
ing, with colon cancer survival rate strongly depending on the                The original EAD2020 training images are augmented by
early detection, i.e. a colonoscopy procedure.                            rotation, colour jitter and elastic deformations. For the seg-
    Automation of the analysis of endoscopic images poses                 mentation task, all the images have been scaled to 513×513
significant technical difficulties. As evident from the EAD               pixels in size, with two training data subsets created. The
challenge, the segmentation task is a very demanding prob-                smaller training subset consists of 11,376 images, augmented
lem, with multiple difficult to define semantic categories, pos-          from the phase-I training dataset. The networks trained on
sibly represented within the same/similar image locations and             this smaller subset have been validated on the phase-II train-
structures of significantly different sizes. Additionally, some           ing dataset and online on the test datasets. This small train-
of these categories (e.g. “bubbles”) are difficult to discrimi-           ing subset was predominantly used to quickly verify specific
nate with respect to appearance and spatial distribution.                 design choices made during the methods’ development. The
                                                                          larger training subset consists of 38,195 images augmented
   1 It refers to the results submitted by the CVML team.

    Copyright c 2020 for this paper by its authors. Use permitted under       2 //github.com{/ultrlytics/yolov3,/open-mmlab/mmdetection,/hujie-

Creative Commons License Attribution 4.0 International (CC BY 4.0).       frank/SENet}.
  Fig. 1. Flowchart of the Network 3 encoder architecture.


from the phase-I and phase-II training datasets. That big-
ger training set was used to train architectures which have
been thought to provide competitive results when trained on         Fig. 2. The number of valid weights in the dilation kernels
the smaller dataset. The networks trained on the larger train-      shown in Fig.1.
ing subset were only evaluated online on the EAD2020 test
datasets.
    For the detection sub-problem, the images have been                 • Network 3: Shown in Fig.1, is based on Network 2,
scaled to 667×400 pixels in size. As for the segmentation,                with the squeeze and excitation module added behind
two augmented training subsets were created. The smaller                  the ASPP module. This is to introduce attention gat-
subset with images augmented from phase-I training dataset                ing at the output of the original encoder to better utilise
consists of 8800 images, whereas the large subset has 30,372              information available in the computed feature maps.
images augmented from the phase-I and phase-II training
datasets.                                                               Following on the methodology proposed in [14], Fig.2
                                                                    shows the number of active kernel weights of the dilated sub-
                                                                    networks. It can be seen that with a too high dilation rate the
                       3. METHODS
                                                                    3×3 kernel is effectively reduced to a 1×1 kernel. However, a
DeepLab V3+ [7] is an end-to-end trained semantic segmen-           too small dilation rate results in a small receptive field, having
tation network, where lower down-sampling rate and dilated          a negative effect on the network performance. The selected
convolutions are used to maintain the size of feature maps,         dilation rates of 2, 4, and 6 provide an effective compromise
and an atrous spatial pyramid pooling (ASPP) module gener-          with kernels having between 4 and 9 valid weights.
ates the final features based on multiple receptive fields. Fi-         Since the proposed networks don’t have built-in rotation
nally, these features are up-sampled, and the classifier assigns    invariance, to improve the segmentation accuracy the image
the unique class label to each pixel.                               rotation augmentation during test time has been investigated.
    A number of novel network architectures (here collec-           For this purpose, rotated versions of the test image are pre-
tively named as DeepEAD), based on the DeepLab V3+, have            sented to the network and the corresponding outputs are aver-
been proposed and validated for the EAD2020 segmentation            aged to better utilise generalisation properties of the network.
challenge. In order to segment the overlapping objects, the         The adopted test time augmentation process is explained in
original multi-class classifier is replaced with 5 binary classi-   Fig.3. The corresponding results, shown in section 4, demon-
fiers. Further changes lead to three network architectures:         strate that the test time augmentation does indeed have a sig-
                                                                    nificant impact on the segmentation performance.
   • Network 1: The original DeepLab V3+ main sub-
     network is replaced by the SE-ResNeXt-50 [13]. It is
     expected to provide better image features, as it outper-                                4. RESULTS
     forms both Xception and ResNet architectures (origi-
     nally used by different implementations of the DeepLab         This section reports on a sample of results obtained for the
     V3+) on the image classification task.                         segmentation and detection methods described above. Ta-
                                                                    ble 1 shows a representative sample of the results obtained
   • Network 2: Based on Network 1, with the global pool-           for the segmentation task on both validation and release-I test
     ing removed from the ASPP and replaced with 3×3                datasets. The results obtained on the validation data (phase-II
     convolutions. The corresponding receptive fields are           training data) are reported in the second column, with all
     expected to improve segmentation of the small objects.         the networks trained only on the augmented images from
     Furthermore, the number of the convolution kernels at          the phase-I training dataset. The results on the release-I test
     each resolution is selected to emphasise small objects.        dataset are reported in the third column. The symbol “*”
Fig. 3. Test time augmentation, with images on the left showing network outputs for the original image and its rotated, in 30
degree intervals, versions. Image on the right shows the result after augmentation with the individual results superimposed in
the original image reference frame.


                                                                    Method            sscore (validation)     sscore (test data)
                                                                    DeepLab v3+              0.45                    0.40
                                                                    Network 1                0.50                    0.48
                                                                    Network 2                0.52                0.50 / 0.59*
                                                                    Network 3                0.54                    0.52

                                                                   Table 1. The segmentation score results for various segmen-
                                                                   tation networks, obtained on the validation (second column)
                                                                   and release-I test (third column) data.

                                                                        Method                          sscore(release-II test)
Fig. 4. The result from Network 3 with (in red) and without             Network 2                              0.5406
(in blue) test time augmentation.                                       Network 3                              0.5488
                                                                        Network 3
                                                                                                                0.5922
                                                                        (+ test time augmentation)
indicates that the result has been obtained for the network
trained on the large training dataset (i.e. images augmented         Table 2. Segmentation scores on the release-II test data.
from the phase-I and phase-II training sets), otherwise results
have been obtained for the network trained on the small train-
ing dataset (i.e. images augmented from the phase-I training       shown in Fig.4 demonstrating impact of the augmentation on
dataset only - see section 2 for more details). It could be        segmentation of the ”instrument” class.
concluded that the gradual improvement of the results on the           Various post-processing operations have been also tested,
validation data is replicated on the test data. As expected        including hole filling and removal of objects from the image
the use of the large training set also improves performance.       black boundary. These, though, had a relatively small, and
This can be seen from the results reported for Network 2,          difficult to predict, effect on the segmentation score. The seg-
with the segmentation score of 0.50 for the network trained        mentation score result for the final submission was reported
on a smaller training set, and score of 0.59 for the exactly the   as 0.5916, which was slightly lower than the best result of
same network but trained on the large dataset. It seems that       0.5922 (see Table 2).
the segmentation score of 0.5934 (for the Network 2 trained            Table 3 shows results obtained for different detection net-
on the larger training set) was a competitive result on the        works tested on the release-II test data. It could be observed
release-I test dataset.                                            that R-CNN networks outperform the Yolo network, with the
     The best results obtained on the release-II test dataset,     best detection score achieved by the Faster R-CNN. This is
with all the networks trained on the larger training dataset,      different from the results obtained on the release-I test set (not
are reported in Table 2. As evident from the table, Network        reported here) where the Yolo network achieved better result.
3 provides the best segmentation results with the test time        This though could be possibly explained by optimisation of
augmentation improving the segmentation score by 0.0434,           the networks design parameters during the second phase of
i.e. about 8%. The effects of the test time augmentation are       testing.
 Method               Detection scores (release-II test data)          Volume 4: VISAPP, Prague, Czech Republic, February
 Yolo V3                            0.1992                             25-27, 2019, pages 632–641. SciTePress, 2019.
 Faster R-CNN                       0.2335
 Cascade R-CNN                      0.2162                         [6] Yun Bo Guo and Bogdan J. Matuszewski. Polyp seg-
                                                                       mentation with fully convolutional deep dilation neural
    Table 3. Detection scores on the release-II test data.             network. In Medical Image Understanding and Analy-
                                                                       sis - 23rd Conference, MIUA 2019, Liverpool, UK, July
                                                                       24-26, 2019, Proceedings, volume 1065 of Communica-
           5. DISCUSSION & CONCLUSION
                                                                       tions in Computer and Information Science, pages 377–
                                                                       388. Springer, 2019.
The paper describes novel segmentation networks, high-
lighting the key characteristics of the proposed deep architec-    [7] Liang-Chieh Chen, Yukun Zhu, George Papandreou,
tures. The proposed methods achieved segmentation scores of            Florian Schroff, and Hartwig Adam. Encoder-decoder
0.5934 on the release-I test data and 0.5922 on the release-II         with atrous separable convolution for semantic image
test data, which seem to be competitive. The overall detection         segmentation. In Proceedings of the European confer-
performance also seems comparatively reasonable with best              ence on computer vision (ECCV), pages 801–818, 2018.
detection score of 0.2335 on the release-II test data. However,
the statistical significance of these results would need to be     [8] Joseph Redmon and Ali Farhadi. Yolov3: An incremen-
investigated. Further improvements could be possible, e.g.             tal improvement. CoRR, abs/1804.02767, 2018.
with the image aspect ratio augmentation to reflect the input
                                                                   [9] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian
format of the adopted networks, or use of the segmentation
                                                                       Sun. Faster R-CNN: towards real-time object detection
network as a pre-selection tool for detection of small objects
                                                                       with region proposal networks. CoRR, abs/1506.01497,
(e.g. specularity artefacts).
                                                                       2015.

                     6. REFERENCES                                [10] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN:
                                                                       high quality object detection and instance segmentation.
 [1] J. Ferlay, M. Colombet, I. Soerjomataram, T. Dyba,                CoRR, abs/1906.09756, 2019.
     G. Randi, M. Bettio, A. Gavin, O. Visser, and F. Bray.
                                                                  [11] Sharib Ali, Felix Zhou, Barbara Braden, Adam Bai-
     Cancer incidence and mortality patterns in europe: Es-
                                                                       ley, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiao-
     timates for 40 countries and 25 major cancers in 2018.
                                                                       qiong Li, Maxime Kayser, Roger D. Soberanis-Mukul,
     European Journal of Cancer, 103:356 – 387, 2018.
                                                                       Shadi Albarqouni, Xiaokang Wang, Chunqing Wang,
 [2] Aymeric Histace, Bogdan J. Matuszewski, and Yan                   Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shu-
     Zhang. Segmentation of myocardial boundaries in                   fan Yang, Mohammad Azam Khan, Xiaohong W. Gao,
     tagged cardiac MRI using active contours: A gradient-             Stefano Realdon, Maxim Loshchenov, Julia A. Schn-
     based approach integrating texture analysis. Int. J.              abel, James E. East, Geroges Wagnieres, Victor B.
     Biomedical Imaging, 2009:983794:1–983794:8, 2009.                 Loschenov, Enrico Grisan, Christian Daul, Walter Blon-
                                                                       del, and Jens Rittscher. An objective comparison of de-
 [3] Yan Zhang, Bogdan J. Matuszewski, Aymeric Histace,                tection and segmentation algorithms for artefacts in clin-
     Frédéric Precioso, Judith Kilgallon, and Christopher J.         ical endoscopy. Scientific Reports, 10, 2020.
     Moore. Boundary delineation in prostate imaging using
     active contour segmentation method with interactively        [12] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,
     defined object regions. volume 6367 of Lecture Notes in           Adam Bailey, Stefano Realdon, James East, Georges
     Computer Science, pages 131–142. Springer, 2010.                  Wagnieres, Victor Loschenov, Enrico Grisan, et al. En-
                                                                       doscopy artifact detection (ead 2019) challenge dataset.
 [4] Yan Zhang, Bogdan J. Matuszewski, Aymeric Histace,                arXiv preprint arXiv:1905.03209, 2019.
     and Frédéric Precioso. Statistical model of shape mo-
     ments with active contour evolution for shape detection      [13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation
     and segmentation. Journal of Mathematical Imaging                 networks. CoRR, abs/1709.01507, 2017.
     and Vision, 47(1-2):35–47, 2013.                             [14] Liang-Chieh Chen, George Papandreou, Florian
                                                                       Schroff, and Hartwig Adam. Rethinking atrous con-
 [5] Yun Bo Guo and Bogdan J. Matuszewski. GIANA polyp
                                                                       volution for semantic image segmentation. CoRR,
     segmentation with fully convolutional dilation neural
                                                                       abs/1706.05587, 2017.
     networks. In Proceedings of the 14th International Joint
     Conference on Computer Vision, Imaging and Computer
     Graphics Theory and Applications, VISIGRAPP 2019,