Automatic Polyp Segmentation
                                  via Parallel Reverse Attention Network
                   Ge-Peng Ji1,2 , Deng-Ping Fan1, *, Tao Zhou1 , Geng Chen1 , Huazhu Fu1 , Ling Shao1
                                        1 Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE.
                                           2 School of Computer Science, Wuhan University, Hubei, China.

                                                  https://github.com/GewelsJI/MediaEval2020-IIAI-Med
                                                                                                                                                                                              Paralleled Connection
ABSTRACT                                                                                                                                                                                                                                                             Global Map
                                                                                                                                             𝑓𝑓1              𝑓𝑓2                           𝑓𝑓3                          𝑓𝑓4
                                                                                                                               Conv1                  Conv2                 Conv3                       Conv4                        Conv5
In this paper, we present a novel deep neural network, termed


                                                                                                                                                                                                                                                          PD
                                                                                                                                   Low-level feature                                         High-level feature                               𝑓𝑓5                                            𝑺𝑺𝒈𝒈
Parallel Reverse Attention Network (PraNet), for the task of auto-                                                                                                                                                                                                      Down-sample

matic polyp segmentation at MediaEval 2020. Specifically, we first                                            𝑓𝑓i        Reverse Attention                                           RA                      RA                          RA


                                                                                      upsampled High-level
aggregate the features in high-level layers using a parallel partial                                                        Multiplication
                                                                                                                                                                    𝑅𝑅i                                                                                    Partial Decoder
                                                                                                                                                                                      𝑅𝑅3                        𝑅𝑅4                          𝑅𝑅5
                                                                                                                                                                                                                                                          𝑓𝑓3 𝑓𝑓4                  𝑓𝑓5


                                                                                                                                                                                                                                                                                         2×up
decoder (PPD). Based on the combined feature, we then generate a                                              𝑆𝑆i


                                                                                                                                                                                                                                                                    2×up
                                                                                                                                                                                                                                                             2×up


                                                                                                                                                                                                                                                                            2×up
                                                                                                                     Sigmoid
global map as the initial guidance area for the following compo-                                                                         Reverse
nents. In addition, we mine the boundary cues using the reverse                                                                                                                                                                                                                          C


                                                                                                                                                                                                                                   Addition
                                                                                                                                                                                                      Addition
                                                                                                             Flow of feature


                                                                                                                                                                          Addition
attention (RA) module, which is able to establish the relationship                                           Flow of decoder
                                                                                                             Flow of up-sampling         Prediction                                       Up-sample                    Up-sample                                                   2×up
between areas and boundary cues. Thanks to the recurrent coop-                                               Deep supervision                                                                                                                                       C
                                                                                                             Conv layer
eration mechanism between areas and boundaries, our PraNet is                                                Partial Decoder
                                                                                                                                                         Sigmoid
                                                                                                             3×3 Conv+BN+ReLU                                                                𝑆𝑆3                        𝑆𝑆4                         𝑆𝑆5
capable of calibrating misaligned predictions, improving the seg-                                            1×1 Conv+BN+ReLU


mentation accuracy and achieving real-time efficiency (∼30fps) on
a single NVIDIA GeForce GTX 1080 GPU.                                                Figure 1: Pipeline of our PraNet, which consists of three re-
                                                                                     verse attention (RA) modules with a parallel partial decoder
1    INTRODUCTION                                                                    (PPD) connection. Please refer to § 2 for more details.
Aiming at developing computer-aided diagnosis systems for auto-
matic polyp segmentation, and detecting all types of polyps (i.e., ir-
regular polyps, smaller or flat polyps) with high efficiency and accu-
racy, Medico Automatic Polyp Segmentation Challenge 20201 [10]                       seen from Fig. 1, PraNet utilizes a parallel partial decoder (see §
benchmarks semantic segmentation methods for segmenting polyp                        2.1) to generate a high-level semantic global map and a set of re-
regions in colonoscopy images on a publicly available dataset, em-                   verse attention modules (see § 2.2) for accurate polyp segmentation
phasizing robustness, speed, and generalization. Following the pro-                  from the colonoscopy images. All components and implementation
tocols of this challenge, we participate two required sub-tasks in-                  details will be elaborated as follows.
cluding (i) Polyp segmentation task and (ii) Algorithm efficiency
task, more task descriptions refer to the challenge guidelines.                      2 APPROACH
   Recent years have witnessed promising progress in addressing                      2.1 Parallel Partial Decoder (PPD)
the task of automatic polyp segmentation using traditional [12, 16]
and deep learning based [1, 2, 7, 13, 20, 21] methods. However, there                Current popular medical image segmentation networks usually
are three core challenges in this field, including (a) the polyps often              rely on a U-Net [15] or a U-Net shaped network (e.g., U-Net++ [26],
vary in appearance, e.g., size, color and texture, even if they are of               ResUNet [25], etc). These models are essentially encoder-decoder
the same type; (b) in colonoscopy images, the boundary between                       frameworks, which typically aggregate all multi-level features ex-
a polyp and its surrounding mucosa is usually blurred and lacks                      tracted from convolutional neural networks (CNNs). As demon-
the intense contrast required for segmentation approaches; (c) the                   strated by Wu et al. [19], compared with high-level features, low-
practical applications of existing algorithms are hindered by their                  level features demand more computational resources due to their
low performance and efficiency.                                                      larger spatial resolutions, but contribute less to performance. Mo-
   Based on these observations, we develop a real-time and ac-                       tivated by this observation, we propose to aggregate high-level
curate framework, termed Parallel Reverse Attention Network                          features with a parallel partial decoder component. More specif-
(PraNet 2 ), for the automatic polyp segmentation task. As can be                    ically, for an input polyp image 𝐼 with size ℎ × 𝑤, five levels of
                                                                                     features {f𝑖 , 𝑖 = 1, ..., 5} with resolution [ℎ/2𝑘−1, 𝑤/2𝑘−1 ] can be
1 https://multimediaeval.github.io/editions/2020/tasks/medico/
                                                                                     extracted from a Res2Net-based [8] backbone network. Then, we
*Corresponding Author: Deng-Ping Fan (Email: dengpfan@gmail.com)                     divide f𝑖 features into low-level features {f𝑖 , 𝑖 = 1, 2} and high-level
Work was done while Ge-Peng Ji was an intern mentored by Deng-Ping Fan.
2 This work is based on our paper [5] published at MICCAI-2020.                      features {f𝑖 , 𝑖 = 3, 4, 5}. We introduce the partial decoder 𝑝𝑑 (·) [19],
                                                                                     a new state-of-the-art (SOTA) decoder component, to aggregate the
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons   high-level features with a paralleled connection. As shown in Fig.
License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, December 14-15 2020, Online
                                                                                     1, the partial decoder feature is computed by PD = 𝑝𝑑 (𝑓3, 𝑓4, 𝑓5 ),
                                                                                     and we can obtain a global map S𝑔 .
MediaEval’20, December 14-15 2020, Online                                                                                        M. Larson et al.


2.2      Reverse Attention (RA)                                              Table 1: Quantitative results for both the polyp segmenta-
                                                                             tion (task 1) and algorithm efficiency (task 2) on a single
In a clinical setting, doctors first roughly locate the polyp region,
                                                                             NVIDIA GeForce GTX 1080 GPU of Medico Automatic Polyp
and then carefully inspect local tissues to accurately label the polyp.
                                                                             Segmentation Challenge 2020.
As discussed in § 2.1, our global map S𝑔 is derived from the deepest
CNN layer, which can only capture a relatively rough location of the
polyp tissues, without structural details (see Fig. 1). To address this       Team Name Jaccard DSC Recall Precision Accuracy         F2    FPS
issue, we propose a principle strategy to progressively mine discrim-
inative polyp regions by erasing foreground objects [3, 18]. Instead             IIAI-Med   0.761   0.839 0.830   0.901      0.960   0.828 29.87
of aggregating features from all levels like in [9, 22, 23], we propose
to adaptively learn the reverse attention in three parallel high-
level features. In other words, our architecture can sequentially
mine complementary regions and details by erasing the existing es-           use any extra data in this challenge. All the inputs of our model
timated polyp regions from high-level side-output features, where            are uniformly resized to 352 × 352 and we augment all the train-
the existing estimation is up-sampled from the deeper layer.                 ing images using multiple strategies, including random horizontal
    Specifically, we obtain the output reverse attention features 𝑅𝑖         flipping, rotating, color enhancement and border cropping. Param-
by multiplying (element-wise ⊙) the high-level side-output feature           eters of the Res2Net-50 [8] backbone are initialized from the model
{𝑓𝑖 , 𝑖 = 3, 4, 5} by a reverse attention weight 𝐴𝑖 , as follows:            pre-trained on ImageNet [4]. Other parameters are initialized us-
                                   𝑅𝑖 = 𝑓𝑖 ⊙ 𝐴𝑖 .                     (1)    ing the default PyTorch settings. The Adam algorithm is used to
                                                                             optimize our model, and it is accelerated by an NVIDIA TITAN
The reverse attention weight 𝐴𝑖 is de-facto for salient object detec-
                                                                             RTX GPU. We set the initial learning rate is 1e-4 and divide it by 10
tion in the computer vision community [3, 24], and can be formu-
                                                                             every 50 epochs. It takes about 40 minutes to train the model with
lated as:
                                                                             a mini-batch size of 26 over 100 epochs. Our final prediction map
                        𝐴𝑖 = ⊖(𝜎 (P (𝑆𝑖+1 ))),                     (2)
                                                                             𝑆𝑝 is generated by 𝑆 3 after a Sigmoid function. The testing dataset
where P (·) denotes an up-sampling operation, 𝜎 (·) is the Sigmoid           consists of 160 polyp images provided by organisers. Our code can
function, and ⊖(·) is a reverse operation subtracting the input from         be found at https://github.com/GewelsJI/MediaEval2020-IIAI-Med.
matrix E, in which all the elements are 1. Fig. 1 (RA) shows the
details of this process. It is worth noting that the erasing strategy
driven by the reverse attention can eventually refine the imprecise          3.4     Results and Analysis
and coarse estimation into an accurate and complete prediction               Without any bells and whistles, such as extra training data or a
map.                                                                         model ensemble, we introduce a new training scheme for addressing
                                                                             the polyp segmentation challenge based on the previous work [5].
3 EXPERIMENTS                                                                For more hyper-parameters and data augmentation settings refer
                                                                             to § 3.3. In the submission phase, we train our model on the Kvasir-
3.1 Learning Strategies
                                                                             SEG dataset [11] and submit the inference results only once. Tab.
We use a hybrid loss function in the training process, which is              1 reports the quantitative results of our approach on sub-task 1,
defined as L = L𝐼𝑜𝑈  𝑤 + L 𝑤 , where L 𝑤 and L 𝑤
                                  𝐵𝐶𝐸             𝐼𝑜𝑈       𝐵𝐶𝐸 represent    which achieve a very high performance (Precision=0.901 and Ac-
the weighted Intersection over Union (IoU) loss and binary cross             curacy=0.960 on test set). Meantime, PraNet also runs at ∼30fps
entropy (BCE) loss for the global restriction and local (pixel-level)        on a single NVIDIA GeForce GTX 1080 GPU, demonstrating its
restriction. The definitions of these losses are the same as in [14, 17]     simplicity and effectiveness. As a robust, general, and real-time
and their effectiveness has been validated in the field of salient           framework, PraNet can help facilitate future academic research and
object detection. Here, we adopt deep supervision for the three              computer-aided diagnosis for automatic polyp segmentation.
side-outputs (i.e., 𝑆 3 , 𝑆 4 , and 𝑆 4 ) and the global map 𝑆𝑔 . Each map
                         𝑢𝑝
is up-sampled (e.g., 𝑆 3 ) to the same size as the ground-truth map
𝐺. Thus the total loss for the proposed PraNet can be formulated
                                                                             4     CONCLUSION
                       𝑢𝑝
as: L𝑡𝑜𝑡𝑎𝑙 = L (𝐺, 𝑆𝑔 ) + 𝑖=5
                                 Í             𝑢𝑝                            Automatic polyp segmentation is a challenging problem because
                                   𝑖=3 L (𝐺, 𝑆𝑖 ).
                                                                             of the diversity in appearance at polyps, complex similar environ-
3.2      Evaluation Metrics                                                  ments, and require high accuracy and inference speed. We have
                                                                             presented a novel architecture, PraNet, for automatically segment-
We employ the metrics widely used in the medical segmentation
                                                                             ing polyps from colonoscopy images. PraNet efficiently integrates
field, including mean IoU (mIoU or Jaccard index), Dice coefficient,
                                                                             a cascaded mechanism and a reverse attention module with a par-
recall, precision, acccuracy and frame per second (FPS) for a com-
                                                                             allel connection, which can be trained in an end-to-end manner.
prehensive evaluation.
                                                                             Another advantage is that PraNet is universal and flexible, meaning
                                                                             that more effective modules can be added to further improve the
3.3      Implementation Details and Datasets
                                                                             accuracy. We hope this study will offer the community an oppor-
We randomly split the whole Kvaris-SEG3 [11] into a training set             tunity to explore more powerful models for related topics such as
(900 images) and validation set (100 images). Note that we do not            lung infection segmentation [6], or even on upstream tasks such as
3 https://datasets.simula.no/kvasir-seg/                                     video-based understanding.
Medico Multimedia Task                                                                                 MediaEval’20, December 14-15 2020, Online


REFERENCES                                                                    [19] Zhe Wu, Li Su, and Qingming Huang. 2019. Cascaded partial decoder
 [1] Mojtaba Akbari, Majid Mohrekesh, Ebrahim Nasr-Esfahani, SM Reza               for fast and accurate salient object detection. In IEEE CVPR. 3907–3916.
     Soroushmehr, Nader Karimi, Shadrokh Samavi, and Kayvan Najar-            [20] Lequan Yu, Hao Chen, Qi Dou, Jing Qin, and Pheng Ann Heng. 2016.
     ian. 2018. Polyp segmentation in colonoscopy images using fully               Integrating online and offline three-dimensional deep learning for
     convolutional network. In IEEE EMBC. 69–72.                                   automated polyp detection in colonoscopy videos. IEEE JBHI 21, 1
 [2] Patrick Brandao, Evangelos Mazomenos, Gastone Ciuti, Renato                   (2016), 65–75.
     Caliò, Federico Bianchi, Arianna Menciassi, Paolo Dario, Anastasios      [21] Ruikai Zhang, Yali Zheng, Carmen CY Poon, Dinggang Shen, and
     Koulaouzidis, Alberto Arezzo, and Danail Stoyanov. 2017. Fully con-           James YW Lau. 2018. Polyp detection during colonoscopy using a
     volutional neural networks for polyp segmentation in colonoscopy. In          regression-based convolutional neural network with a tracker. Pattern
     Medical Imaging 2017: Computer-Aided Diagnosis, Vol. 10134. 101340F.          Recognition 83 (2018), 209–219.
 [3] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. 2018. Reverse          [22] Shihao Zhang, Huazhu Fu, Yuguang Yan, Yubing Zhang, Qingyao Wu,
     attention for salient object detection. In ECCV. 234–250.                     Ming Yang, Mingkui Tan, and Yanwu Xu. 2019. Attention Guided
 [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.        Network for Retinal Image Segmentation. In MICCAI. 797–805.
     2009. Imagenet: A large-scale hierarchical image database. In IEEE       [23] Zhijie Zhang, Huazhu Fu, Hang Dai, Jianbing Shen, Yanwei Pang, and
     CVPR. 248–255.                                                                Ling Shao. 2019. ET-Net: A generic edge-attention guidance network
 [5] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing           for medical image segmentation. In MICCAI. Springer, 442–450.
     Shen, and Ling Shao. 2020. PraNet: Parallel Reverse Attention Network    [24] Zhao Zhang, Zheng Lin, Jun Xu, Wenda Jin, Shao-Ping Lu, and Deng-
     for Polyp Segmentation. MICCAI (2020).                                        Ping Fan. 2020. Bilateral attention network for rgb-d salient object
 [6] Deng-Ping Fan, Tao Zhou, Ge-Peng Ji, Yi Zhou, Geng Chen, Huazhu               detection. arXiv preprint arXiv:2004.14582 (2020).
     Fu, Jianbing Shen, and Ling Shao. 2020. Inf-Net: Automatic COVID-19      [25] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. 2018. Road ex-
     Lung Infection Segmentation from CT Images. IEEE TMI (2020).                  traction by deep residual u-net. IEEE Geoscience and Remote Sensing
 [7] Yuqi Fang, Cheng Chen, Yixuan Yuan, and Kai-yu Tong. 2019. Selective          Letters 15, 5 (2018), 749–753.
     feature aggregation network with area-boundary constraints for polyp     [26] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh,
     segmentation. In MICCAI. Springer, 302–310.                                   and Jianming Liang. 2019. Unet++: A nested u-net architecture for
 [8] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-                 medical image segmentation. IEEE TMI (2019), 3–11.
     Hsuan Yang, and Philip Torr. 2020. Res2Net: A New Multi-scale Back-
     bone Architecture. IEEE TPAMI (2020), 1–1. https://doi.org/10.1109/
     TPAMI.2019.2938758
 [9] Zaiwang Gu, Jun Cheng, Huazhu Fu, Kang Zhou, Huaying Hao, Yitian
     Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu. 2019. CE-Net:
     Context encoder network for 2d medical image segmentation. IEEE
     TMI 38, 10 (2019), 2281–2292.
[10] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, Håvard D. Jo-
     hansen, Dag Johansen, Thomas de Lange, Michael A. Riegler, and Pål
     Halvorsen. 2020. Medico Multimedia Task at MediaEval 2020: Auto-
     matic Polyp Segmentation. In Proc. of MediaEval 2020 CEUR Workshop.
[11] Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen,
     Thomas de Lange, Dag Johansen, and Håvard D Johansen. 2020. Kvasir-
     SEG: A Segmented Polyp Dataset. In MMM. 451–462.
[12] Alexander V Mamonov, Isabel N Figueiredo, Pedro N Figueiredo, and
     Yen-Hsi Richard Tsai. 2014. Automated polyp detection in colon
     capsule endoscopy. IEEE TMI 33, 7 (2014), 1488–1502.
[13] Balamurali Murugesan, Kaushik Sarveswaran, Sharath M Shankara-
     narayana, Keerthi Ram, Jayaraj Joseph, and Mohanasankar
     Sivaprakasam. 2019. Psi-Net: Shape and boundary aware joint
     multi-task deep network for medical image segmentation. In IEEE
     EMBC. 7223–7226.
[14] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood
     Dehghan, and Martin Jagersand. 2019. Basnet: Boundary-aware salient
     object detection. In IEEE CVPR. 7479–7489.
[15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net:
     Convolutional networks for biomedical image segmentation. In MIC-
     CAI. Springer, 234–241.
[16] Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming Liang. 2015.
     Automated polyp detection in colonoscopy videos using shape and
     context information. IEEE TMI 35, 2 (2015), 630–644.
[17] Jun Wei, Shuhui Wang, and Qingming Huang. 2020. F3Net: Fusion,
     Feedback and Focus for Salient Object Detection. In AAAI.
[18] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao
     Zhao, and Shuicheng Yan. 2017. Object region mining with adversarial
     erasing: A simple classification to semantic segmentation approach.
     In IEEE CVPR. 1568–1576.