Automatic Polyp Segmentation via Parallel Reverse Attention Network Ge-Peng Ji1,2 , Deng-Ping Fan1, *, Tao Zhou1 , Geng Chen1 , Huazhu Fu1 , Ling Shao1 1 Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE. 2 School of Computer Science, Wuhan University, Hubei, China. https://github.com/GewelsJI/MediaEval2020-IIAI-Med Paralleled Connection ABSTRACT Global Map 𝑓𝑓1 𝑓𝑓2 𝑓𝑓3 𝑓𝑓4 Conv1 Conv2 Conv3 Conv4 Conv5 In this paper, we present a novel deep neural network, termed PD Low-level feature High-level feature 𝑓𝑓5 π‘Ίπ‘Ίπ’ˆπ’ˆ Parallel Reverse Attention Network (PraNet), for the task of auto- Down-sample matic polyp segmentation at MediaEval 2020. Specifically, we first 𝑓𝑓i Reverse Attention RA RA RA upsampled High-level aggregate the features in high-level layers using a parallel partial Multiplication 𝑅𝑅i Partial Decoder 𝑅𝑅3 𝑅𝑅4 𝑅𝑅5 𝑓𝑓3 𝑓𝑓4 𝑓𝑓5 2Γ—up decoder (PPD). Based on the combined feature, we then generate a 𝑆𝑆i 2Γ—up 2Γ—up 2Γ—up Sigmoid global map as the initial guidance area for the following compo- Reverse nents. In addition, we mine the boundary cues using the reverse C Addition Addition Flow of feature Addition attention (RA) module, which is able to establish the relationship Flow of decoder Flow of up-sampling Prediction Up-sample Up-sample 2Γ—up between areas and boundary cues. Thanks to the recurrent coop- Deep supervision C Conv layer eration mechanism between areas and boundaries, our PraNet is Partial Decoder Sigmoid 3Γ—3 Conv+BN+ReLU 𝑆𝑆3 𝑆𝑆4 𝑆𝑆5 capable of calibrating misaligned predictions, improving the seg- 1Γ—1 Conv+BN+ReLU mentation accuracy and achieving real-time efficiency (∼30fps) on a single NVIDIA GeForce GTX 1080 GPU. Figure 1: Pipeline of our PraNet, which consists of three re- verse attention (RA) modules with a parallel partial decoder 1 INTRODUCTION (PPD) connection. Please refer to Β§ 2 for more details. Aiming at developing computer-aided diagnosis systems for auto- matic polyp segmentation, and detecting all types of polyps (i.e., ir- regular polyps, smaller or flat polyps) with high efficiency and accu- racy, Medico Automatic Polyp Segmentation Challenge 20201 [10] seen from Fig. 1, PraNet utilizes a parallel partial decoder (see Β§ benchmarks semantic segmentation methods for segmenting polyp 2.1) to generate a high-level semantic global map and a set of re- regions in colonoscopy images on a publicly available dataset, em- verse attention modules (see Β§ 2.2) for accurate polyp segmentation phasizing robustness, speed, and generalization. Following the pro- from the colonoscopy images. All components and implementation tocols of this challenge, we participate two required sub-tasks in- details will be elaborated as follows. cluding (i) Polyp segmentation task and (ii) Algorithm efficiency task, more task descriptions refer to the challenge guidelines. 2 APPROACH Recent years have witnessed promising progress in addressing 2.1 Parallel Partial Decoder (PPD) the task of automatic polyp segmentation using traditional [12, 16] and deep learning based [1, 2, 7, 13, 20, 21] methods. However, there Current popular medical image segmentation networks usually are three core challenges in this field, including (a) the polyps often rely on a U-Net [15] or a U-Net shaped network (e.g., U-Net++ [26], vary in appearance, e.g., size, color and texture, even if they are of ResUNet [25], etc). These models are essentially encoder-decoder the same type; (b) in colonoscopy images, the boundary between frameworks, which typically aggregate all multi-level features ex- a polyp and its surrounding mucosa is usually blurred and lacks tracted from convolutional neural networks (CNNs). As demon- the intense contrast required for segmentation approaches; (c) the strated by Wu et al. [19], compared with high-level features, low- practical applications of existing algorithms are hindered by their level features demand more computational resources due to their low performance and efficiency. larger spatial resolutions, but contribute less to performance. Mo- Based on these observations, we develop a real-time and ac- tivated by this observation, we propose to aggregate high-level curate framework, termed Parallel Reverse Attention Network features with a parallel partial decoder component. More specif- (PraNet 2 ), for the automatic polyp segmentation task. As can be ically, for an input polyp image 𝐼 with size β„Ž Γ— 𝑀, five levels of features {f𝑖 , 𝑖 = 1, ..., 5} with resolution [β„Ž/2π‘˜βˆ’1, 𝑀/2π‘˜βˆ’1 ] can be 1 https://multimediaeval.github.io/editions/2020/tasks/medico/ extracted from a Res2Net-based [8] backbone network. Then, we *Corresponding Author: Deng-Ping Fan (Email: dengpfan@gmail.com) divide f𝑖 features into low-level features {f𝑖 , 𝑖 = 1, 2} and high-level Work was done while Ge-Peng Ji was an intern mentored by Deng-Ping Fan. 2 This work is based on our paper [5] published at MICCAI-2020. features {f𝑖 , 𝑖 = 3, 4, 5}. We introduce the partial decoder 𝑝𝑑 (Β·) [19], a new state-of-the-art (SOTA) decoder component, to aggregate the Copyright 2020 for this paper by its authors. Use permitted under Creative Commons high-level features with a paralleled connection. As shown in Fig. License Attribution 4.0 International (CC BY 4.0). MediaEval’20, December 14-15 2020, Online 1, the partial decoder feature is computed by PD = 𝑝𝑑 (𝑓3, 𝑓4, 𝑓5 ), and we can obtain a global map S𝑔 . MediaEval’20, December 14-15 2020, Online M. Larson et al. 2.2 Reverse Attention (RA) Table 1: Quantitative results for both the polyp segmenta- tion (task 1) and algorithm efficiency (task 2) on a single In a clinical setting, doctors first roughly locate the polyp region, NVIDIA GeForce GTX 1080 GPU of Medico Automatic Polyp and then carefully inspect local tissues to accurately label the polyp. Segmentation Challenge 2020. As discussed in Β§ 2.1, our global map S𝑔 is derived from the deepest CNN layer, which can only capture a relatively rough location of the polyp tissues, without structural details (see Fig. 1). To address this Team Name Jaccard DSC Recall Precision Accuracy F2 FPS issue, we propose a principle strategy to progressively mine discrim- inative polyp regions by erasing foreground objects [3, 18]. Instead IIAI-Med 0.761 0.839 0.830 0.901 0.960 0.828 29.87 of aggregating features from all levels like in [9, 22, 23], we propose to adaptively learn the reverse attention in three parallel high- level features. In other words, our architecture can sequentially mine complementary regions and details by erasing the existing es- use any extra data in this challenge. All the inputs of our model timated polyp regions from high-level side-output features, where are uniformly resized to 352 Γ— 352 and we augment all the train- the existing estimation is up-sampled from the deeper layer. ing images using multiple strategies, including random horizontal Specifically, we obtain the output reverse attention features 𝑅𝑖 flipping, rotating, color enhancement and border cropping. Param- by multiplying (element-wise βŠ™) the high-level side-output feature eters of the Res2Net-50 [8] backbone are initialized from the model {𝑓𝑖 , 𝑖 = 3, 4, 5} by a reverse attention weight 𝐴𝑖 , as follows: pre-trained on ImageNet [4]. Other parameters are initialized us- 𝑅𝑖 = 𝑓𝑖 βŠ™ 𝐴𝑖 . (1) ing the default PyTorch settings. The Adam algorithm is used to optimize our model, and it is accelerated by an NVIDIA TITAN The reverse attention weight 𝐴𝑖 is de-facto for salient object detec- RTX GPU. We set the initial learning rate is 1e-4 and divide it by 10 tion in the computer vision community [3, 24], and can be formu- every 50 epochs. It takes about 40 minutes to train the model with lated as: a mini-batch size of 26 over 100 epochs. Our final prediction map 𝐴𝑖 = βŠ–(𝜎 (P (𝑆𝑖+1 ))), (2) 𝑆𝑝 is generated by 𝑆 3 after a Sigmoid function. The testing dataset where P (Β·) denotes an up-sampling operation, 𝜎 (Β·) is the Sigmoid consists of 160 polyp images provided by organisers. Our code can function, and βŠ–(Β·) is a reverse operation subtracting the input from be found at https://github.com/GewelsJI/MediaEval2020-IIAI-Med. matrix E, in which all the elements are 1. Fig. 1 (RA) shows the details of this process. It is worth noting that the erasing strategy driven by the reverse attention can eventually refine the imprecise 3.4 Results and Analysis and coarse estimation into an accurate and complete prediction Without any bells and whistles, such as extra training data or a map. model ensemble, we introduce a new training scheme for addressing the polyp segmentation challenge based on the previous work [5]. 3 EXPERIMENTS For more hyper-parameters and data augmentation settings refer to Β§ 3.3. In the submission phase, we train our model on the Kvasir- 3.1 Learning Strategies SEG dataset [11] and submit the inference results only once. Tab. We use a hybrid loss function in the training process, which is 1 reports the quantitative results of our approach on sub-task 1, defined as L = LπΌπ‘œπ‘ˆ 𝑀 + L 𝑀 , where L 𝑀 and L 𝑀 𝐡𝐢𝐸 πΌπ‘œπ‘ˆ 𝐡𝐢𝐸 represent which achieve a very high performance (Precision=0.901 and Ac- the weighted Intersection over Union (IoU) loss and binary cross curacy=0.960 on test set). Meantime, PraNet also runs at ∼30fps entropy (BCE) loss for the global restriction and local (pixel-level) on a single NVIDIA GeForce GTX 1080 GPU, demonstrating its restriction. The definitions of these losses are the same as in [14, 17] simplicity and effectiveness. As a robust, general, and real-time and their effectiveness has been validated in the field of salient framework, PraNet can help facilitate future academic research and object detection. Here, we adopt deep supervision for the three computer-aided diagnosis for automatic polyp segmentation. side-outputs (i.e., 𝑆 3 , 𝑆 4 , and 𝑆 4 ) and the global map 𝑆𝑔 . Each map 𝑒𝑝 is up-sampled (e.g., 𝑆 3 ) to the same size as the ground-truth map 𝐺. Thus the total loss for the proposed PraNet can be formulated 4 CONCLUSION 𝑒𝑝 as: Lπ‘‘π‘œπ‘‘π‘Žπ‘™ = L (𝐺, 𝑆𝑔 ) + 𝑖=5 Í 𝑒𝑝 Automatic polyp segmentation is a challenging problem because 𝑖=3 L (𝐺, 𝑆𝑖 ). of the diversity in appearance at polyps, complex similar environ- 3.2 Evaluation Metrics ments, and require high accuracy and inference speed. We have presented a novel architecture, PraNet, for automatically segment- We employ the metrics widely used in the medical segmentation ing polyps from colonoscopy images. PraNet efficiently integrates field, including mean IoU (mIoU or Jaccard index), Dice coefficient, a cascaded mechanism and a reverse attention module with a par- recall, precision, acccuracy and frame per second (FPS) for a com- allel connection, which can be trained in an end-to-end manner. prehensive evaluation. Another advantage is that PraNet is universal and flexible, meaning that more effective modules can be added to further improve the 3.3 Implementation Details and Datasets accuracy. We hope this study will offer the community an oppor- We randomly split the whole Kvaris-SEG3 [11] into a training set tunity to explore more powerful models for related topics such as (900 images) and validation set (100 images). Note that we do not lung infection segmentation [6], or even on upstream tasks such as 3 https://datasets.simula.no/kvasir-seg/ video-based understanding. Medico Multimedia Task MediaEval’20, December 14-15 2020, Online REFERENCES [19] Zhe Wu, Li Su, and Qingming Huang. 2019. Cascaded partial decoder [1] Mojtaba Akbari, Majid Mohrekesh, Ebrahim Nasr-Esfahani, SM Reza for fast and accurate salient object detection. In IEEE CVPR. 3907–3916. Soroushmehr, Nader Karimi, Shadrokh Samavi, and Kayvan Najar- [20] Lequan Yu, Hao Chen, Qi Dou, Jing Qin, and Pheng Ann Heng. 2016. ian. 2018. Polyp segmentation in colonoscopy images using fully Integrating online and offline three-dimensional deep learning for convolutional network. In IEEE EMBC. 69–72. automated polyp detection in colonoscopy videos. IEEE JBHI 21, 1 [2] Patrick Brandao, Evangelos Mazomenos, Gastone Ciuti, Renato (2016), 65–75. CaliΓ², Federico Bianchi, Arianna Menciassi, Paolo Dario, Anastasios [21] Ruikai Zhang, Yali Zheng, Carmen CY Poon, Dinggang Shen, and Koulaouzidis, Alberto Arezzo, and Danail Stoyanov. 2017. Fully con- James YW Lau. 2018. Polyp detection during colonoscopy using a volutional neural networks for polyp segmentation in colonoscopy. In regression-based convolutional neural network with a tracker. Pattern Medical Imaging 2017: Computer-Aided Diagnosis, Vol. 10134. 101340F. Recognition 83 (2018), 209–219. [3] Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. 2018. Reverse [22] Shihao Zhang, Huazhu Fu, Yuguang Yan, Yubing Zhang, Qingyao Wu, attention for salient object detection. In ECCV. 234–250. Ming Yang, Mingkui Tan, and Yanwu Xu. 2019. Attention Guided [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Network for Retinal Image Segmentation. In MICCAI. 797–805. 2009. Imagenet: A large-scale hierarchical image database. In IEEE [23] Zhijie Zhang, Huazhu Fu, Hang Dai, Jianbing Shen, Yanwei Pang, and CVPR. 248–255. Ling Shao. 2019. ET-Net: A generic edge-attention guidance network [5] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing for medical image segmentation. In MICCAI. Springer, 442–450. Shen, and Ling Shao. 2020. PraNet: Parallel Reverse Attention Network [24] Zhao Zhang, Zheng Lin, Jun Xu, Wenda Jin, Shao-Ping Lu, and Deng- for Polyp Segmentation. MICCAI (2020). Ping Fan. 2020. Bilateral attention network for rgb-d salient object [6] Deng-Ping Fan, Tao Zhou, Ge-Peng Ji, Yi Zhou, Geng Chen, Huazhu detection. arXiv preprint arXiv:2004.14582 (2020). Fu, Jianbing Shen, and Ling Shao. 2020. Inf-Net: Automatic COVID-19 [25] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. 2018. Road ex- Lung Infection Segmentation from CT Images. IEEE TMI (2020). traction by deep residual u-net. IEEE Geoscience and Remote Sensing [7] Yuqi Fang, Cheng Chen, Yixuan Yuan, and Kai-yu Tong. 2019. Selective Letters 15, 5 (2018), 749–753. feature aggregation network with area-boundary constraints for polyp [26] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, segmentation. In MICCAI. Springer, 302–310. and Jianming Liang. 2019. Unet++: A nested u-net architecture for [8] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming- medical image segmentation. IEEE TMI (2019), 3–11. Hsuan Yang, and Philip Torr. 2020. Res2Net: A New Multi-scale Back- bone Architecture. IEEE TPAMI (2020), 1–1. https://doi.org/10.1109/ TPAMI.2019.2938758 [9] Zaiwang Gu, Jun Cheng, Huazhu Fu, Kang Zhou, Huaying Hao, Yitian Zhao, Tianyang Zhang, Shenghua Gao, and Jiang Liu. 2019. CE-Net: Context encoder network for 2d medical image segmentation. IEEE TMI 38, 10 (2019), 2281–2292. [10] Debesh Jha, Steven A. Hicks, Krister Emanuelsen, HΓ₯vard D. Jo- hansen, Dag Johansen, Thomas de Lange, Michael A. Riegler, and PΓ₯l Halvorsen. 2020. Medico Multimedia Task at MediaEval 2020: Auto- matic Polyp Segmentation. In Proc. of MediaEval 2020 CEUR Workshop. [11] Debesh Jha, Pia H Smedsrud, Michael A Riegler, PΓ₯l Halvorsen, Thomas de Lange, Dag Johansen, and HΓ₯vard D Johansen. 2020. Kvasir- SEG: A Segmented Polyp Dataset. In MMM. 451–462. [12] Alexander V Mamonov, Isabel N Figueiredo, Pedro N Figueiredo, and Yen-Hsi Richard Tsai. 2014. Automated polyp detection in colon capsule endoscopy. IEEE TMI 33, 7 (2014), 1488–1502. [13] Balamurali Murugesan, Kaushik Sarveswaran, Sharath M Shankara- narayana, Keerthi Ram, Jayaraj Joseph, and Mohanasankar Sivaprakasam. 2019. Psi-Net: Shape and boundary aware joint multi-task deep network for medical image segmentation. In IEEE EMBC. 7223–7226. [14] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. 2019. Basnet: Boundary-aware salient object detection. In IEEE CVPR. 7479–7489. [15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In MIC- CAI. Springer, 234–241. [16] Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming Liang. 2015. Automated polyp detection in colonoscopy videos using shape and context information. IEEE TMI 35, 2 (2015), 630–644. [17] Jun Wei, Shuhui Wang, and Qingming Huang. 2020. F3Net: Fusion, Feedback and Focus for Salient Object Detection. In AAAI. [18] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and Shuicheng Yan. 2017. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE CVPR. 1568–1576.