Boundary-aware Pyramid Transformer for Polyp Segmentation Jiacheng Wang1 , Yuxi Ma1 , Ruochen Mu1 and Liansheng Wang1 1 Department of Computer Science at School of Informatics, Xiamen University Abstract According to the World Health Organization (WHO), colorectal cancer(CRC) leads to a growing death rate in recent years, whose major cause comes from adenomatous polyps. Early polyp diagnosis can help to lower the incidence of CRC, which is achieved by colonoscopy as the gold standard. In this direction, polyp segmentation, on the other hand, is still a time- consuming and labor-intensive process. Although deep learning has made significant progress in the group of automatic polyp segmentation recently, however, these models have the following drawbacks: (i) lesions with small size are hard to detect since the pooling layers miss detailed contexts, and (ii) the boundaries are sometimes blurry and ambiguous which are extremely hard to determine. In this paper, we propose to equip the Pyramid vision Transformer with Boundary-aware supervision, so-called BP-Trans, which can build multi-scale feature maps for dense prediction tasks and attentive boundary knowledge for precise boundary segmentation. We perform five-fold cross-validation on the Endoscopic computer vision challenges 2.0, in which the results on all metrics and folds consistently indicate the advantage of our method. Keywords Transformer, Boundary-aware Supervision, Polyp Segmentation 1. Introduction be obviously different under different environmental con- ditions. As a result, during model training, the significant Colorectal cancer(CRC) is the third most prevalent cause association between color and polyp segmentation tends of cancer mortality worldwide, with more than 1.85 mil- to be overly concentrated, which is harmful to model lion cases and 850,000 deaths per year [1], whose major training. Wei et al.[8] present the color exchange (CE) cause is owing to adenomatous polyps. The numbers operation as a solution to this problem. They also propose can bring a more intuitive feeling: 50%70% of colon the Probability correction method (PCS), which can im- cancer comes from adenoma, and the cancer rate of ade- prove positive sample prediction while reducing negative nomatous polyps is 2.9%.4% [2]. Colonoscopy is a vital sample interference. Furthermore, the majority of polyp medical screening technique for the illnesses of the lower regions is rather small. When simple CNN is utilized for digestive system. It may be used to check for intestinal feature extraction, these small areas are frequently over- polyps, bleeding, intestinal blockage, and the exclusion looked. To solve this issue, Wang et al. introduce Pyramid of lesions. With the developing agreement on artificial Vision Transformer(PVT)[9] that can yield multi-scale intelligence, people’s interest has shifted to the subject feature maps for dense prediction tasks by combining of health as deep learning-based polyp segmentation is pyramid structure of the transformer. Dong et al[10] able to aid in clinician diagnosis. In this group, CNNs extend PVT with other modules and propose Poly-PVT have made significant progress in the application of nu- for polyp segmentation, which effectively suppresses the merous imaging applications and automatic polyp seg- noise in the features and greatly improves their expres- mentation is a popular topic among them. Thanks to the siveness. strong ability of infeasible and robust feature represen- Despite the success of Polyp-PVT, it still lacks the tation, FCN[3], U-Net[4], U-Net++[5], DoubleU-Net[6] ability to address the tricky situation when boundaries and ResUNet[7] series, etc, have good results compared are blurry to recognize. To mitigate this problem, we with traditional methods. propose a Boundary-aware Pyramid Transformer (BP- However, these methods have certain limitations. In Trans) for the multi-scale feature extraction along with general, polyp lesions have closed hue with the patients’ boundary knowledge at multiple levels. BP-Trans ex- own intestinal environment so that their appearance will tends PVT with a boundary-aware self-attention module, 4th International Workshop and Challenge on Computer Vision in which is supervised by the boundary key-point map and Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- refines features to yield more powerful representations national Symposium on Biomedical Imaging ISBI2022, March for boundaries. To assess our method, we conduct five- 28th, 2022, IC Royal Bengal, Kolkata, India fold cross-validation on the given dataset of Endoscopic $ jiachengw@stu.xmu.edu.cn (J. Wang); computer vision challenges 2.0. The experimental results Correspondingauthor:lswang@xmu.edu.cn (L. Wang) consistently demonstrate that our proposed framework Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). improves the segmentation performance significantly. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: An overview of the boundary-aware pyramid transformer (BP-Trans) for polyp segmentation. 2. Method embedding and concatenated with the position embed- ding. After that, the resulted vector is passed through the 2.1. Overall Architecture transformer encoder of this layer and the output is recon- structed. The computation of PVT is greatly reduced by As illustrated in 1, to minimize the influence of back- using progressively shrinking pyramids to reduce large drop colors on model training, we first utilize the CE[8] feature maps. module to preprocess the input images. Then, the PVT We remove the decoder layer, and use the PVT encoder encoder proposed by Wang et al. [9] is used for the coarse on top of four multi-scale feature maps (i.e., {π‘₯𝑖 }4𝑖=1 ) feature extraction due to its superior ability in multi-scale generated by different stages. Among these feature maps, representation. After the extraction, features at four dif- π‘₯1 is the lowest level feature, which contains a lot of ferent scales are obtained, in which the feature of the information, but has a lot of noise. In comparison, π‘₯2 , highest level, π‘₯4 , is sent into a Boundary-aware Atten- π‘₯3 , π‘₯4 provide high-level semantic cues. tion Gate (BAG) [11] to retrieve boundary information, resulting in the feature 𝑓𝐡𝐴𝐺 . Finally, we assemble differ- ent levels of features, send them into the prediction head, 2.3. Boundary-aware Knowledge and predict the segmentation map. During inference, Modeling we employ the PCS[8] module to correct tiny polyps’ The main task of BAG[11] is to extract enough local excessive pixel imbalance. details to handle blurred boundaries. We argue that the equipment of boundary information can also let the trans- 2.2. Pyramid Feature Extraction former obtain more power in addressing lesions with PVT [9] introduces a pyramid structure of the trans- ambiguous boundaries. At the end of each transformer former framework to generate multi-scale feature maps encoder layer, we add a BAG to enhance the converted for intensive prediction tasks, which contains four stages features. In addition, the BAG has a key-patch map gen- to generate feature maps at different scales. All stages erator that uses the modified features as the input and share a similar architecture and the details are as follows. outputs a binary patch-wise attention map, where the Assumed that the feature map with size of 𝐻 *π‘Š * 𝐢1 , identity 1 means that the associated patch is at the fuzzy 4 4 border, similar to a classic spatial attention gate. BAG the input image of size 𝐻 * π‘Š * 𝐢 is divided into 4*4 𝐻*π‘Š learns the robust feature representation of fuzzy borders patches. Each flatten patch is linearly projected to get the in a variety of ways thanks to its architecture, which is 3.2. Implementation Details critical to managing the segmentation of fuzzy boundary We utilize the PyTorch framework to create our BP-Trans lesions. and a NVIDIA GeForce RTX 3080 Ti to speed up the computations. We use a multi-scale technique in the 2.4. Prediction with Multi-level Fusion training phase since each polyp has a unique form and To fuse features of different levels, we employ three size. The following are the other details. First, because sub-modules: cascaded fusion module(CFM), camouflage of the variations in picture size from one sequence to the identification module(CIM), and similarity aggregation next. we resize the picture altered by CE[8] to a consistent module(SAM) [10]. CFM is used to extract the seman- size of 352*352. The picture is then flipped horizontally tic and geographical information of polyps in advanced and vertically with 0.5 probabilities, rotated randomly, features from π‘₯2 , π‘₯3 , and π‘₯4 , while CIM is utilized to then subjected to GaussianBlur with 0.1 probabilities. collect the information about polyps camouflaged in π‘₯1 . With an initial learning rate of 1𝑒 βˆ’ 4, we utilize the SAM processing extends the pixel features of the polyp re- AdamW optimizer to update network parameters. The gion with improved semantic location information on the batch size was set to 8 for a total of 120 epochs. To oversee whole polyp region, successfully integrating the cross- model training, we employ a mix of IoU and binary cross layer features. We utilize the Probability correction tech- entropy with logic as the loss function. nique (PCS)[8] module to cope with little polyps with very uneven foreground and background pixels during 3.3. Compared Models the test phase. The primary idea of PCS is to explicitly We tried many models, like PraNet [15], SANet [8], Trans- adjust the forecast probability using logarithmic weight- Fuse [16], HarDNet-MSEG [17].The experimental results ing. This module can significantly increase the accuracy are shown in Table 2. Among these, by backward notic- of the final forecast. ing, PraNet [15] first predicts the rough areas and then implicitly models the borders. As a result, when com- 3. Experiments pared to certain traditional models, it delivers a signif- icant improvement in performance. SANet[8] recom- 3.1. Datasets and Evaluation Metrics mends CE to decouple the image’s color and content, and shallow attention to decrease data noise for tiny polyps that are difficult to separate, Reduce the interference of Table 1 irrelevant factors to the model. TransFuse [16] combines Detailed statistics of each fold in the training data. Transformers and CNNs in a parallel style, where both Train Test global dependency and low-level spatial details can be fold sequence sample sequence sample efficiently captured in a much shallower manner, also 0 36 2571 10 719 achieving good results. In contrast to the above model, 1 37 2827 9 463 HarDNet-MSEG [17] uses a simple encoder-decoder ar- 2 37 2684 9 606 chitecture without any attention modules. The back- 3 37 2465 9 825 bone of HarDNet-MSEG [17] is a low-storage traffic CNN 4 37 2613 9 677 paired with a decoder that offers outstanding accuracy and fast inference times. Experimental results prove that We employ the Endocv2022 dataset [12][13][14] to con- it is 1.3 times faster than PraNet and more than 2 times duct the experiments, which includes 46 video sequences faster than other models. with a total of 3390 images. Five-fold cross-validation is adopted here for fair and thorough comparison, statistics We employ the same encoder-decoder structure as of each fold have been shown in Table 1. Dice score is HarDNet-MSEG [17], but instead of using low-storage used as the final evaluation metric, which mainly focuses traffic CNN as the backbone, we use PVT[9], which is on the internal consistency of segmented objects. a versatile backbone for the dense prediction that we combine with BAG[11] to increase the focus on boundary Dice score(π‘¦π‘‘π‘Ÿπ‘’π‘’ , π‘¦π‘π‘Ÿπ‘’π‘‘ ) β‰œ information. As can be seen in Table 2, the BP-Trans βˆ’15 2ytrue ypred .sum () + 1𝑒 model is superior to the current methods, demonstrating (1) ytrue .sum () + ypred .sum () + 1π‘’βˆ’15 that it has a better learning ability. Here, we report the mean value of Dice score. π‘¦π‘‘π‘Ÿπ‘’π‘’ stands for label and π‘¦π‘π‘Ÿπ‘’π‘‘ stands for prediction. A 3.4. Ablation Study smoothing factor is used here to avoid the training col- We explore the effectiveness of each component in de- lapse when meeting empty labels. tail. Effectiveness of PVT: As shown in 2, compared to Figure 2: Visualized comparison of (a) Image; (b) Ground-truth (GT); (c) Result of our model; (d-h) Results of compared methods. Table 2 Effectiveness of BAG: since the Polyp-PVT[10] Quantitative results of compared methods. model is relatively complex, it is difficult to surpass HarDNet-MSEG[17] in terms of speed, so we try to im- model Dice Dice_std prove the accuracy of the model. Our tries include the Deeplabv3+[18] 0.3019 0.455 combination of PVT[9] and convlstm[21], in order to Deeplabv3+ + pointrend[19] 0.4884 0.4347 make use of the temporal information to further improves PSPNet[20] 0.2933 0.455 the model effect. However, this attempt has not improved SANet[8] 0.634 0.41 the model’s performance. Finally, we decided to start PraNet[15] 0.665 0.39 from the boundary information and use BAG[11] to ob- TransFuse[16] 0.6755 0.3876 tain more information. It turns out that the model has HarDNet-MSEG[17] 0.7247 0.3685 Polyp-PVT[10] 0.7321 0.3662 been improved to a certain extent. BP-Trans(ours) 0.7429 0.352 4. Conclusion other models, Polyp-PVT[10] can achieve the best perfor- This paper introduces a boundary-aware pyramid trans- mance on the training set compared to the other models. former for polyp segmentation, which leverages the abun- HarDNet-MSEG[17] follows closely behind. Their combi- dant knowledge and multi-scale information of bound- nation with several temporal or aggregation modules are aries to boost segmentation performance. We conduct discussed in Table 3. The experimental results prove that five-fold cross-validation to assess the performance and PVT[9] outperforms HarDNet-MSEG[17] in all aspects. the results consistently support the wonderful advantage of our method. Furthermore, our method has achieved Table 3 first place in the first and second rounds during the offi- Ablation study about the key components. cial evaluation. In the future, temporal performance and generalization ability will be improved and we hope that model Dice Dice_std our findings may inspire new approaches to the problem HarDNet-MSEG[17] 0.7247 0.3685 of polyp segmentation. HarDNet-MSEG+CONVLSTM 0.6143 0.382 PVT[9] 0.7321 0.3662 PVT+CONVLSTM 0.7146 0.369 References BP-Trans 0.7429 0.352 [1] L. H. Biller, D. Schrag, Diagnosis and treatment of metastatic colorectal cancer: a review, Jama 325 (2021) 669–685. stances in gastrointestinal endoscopy, Medical [2] B. C. MORSON, Genesis of colorectal cancer, Clinics image analysis 70 (2021) 102002. doi:10.1016/j. in gastroenterology 5 (1976) 505–525. media.2021.102002. [3] M. Akbari, M. Mohrekesh, E. Nasr-Esfahani, S. R. [13] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can- Soroushmehr, N. Karimi, S. Samavi, K. Najarian, nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. Polyp segmentation in colonoscopy images using Anonsen, M. A. Riegler, et al., Polypgen: A fully convolutional network, in: 2018 40th An- multi-center polyp detection and segmentation nual International Conference of the IEEE Engi- dataset for generalisability assessment, arXiv neering in Medicine and Biology Society (EMBC), preprint arXiv:2106.04463 (2021). doi:10.48550/ IEEE, 2018, pp. 69–72. arXiv.2106.04463. [4] A. Mohammed, S. Yildirim, I. Farup, M. Peder- [14] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po- sen, Ø. Hovde, Y-net: A deep convolutional neu- lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, ral network for polyp detection, arXiv preprint V. Thambawita, et al., Assessing generalisabil- arXiv:1806.01907 (2018). ity of deep learning-based polyp detection and [5] N. B. Le Duy Huynh, A u-net++ with pre-trained segmentation methods through a computer vision efficientnet backbone for segmentation of diseases challenge, arXiv preprint arXiv:2202.12031 (2022). and artifacts in endoscopy images and videos, in: doi:10.48550/arXiv.2202.12031. CEUR Workshop Proceedings, volume 2595, 2020, [15] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, pp. 13–17. L. Shao, Pranet: Parallel reverse attention network [6] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, for polyp segmentation, in: International confer- H. D. Johansen, Doubleu-net: A deep convolu- ence on medical image computing and computer- tional neural network for medical image segmenta- assisted intervention, Springer, 2020, pp. 263–273. tion, in: 2020 IEEE 33rd International symposium [16] Y. Zhang, H. Liu, Q. Hu, Transfuse: Fusing trans- on computer-based medical systems (CBMS), IEEE, formers and cnns for medical image segmentation, 2020, pp. 558–564. in: MICCAI, 2021. [7] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, [17] C.-H. Huang, H.-Y. Wu, Y.-L. S. Lin, Hardnet-mseg: T. De Lange, P. Halvorsen, H. D. Johansen, Re- A simple encoder-decoder polyp segmentation neu- sunet++: An advanced architecture for medical ral network that achieves over 0.9 mean dice and image segmentation, in: 2019 IEEE International 86 fps, ArXiv abs/2101.07172 (2021). Symposium on Multimedia (ISM), IEEE, 2019, pp. [18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, 225–2255. H. Adam, Encoder-Decoder with Atrous Separa- [8] J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, S. Cui, ble Convolution for Semantic Image Segmentation, Shallow attention network for polyp segmentation, arXiv:1802.02611 [cs] (2018). arXiv:1802.02611. in: International Conference on Medical Image [19] A. Kirillov, Y. Wu, K. He, R. Girshick, Pointrend: Im- Computing and Computer-Assisted Intervention, age segmentation as rendering, in: 2020 IEEE/CVF Springer, 2021, pp. 699–708. Conference on Computer Vision and Pattern Recog- [9] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, nition (CVPR), 2020, pp. 9796–9805. doi:10.1109/ T. Lu, P. Luo, L. Shao, Pyramid vision transformer: CVPR42600.2020.00982. A versatile backbone for dense prediction without [20] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid Scene convolutions, in: Proceedings of the IEEE/CVF In- Parsing Network, arXiv:1612.01105 [cs] (2017). ternational Conference on Computer Vision, 2021, arXiv:1612.01105. pp. 568–578. [21] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, [10] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, L. Shao, W.-c. Woo, Convolutional lstm network: A machine Polyp-pvt: Polyp segmentation with pyramid vi- learning approach for precipitation nowcasting, Ad- sion transformers, arXiv preprint arXiv:2108.06932 vances in neural information processing systems (2021). 28 (2015). [11] J. Wang, L. Wei, L. Wang, Q. Zhou, L. Zhu, J. Qin, Boundary-aware transformers for skin lesion seg- mentation, in: International Conference on Medical Image Computing and Computer-Assisted Interven- tion, Springer, 2021, pp. 206–216. [12] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po- lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Matuszewski, et al., Deep learning for detec- tion and segmentation of artefact and disease in-