1. Introduction

Boundary-aware Pyramid Transformer for Polyp Segmentation

Jiacheng Wang

Yuxi Ma

Ruochen Mu

Liansheng Wang

0 0 Department of Computer Science at School of Informatics, Xiamen University

According to the World Health Organization (WHO), colorectal cancer(CRC) leads to a growing death rate in recent years, whose major cause comes from adenomatous polyps. Early polyp diagnosis can help to lower the incidence of CRC, which is achieved by colonoscopy as the gold standard. In this direction, polyp segmentation, on the other hand, is still a timeconsuming and labor-intensive process. Although deep learning has made significant progress in the group of automatic polyp segmentation recently, however, these models have the following drawbacks: (i) lesions with small size are hard to detect since the pooling layers miss detailed contexts, and (ii) the boundaries are sometimes blurry and ambiguous which are extremely hard to determine. In this paper, we propose to equip the Pyramid vision Transformer with Boundary-aware supervision, so-called BP-Trans, which can build multi-scale feature maps for dense prediction tasks and attentive boundary knowledge for precise boundary segmentation. We perform five-fold cross-validation on the Endoscopic computer vision challenges 2.0, in which the results on all metrics and folds consistently indicate the advantage of our method.

eol>Transformer Boundary-aware Supervision Polyp Segmentation

1. Introduction 2. Method 2.1. Overall Architecture

As illustrated in 1, to minimize the influence of backdrop colors on model training, we first utilize the CE[8] module to preprocess the input images. Then, the PVT encoder proposed by Wang et al. [9] is used for the coarse feature extraction due to its superior ability in multi-scale representation. After the extraction, features at four different scales are obtained, in which the feature of the highest level, 4, is sent into a Boundary-aware Attention Gate (BAG) [11] to retrieve boundary information, resulting in the feature . Finally, we assemble diferent levels of features, send them into the prediction head, and predict the segmentation map. During inference, we employ the PCS[8] module to correct tiny polyps’ excessive pixel imbalance.

2.2. Pyramid Feature Extraction PVT [9] introduces a pyramid structure of the trans

former framework to generate multi-scale feature maps for intensive prediction tasks, which contains four stages to generate feature maps at diferent scales. All stages share a similar architecture and the details are as follows. Assumed that the feature map with size of 4 * 4 * 1, the input image of size * * is divided into * 4* 4 patches. Each flatten patch is linearly projected to get the embedding and concatenated with the position embedding. After that, the resulted vector is passed through the transformer encoder of this layer and the output is reconstructed. The computation of PVT is greatly reduced by using progressively shrinking pyramids to reduce large feature maps.

We remove the decoder layer, and use the PVT encoder 4 on top of four multi-scale feature maps (i.e., {}=1) generated by diferent stages. Among these feature maps, 1 is the lowest level feature, which contains a lot of information, but has a lot of noise. In comparison, 2, 3, 4 provide high-level semantic cues.

2.3. Boundary-aware Knowledge Modeling

The main task of BAG[11] is to extract enough local details to handle blurred boundaries. We argue that the equipment of boundary information can also let the transformer obtain more power in addressing lesions with ambiguous boundaries. At the end of each transformer encoder layer, we add a BAG to enhance the converted features. In addition, the BAG has a key-patch map generator that uses the modified features as the input and outputs a binary patch-wise attention map, where the identity 1 means that the associated patch is at the fuzzy border, similar to a classic spatial attention gate. BAG learns the robust feature representation of fuzzy borders in a variety of ways thanks to its architecture, which is critical to managing the segmentation of fuzzy boundary lesions.

2.4. Prediction with Multi-level Fusion

To fuse features of diferent levels, we employ three sub-modules: cascaded fusion module(CFM), camouflage identification module(CIM), and similarity aggregation module(SAM) [10]. CFM is used to extract the semantic and geographical information of polyps in advanced features from 2, 3, and 4, while CIM is utilized to collect the information about polyps camouflaged in 1. SAM processing extends the pixel features of the polyp region with improved semantic location information on the whole polyp region, successfully integrating the crosslayer features. We utilize the Probability correction technique (PCS)[8] module to cope with little polyps with very uneven foreground and background pixels during the test phase. The primary idea of PCS is to explicitly adjust the forecast probability using logarithmic weighting. This module can significantly increase the accuracy of the final forecast. We utilize the PyTorch framework to create our BP-Trans and a NVIDIA GeForce RTX 3080 Ti to speed up the computations. We use a multi-scale technique in the training phase since each polyp has a unique form and size. The following are the other details. First, because of the variations in picture size from one sequence to the next. we resize the picture altered by CE[8] to a consistent size of 352*352. The picture is then flipped horizontally and vertically with 0.5 probabilities, rotated randomly, then subjected to GaussianBlur with 0.1 probabilities. With an initial learning rate of 1 − 4, we utilize the AdamW optimizer to update network parameters. The batch size was set to 8 for a total of 120 epochs. To oversee model training, we employ a mix of IoU and binary cross entropy with logic as the loss function.

3.3. Compared Models We tried many models, like PraNet [15], SANet [8], Trans

Fuse [16], HarDNet-MSEG [17].The experimental results are shown in Table 2. Among these, by backward noticing, PraNet [15] first predicts the rough areas and then implicitly models the borders. As a result, when com3. Experiments pared to certain traditional models, it delivers a significant improvement in performance. SANet[8] recom3.1. Datasets and Evaluation Metrics mends CE to decouple the image’s color and content, and shallow attention to decrease data noise for tiny polyps that are dificult to separate, Reduce the interference of Table 1 irrelevant factors to the model. TransFuse [16] combines Detailed statistics of each fold in the training data. Transformers and CNNs in a parallel style, where both fold sTerqauinence sample sTeeqstuence sample egficloiebnatllydecpaepntduerendcyinanadmlouwch-lesvhealllsopwaetiralmdaentanielsr, caalnsobe 0 36 2571 10 719 achieving good results. In contrast to the above model, 1 37 2827 9 463 HarDNet-MSEG [17] uses a simple encoder-decoder ar2 37 2684 9 606 chitecture without any attention modules. The back3 37 2465 9 825 bone of HarDNet-MSEG [17] is a low-storage trafic CNN 4 37 2613 9 677 paired with a decoder that ofers outstanding accuracy and fast inference times. Experimental results prove that

We employ the Endocv2022 dataset [12][13][14] to con- it is 1.3 times faster than PraNet and more than 2 times duct the experiments, which includes 46 video sequences faster than other models. with a total of 3390 images. Five-fold cross-validation is adopted here for fair and thorough comparison, statistics We employ the same encoder-decoder structure as of each fold have been shown in Table 1. Dice score is HarDNet-MSEG [17], but instead of using low-storage used as the final evaluation metric, which mainly focuses trafic CNN as the backbone, we use PVT[ 9], which is on the internal consistency of segmented objects. a versatile backbone for the dense prediction that we combine with BAG[11] to increase the focus on boundary

Dice score(, ) ≜ information. As can be seen in Table 2, the BP-Trans 2ytrueypred.sum () + 1− 15 ( 1 ) model is superior to the current methods, demonstrating ytrue.sum () + ypred.sum () + 1− 15 that it has a better learning ability.

Here, we report the mean value of Dice score.

stands for label and stands for prediction. A smoothing factor is used here to avoid the training collapse when meeting empty labels.

3.4. Ablation Study We explore the efectiveness of each component in de

tail. Efectiveness of PVT: As shown in 2, compared to

Efectiveness of BAG: since the Polyp-PVT[10]

model is relatively complex, it is dificult to surpass HarDNet-MSEG[17] in terms of speed, so we try to improve the accuracy of the model. Our tries include the combination of PVT[9] and convlstm[21], in order to make use of the temporal information to further improves the model efect. However, this attempt has not improved the model’s performance. Finally, we decided to start from the boundary information and use BAG[11] to obtain more information. It turns out that the model has been improved to a certain extent.

4. Conclusion

HarDNet-MSEG[17] HarDNet-MSEG+CONVLSTM PVT[9] PVT+CONVLSTM BP-Trans (2021) 669–685. stances in gastrointestinal endoscopy, Medical [2] B. C. MORSON, Genesis of colorectal cancer, Clinics image analysis 70 (2021) 102002. doi:10.1016/j.

in gastroenterology 5 (1976) 505–525. media.2021.102002. [3] M. Akbari, M. Mohrekesh, E. Nasr-Esfahani, S. R. [13] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. CanSoroushmehr, N. Karimi, S. Samavi, K. Najarian, nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V. Polyp segmentation in colonoscopy images using Anonsen, M. A. Riegler, et al., Polypgen: A fully convolutional network, in: 2018 40th An- multi-center polyp detection and segmentation nual International Conference of the IEEE Engi- dataset for generalisability assessment, arXiv neering in Medicine and Biology Society (EMBC), preprint arXiv:2106.04463 (2021). doi:10.48550/ IEEE, 2018, pp. 69–72. arXiv.2106.04463. [4] A. Mohammed, S. Yildirim, I. Farup, M. Peder- [14] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Posen, Ø. Hovde, Y-net: A deep convolutional neu- lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester, ral network for polyp detection, arXiv preprint V. Thambawita, et al., Assessing generalisabilarXiv:1806.01907 (2018). ity of deep learning-based polyp detection and [5] N. B. Le Duy Huynh, A u-net++ with pre-trained segmentation methods through a computer vision eficientnet backbone for segmentation of diseases challenge, arXiv preprint arXiv:2202.12031 (2022). and artifacts in endoscopy images and videos, in: doi:10.48550/arXiv.2202.12031. CEUR Workshop Proceedings, volume 2595, 2020, [15] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, pp. 13–17. L. Shao, Pranet: Parallel reverse attention network [6] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, for polyp segmentation, in: International conferH. D. Johansen, Doubleu-net: A deep convolu- ence on medical image computing and computertional neural network for medical image segmenta- assisted intervention, Springer, 2020, pp. 263–273. tion, in: 2020 IEEE 33rd International symposium [16] Y. Zhang, H. Liu, Q. Hu, Transfuse: Fusing transon computer-based medical systems (CBMS), IEEE, formers and cnns for medical image segmentation, 2020, pp. 558–564. in: MICCAI, 2021. [7] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, [17] C.-H. Huang, H.-Y. Wu, Y.-L. S. Lin, Hardnet-mseg: T. De Lange, P. Halvorsen, H. D. Johansen, Re- A simple encoder-decoder polyp segmentation neusunet++: An advanced architecture for medical ral network that achieves over 0.9 mean dice and image segmentation, in: 2019 IEEE International 86 fps, ArXiv abs/2101.07172 (2021). Symposium on Multimedia (ISM), IEEE, 2019, pp. [18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schrof, 225–2255. H. Adam, Encoder-Decoder with Atrous Separa[8] J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, S. Cui, ble Convolution for Semantic Image Segmentation, Shallow attention network for polyp segmentation, arXiv:1802.02611 [cs] (2018). arXiv:1802.02611. in: International Conference on Medical Image [19] A. Kirillov, Y. Wu, K. He, R. Girshick, Pointrend: ImComputing and Computer-Assisted Intervention, age segmentation as rendering, in: 2020 IEEE/CVF Springer, 2021, pp. 699–708. Conference on Computer Vision and Pattern Recog[9] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, nition (CVPR), 2020, pp. 9796–9805. doi:10.1109/ T. Lu, P. Luo, L. Shao, Pyramid vision transformer: CVPR42600.2020.00982.

A versatile backbone for dense prediction without [20] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid Scene convolutions, in: Proceedings of the IEEE/CVF In- Parsing Network, arXiv:1612.01105 [cs] (2017). ternational Conference on Computer Vision, 2021, arXiv:1612.01105.

pp. 568–578. [21] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, [10] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, L. Shao, W.-c. Woo, Convolutional lstm network: A machine Polyp-pvt: Polyp segmentation with pyramid vi- learning approach for precipitation nowcasting, Adsion transformers, arXiv preprint arXiv:2108.06932 vances in neural information processing systems (2021). 28 (2015). [11] J. Wang, L. Wei, L. Wang, Q. Zhou, L. Zhu, J. Qin,

Boundary-aware transformers for skin lesion segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2021, pp. 206–216. [12] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Polat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo, B. Matuszewski, et al., Deep learning for detection and segmentation of artefact and disease in

[1]

L. H.

Biller ,

Schrag , Diagnosis and treatment of metastatic colorectal cancer: a review , Jama 325