Boundary-aware Pyramid Transformer for Polyp
Segmentation
Jiacheng Wang1 , Yuxi Ma1 , Ruochen Mu1 and Liansheng Wang1
1
    Department of Computer Science at School of Informatics, Xiamen University


                                          Abstract
                                          According to the World Health Organization (WHO), colorectal cancer(CRC) leads to a growing death rate in recent years,
                                          whose major cause comes from adenomatous polyps. Early polyp diagnosis can help to lower the incidence of CRC, which
                                          is achieved by colonoscopy as the gold standard. In this direction, polyp segmentation, on the other hand, is still a time-
                                          consuming and labor-intensive process. Although deep learning has made significant progress in the group of automatic
                                          polyp segmentation recently, however, these models have the following drawbacks: (i) lesions with small size are hard to
                                          detect since the pooling layers miss detailed contexts, and (ii) the boundaries are sometimes blurry and ambiguous which are
                                          extremely hard to determine. In this paper, we propose to equip the Pyramid vision Transformer with Boundary-aware
                                          supervision, so-called BP-Trans, which can build multi-scale feature maps for dense prediction tasks and attentive boundary
                                          knowledge for precise boundary segmentation. We perform five-fold cross-validation on the Endoscopic computer vision
                                          challenges 2.0, in which the results on all metrics and folds consistently indicate the advantage of our method.

                                          Keywords
                                          Transformer, Boundary-aware Supervision, Polyp Segmentation


1. Introduction                                                                                        be obviously different under different environmental con-
                                                                                                       ditions. As a result, during model training, the significant
Colorectal cancer(CRC) is the third most prevalent cause association between color and polyp segmentation tends
of cancer mortality worldwide, with more than 1.85 mil- to be overly concentrated, which is harmful to model
lion cases and 850,000 deaths per year [1], whose major training. Wei et al.[8] present the color exchange (CE)
cause is owing to adenomatous polyps. The numbers operation as a solution to this problem. They also propose
can bring a more intuitive feeling: 50%70% of colon the Probability correction method (PCS), which can im-
cancer comes from adenoma, and the cancer rate of ade- prove positive sample prediction while reducing negative
nomatous polyps is 2.9%.4% [2]. Colonoscopy is a vital sample interference. Furthermore, the majority of polyp
medical screening technique for the illnesses of the lower regions is rather small. When simple CNN is utilized for
digestive system. It may be used to check for intestinal feature extraction, these small areas are frequently over-
polyps, bleeding, intestinal blockage, and the exclusion looked. To solve this issue, Wang et al. introduce Pyramid
of lesions. With the developing agreement on artificial Vision Transformer(PVT)[9] that can yield multi-scale
intelligence, people’s interest has shifted to the subject feature maps for dense prediction tasks by combining
of health as deep learning-based polyp segmentation is pyramid structure of the transformer. Dong et al[10]
able to aid in clinician diagnosis. In this group, CNNs extend PVT with other modules and propose Poly-PVT
have made significant progress in the application of nu- for polyp segmentation, which effectively suppresses the
merous imaging applications and automatic polyp seg- noise in the features and greatly improves their expres-
mentation is a popular topic among them. Thanks to the siveness.
strong ability of infeasible and robust feature represen-                                                 Despite the success of Polyp-PVT, it still lacks the
tation, FCN[3], U-Net[4], U-Net++[5], DoubleU-Net[6] ability to address the tricky situation when boundaries
and ResUNet[7] series, etc, have good results compared are blurry to recognize. To mitigate this problem, we
with traditional methods.                                                                              propose a Boundary-aware Pyramid Transformer (BP-
   However, these methods have certain limitations. In Trans) for the multi-scale feature extraction along with
general, polyp lesions have closed hue with the patients’ boundary knowledge at multiple levels. BP-Trans ex-
own intestinal environment so that their appearance will tends PVT with a boundary-aware self-attention module,
4th International Workshop and Challenge on Computer Vision in which is supervised by the boundary key-point map and
Endoscopy (EndoCV2022) in conjunction with the 19th IEEE Inter- refines features to yield more powerful representations
national Symposium on Biomedical Imaging ISBI2022, March for boundaries. To assess our method, we conduct five-
28th, 2022, IC Royal Bengal, Kolkata, India                                                            fold cross-validation on the given dataset of Endoscopic
$ jiachengw@stu.xmu.edu.cn (J. Wang);                                                                  computer vision challenges 2.0. The experimental results
Correspondingauthor:lswang@xmu.edu.cn (L. Wang)                                                        consistently demonstrate that our proposed framework
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   improves the segmentation performance significantly.
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: An overview of the boundary-aware pyramid transformer (BP-Trans) for polyp segmentation.


2. Method                                                      embedding and concatenated with the position embed-
                                                               ding. After that, the resulted vector is passed through the
2.1. Overall Architecture                                      transformer encoder of this layer and the output is recon-
                                                               structed. The computation of PVT is greatly reduced by
As illustrated in 1, to minimize the influence of back-
                                                               using progressively shrinking pyramids to reduce large
drop colors on model training, we first utilize the CE[8]
                                                               feature maps.
module to preprocess the input images. Then, the PVT
                                                                  We remove the decoder layer, and use the PVT encoder
encoder proposed by Wang et al. [9] is used for the coarse
                                                               on top of four multi-scale feature maps (i.e., {𝑥𝑖 }4𝑖=1 )
feature extraction due to its superior ability in multi-scale
                                                               generated by different stages. Among these feature maps,
representation. After the extraction, features at four dif-
                                                               𝑥1 is the lowest level feature, which contains a lot of
ferent scales are obtained, in which the feature of the
                                                               information, but has a lot of noise. In comparison, 𝑥2 ,
highest level, 𝑥4 , is sent into a Boundary-aware Atten-
                                                               𝑥3 , 𝑥4 provide high-level semantic cues.
tion Gate (BAG) [11] to retrieve boundary information,
resulting in the feature 𝑓𝐵𝐴𝐺 . Finally, we assemble differ-
ent levels of features, send them into the prediction head, 2.3. Boundary-aware Knowledge
and predict the segmentation map. During inference,                   Modeling
we employ the PCS[8] module to correct tiny polyps’
                                                               The main task of BAG[11] is to extract enough local
excessive pixel imbalance.
                                                               details to handle blurred boundaries. We argue that the
                                                               equipment of boundary information can also let the trans-
2.2. Pyramid Feature Extraction                                former obtain more power in addressing lesions with
PVT [9] introduces a pyramid structure of the trans- ambiguous boundaries. At the end of each transformer
former framework to generate multi-scale feature maps encoder layer, we add a BAG to enhance the converted
for intensive prediction tasks, which contains four stages features. In addition, the BAG has a key-patch map gen-
to generate feature maps at different scales. All stages erator that uses the modified features as the input and
share a similar architecture and the details are as follows. outputs a binary patch-wise attention map, where the
Assumed that the feature map with size of 𝐻       *𝑊    * 𝐶1 , identity 1 means that the associated patch is at the fuzzy
                                                4   4
                                                               border, similar to a classic spatial attention gate. BAG
the input image of size 𝐻 * 𝑊 * 𝐶 is divided into 4*4  𝐻*𝑊
                                                               learns the robust feature representation of fuzzy borders
patches. Each flatten patch is linearly projected to get the
in a variety of ways thanks to its architecture, which is     3.2. Implementation Details
critical to managing the segmentation of fuzzy boundary
                                                              We utilize the PyTorch framework to create our BP-Trans
lesions.
                                                              and a NVIDIA GeForce RTX 3080 Ti to speed up the
                                                              computations. We use a multi-scale technique in the
2.4. Prediction with Multi-level Fusion                       training phase since each polyp has a unique form and
To fuse features of different levels, we employ three         size. The following are the other details. First, because
sub-modules: cascaded fusion module(CFM), camouflage          of the variations in picture size from one sequence to the
identification module(CIM), and similarity aggregation        next. we resize the picture altered by CE[8] to a consistent
module(SAM) [10]. CFM is used to extract the seman-           size of 352*352. The picture is then flipped horizontally
tic and geographical information of polyps in advanced        and vertically with 0.5 probabilities, rotated randomly,
features from 𝑥2 , 𝑥3 , and 𝑥4 , while CIM is utilized to     then subjected to GaussianBlur with 0.1 probabilities.
collect the information about polyps camouflaged in 𝑥1 .      With an initial learning rate of 1𝑒 − 4, we utilize the
SAM processing extends the pixel features of the polyp re-    AdamW optimizer to update network parameters. The
gion with improved semantic location information on the       batch size was set to 8 for a total of 120 epochs. To oversee
whole polyp region, successfully integrating the cross-       model training, we employ a mix of IoU and binary cross
layer features. We utilize the Probability correction tech-   entropy with logic as the loss function.
nique (PCS)[8] module to cope with little polyps with
very uneven foreground and background pixels during           3.3. Compared Models
the test phase. The primary idea of PCS is to explicitly
                                                              We tried many models, like PraNet [15], SANet [8], Trans-
adjust the forecast probability using logarithmic weight-
                                                              Fuse [16], HarDNet-MSEG [17].The experimental results
ing. This module can significantly increase the accuracy
                                                              are shown in Table 2. Among these, by backward notic-
of the final forecast.
                                                              ing, PraNet [15] first predicts the rough areas and then
                                                              implicitly models the borders. As a result, when com-
3. Experiments                                                pared to certain traditional models, it delivers a signif-
                                                              icant improvement in performance. SANet[8] recom-
3.1. Datasets and Evaluation Metrics                          mends CE to decouple the image’s color and content, and
                                                              shallow attention to decrease data noise for tiny polyps
                                                              that are difficult to separate, Reduce the interference of
Table 1                                                       irrelevant factors to the model. TransFuse [16] combines
Detailed statistics of each fold in the training data.        Transformers and CNNs in a parallel style, where both
            Train                     Test                    global dependency and low-level spatial details can be
    fold
            sequence      sample      sequence     sample     efficiently captured in a much shallower manner, also
    0       36            2571        10           719        achieving good results. In contrast to the above model,
    1       37            2827        9            463        HarDNet-MSEG [17] uses a simple encoder-decoder ar-
    2       37            2684        9            606        chitecture without any attention modules. The back-
    3       37            2465        9            825        bone of HarDNet-MSEG [17] is a low-storage traffic CNN
    4       37            2613        9            677        paired with a decoder that offers outstanding accuracy
                                                              and fast inference times. Experimental results prove that
   We employ the Endocv2022 dataset [12][13][14] to con- it is 1.3 times faster than PraNet and more than 2 times
duct the experiments, which includes 46 video sequences faster than other models.
with a total of 3390 images. Five-fold cross-validation is
adopted here for fair and thorough comparison, statistics        We employ the same encoder-decoder structure as
of each fold have been shown in Table 1. Dice score is HarDNet-MSEG [17], but instead of using low-storage
used as the final evaluation metric, which mainly focuses traffic CNN as the backbone, we use PVT[9], which is
on the internal consistency of segmented objects.             a versatile backbone for the dense prediction that we
                                                              combine with BAG[11] to increase the focus on boundary
                        Dice score(𝑦𝑡𝑟𝑢𝑒 , 𝑦𝑝𝑟𝑒𝑑 ) ≜
                                                              information. As can be seen in Table 2, the BP-Trans
                                            −15
                 2ytrue ypred .sum () + 1𝑒                    model is superior to the current methods, demonstrating
                                                          (1)
           ytrue .sum () + ypred .sum () + 1𝑒−15              that it has a better learning ability.
   Here, we report the mean value of Dice score. 𝑦𝑡𝑟𝑢𝑒
stands for label and 𝑦𝑝𝑟𝑒𝑑 stands for prediction. A 3.4. Ablation Study
smoothing factor is used here to avoid the training col-
                                                         We explore the effectiveness of each component in de-
lapse when meeting empty labels.
                                                         tail. Effectiveness of PVT: As shown in 2, compared to
Figure 2: Visualized comparison of (a) Image; (b) Ground-truth (GT); (c) Result of our model; (d-h) Results of compared
methods.


Table 2                                                        Effectiveness of BAG: since the Polyp-PVT[10]
Quantitative results of compared methods.                   model is relatively complex, it is difficult to surpass
                                                            HarDNet-MSEG[17] in terms of speed, so we try to im-
    model                           Dice     Dice_std       prove the accuracy of the model. Our tries include the
    Deeplabv3+[18]                 0.3019     0.455         combination of PVT[9] and convlstm[21], in order to
    Deeplabv3+ + pointrend[19]     0.4884     0.4347        make use of the temporal information to further improves
    PSPNet[20]                     0.2933     0.455         the model effect. However, this attempt has not improved
    SANet[8]                        0.634      0.41         the model’s performance. Finally, we decided to start
    PraNet[15]                      0.665      0.39         from the boundary information and use BAG[11] to ob-
    TransFuse[16]                  0.6755     0.3876
                                                            tain more information. It turns out that the model has
    HarDNet-MSEG[17]               0.7247     0.3685
    Polyp-PVT[10]                  0.7321     0.3662
                                                            been improved to a certain extent.
    BP-Trans(ours)                 0.7429     0.352
                                                            4. Conclusion
other models, Polyp-PVT[10] can achieve the best perfor-    This paper introduces a boundary-aware pyramid trans-
mance on the training set compared to the other models.     former for polyp segmentation, which leverages the abun-
HarDNet-MSEG[17] follows closely behind. Their combi-       dant knowledge and multi-scale information of bound-
nation with several temporal or aggregation modules are     aries to boost segmentation performance. We conduct
discussed in Table 3. The experimental results prove that   five-fold cross-validation to assess the performance and
PVT[9] outperforms HarDNet-MSEG[17] in all aspects.         the results consistently support the wonderful advantage
                                                            of our method. Furthermore, our method has achieved
Table 3                                                     first place in the first and second rounds during the offi-
Ablation study about the key components.                    cial evaluation. In the future, temporal performance and
                                                            generalization ability will be improved and we hope that
   model                            Dice      Dice_std      our findings may inspire new approaches to the problem
   HarDNet-MSEG[17]                 0.7247    0.3685        of polyp segmentation.
   HarDNet-MSEG+CONVLSTM            0.6143    0.382
   PVT[9]                           0.7321    0.3662
   PVT+CONVLSTM                     0.7146    0.369         References
   BP-Trans                         0.7429    0.352
                                                              [1] L. H. Biller, D. Schrag, Diagnosis and treatment of
                                                                  metastatic colorectal cancer: a review, Jama 325
     (2021) 669–685.                                              stances in gastrointestinal endoscopy, Medical
 [2] B. C. MORSON, Genesis of colorectal cancer, Clinics          image analysis 70 (2021) 102002. doi:10.1016/j.
     in gastroenterology 5 (1976) 505–525.                        media.2021.102002.
 [3] M. Akbari, M. Mohrekesh, E. Nasr-Esfahani, S. R.        [13] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R. Can-
     Soroushmehr, N. Karimi, S. Samavi, K. Najarian,              nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.
     Polyp segmentation in colonoscopy images using               Anonsen, M. A. Riegler, et al., Polypgen: A
     fully convolutional network, in: 2018 40th An-               multi-center polyp detection and segmentation
     nual International Conference of the IEEE Engi-              dataset for generalisability assessment, arXiv
     neering in Medicine and Biology Society (EMBC),              preprint arXiv:2106.04463 (2021). doi:10.48550/
     IEEE, 2018, pp. 69–72.                                       arXiv.2106.04463.
 [4] A. Mohammed, S. Yildirim, I. Farup, M. Peder-           [14] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G. Po-
     sen, Ø. Hovde, Y-net: A deep convolutional neu-              lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,
     ral network for polyp detection, arXiv preprint              V. Thambawita, et al., Assessing generalisabil-
     arXiv:1806.01907 (2018).                                     ity of deep learning-based polyp detection and
 [5] N. B. Le Duy Huynh, A u-net++ with pre-trained               segmentation methods through a computer vision
     efficientnet backbone for segmentation of diseases           challenge, arXiv preprint arXiv:2202.12031 (2022).
     and artifacts in endoscopy images and videos, in:            doi:10.48550/arXiv.2202.12031.
     CEUR Workshop Proceedings, volume 2595, 2020,           [15] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen,
     pp. 13–17.                                                   L. Shao, Pranet: Parallel reverse attention network
 [6] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen,            for polyp segmentation, in: International confer-
     H. D. Johansen, Doubleu-net: A deep convolu-                 ence on medical image computing and computer-
     tional neural network for medical image segmenta-            assisted intervention, Springer, 2020, pp. 263–273.
     tion, in: 2020 IEEE 33rd International symposium        [16] Y. Zhang, H. Liu, Q. Hu, Transfuse: Fusing trans-
     on computer-based medical systems (CBMS), IEEE,              formers and cnns for medical image segmentation,
     2020, pp. 558–564.                                           in: MICCAI, 2021.
 [7] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen,     [17] C.-H. Huang, H.-Y. Wu, Y.-L. S. Lin, Hardnet-mseg:
     T. De Lange, P. Halvorsen, H. D. Johansen, Re-               A simple encoder-decoder polyp segmentation neu-
     sunet++: An advanced architecture for medical                ral network that achieves over 0.9 mean dice and
     image segmentation, in: 2019 IEEE International              86 fps, ArXiv abs/2101.07172 (2021).
     Symposium on Multimedia (ISM), IEEE, 2019, pp.          [18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff,
     225–2255.                                                    H. Adam, Encoder-Decoder with Atrous Separa-
 [8] J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, S. Cui,          ble Convolution for Semantic Image Segmentation,
     Shallow attention network for polyp segmentation,            arXiv:1802.02611 [cs] (2018). arXiv:1802.02611.
     in: International Conference on Medical Image           [19] A. Kirillov, Y. Wu, K. He, R. Girshick, Pointrend: Im-
     Computing and Computer-Assisted Intervention,                age segmentation as rendering, in: 2020 IEEE/CVF
     Springer, 2021, pp. 699–708.                                 Conference on Computer Vision and Pattern Recog-
 [9] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang,        nition (CVPR), 2020, pp. 9796–9805. doi:10.1109/
     T. Lu, P. Luo, L. Shao, Pyramid vision transformer:          CVPR42600.2020.00982.
     A versatile backbone for dense prediction without       [20] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid Scene
     convolutions, in: Proceedings of the IEEE/CVF In-            Parsing Network, arXiv:1612.01105 [cs] (2017).
     ternational Conference on Computer Vision, 2021,             arXiv:1612.01105.
     pp. 568–578.                                            [21] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong,
[10] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, L. Shao,          W.-c. Woo, Convolutional lstm network: A machine
     Polyp-pvt: Polyp segmentation with pyramid vi-               learning approach for precipitation nowcasting, Ad-
     sion transformers, arXiv preprint arXiv:2108.06932           vances in neural information processing systems
     (2021).                                                      28 (2015).
[11] J. Wang, L. Wei, L. Wang, Q. Zhou, L. Zhu, J. Qin,
     Boundary-aware transformers for skin lesion seg-
     mentation, in: International Conference on Medical
     Image Computing and Computer-Assisted Interven-
     tion, Springer, 2021, pp. 206–216.
[12] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G. Po-
     lat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
     B. Matuszewski, et al., Deep learning for detec-
     tion and segmentation of artefact and disease in-