<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Boundary-aware Pyramid Transformer for Polyp Segmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiacheng Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuxi Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruochen Mu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liansheng Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science at School of Informatics, Xiamen University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>According to the World Health Organization (WHO), colorectal cancer(CRC) leads to a growing death rate in recent years, whose major cause comes from adenomatous polyps. Early polyp diagnosis can help to lower the incidence of CRC, which is achieved by colonoscopy as the gold standard. In this direction, polyp segmentation, on the other hand, is still a timeconsuming and labor-intensive process. Although deep learning has made significant progress in the group of automatic polyp segmentation recently, however, these models have the following drawbacks: (i) lesions with small size are hard to detect since the pooling layers miss detailed contexts, and (ii) the boundaries are sometimes blurry and ambiguous which are extremely hard to determine. In this paper, we propose to equip the Pyramid vision Transformer with Boundary-aware supervision, so-called BP-Trans, which can build multi-scale feature maps for dense prediction tasks and attentive boundary knowledge for precise boundary segmentation. We perform five-fold cross-validation on the Endoscopic computer vision challenges 2.0, in which the results on all metrics and folds consistently indicate the advantage of our method.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transformer</kwd>
        <kwd>Boundary-aware Supervision</kwd>
        <kwd>Polyp Segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. Overall Architecture</title>
        <p>As illustrated in 1, to minimize the influence of
backdrop colors on model training, we first utilize the CE[8]
module to preprocess the input images. Then, the PVT
encoder proposed by Wang et al. [9] is used for the coarse
feature extraction due to its superior ability in multi-scale
representation. After the extraction, features at four
different scales are obtained, in which the feature of the
highest level, 4, is sent into a Boundary-aware
Attention Gate (BAG) [11] to retrieve boundary information,
resulting in the feature . Finally, we assemble
diferent levels of features, send them into the prediction head,
and predict the segmentation map. During inference,
we employ the PCS[8] module to correct tiny polyps’
excessive pixel imbalance.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Pyramid Feature Extraction</title>
        <sec id="sec-2-2-1">
          <title>PVT [9] introduces a pyramid structure of the trans</title>
          <p>former framework to generate multi-scale feature maps
for intensive prediction tasks, which contains four stages
to generate feature maps at diferent scales. All stages
share a similar architecture and the details are as follows.
Assumed that the feature map with size of 4 * 4 * 1,
the input image of size  *  *  is divided into * 
4* 4
patches. Each flatten patch is linearly projected to get the
embedding and concatenated with the position
embedding. After that, the resulted vector is passed through the
transformer encoder of this layer and the output is
reconstructed. The computation of PVT is greatly reduced by
using progressively shrinking pyramids to reduce large
feature maps.</p>
          <p>We remove the decoder layer, and use the PVT encoder
4
on top of four multi-scale feature maps (i.e., {}=1)
generated by diferent stages. Among these feature maps,
1 is the lowest level feature, which contains a lot of
information, but has a lot of noise. In comparison, 2,
3, 4 provide high-level semantic cues.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Boundary-aware Knowledge</title>
      </sec>
      <sec id="sec-2-4">
        <title>Modeling</title>
        <p>The main task of BAG[11] is to extract enough local
details to handle blurred boundaries. We argue that the
equipment of boundary information can also let the
transformer obtain more power in addressing lesions with
ambiguous boundaries. At the end of each transformer
encoder layer, we add a BAG to enhance the converted
features. In addition, the BAG has a key-patch map
generator that uses the modified features as the input and
outputs a binary patch-wise attention map, where the
identity 1 means that the associated patch is at the fuzzy
border, similar to a classic spatial attention gate. BAG
learns the robust feature representation of fuzzy borders
in a variety of ways thanks to its architecture, which is
critical to managing the segmentation of fuzzy boundary
lesions.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.4. Prediction with Multi-level Fusion</title>
        <p>To fuse features of diferent levels, we employ three
sub-modules: cascaded fusion module(CFM), camouflage
identification module(CIM), and similarity aggregation
module(SAM) [10]. CFM is used to extract the
semantic and geographical information of polyps in advanced
features from 2, 3, and 4, while CIM is utilized to
collect the information about polyps camouflaged in 1.
SAM processing extends the pixel features of the polyp
region with improved semantic location information on the
whole polyp region, successfully integrating the
crosslayer features. We utilize the Probability correction
technique (PCS)[8] module to cope with little polyps with
very uneven foreground and background pixels during
the test phase. The primary idea of PCS is to explicitly
adjust the forecast probability using logarithmic
weighting. This module can significantly increase the accuracy
of the final forecast.
We utilize the PyTorch framework to create our BP-Trans
and a NVIDIA GeForce RTX 3080 Ti to speed up the
computations. We use a multi-scale technique in the
training phase since each polyp has a unique form and
size. The following are the other details. First, because
of the variations in picture size from one sequence to the
next. we resize the picture altered by CE[8] to a consistent
size of 352*352. The picture is then flipped horizontally
and vertically with 0.5 probabilities, rotated randomly,
then subjected to GaussianBlur with 0.1 probabilities.
With an initial learning rate of 1 − 4, we utilize the
AdamW optimizer to update network parameters. The
batch size was set to 8 for a total of 120 epochs. To oversee
model training, we employ a mix of IoU and binary cross
entropy with logic as the loss function.</p>
      </sec>
      <sec id="sec-2-6">
        <title>3.3. Compared Models</title>
        <sec id="sec-2-6-1">
          <title>We tried many models, like PraNet [15], SANet [8], Trans</title>
          <p>Fuse [16], HarDNet-MSEG [17].The experimental results
are shown in Table 2. Among these, by backward
noticing, PraNet [15] first predicts the rough areas and then
implicitly models the borders. As a result, when
com3. Experiments pared to certain traditional models, it delivers a
significant improvement in performance. SANet[8]
recom3.1. Datasets and Evaluation Metrics mends CE to decouple the image’s color and content, and
shallow attention to decrease data noise for tiny polyps
that are dificult to separate, Reduce the interference of
Table 1 irrelevant factors to the model. TransFuse [16] combines
Detailed statistics of each fold in the training data. Transformers and CNNs in a parallel style, where both
fold sTerqauinence sample sTeeqstuence sample egficloiebnatllydecpaepntduerendcyinanadmlouwch-lesvhealllsopwaetiralmdaentanielsr, caalnsobe
0 36 2571 10 719 achieving good results. In contrast to the above model,
1 37 2827 9 463 HarDNet-MSEG [17] uses a simple encoder-decoder
ar2 37 2684 9 606 chitecture without any attention modules. The
back3 37 2465 9 825 bone of HarDNet-MSEG [17] is a low-storage trafic CNN
4 37 2613 9 677 paired with a decoder that ofers outstanding accuracy
and fast inference times. Experimental results prove that</p>
          <p>We employ the Endocv2022 dataset [12][13][14] to con- it is 1.3 times faster than PraNet and more than 2 times
duct the experiments, which includes 46 video sequences faster than other models.
with a total of 3390 images. Five-fold cross-validation is
adopted here for fair and thorough comparison, statistics We employ the same encoder-decoder structure as
of each fold have been shown in Table 1. Dice score is HarDNet-MSEG [17], but instead of using low-storage
used as the final evaluation metric, which mainly focuses trafic CNN as the backbone, we use PVT[ 9], which is
on the internal consistency of segmented objects. a versatile backbone for the dense prediction that we
combine with BAG[11] to increase the focus on boundary</p>
          <p>
            Dice score(, ) ≜ information. As can be seen in Table 2, the BP-Trans
2ytrueypred.sum () + 1− 15 (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) model is superior to the current methods, demonstrating
ytrue.sum () + ypred.sum () + 1− 15 that it has a better learning ability.
          </p>
        </sec>
        <sec id="sec-2-6-2">
          <title>Here, we report the mean value of Dice score.</title>
          <p>stands for label and  stands for prediction. A
smoothing factor is used here to avoid the training
collapse when meeting empty labels.</p>
        </sec>
      </sec>
      <sec id="sec-2-7">
        <title>3.4. Ablation Study</title>
        <sec id="sec-2-7-1">
          <title>We explore the efectiveness of each component in de</title>
          <p>tail. Efectiveness of PVT: As shown in 2, compared to</p>
        </sec>
        <sec id="sec-2-7-2">
          <title>Efectiveness of BAG: since the Polyp-PVT[10]</title>
          <p>model is relatively complex, it is dificult to surpass
HarDNet-MSEG[17] in terms of speed, so we try to
improve the accuracy of the model. Our tries include the
combination of PVT[9] and convlstm[21], in order to
make use of the temporal information to further improves
the model efect. However, this attempt has not improved
the model’s performance. Finally, we decided to start
from the boundary information and use BAG[11] to
obtain more information. It turns out that the model has
been improved to a certain extent.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <p>HarDNet-MSEG[17]
HarDNet-MSEG+CONVLSTM
PVT[9]
PVT+CONVLSTM
BP-Trans
(2021) 669–685. stances in gastrointestinal endoscopy, Medical
[2] B. C. MORSON, Genesis of colorectal cancer, Clinics image analysis 70 (2021) 102002. doi:10.1016/j.</p>
      <p>in gastroenterology 5 (1976) 505–525. media.2021.102002.
[3] M. Akbari, M. Mohrekesh, E. Nasr-Esfahani, S. R. [13] S. Ali, D. Jha, N. Ghatwary, S. Realdon, R.
CanSoroushmehr, N. Karimi, S. Samavi, K. Najarian, nizzaro, O. E. Salem, D. Lamarque, C. Daul, K. V.
Polyp segmentation in colonoscopy images using Anonsen, M. A. Riegler, et al., Polypgen: A
fully convolutional network, in: 2018 40th An- multi-center polyp detection and segmentation
nual International Conference of the IEEE Engi- dataset for generalisability assessment, arXiv
neering in Medicine and Biology Society (EMBC), preprint arXiv:2106.04463 (2021). doi:10.48550/
IEEE, 2018, pp. 69–72. arXiv.2106.04463.
[4] A. Mohammed, S. Yildirim, I. Farup, M. Peder- [14] S. Ali, N. Ghatwary, D. Jha, E. Isik-Polat, G.
Posen, Ø. Hovde, Y-net: A deep convolutional neu- lat, C. Yang, W. Li, A. Galdran, M.-Á. G. Ballester,
ral network for polyp detection, arXiv preprint V. Thambawita, et al., Assessing
generalisabilarXiv:1806.01907 (2018). ity of deep learning-based polyp detection and
[5] N. B. Le Duy Huynh, A u-net++ with pre-trained segmentation methods through a computer vision
eficientnet backbone for segmentation of diseases challenge, arXiv preprint arXiv:2202.12031 (2022).
and artifacts in endoscopy images and videos, in: doi:10.48550/arXiv.2202.12031.
CEUR Workshop Proceedings, volume 2595, 2020, [15] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen,
pp. 13–17. L. Shao, Pranet: Parallel reverse attention network
[6] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, for polyp segmentation, in: International
conferH. D. Johansen, Doubleu-net: A deep convolu- ence on medical image computing and
computertional neural network for medical image segmenta- assisted intervention, Springer, 2020, pp. 263–273.
tion, in: 2020 IEEE 33rd International symposium [16] Y. Zhang, H. Liu, Q. Hu, Transfuse: Fusing
transon computer-based medical systems (CBMS), IEEE, formers and cnns for medical image segmentation,
2020, pp. 558–564. in: MICCAI, 2021.
[7] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, [17] C.-H. Huang, H.-Y. Wu, Y.-L. S. Lin, Hardnet-mseg:
T. De Lange, P. Halvorsen, H. D. Johansen, Re- A simple encoder-decoder polyp segmentation
neusunet++: An advanced architecture for medical ral network that achieves over 0.9 mean dice and
image segmentation, in: 2019 IEEE International 86 fps, ArXiv abs/2101.07172 (2021).
Symposium on Multimedia (ISM), IEEE, 2019, pp. [18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schrof,
225–2255. H. Adam, Encoder-Decoder with Atrous
Separa[8] J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, S. Cui, ble Convolution for Semantic Image Segmentation,
Shallow attention network for polyp segmentation, arXiv:1802.02611 [cs] (2018). arXiv:1802.02611.
in: International Conference on Medical Image [19] A. Kirillov, Y. Wu, K. He, R. Girshick, Pointrend:
ImComputing and Computer-Assisted Intervention, age segmentation as rendering, in: 2020 IEEE/CVF
Springer, 2021, pp. 699–708. Conference on Computer Vision and Pattern
Recog[9] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, nition (CVPR), 2020, pp. 9796–9805. doi:10.1109/
T. Lu, P. Luo, L. Shao, Pyramid vision transformer: CVPR42600.2020.00982.</p>
      <p>A versatile backbone for dense prediction without [20] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid Scene
convolutions, in: Proceedings of the IEEE/CVF In- Parsing Network, arXiv:1612.01105 [cs] (2017).
ternational Conference on Computer Vision, 2021, arXiv:1612.01105.</p>
      <p>pp. 568–578. [21] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong,
[10] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, L. Shao, W.-c. Woo, Convolutional lstm network: A machine
Polyp-pvt: Polyp segmentation with pyramid vi- learning approach for precipitation nowcasting,
Adsion transformers, arXiv preprint arXiv:2108.06932 vances in neural information processing systems
(2021). 28 (2015).
[11] J. Wang, L. Wei, L. Wang, Q. Zhou, L. Zhu, J. Qin,</p>
      <p>Boundary-aware transformers for skin lesion
segmentation, in: International Conference on Medical
Image Computing and Computer-Assisted
Intervention, Springer, 2021, pp. 206–216.
[12] S. Ali, M. Dmitrieva, N. Ghatwary, S. Bano, G.
Polat, A. Temizel, A. Krenzer, A. Hekalo, Y. B. Guo,
B. Matuszewski, et al., Deep learning for
detection and segmentation of artefact and disease
in</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Biller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schrag</surname>
          </string-name>
          ,
          <article-title>Diagnosis and treatment of metastatic colorectal cancer: a review</article-title>
          ,
          <source>Jama 325</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>