<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at MediaEval2021: PointRend with Atention Fusion Refinement for Polyps Segmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>APPROACH Attention Fusion Refinement</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Medico task in MediaEval 2021 explores the challenge of building accurate and high-performance algorithms to detect all types of polyps in endoscopic images. This paper introduces our approach for the automatic segmentation of polyp images. We employ a ResNeXt as an encoder backbone with a UNet decoder. Further, the addition of PointRend and Attention Fusion Refinement on the network improves our segmentation performance. The experimental results show the eficiency of the proposed method, which achieves a Jaccard index of 0.7572, an accuracy of 0.9634, and a dice score of 0.8326.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Medico: Transparency in Medical Image Segmentation 2021[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] task
aims to develop automatic segmentation systems for segmenting
polyps in images taken from endoscopies that are transparent and
explainable, and reduce the chance that diagnosticians overlook a
polyp during a colonoscopy. A modified version of the
segmentation part of HyperKvasir [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is given with more than 1000 training
polyp images with their corresponding masks labeled by medical
experts and 200 testing polyp images to challenge the participants
for the robust, transparent, and eficient algorithms for polyp
segmentation.
      </p>
      <p>
        In recent years, the task of automatic polyp segmentation using
deep learning-based [
        <xref ref-type="bibr" rid="ref1 ref3 ref4">1, 3, 4</xref>
        ] methods has gained a lot of
achievements. Especially, the appearance of attention strategies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
efectively improves polyp detection and segmentation performance.
However, it still has some challenges, including the varieties of
polyp’s appearance (size, texture, and color). The boundary
between a polyp and its neighbor regions is usually blurred and hard
to be segmented.
      </p>
      <p>
        In this paper, we propose an accurate and real-time framework
PointRend with Attention Fusion Refinement (PRAFNet) for the
polyp segmentation. Fig. 1 shows the overview of our proposed
framework. PRAFNet utilizes the Attention Fusion Refinement
to decode an efective high-level semantic segmentation, and the
PointRend [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] module to generate high-quality polyp segmentation
from the colonoscopy images. The following section will introduce
our approach and elaborate details about our network.
Current popular medical image segmentation networks usually rely
on a U-Net architecture (e.g., U-Net [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], U-Net++ [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], ResUNet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
etc). These models are essentially encoder-decoder frameworks,
which aggregate all multi-level features extracted with a simple
decoder, which does not efectively leverage these features. Woo
et al. introduce a Convolutional Block Attention Module (CBAM)
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which applies attention-based feature refinement with two
distinctive modules, channel and spatial, to learn what and where to
emphasize or suppress and refines intermediate features efectively.
      </p>
      <p>
        We propose an Attention Fusion Refinement(AFR) module to
better aggregate high-level features and focus on important regions,
combining high-level features with upsampled features by CBAM
as a core module. More specifically, for an input image, five levels
of features {,  = 1, .., 5} can be extracted from a ResNeXt [
        <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
        ]
backbone network. We introduce a new decoder component, AFR,
to aggregate the high-level features with upsampled features. As
shown in Fig. 1, An AFR module inputs a high-level feature  with
the previous upsampled feature +1 and we obtain the upsampled
feature  .
2.2
The U-Net [
        <xref ref-type="bibr" rid="ref13 ref9">9, 13</xref>
        ] model gives decent accuracy. However, it still has
some drawbacks like predicting classes with very near
distinguishable features, not being able to predict precise boundaries, etc. We
have used the PointRend [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] module to address these drawbacks.
      </p>
      <p>PointRend constructs point-wise features at selected points by
concatenating two features, fine-grained to render fine
segmentation details and coarse prediction features to gain more contextual
and semantic information. We use the features 2 as our fine-grained
features and select top  = 3136 uncertain points in each
subdivision step. In general, the uncertain points are located near the
boundary of classes, so it can help refine the polyp’s boundary
efectively. As shown in Fig. 1, we use two subdivision steps of
PointRend to obtain the final segmentation, which is the same size
as the input image. We plot the uncertain points used in PointRend
as blue dots in the coarse predictions 2, 1.
2.3</p>
    </sec>
    <sec id="sec-2">
      <title>Training strategy</title>
      <p>We apply the Bootstrapped Cross Entropy loss to prevent the
models from overfitting on simple pixels and force them to focus on
more challenging cases. With the Bootstrapped Cross Entropy, we
calculate the loss for the top  percent pixels with the largest losses
at each step in the training process. We would also add a "warm-up"</p>
      <p>Input image</p>
      <p>%</p>
      <p>Res Block
$
#
"
&amp;
&amp;'%</p>
      <p>PointRend
!</p>
      <p>PointRend
%</p>
      <p />
      <p>Attention Fusion Refinement
Concat</p>
      <p>Res Block</p>
      <p>ACMB</p>
      <p>Res Block &amp;
Max-Pooling
Up-sampling</p>
      <p>ResNeXt block</p>
      <p>Attention Fusion Refinement</p>
      <p>ACMB Convolutional Block Attention module</p>
      <p>Res Block</p>
      <p>Residual Block
period to the loss with  = 100 such that the network can learn to
adapt to the easy regions first. Then transit to the harder areas by
gradually decaying K to 15 in a polynomial manner.</p>
    </sec>
    <sec id="sec-3">
      <title>3 RESULTS AND ANALYSIS</title>
      <p>
        We performed experiments on six diferent settings for two tasks:
Method 1 uses the UNet with ResNeXt50 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] backbone as a baseline
model. Method 2 extends Method 1 with the PointRend. Method 3
extends Method 2 with the Attention Fusion Refinement. Method 4
uses ResNeXt101 as a backbone with the same settings as Method
3. Method 5 uses EficientNetB6 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as as backbone with the same
setting as Method 3. Method 6 ensembles the results of Method 3,
Method 4, and Method 5 together.
      </p>
      <p>For task 1, we submit five runs from Method 2 to Method 6. For
task 2, we submit two runs. In the first run, we use Method 4. And
the second run is Method 1 for the lightweight architecture.</p>
      <p>Table 1 and 2 shows our results on task 1 and task 2, respectively.
Method 2 is slightly better than method 1 in all metrics, which
shows that PointRend helps improve the results. In method 3, we
use AFR, and the results also improve compared to method 2. With
a stronger backbone (ResNeXt101 instead of ResNeXt50) in method
4, the results are improved with a Jaccard index of 0.7441. Method 5
with an EficientNetB6 backbone is better than method 4 in several
metrics except for precision. In method 6, we ensemble our methods
3, 4, 5 to achieve our best result in this task with the Jaccard index
of 0.7572.</p>
      <p>In task 2, although our method 1 is 1.5 faster than method 2,
method 2 has higher accuracy with a real-time eficiency ( 48 FPS)</p>
    </sec>
    <sec id="sec-4">
      <title>4 CONCLUSION</title>
      <p>This paper presents a fast and accurate method for automatic polyps
segmentation. The proposed methods use an encoder-decoder
architecture. ResNeXt is used as an encoder backbone with the UNet
decoder. Further, PointRend and Attention Fusion Refinement are
applied to improve the segmentation result. PointRend helps refine
the uncertainty points, especially with the boundary regions. The
Attention Fusion Refinement enhances the fusion between
highlevel features and upsampled features in the decoder. In the future,
we plan to apply better architecture such as ResUnet++ or PraNet
for our work and further improve the results.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was funded by Gia Lam Urban Development and
Investment Company Limited, Vingroup and supported by Vingroup
Innovation Foundation (VINIF) under project code VINIF.2019.DA19.
Medico: Transparency in Medical Image Segmentation</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Mojtaba</given-names>
            <surname>Akbari</surname>
          </string-name>
          , Majid Mohrekesh, Ebrahim Nasr Esfahani,
          <string-name>
            <given-names>S.M.</given-names>
            <surname>Reza Soroushmehr</surname>
          </string-name>
          , Nader Karimi, Shadrokh Samavi, and
          <string-name>
            <given-names>Kayvan</given-names>
            <surname>Najarian</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Polyp Segmentation in Colonoscopy Images Using Fully Convolutional Network</article-title>
          .
          <source>Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference</source>
          <year>2018</year>
          ,
          <fpage>69</fpage>
          -
          <lpage>72</lpage>
          . https://doi.org/10.1109/EMBC.
          <year>2018</year>
          .8512197
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Hanna</given-names>
            <surname>Borgli</surname>
          </string-name>
          , Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, Dag Johansen, Carsten Griwodz, Håkon K Stensland,
          <string-name>
            <surname>Enrique</surname>
          </string-name>
          Garcia-Ceja, Peter T Schmidt, Hugo L Hammer,
          <article-title>Michael A Riegler, Pål Halvorsen</article-title>
          , and Thomas de Lange.
          <year>2020</year>
          .
          <article-title>HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy</article-title>
          .
          <source>Scientific Data</source>
          <volume>7</volume>
          ,
          <issue>1</issue>
          (
          <year>2020</year>
          ),
          <volume>283</volume>
          . https://doi.org/10.1038/s41597-020-00622-y
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Deng-Ping</surname>
            <given-names>Fan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ge-Peng</surname>
            <given-names>Ji</given-names>
          </string-name>
          , Tao Zhou, Geng Chen, Huazhu Fu,
          <string-name>
            <given-names>Jianbing</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Ling</given-names>
            <surname>Shao</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>PraNet: Parallel Reverse Attention Network for Polyp Segmentation</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:eess</article-title>
          .IV/
          <year>2006</year>
          .11392
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yuqi</given-names>
            <surname>Fang</surname>
          </string-name>
          , Cheng Chen, Yixuan Yuan, and
          <string-name>
            <surname>Kai-yu Tong</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Selective Feature Aggregation Network with Area-Boundary Constraints for Polyp Segmentation</article-title>
          .
          <fpage>302</fpage>
          -
          <lpage>310</lpage>
          . https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -32239-7_
          <fpage>34</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <article-title>(</article-title>
          <year>2015</year>
          ).
          <source>arXiv:cs.CV/1512.03385</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Steven</given-names>
            <surname>Hicks</surname>
          </string-name>
          , Debesh Jha, Vajira Thambawita, Hugo Hammer, Thomas de Lange, Sravanthi Parasa,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Riegler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Pål</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <year>2021</year>
          . Medico Multimedia Task at MediaEval 2021:
          <article-title>Transparency in Medical Image Segmentation</article-title>
          .
          <source>In Proceedings of MediaEval 2021 CEUR Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pia H. Smedsrud</surname>
            ,
            <given-names>Michael A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            , Dag Johansen, Thomas De Lange, Pål Halvorsen, and
            <given-names>Håvard D.</given-names>
          </string-name>
          <string-name>
            <surname>Johansen</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>ResUNet++: An Advanced Architecture for Medical Image Segmentation</article-title>
          .
          <source>In 2019 IEEE International Symposium on Multimedia (ISM)</source>
          .
          <volume>225</volume>
          -
          <fpage>2255</fpage>
          . https://doi.org/10.1109/ISM46123.
          <year>2019</year>
          .00049
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Kirillov</surname>
          </string-name>
          , Yuxin Wu, Kaiming He, and
          <string-name>
            <given-names>Ross</given-names>
            <surname>Girshick</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>PointRend: Image Segmentation as Rendering</article-title>
          .
          <article-title>(</article-title>
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CV/
          <year>1912</year>
          .08193
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Olaf</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Brox</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>U-Net: Convolutional Networks for Biomedical Image Segmentation</article-title>
          . (
          <year>2015</year>
          ).
          <source>arXiv:cs.CV/1505.04597</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Quoc V.</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>EficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .LG/
          <year>1905</year>
          .11946
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sanghyun</surname>
            <given-names>Woo</given-names>
          </string-name>
          , Jongchan Park,
          <string-name>
            <surname>Joon-Young Lee</surname>
          </string-name>
          , and In So Kweon.
          <year>2018</year>
          .
          <article-title>CBAM: Convolutional Block Attention Module</article-title>
          . (
          <year>2018</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CV/
          <year>1807</year>
          .06521
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Saining</surname>
            <given-names>Xie</given-names>
          </string-name>
          , Ross Girshick, Piotr Dollár, Zhuowen Tu, and
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Aggregated Residual Transformations for Deep Neural Networks</article-title>
          . (
          <year>2017</year>
          ).
          <source>arXiv:cs.CV/1611.05431</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Zongwei</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and
          <string-name>
            <given-names>Jianming</given-names>
            <surname>Liang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>UNet++: A Nested U-Net Architecture for Medical Image Segmentation</article-title>
          . (
          <year>2018</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CV/
          <year>1807</year>
          .10165
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>