<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at MediaEval2021: Polyps Segmentation using TransFuse with Focal Tversky Loss</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nhat-Khang Ngo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tuan-Luc Huynh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thanh-Danh Le</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The Medico task, MediaEval 2021, aims at developing accurate and high-performance techniques for automatic medical image segmentation. In this work, we describe an approach for tackling Tasks 1 and 2 of the challenge. We retrain TransFuse, a state-of-the-art model in medical image segmentation, along with focal Tversky loss function to segment the polyp regions in endoscopic images. The approach focuses on computation eficiency while also producing high-quality segmented results. In evaluation, our method achieves appropriate results for both eficiency and accuracy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Medical image segmentation has become more common in recent
years, thanks to important advances in artificial intelligence. The
work mainly focuses on helping experts diagnose life-threatening
cancers by early detecting and segmenting polyps in medical images.
However, automatic polyp segmentation is challenging due to the
diversity of polyp shapes and positions. Numerous studies leverage
the representation power of deep learning to capture numerous
variations of polyps in endoscopic images. The MediaEval Task 2021
Transparency in Medical Image Segmentation calls for researchers
to investigate a method for polyps segmentation. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
      </p>
      <p>
        This paper presents an approach that can eficiently segment
the polyp regions in the endoscopic images. We train from scratch
TransFuse [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a state-of-the-art model in medical image
segmentation, along with a generalized focal Tversky loss function [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
TransFuse is a combination of vision transformers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
convolutional neural networks in a parallel manner [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While the former
learn to model the relations between regions in the images, the
latter extracts the local details of these regions. The two processes
execute in parallel. Hence, TransFuse boosts the time eficiency
in the inference phase. To combine both information, Zhang et al.
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] propose the BiFusion module consisting of several attention
modules and convolution blocks. In addition, Kvasir-seg [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the
given dataset, is a small dataset with only 1360 samples. The dataset
also consists of many hard samples in which the polyps are large
and have unusual locations and shapes. To address this problem,
we train TransFuse with focal Tversky loss function. We train the
models with various hyperparameter settings to assess the eficacy
and failures of this approach.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Self-attention is a critical phenomenon in deep learning. The
mechanism enables models to capture the global context between objects
in data. Self-attention is used in medical image segmentation to
manage the relationships between regions in the images. Oktay et
al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] integrate Attention Gates into U-net to suppress inessential
areas and emphasize salient characteristics. To further handle the
global context, Chen et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed TransUnet in which the
encoders of U-net are replaced by the encoders of Vision
Transformers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Petit et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed a U-net architecture featuring
self-attention and cross-attention between the encoder and decoder.
While the preceding methods combine self-attention and CNNs
sequentially, Zhang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] combine them in a parallel manner.
This kind of incorporation can mitigate the loss of local details in
deep CNNs and reduce the inference time.
As illustrated in Figure 1, TransFuse includes three branches;
Transformer, CNN, and BiFusion. The Transformer branch makes use
of the Vision Transformers architecture, in which an image is
embedded into patches before being transmitted to many multi-head
self-attention and multi-layer perceptron modules. The result is
molded into several feature maps, which are kept for later fusion.
Simultaneously, the CNN branch downsamples the image into feature
maps with the same size as the corresponding ones in the
Transformer branch. The outputs of the two parallel branches are fused in
the BiFusion module. The module contains spatial attention,
channel attention, and residual blocks to perform multi-modal fusion
and self-attention [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Finally, the fused output is upsampled to get
the segmented result. In addition, deep supervision is provided at
the output of the transformer branch and the final BiFusion module.
In our experiments, we use TransFuse-S proposed by Zhang et al.
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Focal Tversky Loss</title>
      <p>
        Tversky Score is extended from Dice Score that flexibly adjusts the
scores of false positive and false negative cases among the classes
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Equation 1 shows how to calculate the Tversky score. In the
equation,  is a hyperparameter that we can fine-tune during
training. High values of  enhance the recall rate in highly imbalanced
datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Wider polyp regions, consequently, can be detected in
images. Additionally,  is a constant that stabilizes the score.
 =   +  (1)
      </p>
      <p>
        +    + (1 −  )  + 
The Tversky Loss  equals 1− . To tackle hard samples, Abraham et
al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] adapt the loss function to a focal version. The loss is written
1
as   = (1 −  )  , where  ∈ [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ] is a hyperparameter. When a
high Tversky score has a high number of erroneous predictions,
i.e.,   and   , the loss decreases dramatically. By using  &gt; 0.5
and  &gt; 1, the function focuses on merely misclassified samples.
As a result, the model can widen the segmented polyp regions.
4
4.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS AND RESULTS</title>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>
        We train TransFuse-S with the focal Tversky loss by varying  in
ifve Runs. In the first four Runs, we split the dataset into training
and validation sets with the ratio of 8:2, whereas we train the
model with all samples in Run 5. We use four values of  , including
0.3, 0.4, 0.6, and 0.7. In Run 1 and Run 5,  equals to 0.7, while 
equals to 0.6, 0.4, and 0.3 in Runs 2,3,4, respectively. It is worth
noting that when  = 0.5, the Tversky score becomes Dice score.
Thus, we do not use 0.5 in our experiments. In addition, we fix the
value of  to 43 which is proved to be the most efective in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We
use Adam [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to optimize the loss function with a learning rate of
1 − 4, and the batch size of data is 16. Additionally, because we
use deep supervsion, there are three losses 1, 2, and 3 with the
corresponding scales 1 = 0.5, 2 = 0.2, and 3 = 0.3. And thus,
the final loss  equals 0.51 + 0.22 + 0.33.
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>Table 1 displays the outcomes of our submissions from Run 1 to
Run 5 in the challenge’s Task 1. Accuracy, Jaccard score, Dice Score,
F1-score, Recall, and Precision are the six metrics used to assess
predictions. In Run 2, when ℎ = 0.6, we attain the highest
Jaccard score of 0.6780. This run also produces the highest Dice
Score of 0.7756. All runs have a greater recall than a higher precision.
This demonstrates our approach’s responsibility for false negative
predictions. We achieve the greatest recall and accuracy of 0.8584
and 0.8208, respectively. Table 1 further shows that the accuracy
ratings for the five runs are almost comparable. In this section, we
additionally present the inference time for Task 2. Table 2 shows the
Run ID
Run 1
Run 2
Run 3
Run 4
Run 5
average inference time and frame rate, as well as the Jaccard Score,
Recall, and Precision of Task 2’s Run 1. On average, the model
makes one prediction in 0.0132 seconds. Besides fast inference,
our technique produces accurate findings, with a Jaccard score
of 0.6692, a high Recall of 0.8586, and a high Precision of 0.7572.
Furthermore, Figure 2 depicts the eficacy and failure of focusing
on enhancing the recall rate in the dataset. We paint the polyp
regions green based on the projections to see if the borders of these
regions are suitable. The first image demonstrates that strong recall
is acceptable, whereas the green hue in the second image surpasses
the polyp regions.</p>
      <p>Run ID
Run 1</p>
      <sec id="sec-6-1">
        <title>Avg-time</title>
        <p>0.0132
Avg-fps
75.7629</p>
      </sec>
      <sec id="sec-6-2">
        <title>Jacc</title>
        <p>0.6692
Rec
0.8586</p>
      </sec>
      <sec id="sec-6-3">
        <title>Prec 0.7572</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was funded by Gia Lam Urban Development and
Investment Company Limited, Vingroup and supported by Vingroup
Innovation Foundation (VINIF) under project code VINIF.2019.DA19.
Medico: Transparency in Medical Image Segmentation</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Nabila</given-names>
            <surname>Abraham</surname>
          </string-name>
          and Naimul Mefraz Khan.
          <year>2019</year>
          .
          <article-title>A novel focal tversky loss function with improved attention u-net for lesion segmentation</article-title>
          .
          <source>In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI</source>
          <year>2019</year>
          ). IEEE,
          <fpage>683</fpage>
          -
          <lpage>687</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Hanna</given-names>
            <surname>Borgli</surname>
          </string-name>
          , Vajira Thambawita, Pia H Smedsrud, Steven Hicks, Debesh Jha, Sigrun L Eskeland, Kristin Ranheim Randel, Konstantin Pogorelov, Mathias Lux, Duc Tien Dang Nguyen, Dag Johansen, Carsten Griwodz, Håkon K Stensland,
          <string-name>
            <surname>Enrique</surname>
          </string-name>
          Garcia-Ceja, Peter T Schmidt, Hugo L Hammer,
          <article-title>Michael A Riegler, Pål Halvorsen</article-title>
          , and Thomas de Lange.
          <year>2020</year>
          .
          <article-title>HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy</article-title>
          .
          <source>Scientific Data</source>
          <volume>7</volume>
          ,
          <issue>1</issue>
          (
          <year>2020</year>
          ),
          <volume>283</volume>
          . https://doi.org/10.1038/s41597-020-00622-y
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jieneng</given-names>
            <surname>Chen</surname>
          </string-name>
          , Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and
          <string-name>
            <given-names>Yuyin</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Transunet: Transformers make strong encoders for medical image segmentation</article-title>
          .
          <source>arXiv preprint arXiv:2102.04306</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          , Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and others.
          <year>2020</year>
          .
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          . arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Steven</given-names>
            <surname>Hicks</surname>
          </string-name>
          , Debesh Jha, Vajira Thambawita, Hugo Hammer, Thomas de Lange, Sravanthi Parasa,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Riegler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Pål</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <year>2021</year>
          . Medico Multimedia Task at MediaEval 2021:
          <article-title>Transparency in Medical Image Segmentation</article-title>
          .
          <source>In Proceedings of MediaEval 2021 CEUR Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P</given-names>
          </string-name>
          <string-name>
            <surname>Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Petit</surname>
          </string-name>
          , Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and
          <string-name>
            <given-names>Luc</given-names>
            <surname>Soler</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>U-net transformer: self and cross attention for medical image segmentation</article-title>
          .
          <source>In International Workshop on Machine Learning in Medical Imaging</source>
          . Springer,
          <fpage>267</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jo</given-names>
            <surname>Schlemper</surname>
          </string-name>
          , Ozan Oktay, Michiel Schaap, Mattias Heinrich, Bernhard Kainz, Ben Glocker, and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Rueckert</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Attention gated networks: Learning to leverage salient regions in medical images</article-title>
          .
          <source>Medical image analysis 53</source>
          (
          <year>2019</year>
          ),
          <fpage>197</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yundong</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Huiye Liu, and
          <string-name>
            <given-names>Qiang</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Transfuse: Fusing transformers and cnns for medical image segmentation</article-title>
          .
          <source>arXiv preprint arXiv:2102.08005</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>