<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SwinLS: Adapting Swin Transformer to Landslide Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dong Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Zang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zining Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dou Quan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuang Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Artificial Intelligence, Xidian University</institution>
          ,
          <addr-line>Xian, 710071</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Accurate detection of landslides plays an important role in post-disaster search and rescue operations. In this paper, we propose SwinLS for eficient landslide detection in remote sensing images using the swin transformer model. We explore how to eficiently utilize the self-attention mechanism in swin transformer for landslide detection tasks from two aspects. The first aspect is the spectral selection and data enhancement. The second aspect is to reduce imbalanced interference. After that, the performance of the improved swin transformer model is greatly improved, which provides a preliminary exploration for the application of the visual transformer model for remote sensing landslide detection tasks and even anomaly detection tasks. Finally, the proposed SwinLS, achieved the 2nd place in the test leaderboard with 73.99% F1 score, and it difers from the 1st place of 74.54% by only 0.55% F1 score.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Landslide detection</kwd>
        <kwd>remote sensing</kwd>
        <kwd>swin transformer</kwd>
        <kwd>multispectral imagery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>supervised pixel-level classification task. Among these
models, they found through experiments that ResUnet
Landslides have become more frequent due to drastic achieved the best verification performance on landslide
climate change, surface activity, and accidents, threaten- detection tasks, which is due to its reasonable utilization
ing the lives and properties of residents in these areas. of multi-scale features.</p>
      <p>
        Accurate detection of landslides plays an important role Nonetheless, we believe that this is not enough,
bein post-disaster search and rescue operations. As an efi- cause two important issues of landslide detection are
cient and convenient solution, automatic interpretation ignored. The first is the spatial correlation of landslide
of landslide areas from remote sensing images has re- data and the second is the imbalanced problem in
landceived extensive attention from scholars [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To advance slide detection. For the former, we were motivated by
this research, Ghorbanzadeh and Xu et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] release a the observation that the spectra after the collapse of the
large-scale landslide detection dataset with pixel-level slopes exhibited often strong similarities. For the latter,
labels, named Landslide4Sense, and established a related we are inspired by the category statistics in
Ghorbanbenchmark. zadeh and Xu’s paper[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], showing that the proportion
      </p>
      <p>
        The Landslide4Sense dataset contains multi-spectral of landslides is much smaller than that of non-landslide,
imagery from multiple regions and cities collected by which is in line with the anomaly detection problem. To
Sentinel-2 satellites. The data format is pixel blocks of address these issues, we introduce the swin transformer
size 128 with 14 spectrum bands including RGB, VEG [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] model to capture the relationship between landslide
(Vegetation Red Edge), NIR, WV (Water vapour), and regions and design a training strategy for it to solve the
SWIR. This dataset is finely marked by experts to pin- imbalance problem.
point the location of the landslide. In the Landslide4Sense The swin transformer is a recently proposed vision
benchmark, Ghorbanzadeh and Xu et al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] tried a se- transformer model that has demonstrated strong
perforries of classic convolution-based semantic segmentation mance on numerous tasks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The key technology
enmodels, such as ResUNet[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], PSPNet[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], ContextNet[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] abling this model is the self-attention mechanism, which
and DeepLab [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], treating landslide detection as a binary aggregates spatial relationships to extract semantic
features. However, it is not a good way to directly apply
CODbsCeErvOat2io0n22,:Ju2lnyd2W5, o2r0k2s2h,oVpieonnnaC,oAmupslterxiaData Challenges in Earth this model to multi-spectral remote sensing data for
land† Shuang Wang is the Corresponding author. slide detection, such as the Landslide4Sense dataset,
be" zhaodong01@stu.xidian.edu.cn (D. Zhao); cause all spectral segments in multispectral contain
tarqzang@stu.xidian.edu.cn (Q. Zang); get information. Those useless spectra will introduce
21171213901@stu.xidian.edu.cn (Z. Wang); massive noise in the feature aggregation process of the
quandou@xidian.edu.cn (D. Quan); shwang@mail.xidian.edu.cn self-attention mechanism. Therefore, we first performed
(S. Wang) spectral selection experiments to determine which
spec~ https://github.com/DZhaoXd (D. Zhao)
      </p>
      <p>
        © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tra are suitable for performing self-attention based
feaCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) ture aggregation. Finally, we use the RGB spectrum to
train the swin transformer model. To complement it, we
use CUTMIX [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and random rotation data augmentation
to prevent overfitting of larger capacity models.
      </p>
      <p>To solve the imbalance problem in landslides
detection, we design a two-stage balanced training strategy to
make the model better focus on foreground (landslides)
categories. In the first stage, we train the feature
extractor and classifier with weighted cross entropy loss to get
better feature representation. In the second stage, we
ifx the feature extractor and fine-tune the classifier with
ordinary cross entropy loss to weaken the bias of the
classifier. This strategy better mitigates the misleading
of the classifier due to the imbalance between landslide
classes and non-landslide classes.</p>
      <p>Finally, the proposed method called SwinLS, achieved
the 2nd place in the test leaderboard with 73.99% F1
score, and it difers from the 1st place of 74.54% by only
0.55% F1 score.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>As shown in Figure 1, SwinLS is a network of codec
structure, and there are hop links between codecs. Its
encoder  is composed of the base structure of swin
transformer, which has a powerful feature representation
capability. Its decoder  uses a convolutional structure
for decoding and fusing multi-level features for output.
2x
4x
8x
16x
The  loss is the image-level loss performed in
highlevel semantic features in the encoder to assist training,
which is defined as follows,
 = −
1</p>
      <p>
        ∑︁  () log   (()),
| | ∈
(2)
where  is pointer function. When there is a pixel stand
for positive sample (landslide) in , its value is 1,
otherwise it is 0.   (· ) is a fully connected layer with a
global pooling operation.  stands for the total data set.
The  loss is defined as follows,
 = −
1 ∑︁   log (()),
| | ∈ 
(3)
where  stands for the number of negative samples
(non-landslides) and  stands for the number of
positive samples (landslides) in any input image . As
mentioned in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], this re-weighting method can play a
positive role in balancing the feature distribution of positive
and negative samples. However, the classifier will still be
biased. Therefore, in the second stage, we fix the trained
encoder  and use the standard cross-entropy loss 
to train the decoder .
      </p>
      <p>arg min  + .</p>
      <p>(4)</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment</title>
      <sec id="sec-3-1">
        <title>In this section, we show the performance of the methods</title>
        <p>proposed above, respectively. Due to the limited number
Encoder Decoder of submissions of test data in the final stage, the data
provided in our ablation experiments are all performance
on the validation set.</p>
        <p>Input RGB image Swin transformer Multi-scale Fusion Pixel-level loss Spectral selection Since the Transformer model
SSttaaggee12::TFriaxiending ImLNaagone:d0-slelYivdeeesl:?l1osSSsttaaggee21:: TTrraaiinniinngg anteteednstiotno
mpeecrhfaonrmismfetoateuxrteracatghgirgehg-alteivoenlsseumsainngticsfeelaf-tures. If irrelevant spectral information occupies
domi</p>
        <p>Model structure nant information, it will have a significant impact on the
Figure 1: Network structure diagram. performance of swin transformer. To this end, we
perform a set of experiments verifying the efect of diferent</p>
        <p>
          For self-attention mechanism in swin transformer to spectral inputs, as shown in Table 1.
work better in the landslide detection, we performed In Table 1, we discovered an interesting phenomenon.
spectral selection experiments (see Tabel 1 and Figure 1), With the increase of spectral banks, the performance of
Finally, we selected the RGB spectrum from the multi- the fully convolutional models, such as deeplabv3 and
spectral input into the model. To alleviate the foreground Unet, show a gradually increasing trend, while the
perand background imbalance in landslide detection, we de- formance of swin transformer is severely degraded. We
sign a two-stage training strategy. In the first stage, the ifnd out it is because the dimensionality enhancement
codecs are trained simultaneously. For any input samples in the fully convolutional model may attenuate the
neg ∈ × ℎ× 3, we use weighted cross-entropy loss  ative efects of irrelevant channels. Swin transformer,
and Lovasz loss  [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for balanced training as follows, on the other hand, uses the dot product to preform the
self-attention mechanism. When the spectral content
arg min  +  + . (1) unrelated to the landslide dominates, the attention is
se, riously dissipated, which makes the aggregated features
        </p>
        <p>
          In addition, the swin transformer model has a large
capacity and is easier to memorize and lose generalization
under such simple data. To this end, we designed data
augmentation experiments to verify the transformation
methods for landslide detection using only RGB spectral
information, as shown in Table 2. We also add the Unet
model that uses all banks to compare with it.
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a loss that directly optimizes the IoU coeficients,
which is eficient and used as the first stage loss for our
balanced training. Balanced training achieves the best
performance, which further corrects the bias of the
classifier. Finally, balanced training improves the F1 score by
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>4.1% on the basis of baseline. The results of this strategy</title>
        <p>are visualized in Figure 4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This work is in part supported by Key Research and Development Program of Shannxi (Program No.2021ZDLGY0106), Key Research and Development Program of Shannxi</title>
      </sec>
      <sec id="sec-4-2">
        <title>Program of China under Grant No. 2021ZD0110404.</title>
        <p>Precision (%) Recall(%) F1(%)
73.4
65.2
69.3
72.4
78.2
74.7
80.5
79.5
77.1
74.2
73.9
72.7
73.7
74.9
76.1</p>
        <p>In Table 4, we found that when  is small, the accuracy
call rate will improve significantly. This is because when
the  is small, the selected landslide area is only located
in the center of the landslide, and the pixels in the
surrounding area will be ignored due to low confidence. This
makes the self-trained model tend to predict all
surrounding similar blocks as landslides, resulting in increased
over-detection of landslides. As the selected landslide
area continues to increase, the accuracy of the model
This shows that with the addition of many inaccurate
pseudo-labels, it has played a strong role in preventing
over-detection. And the model can learn more knowledge
about the samples to be tested from the noisy training
data, which increases the accuracy.
rate after self-training will degrade seriously, but the re- pseudo-labels with diferent  values, as shown in Figure
continues to rise, and the recall rate begins to decline. (Program No. 2022ZDLGY01 -12) and National Key R&amp;D</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gawlikowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kruspe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>An advanced dirichlet prior network for out-ofdistribution detection in remote sensing</article-title>
          ,
          <source>IEEE Transactions on Geoscience and Remote Sensing</source>
          <volume>60</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ghorbanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ghamis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kreil</surname>
          </string-name>
          ,
          <article-title>Landslide4sense: Reference benchmark data and deep learning models for landslide detection</article-title>
          ,
          <source>arXiv preprint arXiv:2206.00515</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F. I.</given-names>
            <surname>Diakogiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Waldner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Caccetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data</article-title>
          ,
          <source>ISPRS Journal of Photogrammetry and Remote Sensing</source>
          <volume>162</volume>
          (
          <year>2020</year>
          )
          <fpage>94</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Pyramid scene parsing network</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2881</fpage>
          -
          <lpage>2890</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Poudel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Bonde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liwicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zach</surname>
          </string-name>
          , Contextnet:
          <article-title>Exploring context and detail for semantic segmentation in real-time, arXiv preprint</article-title>
          arXiv:
          <year>1805</year>
          .
          <volume>04554</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Papandreou, I. Kokkinos,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Yuille</surname>
          </string-name>
          ,
          <article-title>Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>40</volume>
          (
          <year>2017</year>
          )
          <fpage>834</fpage>
          -
          <lpage>848</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>French</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Laine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mackiewicz</surname>
          </string-name>
          , G. Finlayson,
          <article-title>Semi-supervised semantic segmentation needs strong, high-dimensional perturbations (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Berman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Triki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Blaschko</surname>
          </string-name>
          ,
          <article-title>The lovászsoftmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4413</fpage>
          -
          <lpage>4421</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-S.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Bbn:
          <article-title>Bilateral-branch network with cumulative learning for long-tailed visual recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9719</fpage>
          -
          <lpage>9728</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. M. Martinez</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Kak</surname>
          </string-name>
          ,
          <article-title>Pca versus lda</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>23</volume>
          (
          <year>2001</year>
          )
          <fpage>228</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>T. DeVries</surname>
          </string-name>
          , G. W. Taylor,
          <article-title>Improved regularization of convolutional neural networks with cutout</article-title>
          ,
          <source>arXiv preprint arXiv:1708.04552</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>