=Paper= {{Paper |id=Vol-3207/paper14 |storemode=property |title=SwinLS: Adapting Swin Transformer to Landslide Detection |pdfUrl=https://ceur-ws.org/Vol-3207/paper14.pdf |volume=Vol-3207 |authors=Dong Zhao,Qi Zang,Zining Wang,Dou Quan,Shuang Wang |dblpUrl=https://dblp.org/rec/conf/cdceo/ZhaoZWQW22 }} ==SwinLS: Adapting Swin Transformer to Landslide Detection== https://ceur-ws.org/Vol-3207/paper14.pdf
SwinLS: Adapting Swin Transformer to Landslide
Detection
Dong Zhao1 , Qi Zang1 , Zining Wang1 , Dou Quan1 and Shuang Wang1,†
1
    School of Artificial Intelligence, Xidian University, Xian, 710071, China.


                                       Abstract
                                       Accurate detection of landslides plays an important role in post-disaster search and rescue operations. In this paper, we
                                       propose SwinLS for efficient landslide detection in remote sensing images using the swin transformer model. We explore
                                       how to efficiently utilize the self-attention mechanism in swin transformer for landslide detection tasks from two aspects.
                                       The first aspect is the spectral selection and data enhancement. The second aspect is to reduce imbalanced interference. After
                                       that, the performance of the improved swin transformer model is greatly improved, which provides a preliminary exploration
                                       for the application of the visual transformer model for remote sensing landslide detection tasks and even anomaly detection
                                       tasks. Finally, the proposed SwinLS, achieved the 2nd place in the test leaderboard with 73.99% F1 score, and it differs from
                                       the 1st place of 74.54% by only 0.55% F1 score.

                                       Keywords
                                       Landslide detection, remote sensing, swin transformer, multispectral imagery



1. Introduction                                                                                       supervised pixel-level classification task. Among these
                                                                                                      models, they found through experiments that ResUnet
Landslides have become more frequent due to drastic achieved the best verification performance on landslide
climate change, surface activity, and accidents, threaten- detection tasks, which is due to its reasonable utilization
ing the lives and properties of residents in these areas. of multi-scale features.
Accurate detection of landslides plays an important role                                                 Nonetheless, we believe that this is not enough, be-
in post-disaster search and rescue operations. As an effi- cause two important issues of landslide detection are
cient and convenient solution, automatic interpretation ignored. The first is the spatial correlation of landslide
of landslide areas from remote sensing images has re- data and the second is the imbalanced problem in land-
ceived extensive attention from scholars [1]. To advance slide detection. For the former, we were motivated by
this research, Ghorbanzadeh and Xu et al. [2] release a the observation that the spectra after the collapse of the
large-scale landslide detection dataset with pixel-level slopes exhibited often strong similarities. For the latter,
labels, named Landslide4Sense, and established a related we are inspired by the category statistics in Ghorban-
benchmark.                                                                                            zadeh and Xu’s paper[2], showing that the proportion
   The Landslide4Sense dataset contains multi-spectral of landslides is much smaller than that of non-landslide,
imagery from multiple regions and cities collected by which is in line with the anomaly detection problem. To
Sentinel-2 satellites. The data format is pixel blocks of address these issues, we introduce the swin transformer
size 128 with 14 spectrum bands including RGB, VEG [7] model to capture the relationship between landslide
(Vegetation Red Edge), NIR, WV (Water vapour), and regions and design a training strategy for it to solve the
SWIR. This dataset is finely marked by experts to pin- imbalance problem.
point the location of the landslide. In the Landslide4Sense                                              The swin transformer is a recently proposed vision
benchmark, Ghorbanzadeh and Xu et al [2] tried a se- transformer model that has demonstrated strong perfor-
ries of classic convolution-based semantic segmentation mance on numerous tasks [7]. The key technology en-
models, such as ResUNet[3], PSPNet[4], ContextNet[5] abling this model is the self-attention mechanism, which
and DeepLab [6], treating landslide detection as a binary aggregates spatial relationships to extract semantic fea-
                                                                                                      tures. However, it is not a good way to directly apply
CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth
Observation, July 25, 2022, Vienna, Austria
                                                                                                      this model to multi-spectral remote sensing data for land-
†                                                                                                     slide detection, such as the Landslide4Sense dataset, be-
  Shuang Wang is the Corresponding author.
" zhaodong01@stu.xidian.edu.cn (D. Zhao);                                                             cause all spectral segments in multispectral contain tar-
qzang@stu.xidian.edu.cn (Q. Zang);                                                                    get information. Those useless spectra will introduce
21171213901@stu.xidian.edu.cn (Z. Wang);                                                              massive noise in the feature aggregation process of the
quandou@xidian.edu.cn (D. Quan); shwang@mail.xidian.edu.cn                                            self-attention mechanism. Therefore, we first performed
(S. Wang)
                                                                                                      spectral selection experiments to determine which spec-
~ https://github.com/DZhaoXd (D. Zhao)
         Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tra are suitable for performing self-attention based fea-
         Attribution 4.0 International (CC BY 4.0).
    CEUR

         CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                      ture aggregation. Finally, we use the RGB spectrum to
train the swin transformer model. To complement it, we                                       The 𝐿𝑖𝑐𝑒 loss is the image-level loss performed in high-
use CUTMIX [8] and random rotation data augmentation                                         level semantic features in the encoder to assist training,
to prevent overfitting of larger capacity models.                                            which is defined as follows,
   To solve the imbalance problem in landslides detec-
                                                                                                               1 βˆ‘οΈ
tion, we design a two-stage balanced training strategy to                                          𝐿𝑖𝑐𝑒 = βˆ’             𝛿(𝑦𝑖 ) log 𝑀 𝑃 (𝐸(π‘₯𝑖 )),    (2)
make the model better focus on foreground (landslides)                                                        |𝒳 | π‘₯ βˆˆπ’³
                                                                                                                    𝑖
categories. In the first stage, we train the feature extrac-
tor and classifier with weighted cross entropy loss to get                                   where 𝛿 is pointer function. When there is a pixel stand
better feature representation. In the second stage, we                                       for positive sample (landslide) in 𝑦, its value is 1, oth-
fix the feature extractor and fine-tune the classifier with                                  erwise it is 0. 𝑀 𝑃 (Β·) is a fully connected layer with a
ordinary cross entropy loss to weaken the bias of the                                        global pooling operation. 𝒳 stands for the total data set.
classifier. This strategy better mitigates the misleading                                    The 𝐿𝑀𝑐𝑒 loss is defined as follows,
of the classifier due to the imbalance between landslide                                                       1 βˆ‘οΈ 𝑁𝑛𝑒𝑔
classes and non-landslide classes.                                                                 𝐿𝑀𝑐𝑒 = βˆ’                  𝑦𝑖 log 𝐷(𝐸(π‘₯𝑖 )),      (3)
                                                                                                              |𝒳 | π‘₯ βˆˆπ’³ π‘π‘π‘œπ‘ 
   Finally, the proposed method called SwinLS, achieved                                                             𝑖

the 2nd place in the test leaderboard with 73.99% F1
score, and it differs from the 1st place of 74.54% by only                                   where 𝑁𝑛𝑒𝑔 stands for the number of negative samples
0.55% F1 score.                                                                              (non-landslides) and π‘π‘π‘œπ‘  stands for the number of posi-
                                                                                             tive samples (landslides) in any input image π‘₯. As men-
                                                                                             tioned in [10], this re-weighting method can play a posi-
2. Methods                                                                                   tive role in balancing the feature distribution of positive
                                                                                             and negative samples. However, the classifier will still be
As shown in Figure 1, SwinLS is a network of codec                                           biased. Therefore, in the second stage, we fix the trained
structure, and there are hop links between codecs. Its                                       encoder 𝐸 and use the standard cross-entropy loss 𝐿𝑐𝑒
encoder 𝐸 is composed of the base structure of swin                                          to train the decoder 𝐷.
transformer, which has a powerful feature representation
capability. Its decoder 𝐷 uses a convolutional structure                                                       arg min 𝐿𝑐𝑒 + 𝐿𝑖𝑐𝑒 .                 (4)
                                                                                                                   𝐷
for decoding and fusing multi-level features for output.

                                               2x                                            3. Experiment
                                               4x
                                               8x
                                               16x                                           In this section, we show the performance of the methods
                                                                                             proposed above, respectively. Due to the limited number
                          Encoder                     Decoder                                of submissions of test data in the final stage, the data
                                                                                             provided in our ablation experiments are all performance
                                                                                             on the validation set.
  Input RGB image                                                         Pixel-level loss
                      Swin transformer               Multi-scale Fusion                         Spectral selection Since the Transformer model
                      Stage1: Training                 Stage1: Training
                      Stage2: Fixed        Landslide?  Stage2: Training                      needs to perform feature aggregations using self-
                                           No:0 Yes: 1
                                         Image-level loss
                                                                                             attention mechanism to extract high-level semantic fea-
                                                                                             tures. If irrelevant spectral information occupies domi-
                                         Model structure
                                                                                             nant information, it will have a significant impact on the
Figure 1: Network structure diagram.                                                         performance of swin transformer. To this end, we per-
                                                                                             form a set of experiments verifying the effect of different
   For self-attention mechanism in swin transformer to                                       spectral inputs, as shown in Table 1.
work better in the landslide detection, we performed                                            In Table 1, we discovered an interesting phenomenon.
spectral selection experiments (see Tabel 1 and Figure 1),                                   With the increase of spectral banks, the performance of
Finally, we selected the RGB spectrum from the multi-                                        the fully convolutional models, such as deeplabv3 and
spectral input into the model. To alleviate the foreground                                   Unet, show a gradually increasing trend, while the per-
and background imbalance in landslide detection, we de-                                      formance of swin transformer is severely degraded. We
sign a two-stage training strategy. In the first stage, the                                  find out it is because the dimensionality enhancement
codecs are trained simultaneously. For any input samples                                     in the fully convolutional model may attenuate the neg-
π‘₯𝑖 ∈ π‘…π‘€Γ—β„ŽΓ—3 , we use weighted cross-entropy loss 𝐿𝑀𝑐𝑒                                        ative effects of irrelevant channels. Swin transformer,
and Lovasz loss πΏπ‘™π‘œπ‘£ [9] for balanced training as follows,                                   on the other hand, uses the dot product to preform the
                                                                                             self-attention mechanism. When the spectral content
                    arg min 𝐿𝑀𝑐𝑒 + πΏπ‘™π‘œπ‘£ + 𝐿𝑖𝑐𝑒 .                                      (1)    unrelated to the landslide dominates, the attention is se-
                      𝐸,𝐷
                                                                                             riously dissipated, which makes the aggregated features
Table 1                                                                      In addition, the swin transformer model has a large ca-
Spectral selection experiments. In this table, the RGB denotes               pacity and is easier to memorize and lose generalization
the red, green, and blue spectral. SWIR denotes the 3-band                   under such simple data. To this end, we designed data
far infrared in Sentinel-2. NGB denotes the near-infrared,                   augmentation experiments to verify the transformation
green, and blue spectral. NIR denotes the near-infrared spec-                methods for landslide detection using only RGB spectral
tral. PCA refers to the use of dimensionality reduction tech-                information, as shown in Table 2. We also add the Unet
niques [11] for compressing the original 14 banks into 3 banks.
                                                                             model that uses all banks to compare with it.
Besides, the encoder of unet model is replaced by resnet-32
and the encoder of deeplab model is also resnet-32. The en-
coder of swin transformer is swin-B. The metrics reported in                 Table 2
the table are F1 scores.                                                     Data augmentation experiments. For both models, we ran-
                                                                             domly flip the input data as the baseline. The metrics re-
 Input spectral banks Input banks        Swin Deeplabv3 Unet
                                                                             ported in the table are F1 scores.
         RGB               3             65.6   58.0    59.2
        SWIR               3             55.6   50.2    52.1                         Transformation Swin transformer       Unet
         NGB               3             60.8   59.2    58.9                         None (baseline)      65.6             61.1
      PCA [11]             3             49.5   46.8    52.4                       color enhancement      62.1             60.3
     RGB + NIR             4             63.3   57.2    59.4                           cutout [12]        65.9             62.1
    RGB + SWIR             6             58.2   55.9    59.8                            cutmix [8]        66.0             62.6
  RGB + NIR + SWIR         7             54.8   57.5    60.0                         rotate and shift     69.8             63.7
      All banks           14             55.8   57.8    61.1
                                                                                Table 2 shows that random color augmentation de-
                                                                             grades the performance of swin transformer, while it
contain a lot of noise and are less discriminative. Through
                                                                             improves the Unet model. We analyze that this is be-
the above experiments, we selected the RGB spectrum as
                                                                             cause the RGB samples to be tested are also collected
the input of swin transformer. Moreover, we clearly show
                                                                             from mountainous areas, and the color space is not rich,
a visualization of the dissipation of swin transformer’s
                                                                             so color enhancement leads to invalid generalization. The
attention as the spectrum increases, as shown in Figure
                                                                             purpose of these two strategies, cutout and cutmix, is to
2. This figure further verifies the above conclusion.
                                                                             disrupt the spatial layout of images so that the model can
                                                                             learn robust representations, and both slightly improve
                                                                             the performance of the two models. For swintransformer,
                                                                             the most effective way to enhance the data is to rotate
                                                                             and translate the data, which directly improves the F1
                                                                             score by 4.2%. This augmentation increases the difficulty
                                                                             of capturing the relationship between landslides, which
                                                                             is very effective for swin transformer model. For unet,
                                                                             although this method is effective, the overall improve-
                                                                             ment strength is not as good as that of swin transformer.
                                                                             In general, after the data enhancement of rotation and
                                                                             translation, the F1 score of transformer is 6.1% higher
                                                                             than that of unet.
     a. Image    b. RGB    c. RGB+SWIR   d. RGB+NIR+SWIR   e. Ground Truth      Balanced training We tried multiple sets of meth-
                                                                             ods for balanced training, to verify the effectiveness of
Figure 2: Visualization of the feature activation map of the
swin transformer when inputting different spectral banks.                    these methods, as shown in Table 3. Among them, nor-
We show the features from the last layer of swin transformer                 mal training is a one-stage training method using cross
model in the training set.The redder the feature activation di-              entropy loss. For weighted cross entropy loss, we use
agram, the greater the response.                                             the scale coefficients of positive samples and negative
                                                                             samples as the loss weighting coefficient of negative sam-
   In addition, Table 1 also shows that the swin trans-                      ples. This method has achieved a certain improvement
former without any enhancements shows a very good                            by weighting the positive and negative pixels, but the im-
baseline performance after properly selecting the spec-                      provement is relatively limited. Focal loss [13] balances
trum. Therefore, our subsequent implementations rely                         easy and hard samples by modifying their gradients for
on this strong baseline model to further improve the                         back propagation, and is also used in many unbalanced
performance for detecting landslide.                                         scenarios. But on this task, the performance degrades
   Data augmentation When the task of landslide de-                          when this loss is added. Our analysis is that it has a great
tection only uses RGB spectrum, the data pattern will be                     influence on the gradient, and inappropriate hyperpa-
relatively simple, which increases the risk of overfitting.                  rameters will greatly affect the performance. Lovasz loss
[9] is a loss that directly optimizes the IoU coefficients,
which is efficient and used as the first stage loss for our
balanced training. Balanced training achieves the best
performance, which further corrects the bias of the clas-
sifier. Finally, balanced training improves the F1 score by  a. Test Image b. 𝝀𝝀 = πŸ“πŸ“πŸ“πŸ“πŸ“ c. 𝝀𝝀 = πŸ•πŸ•πŸ•πŸ•πŸ• d. 𝝀𝝀 = πŸ—πŸ—πŸ—πŸ—πŸ— e. 𝝀𝝀 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟏
4.1% on the basis of baseline. The results of this strategy
are visualized in Figure 4.                                 Figure 3: Visualization of pseudo labels when different
                                                                   lambda values are selected. In the pseudo labels, black rep-
                                                                   resents Class 0 (non landslide), red represents class 1 (land-
Table 3                                                            slide), and white represents ignored classes.
Balance training experiments. We use swin transformer with
data augmentation and normal training (only using cross en-
tropy loss) as the baseline model. The metrics reported in the
table are F1 scores.
           Training        Swin transformer         Unet
       Normal training           69.8               63.7
    Weighted cross entropy       70.8               64.9
        Focal loss [13]          68.2               61.8
        Lovasz loss [9]          72.3               66.4
       Balance training          73.9               67.7

   Self-training. We also use self-training techniques
to further improve the model performance, as shown in
Table 4. We verify who to select pseudo-labels is suitable
for landslide detection. We sorted the output probabil-
ities predicted in the previous stage, selected the top
πœ†% high-confidence pixel-level pseudo-labels and added
them to the training data for self-training. The number
of percentages selected should be explored, i.e. πœ†.

Table 4                                                               a. Test Image   b. Normal Training c. Balance Training d. Self Training
Self-training experiments with different πœ† values. ST denotes
the self-training.                                                 Figure 4: Visualization of model output after adding differ-
                                                                   ent strategies.
            πœ†       Precision (%) Recall(%) F1(%)
      - (Before ST)     73.4        74.7     73.9
          50%           65.2        80.5     72.7               In practical application, we can reasonably design this
          70%           69.3        79.5     73.7            parameter according to the requirements. When we need
          90%           72.4        77.1     74.9            to roughly find more areas that may be landslides, we
          100%          78.2        74.2     76.1            design a smaller lambda. When we need to detect the
                                                             landslide area more accurately, we design a larger lambda.
   In Table 4, we found that when πœ† is small, the accuracy      Furthermore, we visualize example plots for picking
rate after self-training will degrade seriously, but the re- pseudo-labels with different πœ† values, as shown in Figure
call rate will improve significantly. This is because when 3. We also visualize the output of the self-trained model
the πœ† is small, the selected landslide area is only located in Figure 4, which further supports the above conclusion.
in the center of the landslide, and the pixels in the sur-
rounding area will be ignored due to low confidence. This Acknowledgments
makes the self-trained model tend to predict all surround-
ing similar blocks as landslides, resulting in increased This work is in part supported by Key Research and Devel-
over-detection of landslides. As the selected landslide opment Program of Shannxi (Program No.2021ZDLGY01-
area continues to increase, the accuracy of the model 06), Key Research and Development Program of Shannxi
continues to rise, and the recall rate begins to decline. (Program No. 2022ZDLGY01 -12) and National Key R&D
This shows that with the addition of many inaccurate Program of China under Grant No. 2021ZD0110404.
pseudo-labels, it has played a strong role in preventing
over-detection. And the model can learn more knowledge
about the samples to be tested from the noisy training
data, which increases the accuracy.
References                                                         preprint arXiv:1708.04552 (2017).
                                                              [13] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. DollΓ‘r, Fo-
 [1] J. Gawlikowski, S. Saha, A. Kruspe, X. X. Zhu,                cal loss for dense object detection, in: Proceedings
     An advanced dirichlet prior network for out-of-               of the IEEE international conference on computer
     distribution detection in remote sensing, IEEE                vision, 2017, pp. 2980–2988.
     Transactions on Geoscience and Remote Sensing
     60 (2022) 1–19.
 [2] O. Ghorbanzadeh, Y. Xu, P. Ghamis, M. Kopp,
     D. Kreil, Landslide4sense: Reference benchmark
     data and deep learning models for landslide detec-
     tion, arXiv preprint arXiv:2206.00515 (2022).
 [3] F. I. Diakogiannis, F. Waldner, P. Caccetta, C. Wu,
     Resunet-a: A deep learning framework for seman-
     tic segmentation of remotely sensed data, ISPRS
     Journal of Photogrammetry and Remote Sensing
     162 (2020) 94–114.
 [4] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene
     parsing network, in: Proceedings of the IEEE con-
     ference on computer vision and pattern recognition,
     2017, pp. 2881–2890.
 [5] R. P. Poudel, U. Bonde, S. Liwicki, C. Zach, Con-
     textnet: Exploring context and detail for seman-
     tic segmentation in real-time, arXiv preprint
     arXiv:1805.04554 (2018).
 [6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,
     A. L. Yuille, Deeplab: Semantic image segmentation
     with deep convolutional nets, atrous convolution,
     and fully connected crfs, IEEE transactions on pat-
     tern analysis and machine intelligence 40 (2017)
     834–848.
 [7] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,
     B. Guo, Swin transformer: Hierarchical vision trans-
     former using shifted windows, in: Proceedings of
     the IEEE/CVF International Conference on Com-
     puter Vision, 2021, pp. 10012–10022.
 [8] G. French, T. Aila, S. Laine, M. Mackiewicz, G. Fin-
     layson,      Semi-supervised semantic segmenta-
     tion needs strong, high-dimensional perturbations
     (2019).
 [9] M. Berman, A. R. Triki, M. B. Blaschko, The lovΓ‘sz-
     softmax loss: A tractable surrogate for the opti-
     mization of the intersection-over-union measure in
     neural networks, in: Proceedings of the IEEE con-
     ference on computer vision and pattern recognition,
     2018, pp. 4413–4421.
[10] B. Zhou, Q. Cui, X.-S. Wei, Z.-M. Chen, Bbn:
     Bilateral-branch network with cumulative learning
     for long-tailed visual recognition, in: Proceedings
     of the IEEE/CVF conference on computer vision
     and pattern recognition, 2020, pp. 9719–9728.
[11] A. M. Martinez, A. C. Kak, Pca versus lda, IEEE
     transactions on pattern analysis and machine intel-
     ligence 23 (2001) 228–233.
[12] T. DeVries, G. W. Taylor, Improved regularization of
     convolutional neural networks with cutout, arXiv