1. Introduction

SwinLS: Adapting Swin Transformer to Landslide Detection

Dong Zhao

Qi Zang

Zining Wang

Dou Quan

Shuang Wang

0 0 School of Artificial Intelligence, Xidian University , Xian, 710071 , China

Accurate detection of landslides plays an important role in post-disaster search and rescue operations. In this paper, we propose SwinLS for eficient landslide detection in remote sensing images using the swin transformer model. We explore how to eficiently utilize the self-attention mechanism in swin transformer for landslide detection tasks from two aspects. The first aspect is the spectral selection and data enhancement. The second aspect is to reduce imbalanced interference. After that, the performance of the improved swin transformer model is greatly improved, which provides a preliminary exploration for the application of the visual transformer model for remote sensing landslide detection tasks and even anomaly detection tasks. Finally, the proposed SwinLS, achieved the 2nd place in the test leaderboard with 73.99% F1 score, and it difers from the 1st place of 74.54% by only 0.55% F1 score.

eol>Landslide detection remote sensing swin transformer multispectral imagery

1. Introduction

supervised pixel-level classification task. Among these models, they found through experiments that ResUnet Landslides have become more frequent due to drastic achieved the best verification performance on landslide climate change, surface activity, and accidents, threaten- detection tasks, which is due to its reasonable utilization ing the lives and properties of residents in these areas. of multi-scale features.

Accurate detection of landslides plays an important role Nonetheless, we believe that this is not enough, bein post-disaster search and rescue operations. As an efi- cause two important issues of landslide detection are cient and convenient solution, automatic interpretation ignored. The first is the spatial correlation of landslide of landslide areas from remote sensing images has re- data and the second is the imbalanced problem in landceived extensive attention from scholars [ 1 ]. To advance slide detection. For the former, we were motivated by this research, Ghorbanzadeh and Xu et al. [ 2 ] release a the observation that the spectra after the collapse of the large-scale landslide detection dataset with pixel-level slopes exhibited often strong similarities. For the latter, labels, named Landslide4Sense, and established a related we are inspired by the category statistics in Ghorbanbenchmark. zadeh and Xu’s paper[ 2 ], showing that the proportion

The Landslide4Sense dataset contains multi-spectral of landslides is much smaller than that of non-landslide, imagery from multiple regions and cities collected by which is in line with the anomaly detection problem. To Sentinel-2 satellites. The data format is pixel blocks of address these issues, we introduce the swin transformer size 128 with 14 spectrum bands including RGB, VEG [ 7 ] model to capture the relationship between landslide (Vegetation Red Edge), NIR, WV (Water vapour), and regions and design a training strategy for it to solve the SWIR. This dataset is finely marked by experts to pin- imbalance problem. point the location of the landslide. In the Landslide4Sense The swin transformer is a recently proposed vision benchmark, Ghorbanzadeh and Xu et al [ 2 ] tried a se- transformer model that has demonstrated strong perforries of classic convolution-based semantic segmentation mance on numerous tasks [ 7 ]. The key technology enmodels, such as ResUNet[ 3 ], PSPNet[ 4 ], ContextNet[ 5 ] abling this model is the self-attention mechanism, which and DeepLab [ 6 ], treating landslide detection as a binary aggregates spatial relationships to extract semantic features. However, it is not a good way to directly apply CODbsCeErvOat2io0n22,:Ju2lnyd2W5, o2r0k2s2h,oVpieonnnaC,oAmupslterxiaData Challenges in Earth this model to multi-spectral remote sensing data for land† Shuang Wang is the Corresponding author. slide detection, such as the Landslide4Sense dataset, be" zhaodong01@stu.xidian.edu.cn (D. Zhao); cause all spectral segments in multispectral contain tarqzang@stu.xidian.edu.cn (Q. Zang); get information. Those useless spectra will introduce 21171213901@stu.xidian.edu.cn (Z. Wang); massive noise in the feature aggregation process of the quandou@xidian.edu.cn (D. Quan); shwang@mail.xidian.edu.cn self-attention mechanism. Therefore, we first performed (S. Wang) spectral selection experiments to determine which spec~ https://github.com/DZhaoXd (D. Zhao)

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tra are suitable for performing self-attention based feaCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) ture aggregation. Finally, we use the RGB spectrum to train the swin transformer model. To complement it, we use CUTMIX [ 8 ] and random rotation data augmentation to prevent overfitting of larger capacity models.

To solve the imbalance problem in landslides detection, we design a two-stage balanced training strategy to make the model better focus on foreground (landslides) categories. In the first stage, we train the feature extractor and classifier with weighted cross entropy loss to get better feature representation. In the second stage, we ifx the feature extractor and fine-tune the classifier with ordinary cross entropy loss to weaken the bias of the classifier. This strategy better mitigates the misleading of the classifier due to the imbalance between landslide classes and non-landslide classes.

Finally, the proposed method called SwinLS, achieved the 2nd place in the test leaderboard with 73.99% F1 score, and it difers from the 1st place of 74.54% by only 0.55% F1 score.

2. Methods

As shown in Figure 1, SwinLS is a network of codec structure, and there are hop links between codecs. Its encoder is composed of the base structure of swin transformer, which has a powerful feature representation capability. Its decoder uses a convolutional structure for decoding and fusing multi-level features for output. 2x 4x 8x 16x The loss is the image-level loss performed in highlevel semantic features in the encoder to assist training, which is defined as follows, = − 1

∑︁ () log (()), | | ∈ (2) where is pointer function. When there is a pixel stand for positive sample (landslide) in , its value is 1, otherwise it is 0. (· ) is a fully connected layer with a global pooling operation. stands for the total data set. The loss is defined as follows, = − 1 ∑︁ log (()), | | ∈ (3) where stands for the number of negative samples (non-landslides) and stands for the number of positive samples (landslides) in any input image . As mentioned in [ 10 ], this re-weighting method can play a positive role in balancing the feature distribution of positive and negative samples. However, the classifier will still be biased. Therefore, in the second stage, we fix the trained encoder and use the standard cross-entropy loss to train the decoder .

arg min + .

(4)

3. Experiment In this section, we show the performance of the methods

proposed above, respectively. Due to the limited number Encoder Decoder of submissions of test data in the final stage, the data provided in our ablation experiments are all performance on the validation set.

Input RGB image Swin transformer Multi-scale Fusion Pixel-level loss Spectral selection Since the Transformer model SSttaaggee12::TFriaxiending ImLNaagone:d0-slelYivdeeesl:?l1osSSsttaaggee21:: TTrraaiinniinngg anteteednstiotno mpeecrhfaonrmismfetoateuxrteracatghgirgehg-alteivoenlsseumsainngticsfeelaf-tures. If irrelevant spectral information occupies domi

Model structure nant information, it will have a significant impact on the Figure 1: Network structure diagram. performance of swin transformer. To this end, we perform a set of experiments verifying the efect of diferent

For self-attention mechanism in swin transformer to spectral inputs, as shown in Table 1. work better in the landslide detection, we performed In Table 1, we discovered an interesting phenomenon. spectral selection experiments (see Tabel 1 and Figure 1), With the increase of spectral banks, the performance of Finally, we selected the RGB spectrum from the multi- the fully convolutional models, such as deeplabv3 and spectral input into the model. To alleviate the foreground Unet, show a gradually increasing trend, while the perand background imbalance in landslide detection, we de- formance of swin transformer is severely degraded. We sign a two-stage training strategy. In the first stage, the ifnd out it is because the dimensionality enhancement codecs are trained simultaneously. For any input samples in the fully convolutional model may attenuate the neg ∈ × ℎ× 3, we use weighted cross-entropy loss ative efects of irrelevant channels. Swin transformer, and Lovasz loss [ 9 ] for balanced training as follows, on the other hand, uses the dot product to preform the self-attention mechanism. When the spectral content arg min + + . (1) unrelated to the landslide dominates, the attention is se, riously dissipated, which makes the aggregated features

In addition, the swin transformer model has a large capacity and is easier to memorize and lose generalization under such simple data. To this end, we designed data augmentation experiments to verify the transformation methods for landslide detection using only RGB spectral information, as shown in Table 2. We also add the Unet model that uses all banks to compare with it. [ 9 ] is a loss that directly optimizes the IoU coeficients, which is eficient and used as the first stage loss for our balanced training. Balanced training achieves the best performance, which further corrects the bias of the classifier. Finally, balanced training improves the F1 score by

4.1% on the basis of baseline. The results of this strategy

are visualized in Figure 4.

Acknowledgments This work is in part supported by Key Research and Development Program of Shannxi (Program No.2021ZDLGY0106), Key Research and Development Program of Shannxi Program of China under Grant No. 2021ZD0110404.

Precision (%) Recall(%) F1(%) 73.4 65.2 69.3 72.4 78.2 74.7 80.5 79.5 77.1 74.2 73.9 72.7 73.7 74.9 76.1

In Table 4, we found that when is small, the accuracy call rate will improve significantly. This is because when the is small, the selected landslide area is only located in the center of the landslide, and the pixels in the surrounding area will be ignored due to low confidence. This makes the self-trained model tend to predict all surrounding similar blocks as landslides, resulting in increased over-detection of landslides. As the selected landslide area continues to increase, the accuracy of the model This shows that with the addition of many inaccurate pseudo-labels, it has played a strong role in preventing over-detection. And the model can learn more knowledge about the samples to be tested from the noisy training data, which increases the accuracy. rate after self-training will degrade seriously, but the re- pseudo-labels with diferent values, as shown in Figure continues to rise, and the recall rate begins to decline. (Program No. 2022ZDLGY01 -12) and National Key R&D

[1]

Gawlikowski ,

Saha ,

Kruspe ,

X. X.

Zhu , An advanced dirichlet prior network for out-ofdistribution detection in remote sensing , IEEE Transactions on Geoscience and Remote Sensing 60 ( 2022 ) 1 - 19 .

[2]

Ghorbanzadeh ,

Xu ,

Ghamis ,

Kopp ,

Kreil , Landslide4sense: Reference benchmark data and deep learning models for landslide detection , arXiv preprint arXiv:2206.00515 ( 2022 ).

[3]

F. I.

Diakogiannis ,

Waldner ,

Caccetta ,

Wu , Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data , ISPRS Journal of Photogrammetry and Remote Sensing 162 ( 2020 ) 94 - 114 .

[4]

Zhao ,

Shi ,

Qi ,

Wang ,

Jia , Pyramid scene parsing network , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2017 , pp. 2881 - 2890 .

[5]

R. P.

Poudel ,

Bonde ,

Liwicki ,

Zach , Contextnet: Exploring context and detail for semantic segmentation in real-time, arXiv preprint arXiv: 1805 . 04554 ( 2018 ).

[6]

L.-C.

Chen , G. Papandreou, I. Kokkinos,

Murphy ,

A. L.

Yuille , Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs , IEEE transactions on pattern analysis and machine intelligence 40 ( 2017 ) 834 - 848 .

[7]

Liu ,

Lin ,

Cao ,

Hu ,

Wei ,

Zhang ,

Lin ,

Guo , Swin transformer: Hierarchical vision transformer using shifted windows , in: Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021 , pp. 10012 - 10022 .

[8]

French ,

Aila ,

Laine ,

Mackiewicz , G. Finlayson, Semi-supervised semantic segmentation needs strong, high-dimensional perturbations ( 2019 ).

[9]

Berman ,

A. R.

Triki ,

M. B.

Blaschko , The lovászsoftmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2018 , pp. 4413 - 4421 .

[10]

Zhou ,

Cui ,

X.-S.

Wei ,

Z.-M.

Chen , Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020 , pp. 9719 - 9728 .

[11] A. M. Martinez , A. C. Kak , Pca versus lda , IEEE transactions on pattern analysis and machine intelligence 23 ( 2001 ) 228 - 233 .

[12] T. DeVries , G. W. Taylor, Improved regularization of convolutional neural networks with cutout , arXiv preprint arXiv:1708.04552 ( 2017 ).

[13] T.-Y. Lin , P.

Goyal , R.

Girshick , K.

He , P.

Dollár , Focal loss for dense object detection , in: Proceedings of the IEEE international conference on computer vision , 2017 , pp. 2980 - 2988 .