=Paper=
{{Paper
|id=Vol-3207/paper14
|storemode=property
|title=SwinLS: Adapting Swin Transformer to Landslide Detection
|pdfUrl=https://ceur-ws.org/Vol-3207/paper14.pdf
|volume=Vol-3207
|authors=Dong Zhao,Qi Zang,Zining Wang,Dou Quan,Shuang Wang
|dblpUrl=https://dblp.org/rec/conf/cdceo/ZhaoZWQW22
}}
==SwinLS: Adapting Swin Transformer to Landslide Detection==
SwinLS: Adapting Swin Transformer to Landslide Detection Dong Zhao1 , Qi Zang1 , Zining Wang1 , Dou Quan1 and Shuang Wang1,β 1 School of Artificial Intelligence, Xidian University, Xian, 710071, China. Abstract Accurate detection of landslides plays an important role in post-disaster search and rescue operations. In this paper, we propose SwinLS for efficient landslide detection in remote sensing images using the swin transformer model. We explore how to efficiently utilize the self-attention mechanism in swin transformer for landslide detection tasks from two aspects. The first aspect is the spectral selection and data enhancement. The second aspect is to reduce imbalanced interference. After that, the performance of the improved swin transformer model is greatly improved, which provides a preliminary exploration for the application of the visual transformer model for remote sensing landslide detection tasks and even anomaly detection tasks. Finally, the proposed SwinLS, achieved the 2nd place in the test leaderboard with 73.99% F1 score, and it differs from the 1st place of 74.54% by only 0.55% F1 score. Keywords Landslide detection, remote sensing, swin transformer, multispectral imagery 1. Introduction supervised pixel-level classification task. Among these models, they found through experiments that ResUnet Landslides have become more frequent due to drastic achieved the best verification performance on landslide climate change, surface activity, and accidents, threaten- detection tasks, which is due to its reasonable utilization ing the lives and properties of residents in these areas. of multi-scale features. Accurate detection of landslides plays an important role Nonetheless, we believe that this is not enough, be- in post-disaster search and rescue operations. As an effi- cause two important issues of landslide detection are cient and convenient solution, automatic interpretation ignored. The first is the spatial correlation of landslide of landslide areas from remote sensing images has re- data and the second is the imbalanced problem in land- ceived extensive attention from scholars [1]. To advance slide detection. For the former, we were motivated by this research, Ghorbanzadeh and Xu et al. [2] release a the observation that the spectra after the collapse of the large-scale landslide detection dataset with pixel-level slopes exhibited often strong similarities. For the latter, labels, named Landslide4Sense, and established a related we are inspired by the category statistics in Ghorban- benchmark. zadeh and Xuβs paper[2], showing that the proportion The Landslide4Sense dataset contains multi-spectral of landslides is much smaller than that of non-landslide, imagery from multiple regions and cities collected by which is in line with the anomaly detection problem. To Sentinel-2 satellites. The data format is pixel blocks of address these issues, we introduce the swin transformer size 128 with 14 spectrum bands including RGB, VEG [7] model to capture the relationship between landslide (Vegetation Red Edge), NIR, WV (Water vapour), and regions and design a training strategy for it to solve the SWIR. This dataset is finely marked by experts to pin- imbalance problem. point the location of the landslide. In the Landslide4Sense The swin transformer is a recently proposed vision benchmark, Ghorbanzadeh and Xu et al [2] tried a se- transformer model that has demonstrated strong perfor- ries of classic convolution-based semantic segmentation mance on numerous tasks [7]. The key technology en- models, such as ResUNet[3], PSPNet[4], ContextNet[5] abling this model is the self-attention mechanism, which and DeepLab [6], treating landslide detection as a binary aggregates spatial relationships to extract semantic fea- tures. However, it is not a good way to directly apply CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth Observation, July 25, 2022, Vienna, Austria this model to multi-spectral remote sensing data for land- β slide detection, such as the Landslide4Sense dataset, be- Shuang Wang is the Corresponding author. " zhaodong01@stu.xidian.edu.cn (D. Zhao); cause all spectral segments in multispectral contain tar- qzang@stu.xidian.edu.cn (Q. Zang); get information. Those useless spectra will introduce 21171213901@stu.xidian.edu.cn (Z. Wang); massive noise in the feature aggregation process of the quandou@xidian.edu.cn (D. Quan); shwang@mail.xidian.edu.cn self-attention mechanism. Therefore, we first performed (S. Wang) spectral selection experiments to determine which spec- ~ https://github.com/DZhaoXd (D. Zhao) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tra are suitable for performing self-attention based fea- Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 ture aggregation. Finally, we use the RGB spectrum to train the swin transformer model. To complement it, we The πΏπππ loss is the image-level loss performed in high- use CUTMIX [8] and random rotation data augmentation level semantic features in the encoder to assist training, to prevent overfitting of larger capacity models. which is defined as follows, To solve the imbalance problem in landslides detec- 1 βοΈ tion, we design a two-stage balanced training strategy to πΏπππ = β πΏ(π¦π ) log π π (πΈ(π₯π )), (2) make the model better focus on foreground (landslides) |π³ | π₯ βπ³ π categories. In the first stage, we train the feature extrac- tor and classifier with weighted cross entropy loss to get where πΏ is pointer function. When there is a pixel stand better feature representation. In the second stage, we for positive sample (landslide) in π¦, its value is 1, oth- fix the feature extractor and fine-tune the classifier with erwise it is 0. π π (Β·) is a fully connected layer with a ordinary cross entropy loss to weaken the bias of the global pooling operation. π³ stands for the total data set. classifier. This strategy better mitigates the misleading The πΏπ€ππ loss is defined as follows, of the classifier due to the imbalance between landslide 1 βοΈ ππππ classes and non-landslide classes. πΏπ€ππ = β π¦π log π·(πΈ(π₯π )), (3) |π³ | π₯ βπ³ ππππ Finally, the proposed method called SwinLS, achieved π the 2nd place in the test leaderboard with 73.99% F1 score, and it differs from the 1st place of 74.54% by only where ππππ stands for the number of negative samples 0.55% F1 score. (non-landslides) and ππππ stands for the number of posi- tive samples (landslides) in any input image π₯. As men- tioned in [10], this re-weighting method can play a posi- 2. Methods tive role in balancing the feature distribution of positive and negative samples. However, the classifier will still be As shown in Figure 1, SwinLS is a network of codec biased. Therefore, in the second stage, we fix the trained structure, and there are hop links between codecs. Its encoder πΈ and use the standard cross-entropy loss πΏππ encoder πΈ is composed of the base structure of swin to train the decoder π·. transformer, which has a powerful feature representation capability. Its decoder π· uses a convolutional structure arg min πΏππ + πΏπππ . (4) π· for decoding and fusing multi-level features for output. 2x 3. Experiment 4x 8x 16x In this section, we show the performance of the methods proposed above, respectively. Due to the limited number Encoder Decoder of submissions of test data in the final stage, the data provided in our ablation experiments are all performance on the validation set. Input RGB image Pixel-level loss Swin transformer Multi-scale Fusion Spectral selection Since the Transformer model Stage1: Training Stage1: Training Stage2: Fixed Landslide? Stage2: Training needs to perform feature aggregations using self- No:0 Yes: 1 Image-level loss attention mechanism to extract high-level semantic fea- tures. If irrelevant spectral information occupies domi- Model structure nant information, it will have a significant impact on the Figure 1: Network structure diagram. performance of swin transformer. To this end, we per- form a set of experiments verifying the effect of different For self-attention mechanism in swin transformer to spectral inputs, as shown in Table 1. work better in the landslide detection, we performed In Table 1, we discovered an interesting phenomenon. spectral selection experiments (see Tabel 1 and Figure 1), With the increase of spectral banks, the performance of Finally, we selected the RGB spectrum from the multi- the fully convolutional models, such as deeplabv3 and spectral input into the model. To alleviate the foreground Unet, show a gradually increasing trend, while the per- and background imbalance in landslide detection, we de- formance of swin transformer is severely degraded. We sign a two-stage training strategy. In the first stage, the find out it is because the dimensionality enhancement codecs are trained simultaneously. For any input samples in the fully convolutional model may attenuate the neg- π₯π β π π€ΓβΓ3 , we use weighted cross-entropy loss πΏπ€ππ ative effects of irrelevant channels. Swin transformer, and Lovasz loss πΏπππ£ [9] for balanced training as follows, on the other hand, uses the dot product to preform the self-attention mechanism. When the spectral content arg min πΏπ€ππ + πΏπππ£ + πΏπππ . (1) unrelated to the landslide dominates, the attention is se- πΈ,π· riously dissipated, which makes the aggregated features Table 1 In addition, the swin transformer model has a large ca- Spectral selection experiments. In this table, the RGB denotes pacity and is easier to memorize and lose generalization the red, green, and blue spectral. SWIR denotes the 3-band under such simple data. To this end, we designed data far infrared in Sentinel-2. NGB denotes the near-infrared, augmentation experiments to verify the transformation green, and blue spectral. NIR denotes the near-infrared spec- methods for landslide detection using only RGB spectral tral. PCA refers to the use of dimensionality reduction tech- information, as shown in Table 2. We also add the Unet niques [11] for compressing the original 14 banks into 3 banks. model that uses all banks to compare with it. Besides, the encoder of unet model is replaced by resnet-32 and the encoder of deeplab model is also resnet-32. The en- coder of swin transformer is swin-B. The metrics reported in Table 2 the table are F1 scores. Data augmentation experiments. For both models, we ran- domly flip the input data as the baseline. The metrics re- Input spectral banks Input banks Swin Deeplabv3 Unet ported in the table are F1 scores. RGB 3 65.6 58.0 59.2 SWIR 3 55.6 50.2 52.1 Transformation Swin transformer Unet NGB 3 60.8 59.2 58.9 None (baseline) 65.6 61.1 PCA [11] 3 49.5 46.8 52.4 color enhancement 62.1 60.3 RGB + NIR 4 63.3 57.2 59.4 cutout [12] 65.9 62.1 RGB + SWIR 6 58.2 55.9 59.8 cutmix [8] 66.0 62.6 RGB + NIR + SWIR 7 54.8 57.5 60.0 rotate and shift 69.8 63.7 All banks 14 55.8 57.8 61.1 Table 2 shows that random color augmentation de- grades the performance of swin transformer, while it contain a lot of noise and are less discriminative. Through improves the Unet model. We analyze that this is be- the above experiments, we selected the RGB spectrum as cause the RGB samples to be tested are also collected the input of swin transformer. Moreover, we clearly show from mountainous areas, and the color space is not rich, a visualization of the dissipation of swin transformerβs so color enhancement leads to invalid generalization. The attention as the spectrum increases, as shown in Figure purpose of these two strategies, cutout and cutmix, is to 2. This figure further verifies the above conclusion. disrupt the spatial layout of images so that the model can learn robust representations, and both slightly improve the performance of the two models. For swintransformer, the most effective way to enhance the data is to rotate and translate the data, which directly improves the F1 score by 4.2%. This augmentation increases the difficulty of capturing the relationship between landslides, which is very effective for swin transformer model. For unet, although this method is effective, the overall improve- ment strength is not as good as that of swin transformer. In general, after the data enhancement of rotation and translation, the F1 score of transformer is 6.1% higher than that of unet. a. Image b. RGB c. RGB+SWIR d. RGB+NIR+SWIR e. Ground Truth Balanced training We tried multiple sets of meth- ods for balanced training, to verify the effectiveness of Figure 2: Visualization of the feature activation map of the swin transformer when inputting different spectral banks. these methods, as shown in Table 3. Among them, nor- We show the features from the last layer of swin transformer mal training is a one-stage training method using cross model in the training set.The redder the feature activation di- entropy loss. For weighted cross entropy loss, we use agram, the greater the response. the scale coefficients of positive samples and negative samples as the loss weighting coefficient of negative sam- In addition, Table 1 also shows that the swin trans- ples. This method has achieved a certain improvement former without any enhancements shows a very good by weighting the positive and negative pixels, but the im- baseline performance after properly selecting the spec- provement is relatively limited. Focal loss [13] balances trum. Therefore, our subsequent implementations rely easy and hard samples by modifying their gradients for on this strong baseline model to further improve the back propagation, and is also used in many unbalanced performance for detecting landslide. scenarios. But on this task, the performance degrades Data augmentation When the task of landslide de- when this loss is added. Our analysis is that it has a great tection only uses RGB spectrum, the data pattern will be influence on the gradient, and inappropriate hyperpa- relatively simple, which increases the risk of overfitting. rameters will greatly affect the performance. Lovasz loss [9] is a loss that directly optimizes the IoU coefficients, which is efficient and used as the first stage loss for our balanced training. Balanced training achieves the best performance, which further corrects the bias of the clas- sifier. Finally, balanced training improves the F1 score by a. Test Image b. ππ = πππππ c. ππ = πππππ d. ππ = πππππ e. ππ = πππππππ 4.1% on the basis of baseline. The results of this strategy are visualized in Figure 4. Figure 3: Visualization of pseudo labels when different lambda values are selected. In the pseudo labels, black rep- resents Class 0 (non landslide), red represents class 1 (land- Table 3 slide), and white represents ignored classes. Balance training experiments. We use swin transformer with data augmentation and normal training (only using cross en- tropy loss) as the baseline model. The metrics reported in the table are F1 scores. Training Swin transformer Unet Normal training 69.8 63.7 Weighted cross entropy 70.8 64.9 Focal loss [13] 68.2 61.8 Lovasz loss [9] 72.3 66.4 Balance training 73.9 67.7 Self-training. We also use self-training techniques to further improve the model performance, as shown in Table 4. We verify who to select pseudo-labels is suitable for landslide detection. We sorted the output probabil- ities predicted in the previous stage, selected the top π% high-confidence pixel-level pseudo-labels and added them to the training data for self-training. The number of percentages selected should be explored, i.e. π. Table 4 a. Test Image b. Normal Training c. Balance Training d. Self Training Self-training experiments with different π values. ST denotes the self-training. Figure 4: Visualization of model output after adding differ- ent strategies. π Precision (%) Recall(%) F1(%) - (Before ST) 73.4 74.7 73.9 50% 65.2 80.5 72.7 In practical application, we can reasonably design this 70% 69.3 79.5 73.7 parameter according to the requirements. When we need 90% 72.4 77.1 74.9 to roughly find more areas that may be landslides, we 100% 78.2 74.2 76.1 design a smaller lambda. When we need to detect the landslide area more accurately, we design a larger lambda. In Table 4, we found that when π is small, the accuracy Furthermore, we visualize example plots for picking rate after self-training will degrade seriously, but the re- pseudo-labels with different π values, as shown in Figure call rate will improve significantly. This is because when 3. We also visualize the output of the self-trained model the π is small, the selected landslide area is only located in Figure 4, which further supports the above conclusion. in the center of the landslide, and the pixels in the sur- rounding area will be ignored due to low confidence. This Acknowledgments makes the self-trained model tend to predict all surround- ing similar blocks as landslides, resulting in increased This work is in part supported by Key Research and Devel- over-detection of landslides. As the selected landslide opment Program of Shannxi (Program No.2021ZDLGY01- area continues to increase, the accuracy of the model 06), Key Research and Development Program of Shannxi continues to rise, and the recall rate begins to decline. (Program No. 2022ZDLGY01 -12) and National Key R&D This shows that with the addition of many inaccurate Program of China under Grant No. 2021ZD0110404. pseudo-labels, it has played a strong role in preventing over-detection. And the model can learn more knowledge about the samples to be tested from the noisy training data, which increases the accuracy. References preprint arXiv:1708.04552 (2017). [13] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. DollΓ‘r, Fo- [1] J. Gawlikowski, S. Saha, A. Kruspe, X. X. Zhu, cal loss for dense object detection, in: Proceedings An advanced dirichlet prior network for out-of- of the IEEE international conference on computer distribution detection in remote sensing, IEEE vision, 2017, pp. 2980β2988. Transactions on Geoscience and Remote Sensing 60 (2022) 1β19. [2] O. Ghorbanzadeh, Y. Xu, P. Ghamis, M. Kopp, D. Kreil, Landslide4sense: Reference benchmark data and deep learning models for landslide detec- tion, arXiv preprint arXiv:2206.00515 (2022). [3] F. I. Diakogiannis, F. Waldner, P. Caccetta, C. Wu, Resunet-a: A deep learning framework for seman- tic segmentation of remotely sensed data, ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94β114. [4] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE con- ference on computer vision and pattern recognition, 2017, pp. 2881β2890. [5] R. P. Poudel, U. Bonde, S. Liwicki, C. Zach, Con- textnet: Exploring context and detail for seman- tic segmentation in real-time, arXiv preprint arXiv:1805.04554 (2018). [6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE transactions on pat- tern analysis and machine intelligence 40 (2017) 834β848. [7] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision trans- former using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, 2021, pp. 10012β10022. [8] G. French, T. Aila, S. Laine, M. Mackiewicz, G. Fin- layson, Semi-supervised semantic segmenta- tion needs strong, high-dimensional perturbations (2019). [9] M. Berman, A. R. Triki, M. B. Blaschko, The lovΓ‘sz- softmax loss: A tractable surrogate for the opti- mization of the intersection-over-union measure in neural networks, in: Proceedings of the IEEE con- ference on computer vision and pattern recognition, 2018, pp. 4413β4421. [10] B. Zhou, Q. Cui, X.-S. Wei, Z.-M. Chen, Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9719β9728. [11] A. M. Martinez, A. C. Kak, Pca versus lda, IEEE transactions on pattern analysis and machine intel- ligence 23 (2001) 228β233. [12] T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with cutout, arXiv