=Paper=
{{Paper
|id=Vol-3207/paper15
|storemode=property
|title=On the Generalization of the Semantic Segmentation Model for Landslide Detection
|pdfUrl=https://ceur-ws.org/Vol-3207/paper15.pdf
|volume=Vol-3207
|authors=Fahong Zhang,Yilei Shi,Qingsong Xu,Zhitong Xiong,Wei Yao,Xiao Xiang Zhu
|dblpUrl=https://dblp.org/rec/conf/cdceo/ZhangSXXYZ22
}}
==On the Generalization of the Semantic Segmentation Model for Landslide Detection==
On the Generalization of the Semantic Segmentation Model for Landslide Detection Fahong Zhang1 , Yilei Shi2 , Qingsong Xu1 , Zhitong Xiong1 , Wei Yao3 and Xiao Xiang Zhu13 1 Data Science in Earth Observation, Technical University of Munich (TUM), Munich, Germany 2 Chair of Remote Sensing Technology (LMF), Technical University of Munich, Munich, Germany 3 Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), WeΓling, Germany Abstract The goal of landslide detection is to detect regions with landslide events. It is critical for emergency response and disaster monitoring. This study is based on the context of Landslide4Sense competition, whose goal is to promote effective and innovative algorithms to detect landslides across different continents, using Sentinel-2 and ALOS PALSAR data. Considering its global-scale coverage, studying the generalization performance of the landslide detection model on unseen regions turns out to be an important task. To this end, we propose a self-training method to improve the generalizability of the landslide detection model by exploiting the pseudo labels of unlabeled samples with low uncertainty. According to experimental results, the proposed self-training method is effective in bridging the shifts between labeled and unlabeled data, and achieves the rank of the 3rd place on the Landslide4Sense competition. Keywords Landslide detection, Semantic segmentation, Self-training, Domain adaptation, 1. Introduction transferability of semantic segmentation model is also of great importance. Due to the different atmospheric With the ongoing climate change and the rapid urbaniza- conditions, shooting angles and illuminations, satellite tion in landslide-prone terrains, Landslides have become data across different regions may have large domain shifts an increasingly threatening hazard in mountainous ar- [6]. As a result, the semantic segmentation model trained eas and started to affect a large amount of population. on specific areas may fail to generalize to different unseen In order to accurately and rapidly monitor the landslide regions across the world in different periods of time. events occurred over the world, satellite data are con- Self-training approaches have been demonstrated to sidered as a promising data source owing to their high be effective in promoting the generalizability of deep global coverage, relatively high temporal and spectral learning models in the field of semi-supervised learning resolution. and domain adaptation [7]. They first generate pseudo In a technical point of view, the landslide detection labels on the unlabeled data based on a teacher model problem based on satellite data can be regarded as a bi- pre-trained on labeled data. Then the pseudo labels with nary semantic segmentation problem, where the learning high confidence will be used to supervise the training based model is required to distinguish the landslides with of the student model on the unlabeled data. With this background areas. In the computer vision society, seman- considered, we propose a self-training method based on a tic segmentation has always been a popular research Monte-Carlo dropout uncertainty [8] and class-balanced topic. From the earlier Fully Convolution Network (FCN) thresholding. The contributions of this paper can be [1, 2] to the currently dominating transformer-based ap- listed as follows: proaches [3, 4], tremendous improvements have been witnessed with the developments of the network archi- β’ We propose a self-training method based on tecture. As reported in [5], several baseline semantic Monte-Carlo dropout uncertainty and class- segmentation models have demonstrated promising per- balanced thresholding on the task of landslide formances in the task of landslide detection. detection. The experimental results demonstrate In addition to designing more sophisticated and task that the proposed method can provide significant specific network architectures, the research towards the improvements over the baseline, and help to im- prove the generalizability of semantic segmenta- CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth tion models. Observation, July 25, 2022, Vienna, Austria β’ We technically prove the effectiveness of the pro- $ fahong.zhang@tum.de (F. Zhang); yilei.shi@tum.de (Y. Shi); posed method on Landslide4Sense competition, qingsong.xu@tum.de (Q. Xu); zhitong.xiong@tum.de (Z. Xiong); where we achieve the 3rd prize with a testing F1 wei.yao@dlr.de (W. Yao); xiaoxiang.zhu@tum.de (X. X. Zhu) score of 73.50%. Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Uncertainty Teacher Model Thresholding Unlabeled Image Softmax Output Target Prediction Pseudo labels Labeled Image Mix-up and Augmentation Student Model Pseudo Label Loss Source Prediction Labels Supervised Loss Figure 1: Pipeline of the proposed self-training method. In each training step, a batch of labeled and unlabeled data will be given to the teacher and the student models, where data augmentations and mix-up operation [9] will be applied to the student model branch. For labeled data, supervised losses will be calculated based on the provided labels. For unlabeled data, we first apply Monte-Carlo dropout [8] on the teacher model to estimate the uncertainty of unlabeled predictions, and then generate the pseudo labels based on a class-balanced threshold (see 2.3). The teacher model will be fixed during training. 2. Methodology 2.3. Self-training We illustrate the pipeline of the proposed method in Fig. As shown in Fig. 1, a teacher model pre-trained on the 1. The remaining parts of this section will formulate the training data will be used to generate pseudo labels for landslide detection problem and elaborate the methodol- supervising the student model. However, since the raw ogy in details. pseudo labels are usually noisy, a selection strategy is required to filter out the misclassified pixels. First, we use the Monte-Carlo dropout strategy [8] to 2.1. Problem Formulations estimate an uncertainty map for each input test patch. In the landslide detection problem, we are given a set More specifically, we forward the test patch to the source of labeled training data ππ‘ππππ = {π₯π‘π , π¦π‘π }, and un- model with 10 different runs. In each run, random labeled test data π·π‘ππ π‘ = {π₯π‘π }, where π₯π‘π , π¦π‘π , and dropout with 0.3 dropping rate will be applied to the π₯π‘π β Rπ»Γπ are each training patch, training label, and feature map obtained by the first convolution layer. The test patch, respectively. Our task is to train a semantic variances of 10 different output logits will be considered segmentation model on π·π‘ππππ and π·π‘ππ π‘ , and optimize as the uncertainty map. its performance on ππ‘ππ π‘ . The overall loss function of Second, we mask out the uncertain predictions from the proposed method is: the teacher model. Inspired by [7], we propose to select a certain proportion of the pixels for each class with the β = βπππ₯ πππ₯ π π’π + βππ π . (1) lowest uncertainty among all the test data. To this end, 90% of the background pixels and 70% of the landslide The mix supervised loss βπππ₯ π π’π and pseud label loss βππ π πππ₯ pixels are utilized, and the others will be ignored when will be formulated in Sec. 2.4 calculating the losses. Finally, the pseudo label loss can be formulated by: 2.2. Supervised Losses βππ π (π₯π‘π , π¦Λπ‘π ) = βπππ‘ (π₯π‘π , π¦Λπ‘π ) + βπππ (π₯π‘π , π¦Λπ‘π ). (3) We use cross entropy loss and jaccard loss as the super- vised losses: Here π¦Λπ‘π corresponds to the pseudo labels generated by the teacher model. βπ π’π (π₯π‘π , π¦π‘π ) = βπππ‘ (π₯π‘π , π¦π‘π ) + βπππ (π₯π‘π , π¦π‘π ). (2) 2.4. Mix-up Strategy the validation phase, only validation data are released. During the test phase, the test data will be available, yet To prevent the model from overfitting to the training data, the chances for submitting the results for evaluation will a mix-up strategy [9] is applied to both the training and be limited. With this as background information, we give test data to further increase the generalizability. Given the workflow of training our final model as follows. a batch of training and test data, the mixed data can be generated by: β’ Model 1. We first train a base model using solely β² the training data, which means the teacher branch Λ π‘π = ππ₯π‘π + (1 β π)π₯π‘π , π₯ in Fig. 1 is blocked. ResNet50 [11] and Deeplab (4) Λ π‘π = ππ₯π‘π + (1 β π)π₯π‘π . π₯ β² V3+ [12] are used as the backbone and the de- coder, respectively. The ResNet50 backbone is β² Here π₯π‘π is derived from π₯π‘π , where all the image patches initialized using the ImageNet pretrained weights. in the same batch are shuffled. π is a scalar randomly The training lasts for only 30, 000 iterations to sampled from a predefined beta distribution during train- avoid overfitting. ing. Then we can reformulate the supervised and pseudo β’ Model 2. This model is developed during the label losses as: validation phase, where we use Model 1 as the β² teacher model, and validation data as the unla- βπππ₯ Λ π‘π , π¦π‘π ) + (1 β π)βπ π’π (π₯ π π’π = πβπ π’π (π₯ Λ π‘π , π¦π‘π ), beled data. The architecture is based on HRNet β² [13]. βπππ₯ Λ π‘π , π¦Λπ‘π ) + (1 β π)βππ π (π₯ ππ π = πβππ π (π₯ Λ π‘π , π¦Λπ‘π ). β’ Model 3. Compared to Model 2, the only differ- (5) ence of Model 3 is that we apply a ResNext50 [14] backbone and a Deeplab V3+ [12] architec- 2.5. Post-processing ture. We apply the dense conditional random field (DenseCRF) β’ Final Model. The final model uses all the valida- [10] as a post-processing technique to better match the tion and test data as unlabeled data. Following Fig. predicted landslide contours with the ground truth. 1, its student model is pre-trained on Model 3, and Model 2 is considered as the teacher model. 3. Experiments 3.3. Results 3.1. Datasets The final results on the test leaderboard are shown in Tab. 1. For our methods, we plot the results of the Final The proposed method is developed and evaluated on Model and Model 2. Due to the limited submission times, the Landslide4Sense competition [5]. The provided data the other models were not evaluated. By comparing the consist of 12 Sentinel-2 bands and 2 topological bands results of Model 2 to Final Model, one can observe that including SLOP and DEM, both of which are derived pre-training on a different architecture (Model 3) helps from ALOS PALSAR. Each band is resized to 10 meter to improve the performance of the Final Model. resolution per pixel. The data are cropped to 128 Γ 128 Some qualitative results on the testing data are shown patches. 3799, 245 and 800 patches are provided for in Fig. 2. According to the results, the proposed method training, validation and testing, respectively. can successfully distinguish the road areas with the land- slides, which are similar to each other in RGB appear- 3.2. Implementation Details ances. However, some small landslides that fall to the road are also ignored (see the first two rows). By compar- For the overall training setting, we use SGD optimizer ing the raw predictions and the post-processed results, with Nesterov acceleration to train the network, where we notice that DenseCRF will remove some isolated land- the momentum and weight decay are set to 0.9 and slide predictions, but help to shrink them to better fit 5 Γ 10β4 , respectively. The batch size is set to 16, and to the spatial topology (see red rectangles in the last the training lasts for 60, 000 iterations. For data pre- column). processing, we normalize the first 12 bands by linearly scaling them to the range of [0, 1]. For data augmen- tation, we perform random flipping, random resizing 3.4. Ablation Study and cropping, and finally resize the patch to the size of We perform the ablation study based on the validation 256 Γ 256. data and list the results in Tab. 2. It can be observed that The time period of the Landslide4Sense competition both Model 2 and Model 3 are superior to Model 1 by includes a validation phase and a test phase. During a large margin. In addition, if the self-training branch is RGB DEM SLOP Uncertainty Prediction DenseCRF Figure 2: Qualitative results on test data. From left to right columns, we visualize the RGB, DEM and SLOP channels of the data, MC-dropout-based uncertainty maps, predictions from the network and the post-processed results by DenseCRF. Table 1 Table 2 F1 score (%) during the test phase. Ablation study results during the validation phase (%). βw/o STβ means the self-training or the teacher model branch in Fig. 1 is blocked. βCRFβ means DenseCRF is activated as the Team Name F1 post-processing method. kingdrone 74.54 seek 73.99 Model Precision Recall F1 ours (Final Model) 73.50 ours (Model 2) 72.50 Model 1 69.70 82.60 75.60 sikui 71.87 Model 1 + CRF 76.82 80.48 78.61 sklgp 71.29 Model 2 (w/o ST) 66.96 81.23 73.41 bao18 70.15 Model 2 75.60 82.21 78.76 Model 2 + CRF 82.45 78.36 80.35 Model 3 (w/o ST) 65.63 82.31 73.03 blocked, the performance will be decreased. This demon- Model 3 73.89 82.34 77.88 strates the effectiveness of the proposed self-training Model 3 + CRF 80.19 78.94 79.56 method. 4. Conclusions between labeled and unlabeled data. This paper studies the landslide detection problem and propose a self-training method to improve the generaliz- Acknowledgments ability of the semantic segmentation model. The experi- mental results on Landslide4Sense dataset demonstrate This work is sponsored by China Scholarship Council. that the proposed method can help to bridge the gap References [13] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al., Deep high- [1] J. Long, E. Shelhamer, T. Darrell, Fully convolu- resolution representation learning for visual recog- tional networks for semantic segmentation, in: Pro- nition, IEEE transactions on pattern analysis and ceedings of the IEEE conference on computer vision machine intelligence 43 (2020) 3349β3364. and pattern recognition, 2015, pp. 3431β3440. [14] S. Xie, R. Girshick, P. DollΓ‘r, Z. Tu, K. He, Ag- [2] O. Ronneberger, P. Fischer, T. Brox, U-net: Convo- gregated residual transformations for deep neural lutional networks for biomedical image segmenta- networks, in: Proceedings of the IEEE conference tion, in: Medical Image Computing and Computer- on computer vision and pattern recognition, 2017, Assisted Intervention - MICCAI, volume 9351, 2015, pp. 1492β1500. pp. 234β241. [3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recog- nition at scale, arXiv preprint arXiv:2010.11929 (2020). [4] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Al- varez, P. Luo, Segformer: Simple and efficient de- sign for semantic segmentation with transformers, Advances in Neural Information Processing Sys- tems 34 (2021) 12077β12090. [5] O. Ghorbanzadeh, Y. Xu, P. Ghamis, M. Kopp, D. Kreil, Landslide4sense: Reference benchmark data and deep learning models for landslide detec- tion, arXiv preprint arXiv:2206.00515 (2022). [6] O. Tasar, A. Giros, Y. Tarabalka, P. Alliez, S. Clerc, Daugnet: Unsupervised, multisource, multitarget, and life-long domain adaptation for semantic seg- mentation of satellite images, IEEE Transactions on Geoscience and Remote Sensing 59 (2020) 1067β 1081. [7] Y. Zou, Z. Yu, B. Kumar, J. Wang, Unsupervised domain adaptation for semantic segmentation via class-balanced self-training, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 289β305. [8] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap- proximation: Representing model uncertainty in deep learning, in: international conference on ma- chine learning, PMLR, 2016, pp. 1050β1059. [9] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz, mixup: Beyond empirical risk minimization, arXiv preprint arXiv:1710.09412 (2017). [10] P. KrΓ€henbΓΌhl, V. Koltun, Efficient inference in fully connected crfs with gaussian edge potentials, Advances in neural information processing systems 24 (2011). [11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- ing for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770β778. [12] L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Re- thinking atrous convolution for semantic image seg- mentation, arXiv preprint arXiv:1706.05587 (2017).