=Paper= {{Paper |id=Vol-3207/paper15 |storemode=property |title=On the Generalization of the Semantic Segmentation Model for Landslide Detection |pdfUrl=https://ceur-ws.org/Vol-3207/paper15.pdf |volume=Vol-3207 |authors=Fahong Zhang,Yilei Shi,Qingsong Xu,Zhitong Xiong,Wei Yao,Xiao Xiang Zhu |dblpUrl=https://dblp.org/rec/conf/cdceo/ZhangSXXYZ22 }} ==On the Generalization of the Semantic Segmentation Model for Landslide Detection== https://ceur-ws.org/Vol-3207/paper15.pdf
On the Generalization of the Semantic Segmentation Model
for Landslide Detection
Fahong Zhang1 , Yilei Shi2 , Qingsong Xu1 , Zhitong Xiong1 , Wei Yao3 and Xiao Xiang Zhu13
1
  Data Science in Earth Observation, Technical University of Munich (TUM), Munich, Germany
2
  Chair of Remote Sensing Technology (LMF), Technical University of Munich, Munich, Germany
3
  Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Weßling, Germany


                                             Abstract
                                             The goal of landslide detection is to detect regions with landslide events. It is critical for emergency response and disaster
                                             monitoring. This study is based on the context of Landslide4Sense competition, whose goal is to promote effective and
                                             innovative algorithms to detect landslides across different continents, using Sentinel-2 and ALOS PALSAR data. Considering
                                             its global-scale coverage, studying the generalization performance of the landslide detection model on unseen regions turns
                                             out to be an important task. To this end, we propose a self-training method to improve the generalizability of the landslide
                                             detection model by exploiting the pseudo labels of unlabeled samples with low uncertainty. According to experimental results,
                                             the proposed self-training method is effective in bridging the shifts between labeled and unlabeled data, and achieves the
                                             rank of the 3rd place on the Landslide4Sense competition.

                                             Keywords
                                             Landslide detection, Semantic segmentation, Self-training, Domain adaptation,



1. Introduction                                                                                                                       transferability of semantic segmentation model is also
                                                                                                                                      of great importance. Due to the different atmospheric
With the ongoing climate change and the rapid urbaniza-                                                                               conditions, shooting angles and illuminations, satellite
tion in landslide-prone terrains, Landslides have become                                                                              data across different regions may have large domain shifts
an increasingly threatening hazard in mountainous ar-                                                                                 [6]. As a result, the semantic segmentation model trained
eas and started to affect a large amount of population.                                                                               on specific areas may fail to generalize to different unseen
In order to accurately and rapidly monitor the landslide                                                                              regions across the world in different periods of time.
events occurred over the world, satellite data are con-                                                                                  Self-training approaches have been demonstrated to
sidered as a promising data source owing to their high                                                                                be effective in promoting the generalizability of deep
global coverage, relatively high temporal and spectral                                                                                learning models in the field of semi-supervised learning
resolution.                                                                                                                           and domain adaptation [7]. They first generate pseudo
   In a technical point of view, the landslide detection                                                                              labels on the unlabeled data based on a teacher model
problem based on satellite data can be regarded as a bi-                                                                              pre-trained on labeled data. Then the pseudo labels with
nary semantic segmentation problem, where the learning                                                                                high confidence will be used to supervise the training
based model is required to distinguish the landslides with                                                                            of the student model on the unlabeled data. With this
background areas. In the computer vision society, seman-                                                                              considered, we propose a self-training method based on a
tic segmentation has always been a popular research                                                                                   Monte-Carlo dropout uncertainty [8] and class-balanced
topic. From the earlier Fully Convolution Network (FCN)                                                                               thresholding. The contributions of this paper can be
[1, 2] to the currently dominating transformer-based ap-                                                                              listed as follows:
proaches [3, 4], tremendous improvements have been
witnessed with the developments of the network archi-                                                                                      β€’ We propose a self-training method based on
tecture. As reported in [5], several baseline semantic                                                                                       Monte-Carlo dropout uncertainty and class-
segmentation models have demonstrated promising per-                                                                                         balanced thresholding on the task of landslide
formances in the task of landslide detection.                                                                                                detection. The experimental results demonstrate
   In addition to designing more sophisticated and task                                                                                      that the proposed method can provide significant
specific network architectures, the research towards the                                                                                     improvements over the baseline, and help to im-
                                                                                                                                             prove the generalizability of semantic segmenta-
CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth
                                                                                                                                             tion models.
Observation, July 25, 2022, Vienna, Austria                                                                                                β€’ We technically prove the effectiveness of the pro-
$ fahong.zhang@tum.de (F. Zhang); yilei.shi@tum.de (Y. Shi);                                                                                 posed method on Landslide4Sense competition,
qingsong.xu@tum.de (Q. Xu); zhitong.xiong@tum.de (Z. Xiong);                                                                                 where we achieve the 3rd prize with a testing F1
wei.yao@dlr.de (W. Yao); xiaoxiang.zhu@tum.de (X. X. Zhu)                                                                                    score of 73.50%.
                                       Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                Uncertainty



                                                    Teacher Model

                                                                                                      Thresholding
       Unlabeled Image                                                        Softmax Output




                                                                              Target Prediction                Pseudo labels
        Labeled Image
                               Mix-up and
                              Augmentation
                                                    Student Model
                                                                                                     Pseudo Label Loss

                                                                             Source Prediction                    Labels




                                                                                                  Supervised
                                                                                                    Loss

Figure 1: Pipeline of the proposed self-training method. In each training step, a batch of labeled and unlabeled data will
be given to the teacher and the student models, where data augmentations and mix-up operation [9] will be applied to the
student model branch. For labeled data, supervised losses will be calculated based on the provided labels. For unlabeled data,
we first apply Monte-Carlo dropout [8] on the teacher model to estimate the uncertainty of unlabeled predictions, and then
generate the pseudo labels based on a class-balanced threshold (see 2.3). The teacher model will be fixed during training.


2. Methodology                                                      2.3. Self-training
We illustrate the pipeline of the proposed method in Fig. As shown in Fig. 1, a teacher model pre-trained on the
1. The remaining parts of this section will formulate the training data will be used to generate pseudo labels for
landslide detection problem and elaborate the methodol-   supervising the student model. However, since the raw
ogy in details.                                           pseudo labels are usually noisy, a selection strategy is
                                                          required to filter out the misclassified pixels.
                                                             First, we use the Monte-Carlo dropout strategy [8] to
2.1. Problem Formulations                                 estimate an uncertainty map for each input test patch.
In the landslide detection problem, we are given a set More specifically, we forward the test patch to the source
of labeled training data π’Ÿπ‘‘π‘Ÿπ‘Žπ‘–π‘› = {π‘₯π‘‘π‘Ÿ , π‘¦π‘‘π‘Ÿ }, and un- model with 10 different runs. In each run, random
labeled test data 𝐷𝑑𝑒𝑠𝑑 = {π‘₯𝑑𝑒 }, where π‘₯π‘‘π‘Ÿ , π‘¦π‘‘π‘Ÿ , and dropout with 0.3 dropping rate will be applied to the
π‘₯𝑑𝑒 ∈ Rπ»Γ—π‘Š are each training patch, training label, and feature map obtained by the first convolution layer. The
test patch, respectively. Our task is to train a semantic variances of 10 different output logits will be considered
segmentation model on π·π‘‘π‘Ÿπ‘Žπ‘–π‘› and 𝐷𝑑𝑒𝑠𝑑 , and optimize as the uncertainty map.
its performance on π’Ÿπ‘‘π‘’π‘ π‘‘ . The overall loss function of      Second, we mask out the uncertain predictions from
the proposed method is:                                   the teacher model. Inspired by [7], we propose to select
                                                          a certain proportion of the pixels for each class with the
                  β„’ = β„’π‘šπ‘–π‘₯        π‘šπ‘–π‘₯
                          𝑠𝑒𝑝 + ℒ𝑝𝑠𝑒 .                (1) lowest uncertainty among all the test data. To this end,
                                                          90% of the background pixels and 70% of the landslide
The mix supervised loss β„’π‘šπ‘–π‘₯
                           𝑠𝑒𝑝 and pseud label loss ℒ𝑝𝑠𝑒
                                                     π‘šπ‘–π‘₯
                                                          pixels are utilized, and the others will be ignored when
will be formulated in Sec. 2.4                            calculating the losses. Finally, the pseudo label loss can
                                                          be formulated by:
2.2. Supervised Losses
                                                                  ℒ𝑝𝑠𝑒 (π‘₯𝑑𝑒 , 𝑦ˆ𝑑𝑒 ) = ℒ𝑐𝑒𝑑 (π‘₯𝑑𝑒 , 𝑦ˆ𝑑𝑒 ) + β„’π‘—π‘Žπ‘ (π‘₯𝑑𝑒 , 𝑦ˆ𝑑𝑒 ). (3)
We use cross entropy loss and jaccard loss as the super-
vised losses:                                                   Here 𝑦ˆ𝑑𝑒 corresponds to the pseudo labels generated by
                                                                the teacher model.
 ℒ𝑠𝑒𝑝 (π‘₯π‘‘π‘Ÿ , π‘¦π‘‘π‘Ÿ ) = ℒ𝑐𝑒𝑑 (π‘₯π‘‘π‘Ÿ , π‘¦π‘‘π‘Ÿ ) + β„’π‘—π‘Žπ‘ (π‘₯π‘‘π‘Ÿ , π‘¦π‘‘π‘Ÿ ). (2)
2.4. Mix-up Strategy                                                the validation phase, only validation data are released.
                                                                    During the test phase, the test data will be available, yet
To prevent the model from overfitting to the training data,
                                                                    the chances for submitting the results for evaluation will
a mix-up strategy [9] is applied to both the training and
                                                                    be limited. With this as background information, we give
test data to further increase the generalizability. Given
                                                                    the workflow of training our final model as follows.
a batch of training and test data, the mixed data can be
generated by:                                                            β€’ Model 1. We first train a base model using solely
                                          β€²                                the training data, which means the teacher branch
               ˜ π‘‘π‘Ÿ = πœ†π‘₯π‘‘π‘Ÿ + (1 βˆ’ πœ†)π‘₯π‘‘π‘Ÿ ,
               π‘₯                                                           in Fig. 1 is blocked. ResNet50 [11] and Deeplab
                                                              (4)
                ˜ 𝑑𝑒 = πœ†π‘₯𝑑𝑒 + (1 βˆ’ πœ†)π‘₯𝑑𝑒 .
                π‘₯
                                          β€²
                                                                           V3+ [12] are used as the backbone and the de-
                                                                           coder, respectively. The ResNet50 backbone is
       β€²
Here π‘₯π‘‘π‘Ÿ is derived from π‘₯π‘‘π‘Ÿ , where all the image patches                 initialized using the ImageNet pretrained weights.
in the same batch are shuffled. πœ† is a scalar randomly                     The training lasts for only 30, 000 iterations to
sampled from a predefined beta distribution during train-                  avoid overfitting.
ing. Then we can reformulate the supervised and pseudo                   β€’ Model 2. This model is developed during the
label losses as:                                                           validation phase, where we use Model 1 as the
                                                       β€²
                                                                           teacher model, and validation data as the unla-
  β„’π‘šπ‘–π‘₯          ˜ π‘‘π‘Ÿ , π‘¦π‘‘π‘Ÿ ) + (1 βˆ’ πœ†)ℒ𝑠𝑒𝑝 (π‘₯
   𝑠𝑒𝑝 = πœ†β„’π‘ π‘’π‘ (π‘₯                           ˜ π‘‘π‘Ÿ , π‘¦π‘‘π‘Ÿ ),                  beled data. The architecture is based on HRNet
                                                       β€²                   [13].
  β„’π‘šπ‘–π‘₯          ˜ 𝑑𝑒 , 𝑦ˆ𝑑𝑒 ) + (1 βˆ’ πœ†)ℒ𝑝𝑠𝑒 (π‘₯
   𝑝𝑠𝑒 = πœ†β„’π‘π‘ π‘’ (π‘₯                            ˜ 𝑑𝑒 , 𝑦ˆ𝑑𝑒 ).
                                                                         β€’ Model 3. Compared to Model 2, the only differ-
                                                              (5)
                                                                           ence of Model 3 is that we apply a ResNext50
                                                                           [14] backbone and a Deeplab V3+ [12] architec-
2.5. Post-processing                                                       ture.
We apply the dense conditional random field (DenseCRF)                   β€’ Final Model. The final model uses all the valida-
[10] as a post-processing technique to better match the                    tion and test data as unlabeled data. Following Fig.
predicted landslide contours with the ground truth.                        1, its student model is pre-trained on Model 3,
                                                                           and Model 2 is considered as the teacher model.

3. Experiments                                                      3.3. Results
3.1. Datasets                                             The final results on the test leaderboard are shown in
                                                          Tab. 1. For our methods, we plot the results of the Final
The proposed method is developed and evaluated on Model and Model 2. Due to the limited submission times,
the Landslide4Sense competition [5]. The provided data the other models were not evaluated. By comparing the
consist of 12 Sentinel-2 bands and 2 topological bands results of Model 2 to Final Model, one can observe that
including SLOP and DEM, both of which are derived pre-training on a different architecture (Model 3) helps
from ALOS PALSAR. Each band is resized to 10 meter to improve the performance of the Final Model.
resolution per pixel. The data are cropped to 128 Γ— 128      Some qualitative results on the testing data are shown
patches. 3799, 245 and 800 patches are provided for in Fig. 2. According to the results, the proposed method
training, validation and testing, respectively.           can successfully distinguish the road areas with the land-
                                                          slides, which are similar to each other in RGB appear-
3.2. Implementation Details                               ances. However, some small landslides that fall to the
                                                          road are also ignored (see the first two rows). By compar-
For the overall training setting, we use SGD optimizer
                                                          ing the raw predictions and the post-processed results,
with Nesterov acceleration to train the network, where
                                                          we notice that DenseCRF will remove some isolated land-
the momentum and weight decay are set to 0.9 and
                                                          slide predictions, but help to shrink them to better fit
5 Γ— 10βˆ’4 , respectively. The batch size is set to 16, and
                                                          to the spatial topology (see red rectangles in the last
the training lasts for 60, 000 iterations. For data pre-
                                                          column).
processing, we normalize the first 12 bands by linearly
scaling them to the range of [0, 1]. For data augmen-
tation, we perform random flipping, random resizing 3.4. Ablation Study
and cropping, and finally resize the patch to the size of We perform the ablation study based on the validation
256 Γ— 256.                                                data and list the results in Tab. 2. It can be observed that
   The time period of the Landslide4Sense competition both Model 2 and Model 3 are superior to Model 1 by
includes a validation phase and a test phase. During a large margin. In addition, if the self-training branch is
             RGB                DEM              SLOP            Uncertainty          Prediction       DenseCRF


Figure 2: Qualitative results on test data. From left to right columns, we visualize the RGB, DEM and SLOP channels of the
data, MC-dropout-based uncertainty maps, predictions from the network and the post-processed results by DenseCRF.


Table 1                                                       Table 2
F1 score (%) during the test phase.                           Ablation study results during the validation phase (%). β€œw/o
                                                              ST” means the self-training or the teacher model branch in
                                                              Fig. 1 is blocked. β€œCRF” means DenseCRF is activated as the
                   Team Name           F1                     post-processing method.
                    kingdrone         74.54
                       seek           73.99                             Model            Precision   Recall        F1
               ours (Final Model)     73.50
                 ours (Model 2)       72.50                           Model 1              69.70     82.60        75.60
                       sikui          71.87                         Model 1 + CRF          76.82     80.48        78.61
                      sklgp           71.29
                                                                   Model 2 (w/o ST)        66.96     81.23     73.41
                      bao18           70.15
                                                                      Model 2              75.60     82.21     78.76
                                                                   Model 2 + CRF           82.45     78.36     80.35

                                                                   Model 3 (w/o ST)        65.63     82.31        73.03
blocked, the performance will be decreased. This demon-
                                                                      Model 3              73.89     82.34        77.88
strates the effectiveness of the proposed self-training
                                                                   Model 3 + CRF           80.19     78.94        79.56
method.


4. Conclusions                                                between labeled and unlabeled data.
This paper studies the landslide detection problem and
propose a self-training method to improve the generaliz- Acknowledgments
ability of the semantic segmentation model. The experi-
mental results on Landslide4Sense dataset demonstrate This work is sponsored by China Scholarship Council.
that the proposed method can help to bridge the gap
References                                                    [13] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao,
                                                                   D. Liu, Y. Mu, M. Tan, X. Wang, et al., Deep high-
 [1] J. Long, E. Shelhamer, T. Darrell, Fully convolu-             resolution representation learning for visual recog-
     tional networks for semantic segmentation, in: Pro-           nition, IEEE transactions on pattern analysis and
     ceedings of the IEEE conference on computer vision            machine intelligence 43 (2020) 3349–3364.
     and pattern recognition, 2015, pp. 3431–3440.            [14] S. Xie, R. Girshick, P. DollΓ‘r, Z. Tu, K. He, Ag-
 [2] O. Ronneberger, P. Fischer, T. Brox, U-net: Convo-            gregated residual transformations for deep neural
     lutional networks for biomedical image segmenta-              networks, in: Proceedings of the IEEE conference
     tion, in: Medical Image Computing and Computer-               on computer vision and pattern recognition, 2017,
     Assisted Intervention - MICCAI, volume 9351, 2015,            pp. 1492–1500.
     pp. 234–241.
 [3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
     senborn, X. Zhai, T. Unterthiner, M. Dehghani,
     M. Minderer, G. Heigold, S. Gelly, et al., An image is
     worth 16x16 words: Transformers for image recog-
     nition at scale, arXiv preprint arXiv:2010.11929
     (2020).
 [4] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Al-
     varez, P. Luo, Segformer: Simple and efficient de-
     sign for semantic segmentation with transformers,
     Advances in Neural Information Processing Sys-
     tems 34 (2021) 12077–12090.
 [5] O. Ghorbanzadeh, Y. Xu, P. Ghamis, M. Kopp,
     D. Kreil, Landslide4sense: Reference benchmark
     data and deep learning models for landslide detec-
     tion, arXiv preprint arXiv:2206.00515 (2022).
 [6] O. Tasar, A. Giros, Y. Tarabalka, P. Alliez, S. Clerc,
     Daugnet: Unsupervised, multisource, multitarget,
     and life-long domain adaptation for semantic seg-
     mentation of satellite images, IEEE Transactions
     on Geoscience and Remote Sensing 59 (2020) 1067–
     1081.
 [7] Y. Zou, Z. Yu, B. Kumar, J. Wang, Unsupervised
     domain adaptation for semantic segmentation via
     class-balanced self-training, in: Proceedings of the
     European conference on computer vision (ECCV),
     2018, pp. 289–305.
 [8] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap-
     proximation: Representing model uncertainty in
     deep learning, in: international conference on ma-
     chine learning, PMLR, 2016, pp. 1050–1059.
 [9] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz,
     mixup: Beyond empirical risk minimization, arXiv
     preprint arXiv:1710.09412 (2017).
[10] P. KrΓ€henbΓΌhl, V. Koltun, Efficient inference in
     fully connected crfs with gaussian edge potentials,
     Advances in neural information processing systems
     24 (2011).
[11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     ing for image recognition, in: Proceedings of the
     IEEE conference on computer vision and pattern
     recognition, 2016, pp. 770–778.
[12] L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Re-
     thinking atrous convolution for semantic image seg-
     mentation, arXiv preprint arXiv:1706.05587 (2017).