=Paper= {{Paper |id=Vol-3207/paper1 |storemode=property |title=A Center-Masked Convolutional Transformer for Hyperspectral Image Classification |pdfUrl=https://ceur-ws.org/Vol-3207/paper1.pdf |volume=Vol-3207 |authors=Yifan Wang,Shuguo Jiang,Meng Xu,Shuyu Zhang,Sen Jia |dblpUrl=https://dblp.org/rec/conf/cdceo/WangJXZJ22 }} ==A Center-Masked Convolutional Transformer for Hyperspectral Image Classification== https://ceur-ws.org/Vol-3207/paper1.pdf
A Center-Masked Convolutional Transformer for
Hyperspectral Image Classification
Yifan Wang1 , Shuguo Jiang1 , Meng Xu1 , Shuyu Zhang1 and Sen Jia1,*
1
    College of Computer Science and Software Engineering, Shenzhen University, China


                                          Abstract
                                          Hyperspectral images (HSIs) have a wide field of view and rich spectral information, where each pixel represents a small area
                                          of the earth’s surface. The pixel-level classification task of HSI has become one of the research hotspots in hyperspectral
                                          image processing and analysis. More and more deep learning methods have been proposed in recent years, among which
                                          convolutional neural network (CNN) is the most influential. However, it is difficult for CNN-based models to obtain the
                                          global receptive field in HSI classification task. Besides, most of the self-supervised training methods are based on sample
                                          reconstruction, and it is not easy to achieve effective use of unlabeled samples. In this paper, we propose a novel convolutional
                                          embedding module, combined with the Transformer blocks, which successfully improves the context-awareness while
                                          retaining the local feature extraction capability. Moreover, a new self-supervised task is designed to make more efficient use
                                          of unlabeled data. Our proposed pre-training task only masks the central token and reconstructs the central pixel from a
                                          learnable vector. It allows the model to capture the patterns between the central object and surrounding objects without
                                          labels.

                                          Keywords
                                          Deep learning, Masked autoencoder, Transformer, Hyperspectral image classification.



1. Introduction                                                                                         [6], to classify the ground objects through spectral in-
                                                                                                        formation. However, the imaging distance of HSI is far
Hyperspectral images are generally composed of dozens away, and there are many interference factors in this pro-
to hundreds of bands and have the characteristics of low cess, so that the spectral curve of different surface objects
spatial resolution and high spectral resolution. The spec- is not always easy to distinguish. This creates difficul-
tral information provides the possibility to distinguish the ties for these methods to achieve good performance in
corresponding land covers, which has spawned various complex scenes. In recent years, deep learning methods
research fields. Among them, pixel-level hyperspectral have gradually become popular, in which CNN-based
image classification is the most concerned one in the methods are dominant. Hu et al. [7] made a preliminary
community. Its main task is to assign a class label to attempt that several 1-D convolutional layers are stacked
each pixel, somewhat like semantic segmentation in the to extract local spectral information, and many classical
computer vision (CV) field. Different from RGB image, data augmentation methods in CV have been introduced.
hyperspectral image is high-dimensional data. In order to Roy et al. [8] combined 3D-CNN and 2D-CNN to achieve
avoid the curse of dimensionality, principal component hierarchical feature learning. In addition, other neural
analysis (PCA) [1] and independent component analysis networks have also achieved good performance. Zhou
[2] are widely used for redundancy elimination.                                                         et al. [9] designed a two-branch Long Short-Term Mem-
   So far, many hyperspectral image classification meth- ory network (LSTM) to extract spectral information and
ods have been proposed, but deep learning methods have spatial information respectively. He et al. [10] proposed
taken the lead. According to the different techniques a pure multilayer perceptron (MLP) network, proving
used, it can be divided into traditional methods and deep that the MLP network still has potential. Hong et al.
learning-based methods. In early research, people mostly [11] designed a mini-batch graph neural network. It
selected a single pixel and all its spectral information is worth mentioning that the recently prevalent Trans-
as the training sample and rely on the traditional clas- former model has also been introduced. Hu et al. [12]
sifiers, such as logistic regression [3], decision tree [4], used 1-D convolution as an embedding layer combined
random forest [5], and support vector machine (SVM) with Transformer Block. Hong et al. [13] analyzed the dif-
                                                                                                        ference between Transformer and other classical neural
CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth networks in detail and proposed a ViT-based Spectral-
Observation, July 25, 2022, Vienna, Austria
                                                                                                        Former for spectral information learning. Zhong et al.
*
  Corresponding author.
$ 2070276050@email.szu.edu.cn (Y. Wang); shuguoj@foxmail.com [14] proposed a spatial–spectral Transformer network
(S. Jiang); m.xu@szu.edu.cn (M. Xu); shuyu-zhang@szu.edu.cn                                             and a model structure search framework. Dang et al. [15]
(S. Zhang); senjia@szu.edu.cn (S. Jia)                                                                  combined spectral-spatial attention module with densely
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).                                                   connected Transformer blocks. Besides, self-attention
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
network is also used to address adversarial attacks that Table 1
may be encountered in hyperspectral classification tasks The Detail Description of BG3DCE Module
[16].
   However, limited by the size of the receptive field,          Layer    Kernel Size Output Shape Groups Param
it is difficult for CNN models to capture the global re-         Input          -     (8, 13, 13, 10) -      -
lationship. Meanwhile, deep learning models are data-         Conv3D-1      (3, 3, 3) (32, 11, 11, 8) 8     896
driven which means that more labeled data leads to better BatchNorm3d-1         -            -        -     64
model performance. But, obtaining such a large number         Conv3D-2      (3, 3, 3)  (32, 9, 9, 6)  8    3,488
of labeled samples in practical applications is expensive. BatchNorm3d-2        -            -        -     64
How to effectively use unlabeled data has become an            Reshape          -       (192, 9, 9)   -
urgent need. The self-supervised pre-training method in
                                                               Conv2D         (1,1)     (128, 9, 9)   8    3,200
HSI classification task is still stuck in the autoencoder-
                                                               Reshape          -        (81, 128)    -      -
based sample reconstruction [17]. This article proposes
a band-grouping-based 3D convolutional Transformer
(BG3DCT) and a new self-supervised task for model pre-
training. The main contributions are listed as follows:    2.1.1. BG3DCE Module

     • A novel band-grouping-based 3D convolutional            Considering the differences between RGB images and
       Transformer is designed for HSI classification.         HSIs, the ViT network is not well compatible with hyper-
       We replace the commonly used linear embedding           spectral image. When the training sample is limited, the
       module with a well-designed 3D convolutional            linear embedding module cannot sufficiently character-
       embedding module, combined with the spectral            ize the spatial-spectral features. Meanwhile, CNN-based
       segmentation strategy, to achieve efficient spatial-    module is more adaptable to this situation, while also
       spectral feature embedding in each sub-band.            being able to capture the local texture information. So
                                                               we design a band-grouping-based 3D convolutional em-
     • According to the characteristics of hyperspectral
                                                               bedding module for HSI embedding. Firstly, we perform
       data, a new pre-training task is proposed. In the
                                                               PCA processing on the input samples and employ a spec-
       process of masking and reconstructing the cen-
                                                               tral partition strategy to divide the spectra into several
       ter pixel, the model’s ability to capture the rela-
                                                               sub-bands of equal length. Because the spectral curves of
       tionship between the center pixel and surround-
                                                               objects often have local differences, the extraction of 3D-
       ing pixels is improved. Compared to the overall
                                                               CNN on sub-band is more efficient than full-band. Then,
       sample reconstruction task, center-masked pre-
                                                               paralleled 3D convolution extraction are performed on
       training task is more efficient for the representa-
                                                               each sub-band twice, and the 3D batch normalization
       tion of center area in the pre-training stage.
                                                               operation is followed to unify the feature deviations gen-
     • A series of comparative experiments and ablation        erated from each sub-band. Finally, we concatenate the
       experiments demonstrate the effectiveness of our        features to maintain the relative positional relationship
       proposed pre-training method and BG3DCT net-            between sub-bands and a lightweight 2D-CNN is used for
       work. In particular, our proposed pre-training          feature fusion and compression. In particular, paralleled
       method can alleviate the instability of results         3D convolution operation have a simple implementation,
       caused by random sampling in the limited train-         and the convolution function includes a grouped con-
       ing samples scenario.                                   volution option. The detailed description of BG3DCE
                                                               module is listed in table 1.
    The rest of this paper is organized as follows. Our
proposed method will be introduced in detail in section
II. The descriptions of the comparative experiments and        2.1.2. Transformer Encoder
result analysis will be provided in section III, and section
                                                    The context-aware ability of CNN often needs to make
IV presents the conclusions.                        the model go deeper, but HSI data is limited, so it is dif-
                                                    ficult for us to stack modules as simply as the model in
                                                    CV task. In contrast, The multi-head attention module
2. Methodology                                      can make up for the shortcomings of CNN here and ef-
                                                    fectively model the relationship between ground objects.
2.1. BG3DCT Network
                                                    So, the combination of CNN and Transformer is comple-
The BG3DCT network has three parts, the band- mentary and powerful. A standard Transformer mainly
grouping-based 3D convolutional embedding (BG3DCE) comprises position encoding, multi-head attention, and
module, Transformer encoder, and MLP head. The spe- feedforward layers. Since the convolution features con-
cific design is as follows.                         tains position information, the positional encoding is not
Figure 1: Architecture of the proposed band-grouping-based 3D convolutional Transformer for HSI classification.




Figure 2: Architecture of the proposed center-mask pre-training task.



used here, and the embedded spatial-spectral features        hood areas of HSI samples serve for the central pixel. So,
are directly input into the Transformer blocks. Finally,     our masking target can select the essential part in the
we add an average pooling layer to achieve the global        training sample, namely the center pixel. Therefore, we
representation and get the classification results through    replace the token in the middle of the sequence E with
a MLP layer.                                                 a learnable vector. Then, the masked sequence is input
                                                             into the decoder, and the pixel-level reconstruction is per-
2.2. Center-mask Pre-training Task                           formed by a MLP head to obtain the reconstruction result
                                                             𝑣ˆ𝑐 of the center pixel. The target of the center-masked
Today, most hyperspectral image classification meth- pre-training task is to reconstruct the centre pixel as effi-
ods are patch-based. The model’s input not only is the ciently as possible so that the encoder can better learn the
spectral curve of the centre pixel but also contains its relationship between the centre pixel and the neighbour
neighbour region, which is generally a square area and pixels without labels. The reconstruction target can be
makes the input more distinctive. Inspired by the form formulated as:
of the training sample, we propose the center-masked
pre-training task, which is similar to MAE [18] but our                     𝑇 (𝑣𝑐 , 𝑣ˆ𝑐 ) = min |𝑣𝑐 − 𝑣ˆ𝑐 |2          (1)
method is easier to implement.
   The flowchart of our proposed pre-training task is where 𝑇 is the similarity function. In the deep learning
shown in Fig. 2. The Encoder is a BG3DCT network framework, function 𝑇 is equivalent to the mean squared
but removes the average pooling layer and MLP layer. error (MSE) loss function.
The decoder consists of two layers of standard Trans-
former encoders, which are only used in the pre-training
stage. Given an input sample X and center pixel vector
                                                             3. Experiment
𝑣𝑐 , the latent representation of the input sample is E (em- To fully evaluate our proposed pre-training method and
bedded by the BG3DCE module). Unlike self-supervised BG3DCT network, we conduct comparative and ablation
pre-training in the CV field, RGB images cannot directly experiments on two public datasets, Salinas and Yellow
find areas that need to be focused on, but the neighbor- River Estuary (YRE). The detailed information and the
Table 2                                                         Table 3
Number of Training and Testing Samples on The Salinas           Number of Training and Testing Samples on The YRE Dataset
Dataset
                                                                      Class         Class Name         Training Testing
      Class       Class Name         Training Testing                   1             Building            10      523
        1   Brocoli green weeds 1        5      2004                    2               River             10      5366
        2   Brocoli green weeds 2        5      3721                    3           Salt Marsh            10     4985
        3            Fallow              5      1971                    4           Shallow Sea           10     17540
                                                                        5             Deep Sea            10     18667
        4     Fallow rough plow          5      1389
                                                                        6   Intertidal Saltwater Marsh    10      2333
        5       Fallow smooth            5      2673                    7            Tidal Flat           10      1782
        6           Stubble              5     3954                     8               Pond              10     1777
        7            Celery              5      3574                    9             Sorghum             10       636
        8     Grapes untrained           5     11266                   10               Corn              10      1499
        9    Soil vinyard develop        5      6198                   11           Lotus Root            10     2709
       10 Corn senesced green weeds      5      3273                   12           Aquaculture           10     8009
       11   Lettuce romaine 4wk          5      1063                   13               Rice              10     5498
       12   Lettuce romaine 5wk          5      1922                   14       Tamarix Chinensis         10      1210
                                                                       15 Freshwater Herbaceous Marsh     10     1407
       13   Lettuce romaine 6wk          5       911
                                                                       16          Suaeda Salsa           10      864
       14   Lettuce romaine 7wk          5      1065
                                                                       17      Spartina Alterniflora      10       570
       15     Vinyard untrained          5      7263                   18               Reed              10     1960
       16   Vinyard vertical trellis     5      1802                   19            Floodplain           10      337
                      Total             80     54129                   20              Locus              10       65
                                                                                        Total            200     77737



partition of the training set and testing set are shown
in the table 2 and table 3, respectively. We use three          180 bands after removing noise bands. The surface ob-
metrics to evaluate the classification results, overall accu-   jects are mainly wetland vegetation, there are 20 kinds
racy (OA), classwise average accuracy (AA), and kappa           of objects, and the total number of labeled samples is
coefficient (𝜅). All the experiments are conducted on a         77,937.
computer with an Intel Xeon Platinum 8260 CPU, 64-GB
RAM and an NVIDIA Tesla P100-16GB GPU. The model                3.2. Comparative Experiment
structure and parameter settings of the comparison meth-
ods comply with open source codes or corresponding pa-      To demonstrate the superiority of our proposed method,
pers. For our proposed model, the patchsize is set to 13,   we select five state-of-the-art methods on two public
the spectral dimension is 80 after PCA, and the number      datasets, Salinas and YRE, including four CNN-based
of sub-bands is 10. The embedding size of each token is     methods and one classical Transformer network. They
set to 128. The learning rate is set to 0.001, and Adam     are CNNHSI [19], FC3D [20], HybridSN [21], TwoCNN
is adopt as the gradient descent optimizer. Meanwhile,      [22], and Vision Transformer (ViT) [23]. Among them,
all the experiments are repeated ten times to smooth out    several methods based on 2D-CNN are distinguished in
errors caused by random sampling. The setting of the        the size of the convolution kernel and the structure de-
center-masked pre-training is the same.                     sign. CNNHSI stacks several 2-D Convolution layers with
                                                            1×1 kernel size. TwoCNN is a dual-branch CNN with a
                                                            2D-CNN and a 1D-CNN to extract spatial information
3.1. Datasets Description                                   and spectral information, respectively. FC3D is a pure 3D-
3.1.1. Salinas Dataset                                      CNN network, and HybridSN uses 3D convolution and
                                                            2D convolution successively for hierarchical feature ex-
The salinas dataset, collected by the AVIRS sensor in the traction. ViT divides the input samples into equal-sized
Salinas Valley, USA, has an image size of 512 × 217 and a patches, obtains the embedded tokens through linear
spatial resolution of 3.7 meters. After noise band removal, embedding module, and then inputs them into the Trans-
204 bands are remained. There are 16 kinds of ground former encoder.
objects in the dataset, with 56,975 samples that can be        The results of the comparative experiments are shown
used for pixel-level classification.                        in table 4 and table 5. Our method has obtained obvious
                                                            advantages and achieved the best or second-best results
3.1.2. YRE Dataset                                          in each class, reflecting our approach’s superiority and
                                                            robustness. Under the setting of training with limited
YRE dataset is a large scene dataset captured by the
                                                            samples, CNNHSI achieves excellent classification results
Gaofen-5 satellite in the yellow river estuary region of
                                                            due to its lightweight network structure. Limited by the
Shandong Province, China. Its size is 1400 × 1400, and
                                                            large model size, HybridSN, FC3D, and TwoCNN fail to
the spatial resolution of each pixel is 30 meters, leaving
Table 4                                                          Table 6
Classification Accuracy (%) and Kappa Measure for The Salinas    Ablation Study Results Toward The Center-Masked Pre-
Dataset                                                          training Pretask on The salinas and YRE Datasets
       Class     ViT HybridSN FC3D CNNHSI TwoCNN Ours                                            YRE               Salinas
         1      97.83  99.93  99.95 93.64   93.44 99.98                      Case
                                                                                            OA    AA    𝜅     OA     AA    𝜅
         2      97.21  98.31  92.92 79.30   89.86 99.92                BG3DCT w/ pretrain 89.01 84.11 87.30 89.99 95.40 88.87
         3      82.83  97.04  97.84 84.62   90.63 99.99                BG3DCT w/o pretrain 86.24 82.19 84.15 88.60 85.42 86.46
         4      92.91  93.18  96.66 99.40   97.45  99.11
         5      85.29 98.08   86.54 77.31   95.23  97.59
         6      98.58  98.46  94.08 98.45   99.74 99.95
         7      98.01  99.76  99.78 99.13  100.00 100.00
         8      69.92  59.81  53.82 65.56  74.59   63.87         is slightly lower than that of the Salinas dataset. It is
         9
        10
                97.38
                82.71
                       96.84
                       93.33
                              95.46
                              90.32
                                    94.70
                                    46.59
                                            99.67
                                           95.17
                                                  99.77
                                                   95.08         worth mentioning that TwoCNN achieves good classifi-
        11      90.61  97.65  89.21 90.40   95.71 99.96          cation results, which may benefit from its spectral feature
        12      98.10  94.88  91.28 97.85   99.20 99.43
        13      90.71  84.63  87.82 99.19  99.28   97.08         extraction branch.
        14      96.15  98.97  92.99 92.94   96.39 99.31
        15      62.93  71.60  65.73 40.12   66.80 83.49
        16      94.19  79.97  88.53 82.69   96.38 99.08          3.3. Ablation Study
       OA (%)   84.68  85.24  81.65 76.41   87.99 89.66
       AA (%)
         𝜅
                89.71
                83.01
                       91.40
                       83.70
                              88.93
                              79.78
                                    83.87
                                    73.74
                                            93.10
                                            86.65
                                                  95.85
                                                  88.54
                                                                 In this section, we only conduct ablation experiments
                                                                 on our proposed pre-training task, considering that the
                                                                 experimental results compared with the ViT model can
Table 5                                                          intuitively demonstrate the effectiveness of our proposed
Classification Accuracy (%) and Kappa Measure for The YRE        BG3DCT module. The results are shown in table 6 that
Dataset                                                          the OA of the model finetuned on pre-trained model
       Class     ViT HybridSN FC3D CNNHSI TwoCNN Ours
                                                                 outperforms the model without pre-training by 2.77%
         1      49.78  82.35  78.84 90.66   70.48 84.23          and 1.33%, on the Salinas and YRE datasets, respectively.
         2
         3
                95.45 100.00
                55.45  61.34
                              99.93
                              72.20
                                    98.83
                                    74.55
                                            99.96
                                           85.93
                                                  97.36
                                                  78.15
                                                                 This undoubtedly proves the superiority and robustness
         4      78.22  71.04  72.78 73.08   90.44 92.36          of our pre-training task.
         5      85.89  76.74  90.49 84.31   97.59 99.55
         6      79.48  81.77  83.37 81.59   82.65 85.85
         7      56.38  51.53  57.30 59.95  63.63  60.30
         8      73.16  73.94  73.68 78.55   60.17 73.23          4. Conclusion
         9      75.10  85.90  86.11 86.57   82.70 90.24
        10
        11
                57.06
                63.65
                       70.82
                       82.55
                              62.15
                              83.38
                                    88.20
                                    88.73
                                            72.22
                                            75.71
                                                  88.49
                                                  90.84
                                                                 In this article, we creatively propose a band-grouping-
        12      72.82  76.36  79.69 77.62   73.76 76.96          based convolutional embedding module to extract spatial-
        13
        14
                71.27
                67.11
                       87.47
                       75.32
                              84.49
                              76.50
                                    92.63
                                    87.81
                                            87.79
                                           88.63
                                                  89.78
                                                  79.28
                                                                 spectral information in each sub-bands. The Transformer
        15      59.32  64.65  74.65 82.00  96.32  71.77          module is used to model the global relationship between
        16
        17
                74.27
                81.29
                       93.02
                       93.98
                              89.89
                              89.47
                                    92.66
                                    95.08
                                            92.20
                                            93.72
                                                  95.08
                                                  94.40
                                                                 surface objects. Additionally, for effective use of unla-
        18      44.74  58.11  62.74 65.86   67.59 70.65          beled data, we design a new unsupervised pre-training
        19
        20
                60.53
                87.69
                       82.89
                       91.28
                              68.25
                              71.79
                                    88.72
                                    85.84
                                            68.84
                                            64.00
                                                  71.60
                                                  92.15
                                                                 task for hyperspectral classification. Through the mask
       OA (%)   75.58  76.14  80.84 84.61   87.45 89.01          and reconstruction process of the token generated from
       AA (%)   69.43  78.05  77.88 83.26   80.72 84.11          the central area, the model can initialize the backbone
         𝜅      71.87  72.96  78.09 82.30   85.48 87.29
                                                                 network without labeled data and provide a more stable
                                                                 model performance. To fully evaluate our proposed meth-
                                                                 ods, we conducted a series of comparative experiments
obtain superior classification results. The performance          and ablation experiments on two public datasets, Salinas
of the ViT model is not stable. When the distribution of         and YRE. The experimental results prove the effective-
ground objects in the dataset is more complex, the linear        ness and superiority of our method.
embedding module drags down the model performance.
Therefore, this proves the necessity of a well-designed
embedding layer for the Transformer network in HSI               5. Acknowledgment
classification. In addition, all the methods cannot dis-
criminate the Vinyard untrained class well, which may            This work was supported in part by the National Natu-
be caused by the large variability of this land cover. It is a   ral Science Foundation of China under Grant 41971300,
problem we need to solve in the future. The classification       and Grant 61901278; in part by the Key Project of
results of each method on the YRE dataset are similar            Department of Education of Guangdong Province un-
to the Salinas dataset. The YRE dataset is a large scene         der Grant 2020ZDZX3045; in part by the Guangdong
dataset so that the classification task is more complicated.     Basic and Applied Basic Research Foundation under
Hence, the classification performance of each method
Grant 2022A1515011290; in part by the Natural Sci- [12] X. Hu, W. Yang, H. Wen, Y. Liu, Y. Peng, A
ence Foundation of Guangdong Province under Grant                lightweight 1-d convolution augmented trans-
2021A1515011413; in part by Shenzhen Scientific Re-              former with metric learning for hyperspectral im-
search and Development Funding Program under Grant               age classification, Sensors 21 (2021) 1751.
20200803152531004.                                          [13] D. Hong, Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza,
                                                                 J. Chanussot, Spectralformer: Rethinking hy-
                                                                 perspectral image classification with transform-
References                                                       ers, IEEE Transactions on Geoscience and Remote
                                                                 Sensing 60 (2022) 1–15. doi:10.1109/TGRS.2021.
 [1] M. D. Farrell, R. M. Mersereau, On the impact of pca
                                                                 3130716.
     dimension reduction for hyperspectral detection
                                                            [14] Z. Zhong, Y. Li, L. Ma, J. Li, W.-S. Zheng, Spectral-
     of difficult targets, IEEE Geoscience and Remote
                                                                 spatial transformer network for hyperspectral im-
     Sensing Letters 2 (2005) 192–195.
                                                                 age classification: A factorized architecture search
 [2] S. Moussaoui, H. Hauksdottir, F. Schmidt, C. Jutten,
                                                                 framework, IEEE Transactions on Geoscience and
     J. Chanussot, D. Brie, S. Douté, J. A. Benediktsson,
                                                                 Remote Sensing (2021).
     On the decomposition of mars hyperspectral data
                                                            [15] L. Dang, L. Weng, W. Dong, S. Li, Y. Hou, Spectral-
     by ica and bayesian positive source separation, Neu-
                                                                 spatial attention transformer with dense connection
     rocomputing 71 (2008) 2194–2208.
                                                                 for hyperspectral image classification, Computa-
 [3] Y. Qian, M. Ye, J. Zhou, Hyperspectral image clas-
                                                                 tional Intelligence and Neuroscience 2022 (2022).
     sification based on structured sparse logistic re-
                                                            [16] Y. Xu, B. Du, L. Zhang, Self-attention context net-
     gression and three-dimensional wavelet texture fea-
                                                                 work: Addressing the threat of adversarial attacks
     tures, IEEE Transactions on Geoscience and Remote
                                                                 for hyperspectral image classification, IEEE Trans-
     Sensing 51 (2012) 2276–2291.
                                                                 actions on Image Processing 30 (2021) 8671–8685.
 [4] S. Kuching, The performance of maximum likeli-
                                                                 doi:10.1109/TIP.2021.3118977.
     hood, spectral angle mapper, neural network and
                                                            [17] Y. Chen, Z. Lin, X. Zhao, G. Wang, Y. Gu, Deep
     decision tree classifiers in hyperspectral image anal-
                                                                 learning-based classification of hyperspectral data,
     ysis, Journal of Computer Science 3 (2007) 419–423.
                                                                 IEEE Journal of Selected topics in applied earth ob-
 [5] J. Xia, P. Ghamisi, N. Yokoya, A. Iwasaki, Random
                                                                 servations and remote sensing 7 (2014) 2094–2107.
     forest ensembles and extended multiextinction pro-
                                                            [18] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick,
     files for hyperspectral image classification, IEEE
                                                                 Masked autoencoders are scalable vision learners,
     Transactions on Geoscience and Remote Sensing
                                                                 arXiv preprint arXiv:2111.06377 (2021).
     56 (2017) 202–216.
                                                            [19] S. Yu, S. Jia, C. Xu, Convolutional neural networks
 [6] M. Chi, R. Feng, L. Bruzzone, Classification of hy-
                                                                 for hyperspectral image classification, Neurocom-
     perspectral remote-sensing data with primal svm
                                                                 puting 219 (2017) 88–98.
     for small-sized training dataset problem, Advances
                                                            [20] M. Ahmad, A. M. Khan, M. Mazzara, S. Distefano,
     in space research 41 (2008) 1793–1799.
                                                                 M. Ali, M. S. Sarfraz, A fast and compact 3-d cnn
 [7] W. Hu, Y. Huang, L. Wei, F. Zhang, H. Li, Deep
                                                                 for hyperspectral image classification, IEEE Geo-
     convolutional neural networks for hyperspectral
                                                                 science and Remote Sensing Letters (2020).
     image classification, Journal of Sensors 2015 (2015).
                                                            [21] S. K. Roy, G. Krishna, S. R. Dubey, B. B. Chaudhuri,
 [8] S. K. Roy, G. Krishna, S. R. Dubey, B. B. Chaudhuri,
                                                                 Hybridsn: Exploring 3-d–2-d cnn feature hierar-
     Hybridsn: Exploring 3-d–2-d cnn feature hierar-
                                                                 chy for hyperspectral image classification, IEEE
     chy for hyperspectral image classification, IEEE
                                                                 Geoscience and Remote Sensing Letters 17 (2020)
     Geoscience and Remote Sensing Letters 17 (2019)
                                                                 277–281.
     277–281.
                                                            [22] J. Yang, Y.-Q. Zhao, J. C.-W. Chan, Learning and
 [9] F. Zhou, R. Hang, Q. Liu, X. Yuan, Hyperspectral im-
                                                                 transferring deep joint spectral–spatial features for
     age classification using spectral-spatial lstms, Neu-
                                                                 hyperspectral classification, IEEE Transactions on
     rocomputing 328 (2019) 39–47.
                                                                 Geoscience and Remote Sensing 55 (2017) 4729–
[10] X. He, Y. Chen, Modifications of the multi-layer
                                                                 4742.
     perceptron for hyperspectral image classification,
                                                            [23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
     Remote Sensing 13 (2021) 3547.
                                                                 senborn, X. Zhai, T. Unterthiner, M. Dehghani,
[11] D. Hong, L. Gao, J. Yao, B. Zhang, A. Plaza,
                                                                 M. Minderer, G. Heigold, S. Gelly, et al., An image is
     J. Chanussot, Graph convolutional networks for
                                                                 worth 16x16 words: Transformers for image recog-
     hyperspectral image classification, IEEE Transac-
                                                                 nition at scale, arXiv preprint arXiv:2010.11929
     tions on Geoscience and Remote Sensing 59 (2020)
                                                                 (2020).
     5966–5978.