A Center-Masked Convolutional Transformer for Hyperspectral Image Classification Yifan Wang1 , Shuguo Jiang1 , Meng Xu1 , Shuyu Zhang1 and Sen Jia1,* 1 College of Computer Science and Software Engineering, Shenzhen University, China Abstract Hyperspectral images (HSIs) have a wide field of view and rich spectral information, where each pixel represents a small area of the earth’s surface. The pixel-level classification task of HSI has become one of the research hotspots in hyperspectral image processing and analysis. More and more deep learning methods have been proposed in recent years, among which convolutional neural network (CNN) is the most influential. However, it is difficult for CNN-based models to obtain the global receptive field in HSI classification task. Besides, most of the self-supervised training methods are based on sample reconstruction, and it is not easy to achieve effective use of unlabeled samples. In this paper, we propose a novel convolutional embedding module, combined with the Transformer blocks, which successfully improves the context-awareness while retaining the local feature extraction capability. Moreover, a new self-supervised task is designed to make more efficient use of unlabeled data. Our proposed pre-training task only masks the central token and reconstructs the central pixel from a learnable vector. It allows the model to capture the patterns between the central object and surrounding objects without labels. Keywords Deep learning, Masked autoencoder, Transformer, Hyperspectral image classification. 1. Introduction [6], to classify the ground objects through spectral in- formation. However, the imaging distance of HSI is far Hyperspectral images are generally composed of dozens away, and there are many interference factors in this pro- to hundreds of bands and have the characteristics of low cess, so that the spectral curve of different surface objects spatial resolution and high spectral resolution. The spec- is not always easy to distinguish. This creates difficul- tral information provides the possibility to distinguish the ties for these methods to achieve good performance in corresponding land covers, which has spawned various complex scenes. In recent years, deep learning methods research fields. Among them, pixel-level hyperspectral have gradually become popular, in which CNN-based image classification is the most concerned one in the methods are dominant. Hu et al. [7] made a preliminary community. Its main task is to assign a class label to attempt that several 1-D convolutional layers are stacked each pixel, somewhat like semantic segmentation in the to extract local spectral information, and many classical computer vision (CV) field. Different from RGB image, data augmentation methods in CV have been introduced. hyperspectral image is high-dimensional data. In order to Roy et al. [8] combined 3D-CNN and 2D-CNN to achieve avoid the curse of dimensionality, principal component hierarchical feature learning. In addition, other neural analysis (PCA) [1] and independent component analysis networks have also achieved good performance. Zhou [2] are widely used for redundancy elimination. et al. [9] designed a two-branch Long Short-Term Mem- So far, many hyperspectral image classification meth- ory network (LSTM) to extract spectral information and ods have been proposed, but deep learning methods have spatial information respectively. He et al. [10] proposed taken the lead. According to the different techniques a pure multilayer perceptron (MLP) network, proving used, it can be divided into traditional methods and deep that the MLP network still has potential. Hong et al. learning-based methods. In early research, people mostly [11] designed a mini-batch graph neural network. It selected a single pixel and all its spectral information is worth mentioning that the recently prevalent Trans- as the training sample and rely on the traditional clas- former model has also been introduced. Hu et al. [12] sifiers, such as logistic regression [3], decision tree [4], used 1-D convolution as an embedding layer combined random forest [5], and support vector machine (SVM) with Transformer Block. Hong et al. [13] analyzed the dif- ference between Transformer and other classical neural CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth networks in detail and proposed a ViT-based Spectral- Observation, July 25, 2022, Vienna, Austria Former for spectral information learning. Zhong et al. * Corresponding author. $ 2070276050@email.szu.edu.cn (Y. Wang); shuguoj@foxmail.com [14] proposed a spatial–spectral Transformer network (S. Jiang); m.xu@szu.edu.cn (M. Xu); shuyu-zhang@szu.edu.cn and a model structure search framework. Dang et al. [15] (S. Zhang); senjia@szu.edu.cn (S. Jia) combined spectral-spatial attention module with densely © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). connected Transformer blocks. Besides, self-attention CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) network is also used to address adversarial attacks that Table 1 may be encountered in hyperspectral classification tasks The Detail Description of BG3DCE Module [16]. However, limited by the size of the receptive field, Layer Kernel Size Output Shape Groups Param it is difficult for CNN models to capture the global re- Input - (8, 13, 13, 10) - - lationship. Meanwhile, deep learning models are data- Conv3D-1 (3, 3, 3) (32, 11, 11, 8) 8 896 driven which means that more labeled data leads to better BatchNorm3d-1 - - - 64 model performance. But, obtaining such a large number Conv3D-2 (3, 3, 3) (32, 9, 9, 6) 8 3,488 of labeled samples in practical applications is expensive. BatchNorm3d-2 - - - 64 How to effectively use unlabeled data has become an Reshape - (192, 9, 9) - urgent need. The self-supervised pre-training method in Conv2D (1,1) (128, 9, 9) 8 3,200 HSI classification task is still stuck in the autoencoder- Reshape - (81, 128) - - based sample reconstruction [17]. This article proposes a band-grouping-based 3D convolutional Transformer (BG3DCT) and a new self-supervised task for model pre- training. The main contributions are listed as follows: 2.1.1. BG3DCE Module • A novel band-grouping-based 3D convolutional Considering the differences between RGB images and Transformer is designed for HSI classification. HSIs, the ViT network is not well compatible with hyper- We replace the commonly used linear embedding spectral image. When the training sample is limited, the module with a well-designed 3D convolutional linear embedding module cannot sufficiently character- embedding module, combined with the spectral ize the spatial-spectral features. Meanwhile, CNN-based segmentation strategy, to achieve efficient spatial- module is more adaptable to this situation, while also spectral feature embedding in each sub-band. being able to capture the local texture information. So we design a band-grouping-based 3D convolutional em- • According to the characteristics of hyperspectral bedding module for HSI embedding. Firstly, we perform data, a new pre-training task is proposed. In the PCA processing on the input samples and employ a spec- process of masking and reconstructing the cen- tral partition strategy to divide the spectra into several ter pixel, the model’s ability to capture the rela- sub-bands of equal length. Because the spectral curves of tionship between the center pixel and surround- objects often have local differences, the extraction of 3D- ing pixels is improved. Compared to the overall CNN on sub-band is more efficient than full-band. Then, sample reconstruction task, center-masked pre- paralleled 3D convolution extraction are performed on training task is more efficient for the representa- each sub-band twice, and the 3D batch normalization tion of center area in the pre-training stage. operation is followed to unify the feature deviations gen- • A series of comparative experiments and ablation erated from each sub-band. Finally, we concatenate the experiments demonstrate the effectiveness of our features to maintain the relative positional relationship proposed pre-training method and BG3DCT net- between sub-bands and a lightweight 2D-CNN is used for work. In particular, our proposed pre-training feature fusion and compression. In particular, paralleled method can alleviate the instability of results 3D convolution operation have a simple implementation, caused by random sampling in the limited train- and the convolution function includes a grouped con- ing samples scenario. volution option. The detailed description of BG3DCE module is listed in table 1. The rest of this paper is organized as follows. Our proposed method will be introduced in detail in section II. The descriptions of the comparative experiments and 2.1.2. Transformer Encoder result analysis will be provided in section III, and section The context-aware ability of CNN often needs to make IV presents the conclusions. the model go deeper, but HSI data is limited, so it is dif- ficult for us to stack modules as simply as the model in CV task. In contrast, The multi-head attention module 2. Methodology can make up for the shortcomings of CNN here and ef- fectively model the relationship between ground objects. 2.1. BG3DCT Network So, the combination of CNN and Transformer is comple- The BG3DCT network has three parts, the band- mentary and powerful. A standard Transformer mainly grouping-based 3D convolutional embedding (BG3DCE) comprises position encoding, multi-head attention, and module, Transformer encoder, and MLP head. The spe- feedforward layers. Since the convolution features con- cific design is as follows. tains position information, the positional encoding is not Figure 1: Architecture of the proposed band-grouping-based 3D convolutional Transformer for HSI classification. Figure 2: Architecture of the proposed center-mask pre-training task. used here, and the embedded spatial-spectral features hood areas of HSI samples serve for the central pixel. So, are directly input into the Transformer blocks. Finally, our masking target can select the essential part in the we add an average pooling layer to achieve the global training sample, namely the center pixel. Therefore, we representation and get the classification results through replace the token in the middle of the sequence E with a MLP layer. a learnable vector. Then, the masked sequence is input into the decoder, and the pixel-level reconstruction is per- 2.2. Center-mask Pre-training Task formed by a MLP head to obtain the reconstruction result 𝑣ˆ𝑐 of the center pixel. The target of the center-masked Today, most hyperspectral image classification meth- pre-training task is to reconstruct the centre pixel as effi- ods are patch-based. The model’s input not only is the ciently as possible so that the encoder can better learn the spectral curve of the centre pixel but also contains its relationship between the centre pixel and the neighbour neighbour region, which is generally a square area and pixels without labels. The reconstruction target can be makes the input more distinctive. Inspired by the form formulated as: of the training sample, we propose the center-masked pre-training task, which is similar to MAE [18] but our 𝑇 (𝑣𝑐 , 𝑣ˆ𝑐 ) = min |𝑣𝑐 − 𝑣ˆ𝑐 |2 (1) method is easier to implement. The flowchart of our proposed pre-training task is where 𝑇 is the similarity function. In the deep learning shown in Fig. 2. The Encoder is a BG3DCT network framework, function 𝑇 is equivalent to the mean squared but removes the average pooling layer and MLP layer. error (MSE) loss function. The decoder consists of two layers of standard Trans- former encoders, which are only used in the pre-training stage. Given an input sample X and center pixel vector 3. Experiment 𝑣𝑐 , the latent representation of the input sample is E (em- To fully evaluate our proposed pre-training method and bedded by the BG3DCE module). Unlike self-supervised BG3DCT network, we conduct comparative and ablation pre-training in the CV field, RGB images cannot directly experiments on two public datasets, Salinas and Yellow find areas that need to be focused on, but the neighbor- River Estuary (YRE). The detailed information and the Table 2 Table 3 Number of Training and Testing Samples on The Salinas Number of Training and Testing Samples on The YRE Dataset Dataset Class Class Name Training Testing Class Class Name Training Testing 1 Building 10 523 1 Brocoli green weeds 1 5 2004 2 River 10 5366 2 Brocoli green weeds 2 5 3721 3 Salt Marsh 10 4985 3 Fallow 5 1971 4 Shallow Sea 10 17540 5 Deep Sea 10 18667 4 Fallow rough plow 5 1389 6 Intertidal Saltwater Marsh 10 2333 5 Fallow smooth 5 2673 7 Tidal Flat 10 1782 6 Stubble 5 3954 8 Pond 10 1777 7 Celery 5 3574 9 Sorghum 10 636 8 Grapes untrained 5 11266 10 Corn 10 1499 9 Soil vinyard develop 5 6198 11 Lotus Root 10 2709 10 Corn senesced green weeds 5 3273 12 Aquaculture 10 8009 11 Lettuce romaine 4wk 5 1063 13 Rice 10 5498 12 Lettuce romaine 5wk 5 1922 14 Tamarix Chinensis 10 1210 15 Freshwater Herbaceous Marsh 10 1407 13 Lettuce romaine 6wk 5 911 16 Suaeda Salsa 10 864 14 Lettuce romaine 7wk 5 1065 17 Spartina Alterniflora 10 570 15 Vinyard untrained 5 7263 18 Reed 10 1960 16 Vinyard vertical trellis 5 1802 19 Floodplain 10 337 Total 80 54129 20 Locus 10 65 Total 200 77737 partition of the training set and testing set are shown in the table 2 and table 3, respectively. We use three 180 bands after removing noise bands. The surface ob- metrics to evaluate the classification results, overall accu- jects are mainly wetland vegetation, there are 20 kinds racy (OA), classwise average accuracy (AA), and kappa of objects, and the total number of labeled samples is coefficient (𝜅). All the experiments are conducted on a 77,937. computer with an Intel Xeon Platinum 8260 CPU, 64-GB RAM and an NVIDIA Tesla P100-16GB GPU. The model 3.2. Comparative Experiment structure and parameter settings of the comparison meth- ods comply with open source codes or corresponding pa- To demonstrate the superiority of our proposed method, pers. For our proposed model, the patchsize is set to 13, we select five state-of-the-art methods on two public the spectral dimension is 80 after PCA, and the number datasets, Salinas and YRE, including four CNN-based of sub-bands is 10. The embedding size of each token is methods and one classical Transformer network. They set to 128. The learning rate is set to 0.001, and Adam are CNNHSI [19], FC3D [20], HybridSN [21], TwoCNN is adopt as the gradient descent optimizer. Meanwhile, [22], and Vision Transformer (ViT) [23]. Among them, all the experiments are repeated ten times to smooth out several methods based on 2D-CNN are distinguished in errors caused by random sampling. The setting of the the size of the convolution kernel and the structure de- center-masked pre-training is the same. sign. CNNHSI stacks several 2-D Convolution layers with 1×1 kernel size. TwoCNN is a dual-branch CNN with a 2D-CNN and a 1D-CNN to extract spatial information 3.1. Datasets Description and spectral information, respectively. FC3D is a pure 3D- 3.1.1. Salinas Dataset CNN network, and HybridSN uses 3D convolution and 2D convolution successively for hierarchical feature ex- The salinas dataset, collected by the AVIRS sensor in the traction. ViT divides the input samples into equal-sized Salinas Valley, USA, has an image size of 512 × 217 and a patches, obtains the embedded tokens through linear spatial resolution of 3.7 meters. After noise band removal, embedding module, and then inputs them into the Trans- 204 bands are remained. There are 16 kinds of ground former encoder. objects in the dataset, with 56,975 samples that can be The results of the comparative experiments are shown used for pixel-level classification. in table 4 and table 5. Our method has obtained obvious advantages and achieved the best or second-best results 3.1.2. YRE Dataset in each class, reflecting our approach’s superiority and robustness. Under the setting of training with limited YRE dataset is a large scene dataset captured by the samples, CNNHSI achieves excellent classification results Gaofen-5 satellite in the yellow river estuary region of due to its lightweight network structure. Limited by the Shandong Province, China. Its size is 1400 × 1400, and large model size, HybridSN, FC3D, and TwoCNN fail to the spatial resolution of each pixel is 30 meters, leaving Table 4 Table 6 Classification Accuracy (%) and Kappa Measure for The Salinas Ablation Study Results Toward The Center-Masked Pre- Dataset training Pretask on The salinas and YRE Datasets Class ViT HybridSN FC3D CNNHSI TwoCNN Ours YRE Salinas 1 97.83 99.93 99.95 93.64 93.44 99.98 Case OA AA 𝜅 OA AA 𝜅 2 97.21 98.31 92.92 79.30 89.86 99.92 BG3DCT w/ pretrain 89.01 84.11 87.30 89.99 95.40 88.87 3 82.83 97.04 97.84 84.62 90.63 99.99 BG3DCT w/o pretrain 86.24 82.19 84.15 88.60 85.42 86.46 4 92.91 93.18 96.66 99.40 97.45 99.11 5 85.29 98.08 86.54 77.31 95.23 97.59 6 98.58 98.46 94.08 98.45 99.74 99.95 7 98.01 99.76 99.78 99.13 100.00 100.00 8 69.92 59.81 53.82 65.56 74.59 63.87 is slightly lower than that of the Salinas dataset. It is 9 10 97.38 82.71 96.84 93.33 95.46 90.32 94.70 46.59 99.67 95.17 99.77 95.08 worth mentioning that TwoCNN achieves good classifi- 11 90.61 97.65 89.21 90.40 95.71 99.96 cation results, which may benefit from its spectral feature 12 98.10 94.88 91.28 97.85 99.20 99.43 13 90.71 84.63 87.82 99.19 99.28 97.08 extraction branch. 14 96.15 98.97 92.99 92.94 96.39 99.31 15 62.93 71.60 65.73 40.12 66.80 83.49 16 94.19 79.97 88.53 82.69 96.38 99.08 3.3. Ablation Study OA (%) 84.68 85.24 81.65 76.41 87.99 89.66 AA (%) 𝜅 89.71 83.01 91.40 83.70 88.93 79.78 83.87 73.74 93.10 86.65 95.85 88.54 In this section, we only conduct ablation experiments on our proposed pre-training task, considering that the experimental results compared with the ViT model can Table 5 intuitively demonstrate the effectiveness of our proposed Classification Accuracy (%) and Kappa Measure for The YRE BG3DCT module. The results are shown in table 6 that Dataset the OA of the model finetuned on pre-trained model Class ViT HybridSN FC3D CNNHSI TwoCNN Ours outperforms the model without pre-training by 2.77% 1 49.78 82.35 78.84 90.66 70.48 84.23 and 1.33%, on the Salinas and YRE datasets, respectively. 2 3 95.45 100.00 55.45 61.34 99.93 72.20 98.83 74.55 99.96 85.93 97.36 78.15 This undoubtedly proves the superiority and robustness 4 78.22 71.04 72.78 73.08 90.44 92.36 of our pre-training task. 5 85.89 76.74 90.49 84.31 97.59 99.55 6 79.48 81.77 83.37 81.59 82.65 85.85 7 56.38 51.53 57.30 59.95 63.63 60.30 8 73.16 73.94 73.68 78.55 60.17 73.23 4. Conclusion 9 75.10 85.90 86.11 86.57 82.70 90.24 10 11 57.06 63.65 70.82 82.55 62.15 83.38 88.20 88.73 72.22 75.71 88.49 90.84 In this article, we creatively propose a band-grouping- 12 72.82 76.36 79.69 77.62 73.76 76.96 based convolutional embedding module to extract spatial- 13 14 71.27 67.11 87.47 75.32 84.49 76.50 92.63 87.81 87.79 88.63 89.78 79.28 spectral information in each sub-bands. The Transformer 15 59.32 64.65 74.65 82.00 96.32 71.77 module is used to model the global relationship between 16 17 74.27 81.29 93.02 93.98 89.89 89.47 92.66 95.08 92.20 93.72 95.08 94.40 surface objects. Additionally, for effective use of unla- 18 44.74 58.11 62.74 65.86 67.59 70.65 beled data, we design a new unsupervised pre-training 19 20 60.53 87.69 82.89 91.28 68.25 71.79 88.72 85.84 68.84 64.00 71.60 92.15 task for hyperspectral classification. Through the mask OA (%) 75.58 76.14 80.84 84.61 87.45 89.01 and reconstruction process of the token generated from AA (%) 69.43 78.05 77.88 83.26 80.72 84.11 the central area, the model can initialize the backbone 𝜅 71.87 72.96 78.09 82.30 85.48 87.29 network without labeled data and provide a more stable model performance. To fully evaluate our proposed meth- ods, we conducted a series of comparative experiments obtain superior classification results. The performance and ablation experiments on two public datasets, Salinas of the ViT model is not stable. When the distribution of and YRE. The experimental results prove the effective- ground objects in the dataset is more complex, the linear ness and superiority of our method. embedding module drags down the model performance. Therefore, this proves the necessity of a well-designed embedding layer for the Transformer network in HSI 5. Acknowledgment classification. In addition, all the methods cannot dis- criminate the Vinyard untrained class well, which may This work was supported in part by the National Natu- be caused by the large variability of this land cover. It is a ral Science Foundation of China under Grant 41971300, problem we need to solve in the future. The classification and Grant 61901278; in part by the Key Project of results of each method on the YRE dataset are similar Department of Education of Guangdong Province un- to the Salinas dataset. The YRE dataset is a large scene der Grant 2020ZDZX3045; in part by the Guangdong dataset so that the classification task is more complicated. Basic and Applied Basic Research Foundation under Hence, the classification performance of each method Grant 2022A1515011290; in part by the Natural Sci- [12] X. Hu, W. Yang, H. Wen, Y. Liu, Y. Peng, A ence Foundation of Guangdong Province under Grant lightweight 1-d convolution augmented trans- 2021A1515011413; in part by Shenzhen Scientific Re- former with metric learning for hyperspectral im- search and Development Funding Program under Grant age classification, Sensors 21 (2021) 1751. 20200803152531004. [13] D. Hong, Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza, J. Chanussot, Spectralformer: Rethinking hy- perspectral image classification with transform- References ers, IEEE Transactions on Geoscience and Remote Sensing 60 (2022) 1–15. doi:10.1109/TGRS.2021. [1] M. D. Farrell, R. M. Mersereau, On the impact of pca 3130716. dimension reduction for hyperspectral detection [14] Z. Zhong, Y. Li, L. Ma, J. Li, W.-S. Zheng, Spectral- of difficult targets, IEEE Geoscience and Remote spatial transformer network for hyperspectral im- Sensing Letters 2 (2005) 192–195. age classification: A factorized architecture search [2] S. Moussaoui, H. Hauksdottir, F. Schmidt, C. Jutten, framework, IEEE Transactions on Geoscience and J. Chanussot, D. Brie, S. Douté, J. A. Benediktsson, Remote Sensing (2021). On the decomposition of mars hyperspectral data [15] L. Dang, L. Weng, W. Dong, S. Li, Y. Hou, Spectral- by ica and bayesian positive source separation, Neu- spatial attention transformer with dense connection rocomputing 71 (2008) 2194–2208. for hyperspectral image classification, Computa- [3] Y. Qian, M. Ye, J. Zhou, Hyperspectral image clas- tional Intelligence and Neuroscience 2022 (2022). sification based on structured sparse logistic re- [16] Y. Xu, B. Du, L. Zhang, Self-attention context net- gression and three-dimensional wavelet texture fea- work: Addressing the threat of adversarial attacks tures, IEEE Transactions on Geoscience and Remote for hyperspectral image classification, IEEE Trans- Sensing 51 (2012) 2276–2291. actions on Image Processing 30 (2021) 8671–8685. [4] S. Kuching, The performance of maximum likeli- doi:10.1109/TIP.2021.3118977. hood, spectral angle mapper, neural network and [17] Y. Chen, Z. Lin, X. Zhao, G. Wang, Y. Gu, Deep decision tree classifiers in hyperspectral image anal- learning-based classification of hyperspectral data, ysis, Journal of Computer Science 3 (2007) 419–423. IEEE Journal of Selected topics in applied earth ob- [5] J. Xia, P. Ghamisi, N. Yokoya, A. Iwasaki, Random servations and remote sensing 7 (2014) 2094–2107. forest ensembles and extended multiextinction pro- [18] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, files for hyperspectral image classification, IEEE Masked autoencoders are scalable vision learners, Transactions on Geoscience and Remote Sensing arXiv preprint arXiv:2111.06377 (2021). 56 (2017) 202–216. [19] S. Yu, S. Jia, C. Xu, Convolutional neural networks [6] M. Chi, R. Feng, L. Bruzzone, Classification of hy- for hyperspectral image classification, Neurocom- perspectral remote-sensing data with primal svm puting 219 (2017) 88–98. for small-sized training dataset problem, Advances [20] M. Ahmad, A. M. Khan, M. Mazzara, S. Distefano, in space research 41 (2008) 1793–1799. M. Ali, M. S. Sarfraz, A fast and compact 3-d cnn [7] W. Hu, Y. Huang, L. Wei, F. Zhang, H. Li, Deep for hyperspectral image classification, IEEE Geo- convolutional neural networks for hyperspectral science and Remote Sensing Letters (2020). image classification, Journal of Sensors 2015 (2015). [21] S. K. Roy, G. Krishna, S. R. Dubey, B. B. Chaudhuri, [8] S. K. Roy, G. Krishna, S. R. Dubey, B. B. Chaudhuri, Hybridsn: Exploring 3-d–2-d cnn feature hierar- Hybridsn: Exploring 3-d–2-d cnn feature hierar- chy for hyperspectral image classification, IEEE chy for hyperspectral image classification, IEEE Geoscience and Remote Sensing Letters 17 (2020) Geoscience and Remote Sensing Letters 17 (2019) 277–281. 277–281. [22] J. Yang, Y.-Q. Zhao, J. C.-W. Chan, Learning and [9] F. Zhou, R. Hang, Q. Liu, X. Yuan, Hyperspectral im- transferring deep joint spectral–spatial features for age classification using spectral-spatial lstms, Neu- hyperspectral classification, IEEE Transactions on rocomputing 328 (2019) 39–47. Geoscience and Remote Sensing 55 (2017) 4729– [10] X. He, Y. Chen, Modifications of the multi-layer 4742. perceptron for hyperspectral image classification, [23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- Remote Sensing 13 (2021) 3547. senborn, X. Zhai, T. Unterthiner, M. Dehghani, [11] D. Hong, L. Gao, J. Yao, B. Zhang, A. Plaza, M. Minderer, G. Heigold, S. Gelly, et al., An image is J. Chanussot, Graph convolutional networks for worth 16x16 words: Transformers for image recog- hyperspectral image classification, IEEE Transac- nition at scale, arXiv preprint arXiv:2010.11929 tions on Geoscience and Remote Sensing 59 (2020) (2020). 5966–5978.