=Paper=
{{Paper
|id=Vol-3052/short5
|storemode=property
|title=Uncertainty-Aware Graph-Based Multimodal Remote Sensing Detection of Out-Of-Distribution Samples
|pdfUrl=https://ceur-ws.org/Vol-3052/short5.pdf
|volume=Vol-3052
|authors=Iain Rolland,,Andrea Marinoni,,Sivasakthy Selvakumaran
|dblpUrl=https://dblp.org/rec/conf/cikm/RollandMS21
}}
==Uncertainty-Aware Graph-Based Multimodal Remote Sensing Detection of Out-Of-Distribution Samples==
Uncertainty-Aware Graph-Based Multimodal Remote Sensing Detection of Out-Of-Distribution Samples Iain Rolland1 , Andrea Marinoni1,2 and Sivasakthy Selvakumaran1 1 Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, United Kingdom 2 Department of Physics and Technology, UiT the Arctic University of Norway, P.O. box 6050 Langnes, NO-9037, Tromsø, Norway Abstract Having the ability to quantify prediction confidence or uncertainty will greatly assist the successful integration of deep learning methods into high-stake decision making processes. Graph-based convolutional neural networks can be trained to perform classification of multimodal remote sensing data using a model output which represents a Dirichlet distribution parameterization. This parameterization can then also be used to obtain measures of prediction uncertainty. By making a correspondence between a multinomial opinion, as described by subjective logic, and a Dirichlet distribution parameterization, a direct mapping between the two can be performed. A multinomial opinion of this kind can produce quantified measures of uncertainty and distinguish uncertainty due to a lack of evidence (vacuity) and uncertainty due to conflicting evidence (dissonance). With an appropriately chosen loss function, the graph-based classifier will converge to provide accurate estimates of uncertainty. The results presented in this paper show that the measures of uncertainty provided by such models are capable of better distinguishing out-of-distribution data samples than probabilistic measures of uncertainty produced by equivalent deterministic neural networks. Keywords Multimodal remote sensing, uncertainty estimates, graph convolutional networks, subjective logic, land cover classification 1. Introduction ter results than any individual data mode would produce in isolation. Each data capturing technique will natu- The capability of algorithms to provide accurate measures rally have its own strengths and weaknesses, inherent to of confidence and uncertainty is important if they are to the physical properties of the sensing mode [6, 7]. De- be adopted in real-world scenarios where the stakes can terministic classification, while useful, is held back by be high [1]. Although deep learning methods are often its inability to express uncertainty. Adoption of such capable of producing high-accuracy predictions [2, 3], techniques will always be limited by the adopter’s trust they are generally criticized for being unable to express in the predictions. Uncertainty estimates, however, will when to have confidence in the prediction and when greatly assist human trust in models, as it will provide the prediction should be presented as uncertain. If deep a quantification of confidence that might indicate when learning models are to be integrated reliably into real- a prediction is not to be trusted, and more importantly, world decision making processes, it is of vital importance when a prediction is given with great certainty [8]. that the methods being used are capable of accurately In this paper, we have analyzed how well different mea- expressing uncertainty [4]. sures of model uncertainty perform the task of identify- With remotely-sensed data being available with ever- ing data points which belong to a distribution other than greater temporal and spatial resolutions, the development those observed during training (out of distribution detec- of computational processing methods which are capable tion). To do so, we have used graph-based neural network of robustly handling such large volumes of data will assist architectures that are adapted to provide subjective opin- countless earth-monitoring applications [5]. Specifically, ions (as described in the field of belief or evidence theory with data being captured now using a wide range of [9]) through the use of Dirichlet distribution parameteri- techniques with complementary strengths, the ability to zations [10, 11]. The subjective opinions can be used to combine this data into a multimodal analysis will allow measure two intuitive measures of uncertainty: vacuity each data mode to interact synergistically to provide bet- and dissonance. Vacuity is a measure of the uncertainty related to an absence of observed evidence, i.e. a higher CDCEO 2021: 1st Workshop on Complex Data Challenges in Earth measure of vacuity suggests a lack of supporting evidence Observation, November 1, 2021, Virtual Event, QLD, Australia. for a prediction. Dissonance is a measure of prediction " imr27@cam.ac.uk ( Iain Rolland); andrea.marinoni@uit.no uncertainty arising due to the presence of conflicting ( Andrea Marinoni); ss683@cam.ac.uk ( Sivasakthy Selvakumaran) 0000-0002-4137-5605 ( Iain Rolland); 0000-0001-6789-0915 evidence. This approach (using graph-based neural net- ( Andrea Marinoni); 0000-0002-8591-0702 works within a subjective-logic framework) is, to the ( Sivasakthy Selvakumaran) best of our knowledge, as-yet untested as a method for © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). performing classification of multimodal remote sensing CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) data. The performance of the adopted technique repre- where Γ() is the gamma function. The distribution’s sents a promising avenue in the search for meaningful expected value is given by uncertainty estimates for this task. 𝛼𝑘 The remainder of this paper is organized as follows: E [Dir(𝑝𝑘 |𝛼)] = ∑︀𝐾 . (4) Section 2 describes the uncertainty framework adopted 𝑘=1 𝛼𝑘 in the methods presented, Section 3 details the construc- If we allow the uncertainty mass and base rates to be tion of the graph-based neural networks used, Section given by 4 presents an analysis of results and Section 5 summa- 𝐾 𝐾 rizes and draws conclusions as well as suggests areas for 𝑢 = ∑︀𝐾 = (5) 𝛼𝑘 𝑆 future work. 𝑘=1 and 𝑎𝑘 = 1/𝐾, ∀𝑘 (6) 2. Uncertainty framework respectively, where 𝑆 refers to the Dirichlet strength, The proposed uncertainty-aware framework relies on the then by equating the probability projection of (1) with definition of uncertainty metrics, which in turn are based the expected value of the Dirichlet distribution given by on subjective logic and a Dirichlet mapping [11]. These (4), the expression for the belief mass can be obtained as steps are detailed in this section, and have been prop- 𝛼𝑘 − 1 erly adapted to the task of multimodal remote sensing 𝑏𝑘 = . (7) 𝑆 classification. This provides us with everything needed in order to map from a Dirichlet distribution to a SL opinion and vice 2.1. Subjective Logic versa. Subjective Logic (SL), takes an evidence-based approach to decision making [12]. Expressing an opinion using 2.3. Uncertainty measures measured quantities of belief allows the distinction to be made between uncertainty due to a lack of evidence From the definitions of the evidential uncertainties pre- (vacuity) and uncertainty due to the presence of conflict- sented in [9], the measures of vacuity and dissonance ing evidence (dissonance). A multinomial opinion, 𝜔, can have been adopted. The measure of vacuity uncertainty be expressed as 𝜔 = (b, 𝑢, a), where b is a belief mass is simply given by the uncertainty mass, i.e. vector, the scalar 𝑢 is the uncertainty mass and a is the 𝐾 base rate vector. For a 𝐾-class classification problem, y, 𝑣𝑎𝑐(𝜔) ≡ 𝑢 = , (8) a and b are all vectors of dimension 𝐾. A projection of 𝑆 𝜔 onto a probability distribution can be made according and the measure of dissonance uncertainty is given by to 𝑏𝑖 𝑗̸=𝑖 𝑏𝑗 Bal(𝑏𝑗 , 𝑏𝑖 ) 𝐾 (︃ ∑︀ )︃ 𝑃 (𝑦 = 𝑘) = 𝑏𝑘 + 𝑎𝑘 𝑢. (1) ∑︁ 𝑑𝑖𝑠𝑠(𝜔) = ∑︀ , (9) It follows that since 𝑘=1 𝑎𝑘 = 1 for the base rate 𝑗̸=𝑖 𝑏𝑗 ∑︀𝐾 𝑖=1 vector, an additivity requirement is described by where Bal() is a function which gives the relative balance 𝐾 ∑︁ between two belief masses, defined by 𝑢+ 𝑏𝑘 = 1. (2) |𝑏 −𝑏 | {︃ 𝑘=1 1 − 𝑏𝑖𝑖 +𝑏𝑗𝑗 , if 𝑏𝑖 + 𝑏𝑗 ̸= 0, Bal(𝑏𝑗 , 𝑏𝑖 ) = (10) 0, otherwise. 2.2. Dirichlet mapping The entropy of the node-level multinomial distribu- If p is a 𝐾-dimensional random vector containing the tions provided by the models is also computed to rep- probability of belonging to each output class, and 𝛼 is resent a form of uncertainty. This is done in order to the strength vector which parameterizes a Dirichlet dis- provide a comparitive metric against which the eviden- tribution, the probability density function of the Dirichlet tial uncertainties can be compared. is given by 𝐾 Γ( 𝐾 3. Graph network architecture ∑︀ 𝛼𝑘 ) ∏︁ 𝛼𝑘 −1 Dir(p|𝛼) = ∏︀ 𝑘=1 𝑝𝑘 , (3) 𝑘=1 𝛼𝑘 𝑘=1 The multimodal data can be represented using a graph, where each of the 𝑁 nodes in the graph represents a pixel in the image. The graph’s adjacency matrix, A ∈ R𝑁 ×𝑁 , is used to represent edges between nodes deemed similar. teacher knowledge distillation and the use of a Dirichlet A set of features, X ∈ R𝑁 ×𝐶 , is used to assign a vector prior. These have been shown to allow subjective models description of each graph node, where 𝐶 denotes the to provide better uncertainty estimates [11]. number of input features. The graph’s degree matrix, D ∈ R𝑁 ×𝑁 ∑︀, is a diagonal matrix with elements given 3.2.1. Teacher knowledge distillation by D𝑖𝑖 = 𝑗 A𝑖𝑗 . The graph convolutional networks (GCNs) used are By training a non-subjective model in advance, its out- of the form proposed by [10], where the graph convolu- puts, 𝑝 ˆ𝑖𝑘 , can be used in order to encourage the subjective tional layer is given by model to converge to node Dirichlet distributions with E[𝑝𝑖𝑘 ] which are close to the teacher’s deterministic esti- mates. This is achieved using an additional term in the (︂ 1 )︂ −2 −1 Z (𝑙+1) = 𝜎 D̃ ÃD̃ Z 𝑊 2 (𝑙) (𝑙) , (11) loss function, ∑︁ ∑︁ (︂ 𝑝 ˆ )︂ where Z(𝑙) , Z(𝑙+1) and 𝑊 (𝑙) are the inputs, outputs and ℒT (𝜃) = ˆ𝑖𝑘 log 𝑖𝑘 𝑝 , (13) E[𝑝𝑖𝑘 ] weights of the 𝑙th layer respectively, and 𝜎() is a non- 𝑖 𝑘 linear activation function. For brevity, the tilde operator which corresponds to the summation of Kullback-Leibler is used to represent the inclusion of self-connection edges (KL) divergence terms between the teacher output prob- in the graph, i.e. Ã = A + I and D̃ = D + I. ability and the expected value of the subjective model’s Dirichlet distribution for each node. Using 𝐷KL ( || ) to 3.1. Subjective models compute the KL divergence, this is stated equivalently as 𝑖𝑘 ‖ E[𝑝𝑖𝑘 ]). Notice that this sum is computed ∑︀ 𝑖 𝐷 KL (𝑝 ˆ An adaptation to the GCN architecture used by [10] must over all nodes as opposed to just the nodes in L. Models be made in order to obtain the subjective opinions that trained using a teacher are denoted using the ‘-T’ suffix will be used to obtain measures of vacuity and disso- e.g. a S-BGCN-T model would indicate that a pre-trained nance uncertainty. The adaptation made means that the GCN was used as a teacher in order to assist the training model will output node-level Dirichlet distribution pa- convergence of a subjective graph convolutional model. rameters, such that the output will provide a probability distribution over multinomial class probabilities for each 3.2.2. Dirichlet prior node. To do so, the softmax output activation function used in the output layer of the GCN is substituted for a A second convergence assistance technique which can be ReLU function. In this way, the model is trained to out- used involves the use of a Dirichlet prior, 𝛼 ˆ . The exact put non-negative evidence contributions, E ∈ R𝑁 ×𝐾 , method chosen to provide 𝛼 ˆ will depend on the nature of where E𝑖 = 𝛼𝑖 − 1 and 𝛼𝑖 refers to the 𝐾-dimensional the problem but we will assume nodes which are nearby concentration parameters of the 𝑖th node. In order to in the graph are more likely to belong to the same output train such a model, the loss function is made up of two class than nodes which are far apart, a property known components: a squared error term, which is minimized as homophily [13]. Using this assumption, we can use the in order to classify a greater proportion of the nodes computed distances on the graph to assign contributions correctly, and a variance term, which is minimized to of evidence from observed node labels to the other nodes incentivize the model to provide confident predictions in the graph using a function of our choosing. If 𝑑𝑖𝑗 where possible. This loss, ℒ(𝜃), is given by denotes the shortest path distance between a given node, ∑︁ ∑︁ [︀ indexed by 𝑖 and an observed node, indexed by 𝑗, then (𝑝𝑖𝑘 − 𝑦𝑖𝑘 )2 + Var(𝑝𝑖𝑘 ) , the amount of evidence contributed to suggest that the ]︀ ℒ(𝜃) = 𝑖∈L 𝑘 𝑖th node belongs to the 𝑘th class is given by ∑︁ ∑︁ [︂ (︂ )︂]︂ (12) 𝛼𝑖𝑘 𝑆𝑖 − 𝛼𝑖𝑘 ⎧ −𝑑2 (︃ )︃ = (𝑝𝑖𝑘 − 𝑦𝑖𝑘 )2 + 2 , ⎪ exp 𝑖𝑗 2 𝑆𝑖 𝑆𝑖 − 𝐾 ⎪ 2𝜎 , if 𝑦𝑗𝑘 = 1, ⎨ 𝑖∈L 𝑘 ℎ𝑖𝑘 (𝑦𝑗 , 𝑑𝑖𝑗 ) = (2𝜋𝜎2 )1/2 (14) where 𝑖 ∈ L refers to the fact that the loss is computed ⎪ otherwise, ⎪ ⎩0, using a sum only over nodes in the training set, L. Models trained with such an output activation and loss function where 𝜎 is a scale parameter which controls the order of will be denoted using the ‘S-’ prefix in order to indicate distance magnitude over which evidence will propagate they provide subjective predictions, e.g. S-GCN. in the prior. The total evidence to suggest the 𝑖th node belongs to the 𝑘th class, 𝑒𝑖𝑘 can be found by summing these contributions over the nodes in the training set, 3.2. Convergence assistance techniques such that the element in the prior is given by In order to assist the convergence of subjective models, ∑︁ 𝛼ˆ 𝑖𝑘 = 1 + 𝑒𝑖𝑘 = 1 + ℎ𝑖𝑘 (𝑦𝑗 , 𝑑𝑖𝑗 ). (15) two additional assistance techniques have been used: 𝑗∈L Table 1 Loss function components and their weighting coefficients for different model types Model name ℒtotal (𝜃) S-BGCN ℒ(𝜃) S-BGCN-T ℒ(𝜃) + 𝜆T ℒT (𝜃) S-BGCN-K ℒ(𝜃) + 𝜆K ℒK (𝜃) S-BGCN-T-K ℒ(𝜃) + 𝜆T ℒT (𝜃) + 𝜆K ℒK (𝜃) Figure 1: Ground truth data with colors depicting land cover The KL divergence between the Dirichlet distribution of classes. This data represents a subset of the 2018 IEEE GRSS the prior and the model output is given by the term Data Fusion Challenge dataset. ∑︁ ℒK (𝜃) = 𝐷KL (Dir(p𝑖 |𝛼𝑖 ) ‖ Dir(p̂𝑖 |𝛼 ˆ 𝑖 )), (16) 𝑖 𝑘-nearest neighbors algorithm with two nodes receiving an edge connecting them if either node was one of the which can, in turn, be incorporated into the total loss 𝑘 nodes which were nearest the other. This produces a function. Models trained using a prior are denoted using graph which is both undirected and unweighted. The the ‘-K’ suffix. graph, which contains approximately 2.16 million nodes, Table 1 shows how these convergence assistance tech- was computed with 𝑘 = 15. niques can be weighted and combined in various permu- In order to measure an uncertainty output’s ability tations to provide a total loss function, ℒtotal (𝜃), as well to separate OOD nodes, a receiver operating character- as the model name abbreviations used to denote which istic (ROC) curve and a precision-recall (PR) curve can combination has been used. The ‘B’ in the model names be computed. The area under the ROC curve and PR of Table 1 refers to the fact that dropout inference has curve (AUROC and AUPR respectively) can be used as a been used as a Bayesian approximation. The coefficients single numerical representation of the detection perfor- 𝜆T and 𝜆K are used to control the relative importance of mance, where an area of 1.0 would represent a perfect the teacher network and the Dirichlet prior respectively discriminator for both metrics. against the importance of the subjective loss function given in (12). These have been considered as hyperpa- rameters which are to be tuned during training. 4.2. Network training and hyperparameters 4. Results and analysis Models were implemented and trained using the Tensor- Flow library [15] on a personal laptop computer with 4.1. Data Intel Core i7 CPU and 16 GB of RAM. In order to handle the imbalance of classes in the dataset, sample weight- A subsection of the 2018 IEEE GRSS Data Fusion Chal- ing was used. Samples were given weights which were lenge dataset [14] ws selected for the purposes of validat- inversely proportional to the number of total samples ing the described methods. The ground truth labels in of each class in the training set. This allows the losses this dataset describe 20 different urban land cover/land related to nodes from under-represented classes to have use classes (i.e. 𝐾 = 20) as well as an unlabelled state, an increased influence over parameter updates and vice described as Unclassified. The modes of input data repre- versa. sent measurements from three sensor types: LiDAR, opti- All GCN-based models were constructed using a cal and hyperspectral (HS). The LiDAR data was provided dropout layer (dropout probability 0.5), a graph con- at 0.5 m resolution, the same resolution as the ground volutional layer, as described in (11), a second dropout truth labels (GT). In order to simplify analysis, the optical layer (dropout probability 0.5) and a second graph con- data (which was provided at 0.05 m resolution) and the volutional layer with the relevant output activation func- HS data (which was provided at 1.0 m resolution) were tion. The kernel weights of the first graph convolutional bilinearly resampled to obtain 0.5 m resolution across layer were regularized using an 𝐿2 penalization. Where inputs and outputs. dropout inference has been used, the number of samples The graph was constructed with each 0.5 m × 0.5 m taken was 100. pixel representing a node in the graph. Each node has Hyperparameters including the learning rate, the a 52-dimensional feature vector describing it (produced 𝐿2 regularization coefficient and the number of GCN by stacking 3 optical channels, 48 HS channels and 1 layer output features, 𝐹 , were selected via a grid-search LiDAR channel). The graph edges are computed using a Table 2 OOD detection: Ability of each uncertainty type to detect OOD nodes (measured by the AUROC and AUPR metrics). Values shown represent the mean ± standard deviation. AUROC AUPR Model Vacuity Dissonance Entropy Vacuity Dissonance Entropy S-BGCN-T-K 0.882 ± 0.085 0.605 ± 0.197 0.878 ± 0.089 0.318 ± 0.289 0.132 ± 0.184 0.316 ± 0.306 S-BGCN-T 0.588 ± 0.147 0.664 ± 0.133 0.578 ± 0.186 0.128 ± 0.187 0.137 ± 0.192 0.143 ± 0.208 S-BGCN 0.586 ± 0.147 0.666 ± 0.132 0.580 ± 0.191 0.127 ± 0.186 0.139 ± 0.190 0.145 ± 0.209 S-GCN 0.580 ± 0.145 0.650 ± 0.120 0.586 ± 0.181 0.125 ± 0.185 0.130 ± 0.191 0.143 ± 0.207 S-MLP 0.767 ± 0.152 0.805 ± 0.114 0.787 ± 0.125 0.245 ± 0.214 0.233 ± 0.170 0.219 ± 0.201 GCN - - 0.538 ± 0.188 - - 0.116 ± 0.179 method. Where used, 𝜆T and 𝜆K were also found using a For the task of OOD detection, the S-BGCN-T-K model grid-search. is the highest ranked model. Its measure of vacuity un- Learning was performed for a maximum of 400 epochs, certainty provided the best distinguishing metric, with but was stopped early if the validation loss failed to de- mean AUROC and AUPR of 0.882 and 0.318 respec- crease further for 60 consecutive epochs. If stopped early, tively, closely followed by performance from the measure model weights were returned to the settings which pro- of entropy (AUROC and AUPR of 0.878 and 0.316 re- vided the lowest validation set loss upon the termination spectively). The performance of the S-BGCN-T-K model of training. stands out above the performance of other models trained. Each test was performed for different random dataset This highlights the importance of the convergence assis- splits and model weight initializations to obtain mean tance techniques used, particularly the use of a meaning- and standard deviation measures of performance. ful prior. A benchmark has been provided by training ‘standard’ The fact that vacuity is the uncertainty measure which GCNs which provide prediction entropy as a form of best distinguishes OOD nodes reflects intuition. Since uncertainty estimate. vacuity measures the absence of evidence for a prediction, it is natural to expect that it would better distinguish 4.3. Out of distribution detection OOD nodes for which the model ought to have little evidence to support its classification. It would be reasonable to expect that uncertainty should be higher when the model is asked to make a prediction using an input which does not resemble the inputs upon 5. Conclusion which it was trained. The relative inability of neural networks to successfully extrapolate beyond the support In this paper we have adapted a novel classification of the training data is a well-known weakness of these method capable of providing uncertainty estimates to the methods [16]. By training models using only a subset of task of multi-class classification of multimodal remote the classes provided by the GT, with the other classes sensing data. The adopted framework, based upon the acting as out of distribution (OOD) samples, the OOD theory of Subjective Logic, provides measures of vacuity detection ability of the uncertainty metrics can be mea- and dissonance uncertainty. Of the types of uncertainty sured. The AUROC and AUPR can be calculated for each assessed, the measure of vacuity was the best metric to uncertainty output provided by each model type, in order perform identification of OOD samples. Experimental to determine the relative performance of the respective results have shown the performance of the S-BGCN-T- metrics for this task. K model in the task of OOD detection to be improved In the results presented, two classes were randomly against baseline methods. This represents a promising selected to act as OOD. This was repeated 10 times, with avenue for uncertainty-aware learning in the task of mul- two new randomly sampled classes selected for each timodal remote sensing classification. training and evaluation loop in order that the variation The presented results illustrate the importance of con- in OOD detection performance due to the nature of the vergence assistance techniques as a means for improving classes selected as OOD could be averaged out and the the quality of uncertainty estimates, particularly through mean and standard deviation computed. Each model type the use of a prior. This can be seen by comparing the S- was assessed over the same 10 sampled OOD class pairs BGCN-T-K OOD detection performance with equivalent for fairness. The AUROC and AUPR values measured models which do not use a prior, e.g. S-BGCN-T. can be found in Table 2. Future work should consider the generalisation poten- tial of this method by assessing performance on other World Congr. DAIS - Work. Distrib. Anal. Infras- challenging remote sensing classification datasets. The truct. Algorithms Multi-Organization Fed., 2017, pp. analysis could also be extended to assess whether the 1–6. doi:10.1109/UIC-ATC.2017.8397411. presented uncertainty measures could be used to detect [9] A. Josang, J.-H. Cho, F. Chen, Uncertainty character- model misclassifications. Additionally, there is scope for istics of subjective opinions, in: 2018 21st Int. Conf. research into how the choice of method for computing Inf. Fusion, Cambridge, U.K., 2018, pp. 1998–2005. the 𝛼ˆ prior affects the quality of uncertainty estimates, [10] T. N. Kipf, M. Welling, Semi-supervised classifica- either by varying the scale parameter, 𝜎, or considering tion with graph convolutional networks, in: 5th Int. different prior computation methods entirely. Conf. Learn. Representations, Toulon, France, 2017. [11] X. Zhao, F. Chen, S. Hu, J.-H. Cho, Uncertainty aware semi-supervised learning on graph data, in: Acknowledgments Advances Neural Inf. Process. Syst., volume 33, 2020, pp. 12827–12836. This work is funded in part by Centre for Integrated [12] A. Jøsang, Subjective Logic - A Formalism for Rea- Remote Sensing and Forecasting for Arctic Operations soning Under Uncertainty, Artificial Intelligence: (CIRFA) and the Research Council of Norway (RCN Grant Foundations, Theory, and Algorithms, Springer, no. 237906), the Automatic Multisensor remote sensing 2016. doi:10.1007/978-3-319-42337-1. for Sea Ice Characterization (AMUSIC) Framsenteret ‘Pol- [13] Q. Huang, H. He, A. Singh, S.-N. Lim, A. Benson, havet’ flagship project 2020, the Isaac Newton Trust, and Combining label propagation and simple models Newnham College, Cambridge, UK. out-performs graph neural networks, in: 9th Int. Conf. Learn. Representations, 2021. References [14] S. Prasad, B. Le Saux, N. Yokoya, R. Hansch, 2018 IEEE GRSS Data Fusion Challenge - Fusion of Mul- [1] B. Goodman, S. Flaxman, European Union reg- tispectral LiDAR and Hyperspectral Data, 2018. ulations on algorithmic decision-making and a URL: https://dx.doi.org/10.21227/jnh9-nz89. doi:10. ‘right to explanation’, AI Mag. 38 (2017) 50–57. 21227/jnh9-nz89. doi:10.1609/aimag.v38i3.2741. [15] M. Abadi, et al., TensorFlow: Large-scale machine [2] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, learning on heterogeneous systems, 2015. URL: Nature 521 (2015) 436–444. https://www.tensorflow.org/, Software available [3] L. Zhang, L. Zhang, B. Du, Deep learning for remote from tensorflow.org. sensing data: A technical tutorial on the state of [16] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim- the art, IEEE Geosci. Remote Sens. Mag. 4 (2016) ple and scalable predictive uncertainty estimation 22–40. using deep ensembles, in: Advances in Neural Inf. [4] J. D. Lee, K. A. See, Trust in automation: Design- Process. Syst., volume 30, Long Beach, CA, USA, ing for appropriate reliance, Human Factors 46 2017, pp. 6402–6413. (2004) 50–80. URL: https://doi.org/10.1518/hfes.46.1. 50_30392. doi:10.1518/hfes.46.1.50_30392. [5] M. Chi, A. Plaza, J. A. Benediktsson, Z. Sun, J. Shen, Y. Zhu, Big data for remote sensing: Challenges and opportunities, Proceedings of the IEEE 104 (2016) 2207–2219. [6] S. Chlaily, M. D. Mura, J. Chanussot, C. Jutten, P. Gamba, A. Marinoni, Capacity and limits of mul- timodal remote sensing: Theoretical aspects and automatic information theory-based image selec- tion, IEEE Trans. Geosci. Remote Sens. 59 (2021) 5598–5618. doi:10.1109/TGRS.2020.3014138. [7] A. Marinoni, S. Chlaily, E. Khachatrian, T. Eltoft, S. Selvakumaran, M. Girolami, C. Jutten, Enhanc- ing ensemble learning and transfer learning in mul- timodal data analysis by adaptive dimensionality reduction, CoRR abs/2105.03682 (2021). URL: https: //arxiv.org/abs/2105.03682. arXiv:2105.03682. [8] S. Chakraborty, et al., Interpretability of deep learn- ing models: A survey of results, in: IEEE Smart