=Paper=
{{Paper
|id=Vol-3052/short5
|storemode=property
|title=Uncertainty-Aware Graph-Based Multimodal Remote Sensing Detection of Out-Of-Distribution Samples
|pdfUrl=https://ceur-ws.org/Vol-3052/short5.pdf
|volume=Vol-3052
|authors=Iain Rolland,,Andrea Marinoni,,Sivasakthy Selvakumaran
|dblpUrl=https://dblp.org/rec/conf/cikm/RollandMS21
}}
==Uncertainty-Aware Graph-Based Multimodal Remote Sensing Detection of Out-Of-Distribution Samples==
<pdf width="1500px">https://ceur-ws.org/Vol-3052/short5.pdf</pdf>
<pre>
Uncertainty-Aware Graph-Based Multimodal Remote
Sensing Detection of Out-Of-Distribution Samples
    Iain Rolland1 , Andrea Marinoni1,2 and Sivasakthy Selvakumaran1
1
    Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, United Kingdom
2
    Department of Physics and Technology, UiT the Arctic University of Norway, P.O. box 6050 Langnes, NO-9037, Tromsø, Norway


                                             Abstract
                                             Having the ability to quantify prediction confidence or uncertainty will greatly assist the successful integration of deep
                                             learning methods into high-stake decision making processes. Graph-based convolutional neural networks can be trained to
                                             perform classification of multimodal remote sensing data using a model output which represents a Dirichlet distribution
                                             parameterization. This parameterization can then also be used to obtain measures of prediction uncertainty. By making a
                                             correspondence between a multinomial opinion, as described by subjective logic, and a Dirichlet distribution parameterization,
                                             a direct mapping between the two can be performed. A multinomial opinion of this kind can produce quantified measures
                                             of uncertainty and distinguish uncertainty due to a lack of evidence (vacuity) and uncertainty due to conflicting evidence
                                             (dissonance). With an appropriately chosen loss function, the graph-based classifier will converge to provide accurate
                                             estimates of uncertainty. The results presented in this paper show that the measures of uncertainty provided by such models
                                             are capable of better distinguishing out-of-distribution data samples than probabilistic measures of uncertainty produced by
                                             equivalent deterministic neural networks.

                                             Keywords
                                             Multimodal remote sensing, uncertainty estimates, graph convolutional networks, subjective logic, land cover classification


1. Introduction                                                                                                       ter results than any individual data mode would produce
                                                                                                                      in isolation. Each data capturing technique will natu-
The capability of algorithms to provide accurate measures                                                             rally have its own strengths and weaknesses, inherent to
of confidence and uncertainty is important if they are to                                                             the physical properties of the sensing mode [6, 7]. De-
be adopted in real-world scenarios where the stakes can                                                               terministic classification, while useful, is held back by
be high [1]. Although deep learning methods are often                                                                 its inability to express uncertainty. Adoption of such
capable of producing high-accuracy predictions [2, 3],                                                                techniques will always be limited by the adopter’s trust
they are generally criticized for being unable to express                                                             in the predictions. Uncertainty estimates, however, will
when to have confidence in the prediction and when                                                                    greatly assist human trust in models, as it will provide
the prediction should be presented as uncertain. If deep                                                              a quantification of confidence that might indicate when
learning models are to be integrated reliably into real-                                                              a prediction is not to be trusted, and more importantly,
world decision making processes, it is of vital importance                                                            when a prediction is given with great certainty [8].
that the methods being used are capable of accurately                                                                    In this paper, we have analyzed how well different mea-
expressing uncertainty [4].                                                                                           sures of model uncertainty perform the task of identify-
   With remotely-sensed data being available with ever-                                                               ing data points which belong to a distribution other than
greater temporal and spatial resolutions, the development                                                             those observed during training (out of distribution detec-
of computational processing methods which are capable                                                                 tion). To do so, we have used graph-based neural network
of robustly handling such large volumes of data will assist                                                           architectures that are adapted to provide subjective opin-
countless earth-monitoring applications [5]. Specifically,                                                            ions (as described in the field of belief or evidence theory
with data being captured now using a wide range of                                                                    [9]) through the use of Dirichlet distribution parameteri-
techniques with complementary strengths, the ability to                                                               zations [10, 11]. The subjective opinions can be used to
combine this data into a multimodal analysis will allow                                                               measure two intuitive measures of uncertainty: vacuity
each data mode to interact synergistically to provide bet-                                                            and dissonance. Vacuity is a measure of the uncertainty
                                                                                                                      related to an absence of observed evidence, i.e. a higher
CDCEO 2021: 1st Workshop on Complex Data Challenges in Earth                                                          measure of vacuity suggests a lack of supporting evidence
Observation, November 1, 2021, Virtual Event, QLD, Australia.                                                         for a prediction. Dissonance is a measure of prediction
" imr27@cam.ac.uk ( Iain Rolland); andrea.marinoni@uit.no
                                                                                                                      uncertainty arising due to the presence of conflicting
( Andrea Marinoni); ss683@cam.ac.uk ( Sivasakthy Selvakumaran)
 0000-0002-4137-5605 ( Iain Rolland); 0000-0001-6789-0915                                                            evidence. This approach (using graph-based neural net-
( Andrea Marinoni); 0000-0002-8591-0702                                                                               works within a subjective-logic framework) is, to the
( Sivasakthy Selvakumaran)                                                                                            best of our knowledge, as-yet untested as a method for
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     performing classification of multimodal remote sensing
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
data. The performance of the adopted technique repre-          where Γ() is the gamma function. The distribution’s
sents a promising avenue in the search for meaningful          expected value is given by
uncertainty estimates for this task.                                                            𝛼𝑘
   The remainder of this paper is organized as follows:                      E [Dir(𝑝𝑘 |𝛼)] = ∑︀𝐾          .           (4)
Section 2 describes the uncertainty framework adopted                                             𝑘=1 𝛼𝑘
in the methods presented, Section 3 details the construc-      If we allow the uncertainty mass and base rates to be
tion of the graph-based neural networks used, Section          given by
4 presents an analysis of results and Section 5 summa-                                  𝐾        𝐾
rizes and draws conclusions as well as suggests areas for                      𝑢 = ∑︀𝐾        =                   (5)
                                                                                           𝛼𝑘    𝑆
future work.                                                                           𝑘=1

                                                               and
                                                                                    𝑎𝑘 = 1/𝐾, ∀𝑘                       (6)
2. Uncertainty framework                                       respectively, where 𝑆 refers to the Dirichlet strength,
The proposed uncertainty-aware framework relies on the         then by equating the probability projection of (1) with
definition of uncertainty metrics, which in turn are based     the expected value of the Dirichlet distribution given by
on subjective logic and a Dirichlet mapping [11]. These        (4), the expression for the belief mass can be obtained as
steps are detailed in this section, and have been prop-                                     𝛼𝑘 − 1
erly adapted to the task of multimodal remote sensing                                𝑏𝑘 =          .                   (7)
                                                                                              𝑆
classification.
                                                               This provides us with everything needed in order to map
                                                               from a Dirichlet distribution to a SL opinion and vice
2.1. Subjective Logic
                                                               versa.
Subjective Logic (SL), takes an evidence-based approach
to decision making [12]. Expressing an opinion using           2.3. Uncertainty measures
measured quantities of belief allows the distinction to
be made between uncertainty due to a lack of evidence          From the definitions of the evidential uncertainties pre-
(vacuity) and uncertainty due to the presence of conflict-     sented in [9], the measures of vacuity and dissonance
ing evidence (dissonance). A multinomial opinion, 𝜔, can       have been adopted. The measure of vacuity uncertainty
be expressed as 𝜔 = (b, 𝑢, a), where b is a belief mass        is simply given by the uncertainty mass, i.e.
vector, the scalar 𝑢 is the uncertainty mass and a is the
                                                                                                   𝐾
base rate vector. For a 𝐾-class classification problem, y,                        𝑣𝑎𝑐(𝜔) ≡ 𝑢 =       ,                 (8)
a and b are all vectors of dimension 𝐾. A projection of                                            𝑆
𝜔 onto a probability distribution can be made according        and the measure of dissonance uncertainty is given by
to
                                                                                       𝑏𝑖 𝑗̸=𝑖 𝑏𝑗 Bal(𝑏𝑗 , 𝑏𝑖 )
                                                                                 𝐾
                                                                                    (︃ ∑︀                       )︃
                 𝑃 (𝑦 = 𝑘) = 𝑏𝑘 + 𝑎𝑘 𝑢.                (1)                      ∑︁
                                                                     𝑑𝑖𝑠𝑠(𝜔) =              ∑︀                     , (9)
It follows that since 𝑘=1 𝑎𝑘 = 1 for the base rate                                             𝑗̸=𝑖 𝑏𝑗
                        ∑︀𝐾
                                                                                𝑖=1
vector, an additivity requirement is described by
                                                               where Bal() is a function which gives the relative balance
                           𝐾
                          ∑︁                                   between two belief masses, defined by
                     𝑢+         𝑏𝑘 = 1.                 (2)
                                                                                        |𝑏 −𝑏 |
                                                                                  {︃
                          𝑘=1                                                       1 − 𝑏𝑖𝑖 +𝑏𝑗𝑗 , if 𝑏𝑖 + 𝑏𝑗 ̸= 0,
                                                                  Bal(𝑏𝑗 , 𝑏𝑖 ) =                                     (10)
                                                                                    0,             otherwise.
2.2. Dirichlet mapping
                                                                  The entropy of the node-level multinomial distribu-
If p is a 𝐾-dimensional random vector containing the
                                                               tions provided by the models is also computed to rep-
probability of belonging to each output class, and 𝛼 is
                                                               resent a form of uncertainty. This is done in order to
the strength vector which parameterizes a Dirichlet dis-
                                                               provide a comparitive metric against which the eviden-
tribution, the probability density function of the Dirichlet
                                                               tial uncertainties can be compared.
is given by
                                    𝐾
                       Γ( 𝐾                                    3. Graph network architecture
                         ∑︀
                               𝛼𝑘 ) ∏︁ 𝛼𝑘 −1
         Dir(p|𝛼) =     ∏︀ 𝑘=1        𝑝𝑘     ,          (3)
                          𝑘=1 𝛼𝑘      𝑘=1
                                                               The multimodal data can be represented using a graph,
                                                               where each of the 𝑁 nodes in the graph represents a pixel
                                                               in the image. The graph’s adjacency matrix, A ∈ R𝑁 ×𝑁 ,
is used to represent edges between nodes deemed similar.        teacher knowledge distillation and the use of a Dirichlet
A set of features, X ∈ R𝑁 ×𝐶 , is used to assign a vector       prior. These have been shown to allow subjective models
description of each graph node, where 𝐶 denotes the             to provide better uncertainty estimates [11].
number of input features. The graph’s degree matrix,
D ∈ R𝑁 ×𝑁   ∑︀, is a diagonal matrix with elements given        3.2.1. Teacher knowledge distillation
by D𝑖𝑖 = 𝑗 A𝑖𝑗 .
   The graph convolutional networks (GCNs) used are             By training a non-subjective model in advance, its out-
of the form proposed by [10], where the graph convolu-          puts, 𝑝
                                                                      ˆ𝑖𝑘 , can be used in order to encourage the subjective
tional layer is given by                                        model to converge to node Dirichlet distributions with
                                                                E[𝑝𝑖𝑘 ] which are close to the teacher’s deterministic esti-
                                                                mates. This is achieved using an additional term in the
                       (︂   1
                                                )︂
                          −2     −1
         Z (𝑙+1)
                  = 𝜎 D̃ ÃD̃ Z 𝑊   2 (𝑙)   (𝑙)
                                                   , (11)
                                                                loss function,
                                                                                      ∑︁ ∑︁ (︂            𝑝
                                                                                                          ˆ
                                                                                                                )︂
where Z(𝑙) , Z(𝑙+1) and 𝑊 (𝑙) are the inputs, outputs and                   ℒT (𝜃) =             ˆ𝑖𝑘 log 𝑖𝑘
                                                                                                 𝑝                 ,    (13)
                                                                                                        E[𝑝𝑖𝑘 ]
weights of the 𝑙th layer respectively, and 𝜎() is a non-                               𝑖  𝑘
linear activation function. For brevity, the tilde operator which corresponds to the summation of Kullback-Leibler
is used to represent the inclusion of self-connection edges (KL) divergence terms between the teacher output prob-
in the graph, i.e. Ã = A + I and D̃ = D + I.               ability and the expected value of the subjective model’s
                                                            Dirichlet distribution for each node. Using 𝐷KL ( || ) to
3.1. Subjective models                                      compute     the KL divergence, this is stated equivalently as
                                                                          𝑖𝑘 ‖ E[𝑝𝑖𝑘 ]). Notice that this sum is computed
                                                            ∑︀
                                                                𝑖 𝐷 KL (𝑝
                                                                        ˆ
An adaptation to the GCN architecture used by [10] must over all nodes as opposed to just the nodes in L. Models
be made in order to obtain the subjective opinions that trained using a teacher are denoted using the ‘-T’ suffix
will be used to obtain measures of vacuity and disso- e.g. a S-BGCN-T model would indicate that a pre-trained
nance uncertainty. The adaptation made means that the GCN was used as a teacher in order to assist the training
model will output node-level Dirichlet distribution pa- convergence of a subjective graph convolutional model.
rameters, such that the output will provide a probability
distribution over multinomial class probabilities for each 3.2.2. Dirichlet prior
node. To do so, the softmax output activation function
used in the output layer of the GCN is substituted for a A second convergence assistance technique which can be
ReLU function. In this way, the model is trained to out- used involves the use of a Dirichlet prior, 𝛼            ˆ . The exact
put non-negative evidence contributions, E ∈ R𝑁 ×𝐾 , method chosen to provide 𝛼             ˆ will depend on the nature of
where E𝑖 = 𝛼𝑖 − 1 and 𝛼𝑖 refers to the 𝐾-dimensional the problem but we will assume nodes which are nearby
concentration parameters of the 𝑖th node. In order to in the graph are more likely to belong to the same output
train such a model, the loss function is made up of two class than nodes which are far apart, a property known
components: a squared error term, which is minimized as homophily [13]. Using this assumption, we can use the
in order to classify a greater proportion of the nodes computed distances on the graph to assign contributions
correctly, and a variance term, which is minimized to of evidence from observed node labels to the other nodes
incentivize the model to provide confident predictions in the graph using a function of our choosing. If 𝑑𝑖𝑗
where possible. This loss, ℒ(𝜃), is given by                denotes the shortest path distance between a given node,
                ∑︁ ∑︁ [︀                                    indexed by 𝑖 and an observed node, indexed by 𝑗, then
                          (𝑝𝑖𝑘 − 𝑦𝑖𝑘 )2 + Var(𝑝𝑖𝑘 ) ,       the amount of evidence contributed to suggest that the
                                                    ]︀
       ℒ(𝜃) =
                𝑖∈L 𝑘                                       𝑖th node belongs to the 𝑘th class is given by
     ∑︁ ∑︁ [︂                        (︂          )︂]︂ (12)
                                 𝛼𝑖𝑘 𝑆𝑖 − 𝛼𝑖𝑘
                                                                                    ⎧
                                                                                              −𝑑2
                                                                                           (︃       )︃
  =            (𝑝𝑖𝑘 − 𝑦𝑖𝑘 )2 + 2                       ,                            ⎪  exp      𝑖𝑗
                                                                                                 2
                                 𝑆𝑖      𝑆𝑖 − 𝐾
                                                                                    ⎪         2𝜎
                                                                                                       , if 𝑦𝑗𝑘 = 1,
                                                                                    ⎨
     𝑖∈L 𝑘                                                        ℎ𝑖𝑘 (𝑦𝑗 , 𝑑𝑖𝑗 ) =     (2𝜋𝜎2 )1/2                         (14)
where 𝑖 ∈ L refers to the fact that the loss is computed
                                                                                    ⎪
                                                                                                          otherwise,
                                                                                    ⎪
                                                                                    ⎩0,
using a sum only over nodes in the training set, L. Models
trained with such an output activation and loss function where 𝜎 is a scale parameter which controls the order of
will be denoted using the ‘S-’ prefix in order to indicate distance magnitude over which evidence will propagate
they provide subjective predictions, e.g. S-GCN.            in the prior. The total evidence to suggest the 𝑖th node
                                                            belongs to the 𝑘th class, 𝑒𝑖𝑘 can be found by summing
                                                            these contributions over the nodes in the training set,
3.2. Convergence assistance techniques                      such that the element in the prior is given by
In order to assist the convergence of subjective models,
                                                                                                  ∑︁
                                                                       𝛼ˆ 𝑖𝑘 = 1 + 𝑒𝑖𝑘 = 1 +            ℎ𝑖𝑘 (𝑦𝑗 , 𝑑𝑖𝑗 ).   (15)
two additional assistance techniques have been used:                                               𝑗∈L
Table 1
Loss function components and their weighting coefficients for
different model types
           Model name             ℒtotal (𝜃)
            S-BGCN                 ℒ(𝜃)
            S-BGCN-T          ℒ(𝜃) + 𝜆T ℒT (𝜃)
            S-BGCN-K         ℒ(𝜃) + 𝜆K ℒK (𝜃)
           S-BGCN-T-K   ℒ(𝜃) + 𝜆T ℒT (𝜃) + 𝜆K ℒK (𝜃)


                                                                Figure 1: Ground truth data with colors depicting land cover
The KL divergence between the Dirichlet distribution of         classes. This data represents a subset of the 2018 IEEE GRSS
the prior and the model output is given by the term             Data Fusion Challenge dataset.

             ∑︁
   ℒK (𝜃) =      𝐷KL (Dir(p𝑖 |𝛼𝑖 ) ‖ Dir(p̂𝑖 |𝛼
                                              ˆ 𝑖 )), (16)
               𝑖                                              𝑘-nearest neighbors algorithm with two nodes receiving
                                                              an edge connecting them if either node was one of the
which can, in turn, be incorporated into the total loss
                                                              𝑘 nodes which were nearest the other. This produces a
function. Models trained using a prior are denoted using
                                                              graph which is both undirected and unweighted. The
the ‘-K’ suffix.
                                                              graph, which contains approximately 2.16 million nodes,
   Table 1 shows how these convergence assistance tech-
                                                              was computed with 𝑘 = 15.
niques can be weighted and combined in various permu-
                                                                 In order to measure an uncertainty output’s ability
tations to provide a total loss function, ℒtotal (𝜃), as well
                                                              to separate OOD nodes, a receiver operating character-
as the model name abbreviations used to denote which
                                                              istic (ROC) curve and a precision-recall (PR) curve can
combination has been used. The ‘B’ in the model names
                                                              be computed. The area under the ROC curve and PR
of Table 1 refers to the fact that dropout inference has
                                                              curve (AUROC and AUPR respectively) can be used as a
been used as a Bayesian approximation. The coefficients
                                                              single numerical representation of the detection perfor-
𝜆T and 𝜆K are used to control the relative importance of
                                                              mance, where an area of 1.0 would represent a perfect
the teacher network and the Dirichlet prior respectively
                                                              discriminator for both metrics.
against the importance of the subjective loss function
given in (12). These have been considered as hyperpa-
rameters which are to be tuned during training.               4.2. Network training and
                                                                      hyperparameters
4. Results and analysis                                         Models were implemented and trained using the Tensor-
                                                                Flow library [15] on a personal laptop computer with
4.1. Data                                                       Intel Core i7 CPU and 16 GB of RAM. In order to handle
                                                                the imbalance of classes in the dataset, sample weight-
A subsection of the 2018 IEEE GRSS Data Fusion Chal-            ing was used. Samples were given weights which were
lenge dataset [14] ws selected for the purposes of validat-     inversely proportional to the number of total samples
ing the described methods. The ground truth labels in           of each class in the training set. This allows the losses
this dataset describe 20 different urban land cover/land        related to nodes from under-represented classes to have
use classes (i.e. 𝐾 = 20) as well as an unlabelled state,       an increased influence over parameter updates and vice
described as Unclassified. The modes of input data repre-       versa.
sent measurements from three sensor types: LiDAR, opti-            All GCN-based models were constructed using a
cal and hyperspectral (HS). The LiDAR data was provided         dropout layer (dropout probability 0.5), a graph con-
at 0.5 m resolution, the same resolution as the ground          volutional layer, as described in (11), a second dropout
truth labels (GT). In order to simplify analysis, the optical   layer (dropout probability 0.5) and a second graph con-
data (which was provided at 0.05 m resolution) and the          volutional layer with the relevant output activation func-
HS data (which was provided at 1.0 m resolution) were           tion. The kernel weights of the first graph convolutional
bilinearly resampled to obtain 0.5 m resolution across          layer were regularized using an 𝐿2 penalization. Where
inputs and outputs.                                             dropout inference has been used, the number of samples
   The graph was constructed with each 0.5 m × 0.5 m            taken was 100.
pixel representing a node in the graph. Each node has              Hyperparameters including the learning rate, the
a 52-dimensional feature vector describing it (produced         𝐿2 regularization coefficient and the number of GCN
by stacking 3 optical channels, 48 HS channels and 1            layer output features, 𝐹 , were selected via a grid-search
LiDAR channel). The graph edges are computed using a
Table 2
OOD detection: Ability of each uncertainty type to detect OOD nodes (measured by the AUROC and AUPR metrics). Values
shown represent the mean ± standard deviation.

                                         AUROC                                            AUPR
         Model
                         Vacuity        Dissonance       Entropy          Vacuity        Dissonance       Entropy
      S-BGCN-T-K     0.882 ± 0.085    0.605 ± 0.197   0.878 ± 0.089   0.318 ± 0.289    0.132 ± 0.184   0.316 ± 0.306
       S-BGCN-T       0.588 ± 0.147   0.664 ± 0.133   0.578 ± 0.186    0.128 ± 0.187   0.137 ± 0.192   0.143 ± 0.208
        S-BGCN        0.586 ± 0.147   0.666 ± 0.132   0.580 ± 0.191    0.127 ± 0.186   0.139 ± 0.190   0.145 ± 0.209
        S-GCN         0.580 ± 0.145   0.650 ± 0.120   0.586 ± 0.181    0.125 ± 0.185   0.130 ± 0.191   0.143 ± 0.207
        S-MLP         0.767 ± 0.152   0.805 ± 0.114   0.787 ± 0.125    0.245 ± 0.214   0.233 ± 0.170   0.219 ± 0.201
         GCN                -               -         0.538 ± 0.188          -               -         0.116 ± 0.179


method. Where used, 𝜆T and 𝜆K were also found using a            For the task of OOD detection, the S-BGCN-T-K model
grid-search.                                                  is the highest ranked model. Its measure of vacuity un-
   Learning was performed for a maximum of 400 epochs,        certainty provided the best distinguishing metric, with
but was stopped early if the validation loss failed to de-    mean AUROC and AUPR of 0.882 and 0.318 respec-
crease further for 60 consecutive epochs. If stopped early,   tively, closely followed by performance from the measure
model weights were returned to the settings which pro-        of entropy (AUROC and AUPR of 0.878 and 0.316 re-
vided the lowest validation set loss upon the termination     spectively). The performance of the S-BGCN-T-K model
of training.                                                  stands out above the performance of other models trained.
   Each test was performed for different random dataset       This highlights the importance of the convergence assis-
splits and model weight initializations to obtain mean        tance techniques used, particularly the use of a meaning-
and standard deviation measures of performance.               ful prior.
   A benchmark has been provided by training ‘standard’          The fact that vacuity is the uncertainty measure which
GCNs which provide prediction entropy as a form of            best distinguishes OOD nodes reflects intuition. Since
uncertainty estimate.                                         vacuity measures the absence of evidence for a prediction,
                                                              it is natural to expect that it would better distinguish
4.3. Out of distribution detection                            OOD nodes for which the model ought to have little
                                                              evidence to support its classification.
It would be reasonable to expect that uncertainty should
be higher when the model is asked to make a prediction
using an input which does not resemble the inputs upon        5. Conclusion
which it was trained. The relative inability of neural
networks to successfully extrapolate beyond the support       In this paper we have adapted a novel classification
of the training data is a well-known weakness of these        method capable of providing uncertainty estimates to the
methods [16]. By training models using only a subset of       task of multi-class classification of multimodal remote
the classes provided by the GT, with the other classes        sensing data. The adopted framework, based upon the
acting as out of distribution (OOD) samples, the OOD          theory of Subjective Logic, provides measures of vacuity
detection ability of the uncertainty metrics can be mea-      and dissonance uncertainty. Of the types of uncertainty
sured. The AUROC and AUPR can be calculated for each          assessed, the measure of vacuity was the best metric to
uncertainty output provided by each model type, in order      perform identification of OOD samples. Experimental
to determine the relative performance of the respective       results have shown the performance of the S-BGCN-T-
metrics for this task.                                        K model in the task of OOD detection to be improved
   In the results presented, two classes were randomly        against baseline methods. This represents a promising
selected to act as OOD. This was repeated 10 times, with      avenue for uncertainty-aware learning in the task of mul-
two new randomly sampled classes selected for each            timodal remote sensing classification.
training and evaluation loop in order that the variation         The presented results illustrate the importance of con-
in OOD detection performance due to the nature of the         vergence assistance techniques as a means for improving
classes selected as OOD could be averaged out and the         the quality of uncertainty estimates, particularly through
mean and standard deviation computed. Each model type         the use of a prior. This can be seen by comparing the S-
was assessed over the same 10 sampled OOD class pairs         BGCN-T-K OOD detection performance with equivalent
for fairness. The AUROC and AUPR values measured              models which do not use a prior, e.g. S-BGCN-T.
can be found in Table 2.                                         Future work should consider the generalisation poten-
tial of this method by assessing performance on other               World Congr. DAIS - Work. Distrib. Anal. Infras-
challenging remote sensing classification datasets. The             truct. Algorithms Multi-Organization Fed., 2017, pp.
analysis could also be extended to assess whether the               1–6. doi:10.1109/UIC-ATC.2017.8397411.
presented uncertainty measures could be used to detect          [9] A. Josang, J.-H. Cho, F. Chen, Uncertainty character-
model misclassifications. Additionally, there is scope for          istics of subjective opinions, in: 2018 21st Int. Conf.
research into how the choice of method for computing                Inf. Fusion, Cambridge, U.K., 2018, pp. 1998–2005.
the 𝛼ˆ prior affects the quality of uncertainty estimates,     [10] T. N. Kipf, M. Welling, Semi-supervised classifica-
either by varying the scale parameter, 𝜎, or considering            tion with graph convolutional networks, in: 5th Int.
different prior computation methods entirely.                       Conf. Learn. Representations, Toulon, France, 2017.
                                                               [11] X. Zhao, F. Chen, S. Hu, J.-H. Cho, Uncertainty
                                                                    aware semi-supervised learning on graph data, in:
Acknowledgments                                                     Advances Neural Inf. Process. Syst., volume 33,
                                                                    2020, pp. 12827–12836.
This work is funded in part by Centre for Integrated
                                                               [12] A. Jøsang, Subjective Logic - A Formalism for Rea-
Remote Sensing and Forecasting for Arctic Operations
                                                                    soning Under Uncertainty, Artificial Intelligence:
(CIRFA) and the Research Council of Norway (RCN Grant
                                                                    Foundations, Theory, and Algorithms, Springer,
no. 237906), the Automatic Multisensor remote sensing
                                                                    2016. doi:10.1007/978-3-319-42337-1.
for Sea Ice Characterization (AMUSIC) Framsenteret ‘Pol-
                                                               [13] Q. Huang, H. He, A. Singh, S.-N. Lim, A. Benson,
havet’ flagship project 2020, the Isaac Newton Trust, and
                                                                    Combining label propagation and simple models
Newnham College, Cambridge, UK.
                                                                    out-performs graph neural networks, in: 9th Int.
                                                                    Conf. Learn. Representations, 2021.
References                                                     [14] S. Prasad, B. Le Saux, N. Yokoya, R. Hansch, 2018
                                                                    IEEE GRSS Data Fusion Challenge - Fusion of Mul-
 [1] B. Goodman, S. Flaxman, European Union reg-                    tispectral LiDAR and Hyperspectral Data, 2018.
     ulations on algorithmic decision-making and a                  URL: https://dx.doi.org/10.21227/jnh9-nz89. doi:10.
     ‘right to explanation’, AI Mag. 38 (2017) 50–57.               21227/jnh9-nz89.
     doi:10.1609/aimag.v38i3.2741.                             [15] M. Abadi, et al., TensorFlow: Large-scale machine
 [2] Y. LeCun, Y. Bengio, G. Hinton, Deep learning,                 learning on heterogeneous systems, 2015. URL:
     Nature 521 (2015) 436–444.                                     https://www.tensorflow.org/, Software available
 [3] L. Zhang, L. Zhang, B. Du, Deep learning for remote            from tensorflow.org.
     sensing data: A technical tutorial on the state of        [16] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim-
     the art, IEEE Geosci. Remote Sens. Mag. 4 (2016)               ple and scalable predictive uncertainty estimation
     22–40.                                                         using deep ensembles, in: Advances in Neural Inf.
 [4] J. D. Lee, K. A. See, Trust in automation: Design-             Process. Syst., volume 30, Long Beach, CA, USA,
     ing for appropriate reliance, Human Factors 46                 2017, pp. 6402–6413.
     (2004) 50–80. URL: https://doi.org/10.1518/hfes.46.1.
     50_30392. doi:10.1518/hfes.46.1.50_30392.
 [5] M. Chi, A. Plaza, J. A. Benediktsson, Z. Sun, J. Shen,
     Y. Zhu, Big data for remote sensing: Challenges and
     opportunities, Proceedings of the IEEE 104 (2016)
     2207–2219.
 [6] S. Chlaily, M. D. Mura, J. Chanussot, C. Jutten,
     P. Gamba, A. Marinoni, Capacity and limits of mul-
     timodal remote sensing: Theoretical aspects and
     automatic information theory-based image selec-
     tion, IEEE Trans. Geosci. Remote Sens. 59 (2021)
     5598–5618. doi:10.1109/TGRS.2020.3014138.
 [7] A. Marinoni, S. Chlaily, E. Khachatrian, T. Eltoft,
     S. Selvakumaran, M. Girolami, C. Jutten, Enhanc-
     ing ensemble learning and transfer learning in mul-
     timodal data analysis by adaptive dimensionality
     reduction, CoRR abs/2105.03682 (2021). URL: https:
     //arxiv.org/abs/2105.03682. arXiv:2105.03682.
 [8] S. Chakraborty, et al., Interpretability of deep learn-
     ing models: A survey of results, in: IEEE Smart

</pre>