=Paper= {{Paper |id=Vol-2808/Paper_7 |storemode=property |title=Feature Space Singularity for Out-of-Distribution Detection |pdfUrl=https://ceur-ws.org/Vol-2808/Paper_7.pdf |volume=Vol-2808 |authors=Haiwen Huang,Zhihan Li,Lulu Wang,Sishuo Chen,Xinyu Zhou,Bin Dong |dblpUrl=https://dblp.org/rec/conf/aaai/HuangLWCZD21 }} ==Feature Space Singularity for Out-of-Distribution Detection== https://ceur-ws.org/Vol-2808/Paper_7.pdf
                  Feature Space Singularity for Out-of-Distribution Detection
                              Haiwen Huang1 , Zhihan Li2 , Lulu Wang3 , Sishuo Chen2
                                         Bin Dong2, 4, 5 , Xinyu Zhou3
                                      1
                                      Department of Computer Science, University of Oxford
                                        2                             3
                                          Peking University             MEGVII Technology
                                    4
                                      Beijing International Center for Mathematical Research
                                5
                                  Institute for Artificial Intelligence and Center for Data Science
                           haiwen.huang2@cs.ox.ac.uk, zxy@megvii.com, dongbin@math.pku.edu.cn

                           Abstract                                 pool of data in the wild (test set) in order to ensure reliable
                                                                    prediction. One natural solution is to remove test samples
  Out-of-Distribution (OoD) detection is important for build-       far from the training data in some designated distances (Lee
  ing safe artificial intelligence systems. However, current OoD
                                                                    et al. 2018; van Amersfoort et al. 2020). However, calculat-
  detection methods still cannot meet the performance require-
  ments for practical deployment. In this paper, we propose         ing the distance to the whole training set needs a formidable
  a simple yet effective algorithm based on a novel observa-        amount of computation without some special design in fea-
  tion: in a trained neural network, OoD samples with bounded       ture and architecture, e.g., training a RBF network (van
  norms well concentrate in the feature space. We call the cen-     Amersfoort et al. 2020). In this paper, we present a sim-
  ter of OoD features the Feature Space Singularity (FSS), and      ple yet effective distance-based solution, which neither com-
  denote the distance of a sample feature to FSS as FSSD.           putes the distance to the training data nor performs extra
  Then, OoD samples can be identified by taking a thresh-           model training than a standard classifier.
  old on the FSSD. Our analysis of the phenomenon reveals              Our approach is based on a novel observation about OoD
  why our algorithm works. We demonstrate that our algo-            samples:
  rithm achieves state-of-the-art performance on various OoD
  detection benchmarks. Besides, FSSD also enjoys robust-              In a trained neural network, OoD samples with
  ness to slight corruption in test data and can be further en-        bounded norms well concentrate in the feature space
  hanced by ensembling. These make FSSD a promising algo-              of the neural network.
  rithm to be employed in real world. We release our code at        In Figure 1, we show an example where OoD features from
  https://github.com/megvii-research/FSSD_OoD_Detection.
                                                                    ImageNet (Russakovsky et al. 2015) concentrate in a neu-
                                                                    ral network trained on the facial dataset MS-1M (Guo et al.
                       Introduction                                 2016). Figure 2 and 3 provide more examples of this phe-
                                                                    nomenon. In fact, we find this phenomenon to be universal
Empirical risk minimization fits a statistical model on a           in most training configurations for most datasets.
training set which is independently sampled from the data              To be more precise, for a given feature extractor Fθ
distribution. As a result, the yielded model is expected to         trained on in-distribution data, the observation states that
generalize to in-distribution data drawn from the same dis-         there exists a point F ∗ in the output space of Fθ such that
tribution. However, in real applications, it is inevitable for      kFθ (x) − F ∗ k is small for x ∈ XOoD , where XOoD is the set
a model to make predictions on Out-of-Distribution (OoD)            of OoD samples. We call F ∗ the Feature Space Singularity
data instead of in-distribution data on which the model is          (FSS). Moreover, we discover the FSS Distance (FSSD)
trained. This can lead to fatal errors such as over-confident
or ridiculous predictions (Hein, Andriushchenko, and Bitter-                        FSSD (x) := kFθ (x) − F ∗ k                (1)
wolf 2018; Rabanser, Günnemann, and Lipton 2019). There-            can reflect the degree of OoD, and thus can be readily used
fore, it is crucial to understand the uncertainty of models         as a metric for OoD detection.
and automatically detect OoD data. In applications like au-            Our analysis demonstrates that this phenomenon can be
tonomous driving and medical services, if the model knows           explained by the training dynamics. The key observation
what it does not know, human intervention can be sought             is that FSSD can be seen as an approximate movement of
and security can be significantly improved.                         Fθt (x) during training, where F ∗ is the initial concentration
   Consider one particular example of OoD detection: some           point of the features. The difference in the moving speed
high-quality human face images are given as in-distribution         dFθt (x)
                                                                             stems from the different similarity to the training
data (training set for OoD detector), and we are interested            dt
                                                                    data measured by the inner product of the gradients. More-
in filtering out non-faces and low quality faces from a large
                                                                    over, this similarity measure varies according to the archi-
Copyright © 2021 for this paper by its authors. Use permitted un-   tecture of the feature extractor.
der Creative Commons License Attribution 4.0 International (CC         We demonstrate the effectiveness of our proposed method
BY 4.0)                                                             with multiple neural networks (LeNet (LeCun and Cortes
            OoD                           Face quality increasing         In-distribution
                                                                                                                        In-distribution
                                                                                                                        OoD
                                                                                                                        FSS F*




             A                        B                             C           D




     A                B           C                                 D


                                                       2



Figure 1: Left: Histogram of FSS Distance (FSSD) of MS1M (in-distribution) and ImageNet (OoD). Exemplar images are
shown at different FSSDs. We can see that FSSD reflects the OoD degree: as the FSSD increases, images change from non-faces
and pseudo-faces, to low-quality faces and high-quality faces. Right: Principle components of features from the penultimate
layer. The spatial relationship among FSS, OoD data, and in-distribution data is shown.


2010), ResNet (He et al. 2016), ResNeXt (Xie et al. 2017),                          Analyzing and Understanding the
LSTM (Hochreiter and Schmidhuber 1997)) trained on var-                               Concentration Phenomenon
ious datasets for classification (FashionMNIST (Xiao, Ra-
                                                                        In this section, we analyze the concentration phenomenon.
sul, and Vollgraf 2017), CIFAR10 (Krizhevsky 2009), Ima-
                                                                        The key observation is that during training, the features of
geNet (Russakovsky et al. 2015), CelebA (Liu et al. 2015),
                                                                        the training data are supervised to move away from the ini-
MS-1M (Guo et al. 2016), bacteria genome dataset (Ren
                                                                        tial point, and the moving speeds of features of other data
et al. 2019)) with varying training set sizes. We show
                                                                        depend on their similarity to the training data. Specifically,
that FSSD achieves state-of-the-art performance on almost
                                                                        this similarity is measured by the inner product of the gradi-
all the considered benchmarks. Moreover, the performance
                                                                        ents. Therefore, data that are dissimilar to the training data
margin between FSSD and other methods increases as the
                                                                        will move little and concentrate in the feature space. This is
size of the training set increases. In particular, on large-scale
                                                                        how FSSD identifies OoD data.
benchmarks (CelebA and MS-1M), FSSD advances the AU-
                                                                           To see this, we derive the training dynamics of the feature
ROC by about 5%. We also evaluate the robustness of our
                                                                        vectors. We denote Fθ : Ra → Rb as the feature extrac-
algorithm when test images are corrupted. We find that our
                                                                        tor which maps inputs to features and Gφ : Rb → Rc to
algorithm can still reliably detect OoD samples under this
                                                                        be the map from features to outputs. The two mappings are
circumstance. Finally, we investigate the effects of ensem-
                                                                        parameterized by θ and φ respectively. The corresponding
bling FSSDs from different layers of a single neural network
                                                                        loss function can be denoted as Lφ (Fθ (x1 ) , · · · , Fθ (xM )).
and multiple trained netowrks.
                                                                        A popular choice is Lφ (Fθ (x1 ) , · · · , Fθ (xM )) =
                                                                        PM
  We summarize our contributions as follows.                               m=1 ` (Gφ (Fθ (xm )) , ym ) /M , where ` is the cross en-
                                                                        tropy loss or the mean squared error. Then, the gradient de-
• We observe that in feature spaces of trained networks                 scent dynamics of θ is
  OoD samples concentrate near a point (FSS), and the dis-                        dθt        ∂ Lφ
                                                                                       =−         (Fθt (x1 ) , · · · , Fθt (xM ))
  tance from a sample feature to FSS (FSSD) measures the                           dt        ∂ θt
  degree of OoD (Section 1).                                                                  M              T                       (2)
                                                                                             X    ∂ Fθt (xm )
                                                                                       =−                        ∂m Lφ ,
                                                                                            m=1
                                                                                                      ∂ θt
• We analyze the concentration phenomenon by analyzing
                                                                                            ∂L
  the dynamics of in-distribution and OoD features during               where ∂m Lφ = ∂ Fθ (xφ m ) ∈ Rb is the backward propagation
                                                                                             t
  training (Section 2).                                                 gradient and subscript t is the training time. The dynamics
                                                                        of the feature extractor Fθ as a function is therefore
• We introduce the FSSD algorithm (Section 3) which                         dFθt (x) ∂ Fθt (x) dθt
  achieves state-of-the-art performance on various OoD de-                             =
                                                                                dt          ∂ θt   dt
  tection benchmarks with considerable robustness (Section                                   M                                   (3)
                                                                                            X ∂ Fθ (x) ∂ Fθ (xm )T
  4).                                                                                  =−             t        t
                                                                                                                         ∂m Lφ .
                                                                                            m=1
                                                                                                    ∂ θt       ∂ θt
 30
                          step 0                                30
                                                                                        step 150                        30
                                                                                                                                                       step 1000                        30
                                                                                                                                                                                                                    step 2000                     30
                                                                                                                                                                                                                                                                          step 12000

 20                                                             20                                                      20                                                              20                                                        20


 10                                                             10                                                      10                                                              10                                                        10


  0                                                              0                                                       0                                                               0                                                         0


 10                                                             10                                                      10                                                              10                                                        10
             In-distribution (MNIST)
             OoD (FMNIST)
 20          FSS F*                                             20                                                      20                                                              20                                                        20
       50     40     30      20          10     0          10        50     40     30      20           10   0     10        50       40          30       20        10    0     10          50         40     30      20        10    0    10         50        40      30     20       10    0     10



                                                                                                                 (a) The dynamics of features Fθt (x).
                            step 0                                                        step 150                                                        step 1000                                                    step 2000                                                step 12000
                                                                     30                                                                                                                                                                                     35
                                         in-distribution                                                                     30
      25                                 OoD                                                                                                                                                  20                                                            30
                                                                     25
                                                                                                                             25                                                                                                                             25
      20
                                                                     20                                                                                                                       15
                                                                                                                             20                                                                                                                             20
      15                                                             15
%




                                                                %




                                                                                                                        %




                                                                                                                                                                                         %




                                                                                                                                                                                                                                                   %
                                                                                                                             15                                                               10                                                            15
      10                                                             10                                                      10                                                                                                                             10
       5                                                                                                                                                                                          5
                                                                      5                                                       5                                                                                                                              5
       0                                                              0                                                       0                                                                   0                                                          0
        50000 100000150000200000250000300000350000                         2500 5000 7500 1000012500150001750020000               0        2000    4000     6000     8000 10000 12000                 0 1000 2000 3000 4000 5000 6000 7000 8000                  0 2500 5000 7500 1000012500150001750020000
                          ||dF(x)/dt||                                                   ||dF(x)/dt||                                                     ||dF(x)/dt||                                                 ||dF(x)/dt||                                             ||dF(x)/dt||

                                         (b) The norm of the derivative, i.e., "moving speed", of last-layer feature vector at different time steps.

Figure 2: We show first two principle components of the feature vector and the L2 norm of the derivatives (Equation (3)). Fea-
tures and derivatives are calculated from the last fully-connected layer of a LeNet trained on MNIST (in-distribution). We feed
in FashionMNIST data as OoD samples. At initialization, features of both in-distribution and OoD samples concentrate near
FSS F ∗ . After training, features of in-distribution samples are pulled away from FSS F ∗ , while features of OoD samples remain
close to FSS F ∗ . Similar dynamics of the softmax layer on in-distribution data was observed by (Li, Zhao, and Scheidegger
2020).


   From Equation (3), we can see that the relative moving                                                                                                             tangent kernel (NTK) (Jacot, Gabriel, and Hongler 2018; Li
speed of the feature Fθt (x) depends on the inner product of                                                                                                          and Liang 2018; Cao and Gu 2020). In this way, FSSD can
the gradient on parameters between x and the training data                                                                                                            be seen as a kernel regression result:
xm . Note here ∂m Lφ is the same for all x. Since FSSD de-
                                                      dF (x)
fined in Equation 1 can be seen as the integration of θdtt                                                                                                                                                          F ∗ ≈ Fθ0 (x)
when the initial value Fθ0 (x) is F ∗ for all x, FSSD(x) will                                                                                                                     FSSD (x)                                                 Fθt (x) − Fθ0 (x)
                                                                                                                                                                                                                    Equation (1)
also be small when the derivative, i.e., the moving speed, is
small.                                                                                                                                                                                                                      M Z T
                                                                                                                                                                                                                            X
   In Figure 2, we show both the features and their mov-                                                                                                                                                       =                           Θt (x, xm ) ∂m Lφ dt                                       (4)
ing speeds of in-distribution and OoD data at different steps                                                                                                                                                            m=1           0
during training. We can see that although in-distribution and                                                                                                                                                               M
                                                                                                                                                                                                                            X
OoD data are indistinguishable at step 0, they are quickly                                                                                                                                                     ≈                      Θ (x, xm ) νm ,
separated since the moving speeds of in-distribution data                                                                                                                                                                m=1
are larger than those of OoD data (Figure 2(b)) and thus                                                                                                                          RT
the accumulated movements of in-distribution data are also                                                                                                            where νm = 0 ∂m Lφ dt.
larger than those of OoD data (Figure 2(a)). In Figure 3,                                                                                                               This indicates that the similarity described by the inner
                                                                                                                                                                                                                                ∂F      (x) ∂ F             (x )T
we show examples of the initial concentration of features                                                                                                             product Θt (x, xm ) := ∂θθt t        θt   m
                                                                                                                                                                                                                    might enjoy sim-
                                                                                                                                                                                                            ∂ θt
in LeNet and ResNet-34 for MNIST vs. FashionMNIST and                                                                                                                 ilar properties to commonly used kernels such as RBF ker-
CIFAR10 vs. SVHN dataset pairs respectively. Empirically,                                                                                                             nel, which diminishes as the distance between x and xm in-
we find the concentration of both in-distribution and OoD                                                                                                             creases. Moreover, since the neural tangent kernel depends
features at the initial stage to be the common case for most                                                                                                          on the neural architecture, this kernel interpretation also sug-
popular architectures using random initialization. We show                                                                                                            gests that feature extractors of different architectures, in-
more examples on our Github page.                                                                                                                                     cluding different layers, can have different properties and
   As we’ve mentioned, Equation (3) demonstrates that the                                                                                                             measure different aspects of the similarity between x and
difference in the moving speed of Fθt (x) stems from differ-                                                                                                          xm . We can see this more clearly later in the investigation of
                                                            ∂F            (x) ∂ F       (x )T
ence in Θt (x, xm ) := ∂θθt t        θt
                                      ∂ θt
                                          m
                                             . We want to fur-                                                                                                        FSSD in different layers.
ther point out that Θt (x, xm ) is effectively acting as a ker-
nel that measures the similarity between x and xm . In fact,                                                                                                                                                          Our Algorithm
when the network width is infinite, Θt (x, xm ) will converge                                                                                                         Based on this phenomenon, we can now construct our OoD
to a time-independent term Θ(x, xm ), which is called neural                                                                                                          detection algorithm. Since the uniform noise input can be
                      Initial                                          Trained
  4                          In-distribution       4                                                Algorithm 1: Computation of FSSD-Ensem
                             OoD                                                                                                       N
  2
                             FSS F*                2                                                 Input: Test  samples x = {xtest
                                                                                                              noise               n } n=1 , noise samples
                                                                                                                       S
  0                                                0                                                          xs       s=1 , ensemble weights αk ,
  2                                                2                                                         perturbation magnitude
                                                                                                                                     ,
  4                                                4
                                                                                                             feature extractors F(k) K  k=1
  6                                                6                                                 for each feature extractor F(k) K       do
                                                                                                                              ∗
                                                                                                                                   PS k=1            
  8                                                8                                                     1. Estimate FSS F(k)    = s=1 F(k) xnoise
                                                                                                                                                 s     /S,
      2.5 0.0   2.5    5.0      7.5   10.0 12.5        2.5 0.0   2.5     5.0     7.5   10.0 12.5                   noise
                                                                                                          where xs ∼ U[0, 1], s = 1, · · · , S
                                (a) MNIST vs. FMNIST                                                     2. Add perturbation to test sample:
                      Initial                                          Trained
  4                                                4                                                      x̃ = x +  sign(∇x F(k) (x) − F(k)  ∗
                                                                                                                                                   )
  3                                                3
  2                                                2                                                    3. Calculate FSSD(k) (x) = F(k) (x̃) − F(k)
                                                                                                                                                ∗

  1                                                1                                                 end
  0                                                0                                                                            PK
                                                                                                     Return FSSD-Ensem (x) = k=1 αk FSSD(k) (x)
  1                                                1
  2                                                2
  3                                                3
  4
      6    5     4       3        2     1      0
                                                   4
                                                       6    5     4       3        2     1    0
                                                                                                                          Experiments
                                                                                                   In this section, we investigate the performance of our FSSD
                                (b) CIFAR10 vs. SVHN
                                                                                                   algorithm on various OoD detection benchmarks.
Figure 3: Both in-distribution and OoD samples are clus-
tered in the feature space of Fθ0 (x) at initialization. More-
                                                                                                   Setup
over, F ∗ ≈ Fθ0 (x) for x ∈ XOoD ∪ Xin-dist .                                                      Benchmarks To conduct a thorough test of our method,
                                                                                                   we consider a wide variety of OoD detection benchmarks.
                                                                                                   In particular, we consider different scales of datasets and dif-
                                                                                                   ferent types of data. We consider different scales of datasets
considered to possess the highest degree of OoD, we use the                                        because large scale datasets tend to have more classes which
center of their features as the FSS F ∗ . The FSSD can then                                        can introduce more ambiguous data. The ambiguous data
be calculated using Equation (1). Note this calculation of                                         are of high classification uncertainty, but are not out-of-
FSS F ∗ is independent from the choice of in-distribution and                                      distribution. We list the benchmarks in Table 1.
OoD datasets. When such natural choice of uniform noise                                               We first consider two common benchmarks from previ-
is unavailable, we can choose FSS F ∗ to be the center of                                          ous OoD detection literature (van Amersfoort et al. 2020;
features of OoD validation data instead.                                                           Ren et al. 2019): (A) FMNIST (Xiao, Rasul, and Vollgraf
    Since a single forward-pass computation through the net-                                       2017) vs. MNIST (LeCun and Cortes 2010) and (B) CI-
work can give us features from each layer, it is also conve-                                       FAR10 (Krizhevsky 2009) vs. SVHN (Netzer et al. 2011).
nient to calculate FSSDs from different layers and ensem-                                          They are known to be challenging for many methods (Ren
                                   PK
ble them as FSSD-Ensem (x) = k=1 αk FSSD(k) (x). The                                               et al. 2019; Nalisnick et al. 2019a). (C) We also construct
ensemble weights αk can be determined using logistic re-                                           ImageNet (dogs), a subset of ImageNet (Russakovsky et al.
gression on some validation data as in (Lee et al. 2018) (see                                      2015) , as in-distribution data. The OoD data are non-dog
Evaluation section in Experiments). In later experiments, if                                       images from ImageNet.
not specified, we use the ensembled FSSD from all layers.                                             For large-scale problems, we consider three benchmarks.
We note that it is also possible to ensemble FSSDs from dif-                                       (D) We train models on ImageNet and detect corrupted im-
ferent architectures or multiple training snapshots (Xie, Xu,                                      ages from the ImageNet-C dataset (Hendrycks and Diet-
and Zhang 2013; Huang et al. 2017). This may further en-                                           terich 2019). We test each method on 80 sets of corruptions
hance the performance of OoD detection. We investigate the                                         (16 types and 5 levels). (E) We train models on face im-
effect of ensembling in the next section.                                                          ages without the “blurry” attribute from CelebA (Liu et al.
                                                                                                   2015) and detect face images with the “blurry” attribute. (F)
    Beside, we also adopt input pre-processing as in (Liang,                                       We train models on web images of celebrities from MS-
Li, and Srikant 2018; Lee et al. 2018) . The idea is to                                            Celeb-1M (MS-1M) (Guo et al. 2016) and detect video cap-
add small perturbations to the test samples in order to in-                                        tures from IJB-C (Maze et al. 2018) which in general have
crease the in-distribution score. It is shown in (Liang, Li,                                       lower quality due to pose, illumination, and resolution is-
and Srikant 2018; Kamoi and Kobayashi 2020) that in-                                               sues. We also consider (G) the bacteria genome benchmark
distribution data are more sensitive to such perturbation and                                      introduced by (Ren et al. 2019), which consists of sequence
it can therefore enlarge the score gap between in-distribution                                     data.
and OoD samples. In particular, we perturb as x̃ = x +                                                To train models on in-distribution datasets, we follow pre-
 sign (∇x FSSD (x)) and take FSSD (x̃) as the final score.                                        vious works (Lee et al. 2018) to train LeNet on FMNIST
    We present the pseudo-code of computing                                                        and ResNet with 34 layers on CIFAR10, ImageNet, and Ima-
FSSD-Ensem (x) in Algorithm 1.                                                                     geNet (dogs). For two face recognition datasets (CelebA and
                                  Table 1: OoD detection benchmarks used in our experiments.

                                          In-distribution                                         OoD
                                                                                                                   Data type
                                                 #Classes       #Samples                                #Samples
                              Dataset                                                   Dataset
                                               (Train/Test)    (Train/Test)                              (Test)
                    A         FMNIST              10/10           60k/10k               MNIST             10k       Image
                    B        CIFAR10              10/10           50k/10k               SVHN              26k       Image
                    C     ImageNet (dogs)         50/50           50k/10k         ImageNet (non-dogs)     10k       Image
                    D        ImageNet           1000/1000     1281.2k/50k             ImageNet-C          50k       Image
                    E    CelebA (not blurry)   10122/10122     153.8k/38.5k         CelebA (blurry)      10.3k      Image
                    F         MS-1M            64736/16184    2923.6k/50k               IJB-C             50k       Image
                    G   Genome (before 2011)      10/10         1000k/1000k       Genome (after 2016)    6000k     Sequence



MS-1M), we train ResNeXt with 50 layers. For the genome                       Ren et al. 2019; Liang, Li, and Srikant 2018) to use a sep-
sequence dataset, we use an character embedding layer and                     arate validation set, which consists of 1,000 images from
two Bidirectional LSTM layers (Schuster and Paliwal 1997).                    each in-distribution and OoD data pair. Ensemble weights
                                                                              αk for FSSD from different layers are extracted from a lo-
Compared methods We compare our method with the                               gistic regression model, which is trained using nested cross
following six common methods for OoD detection. Base:                         validation within the validation set as in (Lee et al. 2018;
the baseline method using the maximum softmax probabil-                       Ma et al. 2018). The same procedure is performed on Maha
ity p (ŷ|x) (Hendrycks and Gimpel 2017). ODIN: temper-                       for fair comparison. The perturbation magnitude  of input
ature scaling on logits and input pre-processing (Liang, Li,                  pre-processing for ODIN, Maha, and FSSD is searched from
and Srikant 2018). Maha: Mahalanobis distance of the sam-                     0 to 0.2 with step size 0.01. The temperature T of ODIN is
ple feature to the closest class-conditional Gaussian distri-                 chosen from 1, 10, 100, and 1000, and the dropout rate of
bution which is estimated from the training data (Lee et al.                  MCD is chosen from 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, and
2018). In our experiments, we follow (Lee et al. 2018) to                     0.5.
use both feature (layer) ensemble and input pre-processing.
DE: Deep Ensemble which averages the softmax probabil-
ity predictions from multiple independently trained classi-                   Main results
fiers (Lakshminarayanan, Pritzel, and Blundell 2017). In our
experiments, we take the average of 5 classifiers by default.
MCD: Monte-Carlo Dropout that uses dropout during both                        The main results are presented in Table 2 and Figure 4. In
training and inference (Gal and Ghahramani 2016). We fol-                     Table 2, we can see that larger datasets entail greater dif-
low (Ovadia et al. 2019) to dropout convolutional layers.                     ficulty in OoD detection. Notably, the advantage of FSSD
For OoD detection, we calculate both the mean and the                         over other methods increases as the dataset size increases.
variance of 32 independent predictions and choose the bet-                    Other methods like Maha and OE perform well on some
ter one to report. OE: Outlier exposure that explicitly en-                   small benchmarks, but have large variance across different
forces uniform probability prediction on an auxiliary dataset                 datasets. In comparison, FSSD maintains great performance
of outliers (Hendrycks, Mazeika, and Dietterich 2019). For                    on these benchmarks. On the genome sequence dataset, we
the choice of auxiliary datsets, we use KMNIST (Clanuwat                      also observe that FSSD outperforms other methods. These
et al. 2018) for benchmark A, CelebA (Liu et al. 2015) for                    results show that FSSD is a promising effective method for
benchmark C, and ImageNet-1K (Russakovsky et al. 2015)                        a wide range of applications.
for benchmark B, E, F. We do not evaluate OE on the se-                          Inspired by (Ovadia et al. 2019), we also evaluate the
quence benchmark, since we can not find a reasonable auxil-                   methods on the ability of detecting distributional dataset
iary dataset. We remark here that Base, ODIN, and FSSD can                    shift like Gaussian noise and JPEG artifacts. Figure 4 shows
be deployed directly with a trained neural network, MCD                       the means and quartiles of AUROC of the compared meth-
needs a trained neural network with dropout layers, while                     ods over 16 types of corruptions on 5 corruption levels.
DE needs multiple trained classifiers. Besides, Maha needs                    We can observe that for each method, the performance of
to use the training data during OoD detection on test data                    OoD detection increases as the level of corruption increases,
and OE trains a neural network either from scratch or by                      while FSSD enjoys the highest AUROC and much less vari-
fine-tuning to utilize the auxiliary dataset.                                 ation over different types of corruptions. The CelebA bench-
                                                                              mark also evaluates the methods on detecting the dataset
Evaluation We follow (Ren et al. 2019; Hendrycks,                             shift of the attribute “blurry”. However, all methods in-
Mazeika, and Dietterich 2019) to use the following met-                       cluding FSSD do not perform very well. There are two
rics to assess the performance of OoD detection. AUROC:                       possible reasons: (1) the attribute “blurry” of CelebA may
Area Under the Receiver Operating Characteristic curve.                       be annotated not clearly enough; (2) the blurs in the wild
AUPRC: Area Under the Precision-Recall Curve. FPR80:                          may be more difficult to detect than the simulated blurs in
False Positive Rate when the true positive rate is 80%.                       ImageNet-C. Overall, we can see that FSSD can more reli-
   For hyper-parameter tuning, we follow (Lee et al. 2018;                    ably detect different kinds of distributional shifts.
                                                   Table 2: Main results. All values are in %.

                               Datasets (Architecture)            Metrics   Base             ODIN       Maha            DE          MCD             OE         FSSD
                                                                  AUROC     77.3              96.9         99.6         83.9        81.7          99.6             99.6
                                FMNIST vs. MNIST
                                                                  AUPRC     79.2              93.0         99.7         83.3        85.3          99.6             99.7
                                    (LeNet)
                                                                   FPR80    43.5               2.5          0.0         27.5        36.8           0.0              0.0
              Small-scale                                         AUROC     89.9              96.7         99.1         93.7        96.7          90.4             99.5
                                 CIFAR10 vs. SVHN
              benchmarks                                          AUPRC     85.4              92.5         98.1         90.6        93.9          89.8             99.5
                                    (ResNet34)
                                                                   FPR80    10.1               4.7          0.3          3.7         2.4          12.5              0.4
                                                                  AUROC     88.5              90.8         83.3         89.0        67.2          92.5             93.1
                            ImageNet dogs vs. non-dogs
                                                                  AUPRC     86.1              88.6         83.0         89.0        66.9          92.6             92.5
                                  (ResNet34)
                                                                   FPR80    19.5              15.2         30.1         18.8        59.2           7.9             10.2
                                                                  AUROC     71.7              73.3         73.9         74.5        69.8          71.5             78.3
                            CelebA non-blurry vs. blurry
                                                                  AUPRC     89.9              91.4         90.9         91.4        88.7          90.7             92.8
                                  (ResNeXt50)
              Large-scale                                          FPR80    52.0              50.3         46.0         47.1        53.2          54.2             39.2
              benchmarks
                                                                  AUROC     60.0              61.3         82.5         63.0        65.5          52.6             86.7
                                   MS-1M vs. IJB-C
                                                                  AUPRC     53.3              55.9         80.6         56.1        59.4          46.6             86.1
                                    (ResNeXt50)
                                                                   FPR80    61.8              59.4         29.6         56.7        58.8          64.2             22.1
                                                                  AUROC     69.6              70.6         70.4         70.0        69.3           NA              74.8
               Sequence             Bacteria Genome
                                                                  AUPRC     69.9              71.9         69.3         56.0        70.2           NA              75.8
              benchmark                 (LSTM)
                                                                   FPR80    57.4              55.9         53.7         30.0        58.3           NA              47.4


    1.0                                                                                100
                                                                                                     CIFAR10 vs. SVHN                       100
                                                                                                                                                         CIFAR10 vs. SVHN
    0.9                                                                                 90                                                   90

                                                                                        80                                                   80
    0.8
                                                                               AUROC




                                                                                                                                    AUROC
AUROC




                                                                                        70                                                   70
    0.7                                                                                                                      ODIN
                                                                   Base                 60                                   OE
                                                                                                                             Maha
                                                                                                                                             60
                                                                   MCD                                                       FSSD
    0.6                                                            ODIN                 50                                                   50
                                                                                              0.8%      1.6%     2.4%     3.0%                    0.3%      0.6%     0.9%   1.2%
                                                                   Maha                                Gaussian Noise                                      Impulse Noise
    0.5                                                            FSSD
          1            2             3             4          5                        100
                                                                                              ImageNet dogs vs. non-dogs                    100
                                                                                                                                                  ImageNet dogs vs. non-dogs
                             Level of corruption
                                                                                        90                                                   90

 Figure 4: Comparison of AUROC on ImageNet vs.                                          80                                                   80
                                                                               AUROC




                                                                                                                                    AUROC
 ImageNet-C. FSSD enjoys the highest mean and the least                                 70                                                   70

 variance across all corruption levels.                                                 60                                                   60

                                                                                        50                                                   50
                                                                                              0.8%      1.6%     2.4%     3.0%                    0.3%      0.6%     0.9%   1.2%
                                                                                                       Gaussian Noise                                      Impulse Noise

 Robustness                                                                  Figure 5: Comparison of OoD detection robustness among
 In practice, it is possible that the test data are slightly cor-            methods on slightly corrupted test data.
 rupted or shifted due to the change of data source, e.g.,
 from lab to real world. We evaluate the ability to distin-
 guish in-distribution data from OoD data when test data
                                                                             has high diversity across different layers, and benefit from
 (both in-distribution and OoD) are slightly corrupted. Note
                                                                             such diversity to reach higher performance. In Figure 6,
 that we still use non-corrupted data during network training
                                                                             we find that FSSD in different layers are working differ-
 and hyper-parameter tuning. We apply Gaussian noise and
                                                                             ently. This can be explained by previous works on under-
 impulse noise, two typical corruptions, with varying levels.
                                                                             standing neural networks by visualizing the different repre-
 Test results on CIFAR10 vs. SVHN and ImageNet dogs vs.
                                                                             sentations learned by low and deep layers of a neural net-
 non-dogs are shown in Figure 5. We can see that FSSD is
                                                                             work (Zeiler and Fergus 2014; Zhou et al. 2015). Gener-
 robust to corruptions presented in test images, while other
                                                                             ally, FSSDs from deep layers reflect more high-level features
 methods may degrade.
                                                                             and FSSDs from early layers reflect more low-level statis-
                                                                             tics. ImageNet (dogs) and ImageNet (non-dogs) are from
 Effects of ensemble                                                         the same dataset (ImageNet), and are therefore similar in
 During our experiments, we find that the ensemble plays                     terms of low-level statistics; while the differences between
 an important role in enhancing the performance of FSSD.                     CIFAR10 and SVHN are in all different levels. From the per-
 Previous studies show that an important issue for ensemble-                 spective of kernel interpretation, this means that the neural
 based algorithms is enforcing diversity (Lakshminarayanan,                  tangent kernels of different layers diversify well and allow
 Pritzel, and Blundell 2017). In our case, we find that FSSD                 the ensemble of FSSD to capture different aspects of the dis-
                                Layer 0                                                                   Layer 1                                                                Layer 2                       0.06
                                                                                                                                                                                                                                            Layer 3                                                        Layer 4
                                                                                                                                                 0.035
 0.010                                          In-distribution        0.008
                                                                                                                                                                                                                                                                        0.035
                                                OoD                    0.007
                                                                                                                                                 0.030                                                         0.05
                                                                                                                                                                                                                                                                        0.030
 0.008
                                                                       0.006                                                                     0.025                                                         0.04
                                                                                                                                                                                                                                                                        0.025
                                                                       0.005                                                                     0.020
 0.006
                                                                                                                                                                                                               0.03                                                     0.020
                                                                       0.004
                                                                                                                                                 0.015
 0.004                                                                                                                                                                                                                                                                  0.015
                                                                       0.003                                                                                                                                   0.02
                                                                                                                                                 0.010                                                                                                                  0.010
                                                                       0.002
 0.002                                                                                                                                                                                                         0.01
                                                                       0.001                                                                     0.005                                                                                                                  0.005

 0.000                                                                 0.000                                                                     0.000                                                         0.00                                                     0.000
         100   150   200        250    300    350     400      450             400        450   500     550        600   650   700        750            100   120   140   160     180     200    220   240               40       50        60        70        80             20        40          60          80         100        120




                                                                                                                           (a) ImageNet (dogs) vs. ImageNet (non-dogs)
                           Layer 0                                                                    Layer 1                                                              Layer 2                                                      Layer 3                                                     Layer 4
                                                                                                                                                0.25                                                                                                                  0.14
                                             In-distribution         0.200                                                                                                                                    0.5
0.12
                                             OoD                                                                                                                                                                                                                      0.12
                                                                     0.175                                                                      0.20
0.10                                                                                                                                                                                                          0.4
                                                                     0.150                                                                                                                                                                                            0.10
0.08                                                                 0.125                                                                      0.15                                                          0.3                                                     0.08
0.06                                                                 0.100                                                                                                                                                                                            0.06
                                                                                                                                                0.10                                                          0.2
0.04                                                                 0.075
                                                                                                                                                                                                                                                                      0.04
                                                                     0.050                                                                      0.05                                                          0.1
0.02                                                                 0.025                                                                                                                                                                                            0.02
0.00                                                                 0.000                                                                      0.00                                                          0.0                                                     0.00
               10          20           30           40                              25         30            35         40          45                  16    18    20    22      24      26    28     30            6        8    10      12    14        16   18             10   15        20   25       30    35   40         45
                                FSSD                                                                    FSSD                                                                 FSSD                                                         FSSD                                                       FSSD

                                                                                                                                                           (b) CIFAR10 vs. SVHN

Figure 6: FSSDs from different layers behave differently. Each row contains FSSD histograms extracted from different layers
of a trained neural network. FSSDs of ImageNet (dogs) and ImageNet (non-dogs) are similar in early layers; while FSSDs of
CIFAR10 and SVHN differ in all the layers. This can be explained by the fact that ImageNet (dogs) and ImageNet (non-dogs)
are similar in low-level statistics since they are sampled from the same dataset, and that FSSDs in early layers capture more of
the difference in low-level statistics.


crepancy between the test data and training data. We show                                                                                                                                       nick et al. 2019b; Serrà et al. 2020). However, these methods
more examples of FSSDs in different layers on our Github                                                                                                                                        typically have extra training difficulty incurred by large gen-
page.                                                                                                                                                                                           erative models.
                                                                                                                                                                                                   (4) There are also works designing non-Euclidean met-
                                                          Related works                                                                                                                         rics to compare test samples to training samples, and regard
                                                                                                                                                                                                those with higher distances to training samples as OoD sam-
Out-of-distribution detection
                                                                                                                                                                                                ples (Lee et al. 2018; van Amersfoort et al. 2020; Kamoi and
According to different understandings of OoD samples, pre-                                                                                                                                      Kobayashi 2020; Lakshminarayanan et al. 2020). Our ap-
vious OoD detection methods can be summarized into four                                                                                                                                         proach resembles this type most. Instead of comparing test
categories.                                                                                                                                                                                     samples to training samples, we compare the features of the
   (1) Some methods regard OoD samples as those with                                                                                                                                            test samples to the center of OoD features.
uniform probability prediction across classes (Hein, An-
driushchenko, and Bitterwolf 2018; Hendrycks and Gimpel                                                                                                                                                                                           Conclusion
2017; Liang, Li, and Srikant 2018; Meinke and Hein 2020)
and treat the test samples with high entropy or low maximum                                                                                                                                     In this work, we propose a new OoD detection algorithm
prediction probability as OoD data. Since these methods are                                                                                                                                     based on a novel observation that OoD samples concentrate
based on prediction, they run the risk of mis-classifying am-                                                                                                                                   in the feature space of a trained neural network. We pro-
biguous data as OoD samples, e.g., when there are thousands                                                                                                                                     vide analysis and understanding of the concentration phe-
of classes in a large-scale dataset.                                                                                                                                                            nomenon by analyzing the training dynamics both theoret-
   (2) OoD samples can also be characterized as samples                                                                                                                                         ically and empirically and further interpreted the algorithm
with high epistemic uncertainty which reflects the lack of in-                                                                                                                                  with the neural tangent kernel. We demonstrate that our al-
formation on these samples (Lakshminarayanan, Pritzel, and                                                                                                                                      gorithm is state-of-the-art in detection performance and is
Blundell 2017; Gal and Ghahramani 2016). Specifically, we                                                                                                                                       robust to measurement noise. Our further investigation on
can propagate the uncertainty of models to the uncertainty                                                                                                                                      the effect of ensemble reveals diversity in layer ensembles
of predictions, which characterizes the level of OoD. MCD                                                                                                                                       and shows promising performance of network ensembles. In
and DE are two popular choices of this type. However, it is                                                                                                                                     summary, we hope that our work can provide new insights
reported that current epistemic uncertainty estimations may                                                                                                                                     for understanding properties of neural networks and add an
noticeably degrade under dataset distributional shift (Ovadia                                                                                                                                   alternative simple and effective OoD detection method to the
et al. 2019). Our experiments on detecting ImageNet-C from                                                                                                                                      safe AI deployment toolkits.
ImageNet (Figure 4) confirm this.
   (3) When the density of data can be approximated, e.g.,                                                                                                                                                                               Acknowledgement
using generative models (Kingma and Dhariwal 2018; Sal-                                                                                                                                         Bin Dong is supported in part by Beijing Natural Science
imans et al. 2017), OoD samples can be classified as those                                                                                                                                      Foundation (No. 180001); National Natural Science Foun-
with low density. Recent works provide many inspiring in-                                                                                                                                       dation of China (NSFC) grant No. 11831002 and Beijing
sights on how to improve this idea (Ren et al. 2019; Nalis-                                                                                                                                     Academy of Artificial Intelligence (BAAI).
                       References                              Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural
Cao, Y.; and Gu, Q. 2020. Generalization Error Bounds          Tangent Kernel: Convergence and Generalization in Neu-
of Gradient Descent for Learning Over-Parameterized Deep       ral Networks. In Bengio, S.; Wallach, H.; Larochelle,
ReLU Networks. In The Thirty-Fourth AAAI Conference            H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R.,
on Artificial Intelligence, 3349–3356. AAAI Press. URL         eds., Advances in Neural Information Processing Sys-
https://aaai.org/ojs/index.php/AAAI/article/view/5736.         tems 31, 8571–8580. Curran Associates, Inc.            URL
                                                               http://papers.nips.cc/paper/8076-neural-tangent-kernel-
Clanuwat, T.; Bober-Irizar, M.; Kitamoto, A.; Lamb, A.; Ya-    convergence-and-generalization-in-neural-networks.pdf.
mamoto, K.; and Ha, D. 2018. Deep Learning for Classical       Kamoi, R.; and Kobayashi, K. 2020.              Why is the
Japanese Literature.                                           Mahalanobis Distance Effective for Anomaly Detection?
Gal, Y.; and Ghahramani, Z. 2016. Dropout as a Bayesian        arXiv:2003.00402 [cs, stat] URL http://arxiv.org/abs/2003.
Approximation: Representing Model Uncertainty in Deep          00402. ArXiv: 2003.00402.
Learning. volume 48 of Proceedings of Machine Learning         Kingma, D. P.; and Dhariwal, P. 2018. Glow: Genera-
Research, 1050–1059. New York, New York, USA: PMLR.            tive Flow with Invertible 1x1 Convolutions. In Bengio, S.;
URL http://proceedings.mlr.press/v48/gal16.html.               Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi,
Guo, Y.; Zhang, L.; Hu, Y.; He, X.; and Gao, J. 2016. MS-      N.; and Garnett, R., eds., Advances in Neural Informa-
Celeb-1M: A Dataset and Benchmark for Large-Scale Face         tion Processing Systems 31, 10215–10224. Curran Asso-
Recognition. In Leibe, B.; Matas, J.; Sebe, N.; and Welling,   ciates, Inc. URL http://papers.nips.cc/paper/8224-glow-
M., eds., Computer Vision – ECCV 2016, volume 9907, 87–        generative-flow-with-invertible-1x1-convolutions.pdf.
102. Cham: Springer International Publishing. ISBN 978-3-      Krizhevsky, A. 2009. Learning multiple layers of features
319-46486-2 978-3-319-46487-9. doi:10.1007/978-3-319-          from tiny images. Technical report.
46487-9_6. URL http://link.springer.com/10.1007/978-3-         Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017.
319-46487-9_6. Series Title: Lecture Notes in Computer         Simple and Scalable Predictive Uncertainty Estimation
Science.                                                       using Deep Ensembles. In Guyon, I.; Luxburg, U. V.;
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual    Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.;
Learning for Image Recognition. In 2016 IEEE Conference        and Garnett, R., eds., Advances in Neural Information
on Computer Vision and Pattern Recognition (CVPR), 770–        Processing Systems 30, 6402–6413. Curran Associates,
778. Las Vegas, NV, USA: IEEE. ISBN 978-1-4673-8851-           Inc.     URL http://papers.nips.cc/paper/7219-simple-and-
1. doi:10.1109/CVPR.2016.90. URL http://ieeexplore.ieee.       scalable-predictive-uncertainty-estimation-using-deep-
org/document/7780459/.                                         ensembles.pdf.
Hein, M.; Andriushchenko, M.; and Bitterwolf, J. 2018.         Lakshminarayanan, B.; Tran, D.; Liu, J.; Padhy, S.; Bedrax-
Why ReLU Networks Yield High-Confidence Predictions            Weiss, T.; and Lin, Z. 2020. Simple and Principled Uncer-
Far Away From the Training Data and How to Mitigate the        tainty Estimation with Deterministic Deep Learning via Dis-
Problem. 2019 IEEE/CVF Conference on Computer Vision           tance Awareness. In Advances in Neural Information Pro-
and Pattern Recognition (CVPR) 41–50.                          cessing Systems 33.
Hendrycks, D.; and Dietterich, T. 2019. Benchmarking Neu-      LeCun, Y.; and Cortes, C. 2010. MNIST handwritten digit
ral Network Robustness to Common Corruptions and Per-          database URL http://yann.lecun.com/exdb/mnist/.
turbations. Proceedings of the International Conference on     Lee, K.; Lee, K.; Lee, H.; and Shin, J. 2018. A Simple Uni-
Learning Representations .                                     fied Framework for Detecting Out-of-Distribution Samples
                                                               and Adversarial Attacks. In Proceedings of the 32nd Inter-
Hendrycks, D.; and Gimpel, K. 2017. A Baseline for De-
                                                               national Conference on Neural Information Processing Sys-
tecting Misclassified and Out-of-Distribution Examples in
                                                               tems, NIPS’18, 7167–7177. Red Hook, NY, USA: Curran
Neural Networks. Proceedings of International Conference
                                                               Associates Inc.
on Learning Representations .
                                                               Li, M.; Zhao, Z.; and Scheidegger, C. 2020. Visualizing
Hendrycks, D.; Mazeika, M.; and Dietterich, T. 2019. Deep      Neural Networks with the Grand Tour. Distill doi:10.23915/
Anomaly Detection with Outlier Exposure. In Interna-           distill.00025. Https://distill.pub/2020/grand-tour.
tional Conference on Learning Representations. URL https:
//openreview.net/forum?id=HyxCxhRcY7.                          Li, Y.; and Liang, Y. 2018. Learning Overparameterized
                                                               Neural Networks via Stochastic Gradient Descent on Struc-
Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-          tured Data. In Proceedings of the 32nd International Confer-
Term Memory. Neural Comput. 9(8): 1735–1780. ISSN              ence on Neural Information Processing Systems, NIPS’18,
0899-7667. doi:10.1162/neco.1997.9.8.1735. URL https:          8168–8177. Red Hook, NY, USA: Curran Associates Inc.
//doi.org/10.1162/neco.1997.9.8.1735.
                                                               Liang, S.; Li, Y.; and Srikant, R. 2018. Enhancing The
Huang, G.; Li, Y.; Pleiss, G.; Liu, Z.; Hopcroft, J. E.; and   Reliability of Out-of-distribution Image Detection in Neu-
Weinberger, K. Q. 2017. Snapshot Ensembles: Train 1, get       ral Networks. In International Conference on Learning
M for free. CoRR abs/1704.00109. URL http://arxiv.org/         Representations. URL https://openreview.net/forum?id=
abs/1704.00109.                                                H1VGkIxRZ.
Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep Learn-    Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;
ing Face Attributes in the Wild. In Proceedings of Interna-   Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;
tional Conference on Computer Vision (ICCV).                  Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Vi-
Ma, X.; Li, B.; Wang, Y.; Erfani, S. M.; Wijewickrema,        sual Recognition Challenge. International Journal of Com-
S.; Schoenebeck, G.; Houle, M. E.; Song, D.; and Bailey,      puter Vision (IJCV) 115(3): 211–252. doi:10.1007/s11263-
J. 2018. Characterizing Adversarial Subspaces Using Lo-       015-0816-y.
cal Intrinsic Dimensionality. In International Conference     Salimans, T.; Karpathy, A.; Chen, X.; and Kingma, D. P.
on Learning Representations. URL https://openreview.net/      2017. PixelCNN++: A PixelCNN Implementation with Dis-
forum?id=B1gJ1L2aW.                                           cretized Logistic Mixture Likelihood and Other Modifica-
Maze, B.; Adams, J.; Duncan, J. A.; Kalka, N.; Miller,        tions. In ICLR.
T.; Otto, C.; Jain, A. K.; Niggel, W. T.; Anderson, J.;       Schuster, M.; and Paliwal, K. 1997. Bidirectional Recur-
Cheney, J.; and Grother, P. 2018. IARPA Janus Bench-          rent Neural Networks. Trans. Sig. Proc. 45(11): 2673–2681.
mark - C: Face Dataset and Protocol. In 2018 Inter-           ISSN 1053-587X. doi:10.1109/78.650093. URL https:
national Conference on Biometrics (ICB), 158–165. Gold        //doi.org/10.1109/78.650093.
Coast, QLD: IEEE. ISBN 978-1-5386-4285-6. doi:10.
                                                              Serrà, J.; Álvarez, D.; Gómez, V.; Slizovskaia, O.; Núñez,
1109/ICB2018.2018.00033. URL https://ieeexplore.ieee.
                                                              J. F.; and Luque, J. 2020. Input Complexity and Out-
org/document/8411217/.
                                                              of-distribution Detection with Likelihood-based Genera-
Meinke, A.; and Hein, M. 2020. Towards neural networks        tive Models. In International Conference on Learning
that provably know when they don’t know. In Interna-          Representations. URL https://openreview.net/forum?id=
tional Conference on Learning Representations. URL https:     SyxIWpVYvr.
//openreview.net/forum?id=ByxGkySKwH.
                                                              van Amersfoort, J.; Smith, L.; Teh, Y. W.; and Gal, Y. 2020.
Nalisnick, E.; Matsukawa, A.; Teh, Y. W.; Gorur, D.; and      Simple and Scalable Epistemic Uncertainty Estimation Us-
Lakshminarayanan, B. 2019a. Do Deep Generative Models         ing a Single Deep Deterministic Neural Network.
Know What They Don’t Know? In International Conference
                                                              Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST:
on Learning Representations. URL https://openreview.net/
                                                              a Novel Image Dataset for Benchmarking Machine Learning
forum?id=H1xwNhCcYm.
                                                              Algorithms.
Nalisnick, E.; Matsukawa, A.; Teh, Y. W.; and Lakshmi-
                                                              Xie, J.; Xu, B.; and Zhang, C. 2013. Horizontal and Verti-
narayanan, B. 2019b. Detecting Out-of-Distribution Inputs
                                                              cal Ensemble with Deep Representation for Classification.
to Deep Generative Models Using Typicality.
                                                              CoRR abs/1306.2759. URL http://arxiv.org/abs/1306.2759.
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.;
                                                              Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; and He, K. 2017.
and Ng, A. Y. 2011. Reading Digits in Natural Im-
                                                              Aggregated Residual Transformations for Deep Neural Net-
ages with Unsupervised Feature Learning. In NIPS Work-
                                                              works. In 2017 IEEE Conference on Computer Vision
shop on Deep Learning and Unsupervised Feature Learn-
                                                              and Pattern Recognition (CVPR), 5987–5995. Honolulu, HI:
ing 2011. URL http://ufldl.stanford.edu/housenumbers/
                                                              IEEE. ISBN 978-1-5386-0457-1. doi:10.1109/CVPR.2017.
nips2011_housenumbers.pdf.
                                                              634. URL http://ieeexplore.ieee.org/document/8100117/.
Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.;
                                                              Zeiler, M.; and Fergus, R. 2014. Visualizing and understand-
Nowozin, S.; Dillon, J. V.; Lakshminarayanan, B.; and
                                                              ing convolutional networks. In Computer Vision, ECCV
Snoek, J. 2019. Can You Trust Your Model’s Uncertainty?
                                                              2014 - 13th European Conference, Proceedings, number
Evaluating Predictive Uncertainty Under Dataset Shift. In
                                                              PART 1 in Lecture Notes in Computer Science (including
NeurIPS.
                                                              subseries Lecture Notes in Artificial Intelligence and Lec-
Rabanser, S.; Günnemann, S.; and Lipton, Z. 2019. Failing     ture Notes in Bioinformatics), 818–833. Springer Verlag.
Loudly: An Empirical Study of Methods for Detecting           ISBN 9783319105895. doi:10.1007/978-3-319-10590-1_
Dataset Shift. In Wallach, H.; Larochelle, H.; Beygelz-       53. 13th European Conference on Computer Vision, ECCV
imer, A.; Alché-Buc, F. d.; Fox, E.; and Garnett, R.,         2014 ; Conference date: 06-09-2014 Through 12-09-2014.
eds., Advances in Neural Information Processing Sys-
                                                              Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Tor-
tems 32, 1396–1408. Curran Associates, Inc. URL http:
                                                              ralba, A. 2015. Object Detectors Emerge in Deep Scene
//papers.nips.cc/paper/8420-failing-loudly-an-empirical-
                                                              CNNs. In International Conference on Learning Represen-
study-of-methods-for-detecting-dataset-shift.pdf.
                                                              tations (ICLR).
Ren, J.; Liu, P. J.; Fertig, E.; Snoek, J.; Poplin, R.; De-
pristo, M.; Dillon, J.; and Lakshminarayanan, B. 2019.
Likelihood Ratios for Out-of-Distribution Detection. In
Wallach, H.; Larochelle, H.; Beygelzimer, A.; Alché-Buc,
F. d.; Fox, E.; and Garnett, R., eds., Advances in Neu-
ral Information Processing Systems 32, 14707–14718. Cur-
ran Associates, Inc. URL http://papers.nips.cc/paper/9611-
likelihood-ratios-for-out-of-distribution-detection.pdf.