<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Feature Space Singularity for Out-of-Distribution Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haiwen Huang</string-name>
          <email>haiwen.huang2@cs.ox.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhihan Li</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lulu Wang</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sishuo Chen</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bin Dong</string-name>
          <email>dongbin@math.pku.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinyu Zhou</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Beijing International Center for Mathematical Research</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Oxford</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Artificial Intelligence and Center for Data Science</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>MEGVII Technology</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Peking University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Out-of-Distribution (OoD) detection is important for building safe artificial intelligence systems. However, current OoD detection methods still cannot meet the performance requirements for practical deployment. In this paper, we propose a simple yet effective algorithm based on a novel observation: in a trained neural network, OoD samples with bounded norms well concentrate in the feature space. We call the center of OoD features the Feature Space Singularity (FSS), and denote the distance of a sample feature to FSS as FSSD. Then, OoD samples can be identified by taking a threshold on the FSSD. Our analysis of the phenomenon reveals why our algorithm works. We demonstrate that our algorithm achieves state-of-the-art performance on various OoD detection benchmarks. Besides, FSSD also enjoys robustness to slight corruption in test data and can be further enhanced by ensembling. These make FSSD a promising algorithm to be employed in real world. We release our code at https://github.com/megvii-research/FSSD_OoD_Detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Empirical risk minimization fits a statistical model on a
training set which is independently sampled from the data
distribution. As a result, the yielded model is expected to
generalize to in-distribution data drawn from the same
distribution. However, in real applications, it is inevitable for
a model to make predictions on Out-of-Distribution (OoD)
data instead of in-distribution data on which the model is
trained. This can lead to fatal errors such as over-confident
or ridiculous predictions
        <xref ref-type="bibr" rid="ref17 ref25 ref37 ref7 ref9">(Hein, Andriushchenko, and
Bitterwolf 2018; Rabanser, Günnemann, and Lipton 2019)</xref>
        .
Therefore, it is crucial to understand the uncertainty of models
and automatically detect OoD data. In applications like
autonomous driving and medical services, if the model knows
what it does not know, human intervention can be sought
and security can be significantly improved.
      </p>
      <p>
        Consider one particular example of OoD detection: some
high-quality human face images are given as in-distribution
data (training set for OoD detector), and we are interested
in filtering out non-faces and low quality faces from a large
pool of data in the wild (test set) in order to ensure reliable
prediction. One natural solution is to remove test samples
far from the training data in some designated distances
        <xref ref-type="bibr" rid="ref23 ref45">(Lee
et al. 2018; van Amersfoort et al. 2020)</xref>
        . However,
calculating the distance to the whole training set needs a formidable
amount of computation without some special design in
feature and architecture, e.g., training a RBF network
        <xref ref-type="bibr" rid="ref45">(van
Amersfoort et al. 2020)</xref>
        . In this paper, we present a
simple yet effective distance-based solution, which neither
computes the distance to the training data nor performs extra
model training than a standard classifier.
      </p>
      <p>Our approach is based on a novel observation about OoD
samples:</p>
      <p>In a trained neural network, OoD samples with
bounded norms well concentrate in the feature space
of the neural network.</p>
      <p>
        In Figure 1, we show an example where OoD features from
ImageNet
        <xref ref-type="bibr" rid="ref40">(Russakovsky et al. 2015)</xref>
        concentrate in a
neural network trained on the facial dataset MS-1M
        <xref ref-type="bibr" rid="ref4">(Guo et al.
2016)</xref>
        . Figure 2 and 3 provide more examples of this
phenomenon. In fact, we find this phenomenon to be universal
in most training configurations for most datasets.
      </p>
      <p>To be more precise, for a given feature extractor F
trained on in-distribution data, the observation states that
there exists a point F in the output space of F such that
kF (x) F k is small for x 2 XOoD, where XOoD is the set
of OoD samples. We call F the Feature Space Singularity
(FSS). Moreover, we discover the FSS Distance (FSSD)
FSSD (x) := kF (x)</p>
      <p>F k
(1)
can reflect the degree of OoD, and thus can be readily used
as a metric for OoD detection.</p>
      <p>Our analysis demonstrates that this phenomenon can be
explained by the training dynamics. The key observation
is that FSSD can be seen as an approximate movement of
F t (x) during training, where F is the initial concentration
point of the features. The difference in the moving speed
dF t (x) stems from the different similarity to the training
dt
data measured by the inner product of the gradients.
Moreover, this similarity measure varies according to the
architecture of the feature extractor.</p>
      <p>We demonstrate the effectiveness of our proposed method
with multiple neural networks (LeNet (LeCun and Cortes</p>
      <p>A
A</p>
      <p>B</p>
      <p>Face quality increasing
In-distribution
C</p>
      <p>B
2</p>
      <p>C
D</p>
      <p>
        D
In-distribution
OoD
FSS F*
2010), ResNet
        <xref ref-type="bibr" rid="ref5">(He et al. 2016)</xref>
        , ResNeXt
        <xref ref-type="bibr" rid="ref49">(Xie et al. 2017)</xref>
        ,
LSTM
        <xref ref-type="bibr" rid="ref12 ref42">(Hochreiter and Schmidhuber 1997)</xref>
        ) trained on
various datasets for classification (FashionMNIST
        <xref ref-type="bibr" rid="ref10 ref46">(Xiao,
Rasul, and Vollgraf 2017)</xref>
        , CIFAR10
        <xref ref-type="bibr" rid="ref18">(Krizhevsky 2009)</xref>
        ,
ImageNet
        <xref ref-type="bibr" rid="ref40">(Russakovsky et al. 2015)</xref>
        , CelebA
        <xref ref-type="bibr" rid="ref27">(Liu et al. 2015)</xref>
        ,
MS-1M
        <xref ref-type="bibr" rid="ref4">(Guo et al. 2016)</xref>
        , bacteria genome dataset
        <xref ref-type="bibr" rid="ref38">(Ren
et al. 2019)</xref>
        ) with varying training set sizes. We show
that FSSD achieves state-of-the-art performance on almost
all the considered benchmarks. Moreover, the performance
margin between FSSD and other methods increases as the
size of the training set increases. In particular, on large-scale
benchmarks (CelebA and MS-1M), FSSD advances the
AUROC by about 5%. We also evaluate the robustness of our
algorithm when test images are corrupted. We find that our
algorithm can still reliably detect OoD samples under this
circumstance. Finally, we investigate the effects of
ensembling FSSDs from different layers of a single neural network
and multiple trained netowrks.
      </p>
      <p>We summarize our contributions as follows.
• We observe that in feature spaces of trained networks
OoD samples concentrate near a point (FSS), and the
distance from a sample feature to FSS (FSSD) measures the
degree of OoD (Section 1).
• We analyze the concentration phenomenon by analyzing
the dynamics of in-distribution and OoD features during
training (Section 2).
• We introduce the FSSD algorithm (Section 3) which
achieves state-of-the-art performance on various OoD
detection benchmarks with considerable robustness (Section
4).</p>
    </sec>
    <sec id="sec-2">
      <title>Analyzing and Understanding the Concentration Phenomenon</title>
      <p>In this section, we analyze the concentration phenomenon.
The key observation is that during training, the features of
the training data are supervised to move away from the
initial point, and the moving speeds of features of other data
depend on their similarity to the training data. Specifically,
this similarity is measured by the inner product of the
gradients. Therefore, data that are dissimilar to the training data
will move little and concentrate in the feature space. This is
how FSSD identifies OoD data.</p>
      <p>To see this, we derive the training dynamics of the feature
tvoerctworhsi.chWmeadpesnointepuFts t:o Rfeaatu!resRabndasGthe
:feRatbur!eexRtrcatcobe the map from features to outputs. The two mappings are
parameterized by and respectively. The corresponding
loss function can be denoted as L (F (x1) ; ; F (xM )).
A popular choice is L (F (x1) ; ; F (xM )) =
PM</p>
      <p>m=1 ` (G (F (xm)) ; ym) =M , where ` is the cross
entropy loss or the mean squared error. Then, the gradient
descent dynamics of is</p>
      <p>m=1
where @mL = @ F@tL(xm) 2 Rb is the backward propagation
gradient and subscript t is the training time. The dynamics
of the feature extractor F as a function is therefore
d t =
dt
=
=
=
dF t (x)
dt</p>
      <p>T</p>
      <p>In-distribution (MNIST)
OoD (FMNIST)</p>
      <p>FSS F*
50 40 30 20 10 0
in-distribution
OoD
30
25
20
%
15
10
5
(a) The dynamics of features F t (x).</p>
      <p>step 1000
step 0
step 150
step 2000
step 12000
15
%
10
5
20
10
0
10
35
30
25
20
%
15
10
5
050000 100000150000200000250000300000350000
||dF(x)/dt||</p>
      <p>From Equation (3), we can see that the relative moving
speed of the feature F t (x) depends on the inner product of
the gradient on parameters between x and the training data
xm. Note here @mL is the same for all x. Since FSSD
defined in Equation 1 can be seen as the integration of dFdtt(x)
when the initial value F 0 (x) is F for all x, FSSD(x) will
also be small when the derivative, i.e., the moving speed, is
small.</p>
      <p>In Figure 2, we show both the features and their
moving speeds of in-distribution and OoD data at different steps
during training. We can see that although in-distribution and
OoD data are indistinguishable at step 0, they are quickly
separated since the moving speeds of in-distribution data
are larger than those of OoD data (Figure 2(b)) and thus
the accumulated movements of in-distribution data are also
larger than those of OoD data (Figure 2(a)). In Figure 3,
we show examples of the initial concentration of features
in LeNet and ResNet-34 for MNIST vs. FashionMNIST and
CIFAR10 vs. SVHN dataset pairs respectively. Empirically,
we find the concentration of both in-distribution and OoD
features at the initial stage to be the common case for most
popular architectures using random initialization. We show
more examples on our Github page.</p>
      <p>
        As we’ve mentioned, Equation (3) demonstrates that the
difference in the moving speed of F t (x) stems from
difference in t (x; xm) := @ F@ tt(x) @ F @t(xtm)T . We want to
further point out that t (x; xm) is effectively acting as a
kernel that measures the similarity between x and xm. In fact,
when the network width is infinite, t (x; xm) will converge
to a time-independent term (x; xm), which is called neural
tangent kernel (NTK)
        <xref ref-type="bibr" rid="ref14 ref15 ref17 ref17 ref25 ref25 ref26 ref28 ref32">(Jacot, Gabriel, and Hongler 2018; Li
and Liang 2018; Cao and Gu 2020)</xref>
        . In this way, FSSD can
be seen as a kernel regression result:
      </p>
      <sec id="sec-2-1">
        <title>FSSD (x)</title>
        <p>F</p>
        <p>F 0 (x)
Equation (1)</p>
        <p>F t (x)</p>
        <p>F 0 (x)
M
X</p>
        <p>Z T
m=1 0
M
X
m=1
(x; xm) m</p>
        <p>;
=
dt
(4)
where m = R0T @mL dt.</p>
        <p>This indicates that the similarity described by the inner
product t (x; xm) := @ F@ tt(x) @ F @t(xtm)T might enjoy
similar properties to commonly used kernels such as RBF
kernel, which diminishes as the distance between x and xm
increases. Moreover, since the neural tangent kernel depends
on the neural architecture, this kernel interpretation also
suggests that feature extractors of different architectures,
including different layers, can have different properties and
measure different aspects of the similarity between x and
xm. We can see this more clearly later in the investigation of
FSSD in different layers.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Our Algorithm</title>
      <p>Based on this phenomenon, we can now construct our OoD
detection algorithm. Since the uniform noise input can be
considered to possess the highest degree of OoD, we use the
center of their features as the FSS F . The FSSD can then
be calculated using Equation (1). Note this calculation of
FSS F is independent from the choice of in-distribution and
OoD datasets. When such natural choice of uniform noise
is unavailable, we can choose FSS F to be the center of
features of OoD validation data instead.</p>
      <p>
        Since a single forward-pass computation through the
network can give us features from each layer, it is also
convenient to calculate FSSDs from different layers and
ensemble them as FSSD-Ensem (x) = PkK=1 kFSSD(k) (x). The
ensemble weights k can be determined using logistic
regression on some validation data as in
        <xref ref-type="bibr" rid="ref23">(Lee et al. 2018)</xref>
        (see
Evaluation section in Experiments). In later experiments, if
not specified, we use the ensembled FSSD from all layers.
We note that it is also possible to ensemble FSSDs from
different architectures or multiple training snapshots
        <xref ref-type="bibr" rid="ref13 ref47">(Xie, Xu,
and Zhang 2013; Huang et al. 2017)</xref>
        . This may further
enhance the performance of OoD detection. We investigate the
effect of ensembling in the next section.
      </p>
      <p>
        Beside, we also adopt input pre-processing as in
        <xref ref-type="bibr" rid="ref17 ref23 ref25 ref26 ref28">(Liang,
Li, and Srikant 2018; Lee et al. 2018)</xref>
        . The idea is to
add small perturbations to the test samples in order to
increase the in-distribution score. It is shown in
        <xref ref-type="bibr" rid="ref15 ref17 ref25 ref26 ref28 ref32">(Liang, Li,
and Srikant 2018; Kamoi and Kobayashi 2020)</xref>
        that
indistribution data are more sensitive to such perturbation and
it can therefore enlarge the score gap between in-distribution
and OoD samples. In particular, we perturb as x~ = x +
sign (rxFSSD (x)) and take FSSD (x~) as the final score.
      </p>
      <sec id="sec-3-1">
        <title>We present the pseudo-code</title>
        <p>FSSD-Ensem (x) in Algorithm 1.
of
computing
4
2
0
2
4
6
1. Estimate FSS F(k) = PS s
s=1 F(k) xnoise =S,
where xnsoise U [0; 1], s = 1;
2. Add perturbation to test sample:
x~ = x + sign(rx F(k) (x)
3. Calculate FSSD(k) (x) =
; S</p>
        <p>F(k) )
F(k) (x~)</p>
        <p>F(k)
end
Return FSSD-Ensem (x) = PK
k=1 k FSSD(k) (x)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section, we investigate the performance of our FSSD
algorithm on various OoD detection benchmarks.</p>
      <sec id="sec-4-1">
        <title>Setup</title>
        <p>Benchmarks To conduct a thorough test of our method,
we consider a wide variety of OoD detection benchmarks.
In particular, we consider different scales of datasets and
different types of data. We consider different scales of datasets
because large scale datasets tend to have more classes which
can introduce more ambiguous data. The ambiguous data
are of high classification uncertainty, but are not
out-ofdistribution. We list the benchmarks in Table 1.</p>
        <p>
          We first consider two common benchmarks from
previous OoD detection literature
          <xref ref-type="bibr" rid="ref38 ref45">(van Amersfoort et al. 2020;
Ren et al. 2019)</xref>
          : (A) FMNIST
          <xref ref-type="bibr" rid="ref10 ref46">(Xiao, Rasul, and Vollgraf
2017)</xref>
          vs. MNIST
          <xref ref-type="bibr" rid="ref22">(LeCun and Cortes 2010)</xref>
          and (B)
CIFAR10
          <xref ref-type="bibr" rid="ref18">(Krizhevsky 2009)</xref>
          vs. SVHN
          <xref ref-type="bibr" rid="ref35">(Netzer et al. 2011)</xref>
          .
They are known to be challenging for many methods
          <xref ref-type="bibr" rid="ref33 ref34 ref38">(Ren
et al. 2019; Nalisnick et al. 2019a)</xref>
          . (C) We also construct
ImageNet (dogs), a subset of ImageNet
          <xref ref-type="bibr" rid="ref40">(Russakovsky et al.
2015)</xref>
          , as in-distribution data. The OoD data are non-dog
images from ImageNet.
        </p>
        <p>
          For large-scale problems, we consider three benchmarks.
(D) We train models on ImageNet and detect corrupted
images from the ImageNet-C dataset
          <xref ref-type="bibr" rid="ref11 ref9">(Hendrycks and
Dietterich 2019)</xref>
          . We test each method on 80 sets of corruptions
(16 types and 5 levels). (E) We train models on face
images without the “blurry” attribute from CelebA
          <xref ref-type="bibr" rid="ref27">(Liu et al.
2015)</xref>
          and detect face images with the “blurry” attribute. (F)
We train models on web images of celebrities from
MSCeleb-1M (MS-1M)
          <xref ref-type="bibr" rid="ref4">(Guo et al. 2016)</xref>
          and detect video
captures from IJB-C
          <xref ref-type="bibr" rid="ref29">(Maze et al. 2018)</xref>
          which in general have
lower quality due to pose, illumination, and resolution
issues. We also consider (G) the bacteria genome benchmark
introduced by
          <xref ref-type="bibr" rid="ref38">(Ren et al. 2019)</xref>
          , which consists of sequence
data.
        </p>
        <p>
          To train models on in-distribution datasets, we follow
previous works
          <xref ref-type="bibr" rid="ref23">(Lee et al. 2018)</xref>
          to train LeNet on FMNIST
and ResNet with 34 layers on CIFAR10, ImageNet, and
ImageNet (dogs). For two face recognition datasets (CelebA and
MS-1M), we train ResNeXt with 50 layers. For the genome
sequence dataset, we use an character embedding layer and
two Bidirectional LSTM layers
          <xref ref-type="bibr" rid="ref12 ref42">(Schuster and Paliwal 1997)</xref>
          .
Compared methods We compare our method with the
following six common methods for OoD detection. Base:
the baseline method using the maximum softmax
probability p (y^jx)
          <xref ref-type="bibr" rid="ref10">(Hendrycks and Gimpel 2017)</xref>
          . ODIN:
temperature scaling on logits and input pre-processing
          <xref ref-type="bibr" rid="ref17 ref25 ref26 ref28">(Liang, Li,
and Srikant 2018)</xref>
          . Maha: Mahalanobis distance of the
sample feature to the closest class-conditional Gaussian
distribution which is estimated from the training data
          <xref ref-type="bibr" rid="ref23">(Lee et al.
2018)</xref>
          . In our experiments, we follow
          <xref ref-type="bibr" rid="ref23">(Lee et al. 2018)</xref>
          to
use both feature (layer) ensemble and input pre-processing.
DE: Deep Ensemble which averages the softmax
probability predictions from multiple independently trained
classifiers
          <xref ref-type="bibr" rid="ref10 ref19">(Lakshminarayanan, Pritzel, and Blundell 2017)</xref>
          . In our
experiments, we take the average of 5 classifiers by default.
MCD: Monte-Carlo Dropout that uses dropout during both
training and inference
          <xref ref-type="bibr" rid="ref2">(Gal and Ghahramani 2016)</xref>
          . We
follow
          <xref ref-type="bibr" rid="ref36">(Ovadia et al. 2019)</xref>
          to dropout convolutional layers.
For OoD detection, we calculate both the mean and the
variance of 32 independent predictions and choose the
better one to report. OE: Outlier exposure that explicitly
enforces uniform probability prediction on an auxiliary dataset
of outliers
          <xref ref-type="bibr" rid="ref11 ref9">(Hendrycks, Mazeika, and Dietterich 2019)</xref>
          . For
the choice of auxiliary datsets, we use KMNIST
          <xref ref-type="bibr" rid="ref1">(Clanuwat
et al. 2018)</xref>
          for benchmark A, CelebA
          <xref ref-type="bibr" rid="ref27">(Liu et al. 2015)</xref>
          for
benchmark C, and ImageNet-1K
          <xref ref-type="bibr" rid="ref40">(Russakovsky et al. 2015)</xref>
          for benchmark B, E, F. We do not evaluate OE on the
sequence benchmark, since we can not find a reasonable
auxiliary dataset. We remark here that Base, ODIN, and FSSD can
be deployed directly with a trained neural network, MCD
needs a trained neural network with dropout layers, while
DE needs multiple trained classifiers. Besides, Maha needs
to use the training data during OoD detection on test data
and OE trains a neural network either from scratch or by
fine-tuning to utilize the auxiliary dataset.
        </p>
        <p>
          Evaluation We follow
          <xref ref-type="bibr" rid="ref11 ref38 ref9">(Ren et al. 2019; Hendrycks,
Mazeika, and Dietterich 2019)</xref>
          to use the following
metrics to assess the performance of OoD detection. AUROC:
Area Under the Receiver Operating Characteristic curve.
AUPRC: Area Under the Precision-Recall Curve. FPR80:
False Positive Rate when the true positive rate is 80%.
        </p>
        <p>
          For hyper-parameter tuning, we follow
          <xref ref-type="bibr" rid="ref17 ref23 ref25 ref26 ref28 ref38">(Lee et al. 2018;
Ren et al. 2019; Liang, Li, and Srikant 2018)</xref>
          to use a
separate validation set, which consists of 1,000 images from
each in-distribution and OoD data pair. Ensemble weights
k for FSSD from different layers are extracted from a
logistic regression model, which is trained using nested cross
validation within the validation set as in
          <xref ref-type="bibr" rid="ref23 ref28">(Lee et al. 2018;
Ma et al. 2018)</xref>
          . The same procedure is performed on Maha
for fair comparison. The perturbation magnitude of input
pre-processing for ODIN, Maha, and FSSD is searched from
0 to 0.2 with step size 0.01. The temperature T of ODIN is
chosen from 1, 10, 100, and 1000, and the dropout rate of
MCD is chosen from 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, and
0.5.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Main results</title>
        <p>The main results are presented in Table 2 and Figure 4. In
Table 2, we can see that larger datasets entail greater
difficulty in OoD detection. Notably, the advantage of FSSD
over other methods increases as the dataset size increases.
Other methods like Maha and OE perform well on some
small benchmarks, but have large variance across different
datasets. In comparison, FSSD maintains great performance
on these benchmarks. On the genome sequence dataset, we
also observe that FSSD outperforms other methods. These
results show that FSSD is a promising effective method for
a wide range of applications.</p>
        <p>
          Inspired by
          <xref ref-type="bibr" rid="ref36">(Ovadia et al. 2019)</xref>
          , we also evaluate the
methods on the ability of detecting distributional dataset
shift like Gaussian noise and JPEG artifacts. Figure 4 shows
the means and quartiles of AUROC of the compared
methods over 16 types of corruptions on 5 corruption levels.
We can observe that for each method, the performance of
OoD detection increases as the level of corruption increases,
while FSSD enjoys the highest AUROC and much less
variation over different types of corruptions. The CelebA
benchmark also evaluates the methods on detecting the dataset
shift of the attribute “blurry”. However, all methods
including FSSD do not perform very well. There are two
possible reasons: (1) the attribute “blurry” of CelebA may
be annotated not clearly enough; (2) the blurs in the wild
may be more difficult to detect than the simulated blurs in
ImageNet-C. Overall, we can see that FSSD can more
reliably detect different kinds of distributional shifts.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Robustness</title>
        <p>In practice, it is possible that the test data are slightly
corrupted or shifted due to the change of data source, e.g.,
from lab to real world. We evaluate the ability to
distinguish in-distribution data from OoD data when test data
(both in-distribution and OoD) are slightly corrupted. Note
that we still use non-corrupted data during network training
and hyper-parameter tuning. We apply Gaussian noise and
impulse noise, two typical corruptions, with varying levels.
Test results on CIFAR10 vs. SVHN and ImageNet dogs vs.
non-dogs are shown in Figure 5. We can see that FSSD is
robust to corruptions presented in test images, while other
methods may degrade.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Effects of ensemble</title>
        <p>
          During our experiments, we find that the ensemble plays
an important role in enhancing the performance of FSSD.
Previous studies show that an important issue for
ensemblebased algorithms is enforcing diversity
          <xref ref-type="bibr" rid="ref10 ref19">(Lakshminarayanan,
Pritzel, and Blundell 2017)</xref>
          . In our case, we find that FSSD
DE
has high diversity across different layers, and benefit from
such diversity to reach higher performance. In Figure 6,
we find that FSSD in different layers are working
differently. This can be explained by previous works on
understanding neural networks by visualizing the different
representations learned by low and deep layers of a neural
network
          <xref ref-type="bibr" rid="ref52 ref54">(Zeiler and Fergus 2014; Zhou et al. 2015)</xref>
          .
Generally, FSSDs from deep layers reflect more high-level features
and FSSDs from early layers reflect more low-level
statistics. ImageNet (dogs) and ImageNet (non-dogs) are from
the same dataset (ImageNet), and are therefore similar in
terms of low-level statistics; while the differences between
CIFAR10 and SVHN are in all different levels. From the
perspective of kernel interpretation, this means that the neural
tangent kernels of different layers diversify well and allow
the ensemble of FSSD to capture different aspects of the
discrepancy between the test data and training data. We show
more examples of FSSDs in different layers on our Github
page.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related works</title>
      <sec id="sec-5-1">
        <title>Out-of-distribution detection</title>
        <p>According to different understandings of OoD samples,
previous OoD detection methods can be summarized into four
categories.</p>
        <p>
          (1) Some methods regard OoD samples as those with
uniform probability prediction across classes
          <xref ref-type="bibr" rid="ref10 ref15 ref17 ref17 ref25 ref25 ref26 ref28 ref32 ref7">(Hein,
Andriushchenko, and Bitterwolf 2018; Hendrycks and Gimpel
2017; Liang, Li, and Srikant 2018; Meinke and Hein 2020)</xref>
          and treat the test samples with high entropy or low maximum
prediction probability as OoD data. Since these methods are
based on prediction, they run the risk of mis-classifying
ambiguous data as OoD samples, e.g., when there are thousands
of classes in a large-scale dataset.
        </p>
        <p>
          (2) OoD samples can also be characterized as samples
with high epistemic uncertainty which reflects the lack of
information on these samples
          <xref ref-type="bibr" rid="ref10 ref19 ref2">(Lakshminarayanan, Pritzel, and
Blundell 2017; Gal and Ghahramani 2016)</xref>
          . Specifically, we
can propagate the uncertainty of models to the uncertainty
of predictions, which characterizes the level of OoD. MCD
and DE are two popular choices of this type. However, it is
reported that current epistemic uncertainty estimations may
noticeably degrade under dataset distributional shift
          <xref ref-type="bibr" rid="ref36">(Ovadia
et al. 2019)</xref>
          . Our experiments on detecting ImageNet-C from
ImageNet (Figure 4) confirm this.
        </p>
        <p>
          (3) When the density of data can be approximated, e.g.,
using generative models
          <xref ref-type="bibr" rid="ref17 ref25">(Kingma and Dhariwal 2018;
Salimans et al. 2017)</xref>
          , OoD samples can be classified as those
with low density. Recent works provide many inspiring
insights on how to improve this idea
          <xref ref-type="bibr" rid="ref33 ref34 ref38">(Ren et al. 2019;
Nalisnick et al. 2019b; Serrà et al. 2020)</xref>
          . However, these methods
typically have extra training difficulty incurred by large
generative models.
        </p>
        <p>
          (4) There are also works designing non-Euclidean
metrics to compare test samples to training samples, and regard
those with higher distances to training samples as OoD
samples
          <xref ref-type="bibr" rid="ref15 ref21 ref23 ref32 ref45">(Lee et al. 2018; van Amersfoort et al. 2020; Kamoi and
Kobayashi 2020; Lakshminarayanan et al. 2020)</xref>
          . Our
approach resembles this type most. Instead of comparing test
samples to training samples, we compare the features of the
test samples to the center of OoD features.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this work, we propose a new OoD detection algorithm
based on a novel observation that OoD samples concentrate
in the feature space of a trained neural network. We
provide analysis and understanding of the concentration
phenomenon by analyzing the training dynamics both
theoretically and empirically and further interpreted the algorithm
with the neural tangent kernel. We demonstrate that our
algorithm is state-of-the-art in detection performance and is
robust to measurement noise. Our further investigation on
the effect of ensemble reveals diversity in layer ensembles
and shows promising performance of network ensembles. In
summary, we hope that our work can provide new insights
for understanding properties of neural networks and add an
alternative simple and effective OoD detection method to the
safe AI deployment toolkits.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>Bin Dong is supported in part by Beijing Natural Science
Foundation (No. 180001); National Natural Science
Foundation of China (NSFC) grant No. 11831002 and Beijing
Academy of Artificial Intelligence (BAAI).
Cao, Y.; and Gu, Q. 2020. Generalization Error Bounds
of Gradient Descent for Learning Over-Parameterized Deep
ReLU Networks. In The Thirty-Fourth AAAI Conference
on Artificial Intelligence, 3349–3356. AAAI Press. URL
https://aaai.org/ojs/index.php/AAAI/article/view/5736.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Clanuwat</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bober-Irizar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kitamoto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lamb</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yamamoto</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ha</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Deep Learning for Classical Japanese Literature</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Gal</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ghahramani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning</article-title>
          . volume
          <volume>48</volume>
          <source>of Proceedings of Machine Learning Research</source>
          ,
          <volume>1050</volume>
          -
          <fpage>1059</fpage>
          . New York, New York, USA: PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>URL http://proceedings.mlr.press/v48/gal16.html.</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; Zhang, L.;
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>MSCeleb-1M: A Dataset and Benchmark for Large-Scale Face Recognition</article-title>
          . In Leibe, B.;
          <string-name>
            <surname>Matas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Sebe, N.; and Welling, M., eds.,
          <source>Computer Vision - ECCV</source>
          <year>2016</year>
          , volume
          <volume>9907</volume>
          ,
          <fpage>87</fpage>
          -
          <lpage>102</lpage>
          . Cham: Springer International Publishing.
          <source>ISBN 978-3- 319-46486-2 978-3-319-46487-9</source>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          - 46487-
          <issue>9</issue>
          _
          <fpage>6</fpage>
          . URL http://link.springer.com/10.1007/978-3-
          <fpage>319</fpage>
          -46487-
          <issue>9</issue>
          _
          <fpage>6</fpage>
          . Series Title: Lecture Notes in Computer Science.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . Las Vegas,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA: IEEE.
          <source>ISBN 978-1-4673-8851- 1</source>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .90. URL http://ieeexplore.ieee.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>org/document/7780459/.</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Hein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Andriushchenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Bitterwolf</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Why ReLU Networks Yield High-Confidence Predictions Far Away</surname>
          </string-name>
          <article-title>From the Training Data and How to Mitigate the Problem</article-title>
          .
          <source>2019 IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (CVPR)
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Hendrycks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Dietterich,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Benchmarking Neural Network Robustness to Common Corruptions and Perturbations</article-title>
          .
          <source>Proceedings of the International Conference on Learning Representations .</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Hendrycks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Gimpel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks</article-title>
          .
          <source>Proceedings of International Conference on Learning Representations .</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Hendrycks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Mazeika,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Dietterich,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Deep Anomaly Detection with Outlier Exposure</article-title>
          . In International Conference on Learning Representations. URL https: //openreview.net/forum?id=
          <fpage>HyxCxhRcY7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Long ShortTerm Memory</article-title>
          .
          <source>Neural Comput</source>
          .
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . ISSN 0899-
          <fpage>7667</fpage>
          . doi:
          <volume>10</volume>
          .1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735. URL https: //doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pleiss</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>Z.</given-names>
            ;
            <surname>Hopcroft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            ; and
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Q.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Snapshot Ensembles: Train 1, get M for free</article-title>
          .
          <source>CoRR abs/1704</source>
          .00109. URL http://arxiv.org/ abs/1704.00109.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Jacot</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gabriel</surname>
          </string-name>
          , F.; and
          <string-name>
            <surname>Hongler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Neural Tangent Kernel: Convergence and Generalization in Neural Networks</article-title>
          . In Bengio, S.; Wallach,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Larochelle,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Grauman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Cesa-Bianchi</surname>
          </string-name>
          , N.; and Garnett, R., eds.,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>31</volume>
          ,
          <fpage>8571</fpage>
          -
          <lpage>8580</lpage>
          . Curran Associates, Inc. URL http://papers.nips.cc/paper/8076-neural
          <article-title>-tangent-kernelconvergence-and-generalization-in-neural-networks</article-title>
          .
          <source>pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Kamoi</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Why is the Mahalanobis Distance Effective for Anomaly Detection</article-title>
          ? arXiv:
          <year>2003</year>
          .00402 [cs, stat] URL http://arxiv.org/abs/
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          00402. ArXiv:
          <year>2003</year>
          .00402.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ; and Dhariwal,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Glow: Generative Flow with Invertible 1x1 Convolutions</article-title>
          . In Bengio, S.; Wallach,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Larochelle,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Grauman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Cesa-Bianchi</surname>
          </string-name>
          , N.; and Garnett, R., eds.,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>31</volume>
          ,
          <fpage>10215</fpage>
          -
          <lpage>10224</lpage>
          . Curran Associates, Inc. URL http://papers.nips.cc/paper/8224-glowgenerative
          <article-title>-flow-with-invertible-1x1-convolutions</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Learning multiple layers of features from tiny images</article-title>
          .
          <source>Technical report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pritzel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Blundell</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles</article-title>
          . In Guyon, I.; Luxburg, U. V.;
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Wallach,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Fergus,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Vishwanathan,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Garnett, R., eds.,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          ,
          <fpage>6402</fpage>
          -
          <lpage>6413</lpage>
          . Curran Associates, Inc. URL http://papers.nips.cc/paper/7219-simple
          <article-title>-andscalable-predictive-uncertainty-estimation-using-deepensembles.pdf.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Padhy,
          <string-name>
            <given-names>S.</given-names>
            ; BedraxWeiss, T.; and
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y.; and
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>MNIST handwritten digit database URL http://yann</article-title>
          .lecun.com/exdb/mnist/.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks</article-title>
          .
          <source>In Proceedings of the 32nd International Conference on Neural Information Processing Systems</source>
          , NIPS'
          <volume>18</volume>
          ,
          <fpage>7167</fpage>
          -
          <lpage>7177</lpage>
          . Red Hook,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA: Curran Associates Inc.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Scheidegger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Visualizing Neural Networks with the Grand Tour</article-title>
          . Distill doi:
          <volume>10</volume>
          .23915/ distill.00025. Https://distill.pub/2020/grand-tour.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and Liang,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data</article-title>
          .
          <source>In Proceedings of the 32nd International Conference on Neural Information Processing Systems</source>
          , NIPS'
          <volume>18</volume>
          ,
          <fpage>8168</fpage>
          -
          <lpage>8177</lpage>
          . Red Hook,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA: Curran Associates Inc.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and Srikant,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks</article-title>
          .
          <source>In International Conference on Learning Representations</source>
          . URL https://openreview.net/forum?id=
          <fpage>H1VGkIxRZ</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Deep Learning Face Attributes in the Wild</article-title>
          .
          <source>In Proceedings of International Conference on Computer Vision</source>
          (ICCV).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Erfani</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wijewickrema</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Schoenebeck,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Houle,
          <string-name>
            <given-names>M. E.</given-names>
            ;
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Bailey</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality</article-title>
          . In International Conference on Learning Representations. URL https://openreview.net/ forum?id=
          <fpage>B1gJ1L2aW</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Maze</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Adams</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Duncan</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kalka</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Otto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Niggel</surname>
          </string-name>
          , W. T.;
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cheney</surname>
            , J.; and Grother,
            <given-names>P.</given-names>
          </string-name>
          <year>2018</year>
          .
          <string-name>
            <given-names>IARPA</given-names>
            <surname>Janus Benchmark - C:</surname>
          </string-name>
          <article-title>Face Dataset and Protocol</article-title>
          .
          <source>In 2018 International Conference on Biometrics (ICB)</source>
          ,
          <fpage>158</fpage>
          -
          <lpage>165</lpage>
          . Gold Coast,
          <source>QLD: IEEE. ISBN 978-1-5386-4285-6</source>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>1109/ICB2018</source>
          .
          <year>2018</year>
          .00033. URL https://ieeexplore.ieee.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>org/document/8411217/.</mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Meinke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Hein,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Towards neural networks that provably know when they don't know</article-title>
          .
          <source>In International Conference on Learning Representations</source>
          . URL https: //openreview.net/forum?id=ByxGkySKwH.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Nalisnick</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Matsukawa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Teh</surname>
            ,
            <given-names>Y. W.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gorur</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B. 2019a. Do</given-names>
          </string-name>
          <string-name>
            <surname>Deep Generative Models Know What They Don't Know</surname>
          </string-name>
          ? In International Conference on Learning Representations. URL https://openreview.net/ forum?id=
          <fpage>H1xwNhCcYm</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Nalisnick</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Matsukawa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Teh</surname>
            ,
            <given-names>Y. W.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2019b</year>
          .
          <article-title>Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Netzer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Coates</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bissacco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Reading Digits in Natural Images with Unsupervised Feature Learning</article-title>
          .
          <source>In NIPS Workshop on Deep Learning and Unsupervised Feature Learning</source>
          <year>2011</year>
          . URL http://ufldl.stanford.edu/housenumbers/ nips2011_housenumbers.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Ovadia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fertig</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nado</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sculley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nowozin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dillon</surname>
            ,
            <given-names>J. V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift</article-title>
          . In NeurIPS.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>Rabanser</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Günnemann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lipton</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift</article-title>
          . In Wallach, H.; Larochelle,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Beygelzimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Alché-Buc</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>d</year>
          .;
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , E.; and Garnett, R., eds.,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          ,
          <fpage>1396</fpage>
          -
          <lpage>1408</lpage>
          . Curran Associates, Inc. URL http: //papers.nips.cc/paper/8420-failing
          <article-title>-loudly-an-empiricalstudy-of-methods-for-detecting-dataset-shift</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>P. J.</given-names>
            ;
            <surname>Fertig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ;
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Poplin,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Depristo,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Dillon</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Lakshminarayanan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <article-title>Likelihood Ratios for Out-of-Distribution Detection</article-title>
          . In Wallach, H.; Larochelle,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Beygelzimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Alché-Buc</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <year>d</year>
          .;
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , E.; and Garnett, R., eds.,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          ,
          <fpage>14707</fpage>
          -
          <lpage>14718</lpage>
          . Curran Associates, Inc. URL http://papers.nips.cc/paper/9611- likelihood
          <article-title>-ratios-for-out-of-distribution-detection</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Su,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Krause</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Satheesh,
          <string-name>
            <given-names>S.</given-names>
            ; Ma, S.;
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ;
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            ; and
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          (IJCV)
          <volume>115</volume>
          (
          <issue>3</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11263- 015-0816-y.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          2017.
          <article-title>PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications</article-title>
          .
          <source>In ICLR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Paliwal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Bidirectional Recurrent Neural Networks</article-title>
          .
          <source>Trans. Sig. Proc</source>
          .
          <volume>45</volume>
          (
          <issue>11</issue>
          ):
          <fpage>2673</fpage>
          -
          <lpage>2681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>ISSN 1053-587X. doi:10.1109/78</source>
          .650093. URL https: //doi.org/10.1109/78.650093.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <string-name>
            <surname>Serrà</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Álvarez</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Slizovskaia</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Núñez</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Luque</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Input Complexity and Outof-distribution Detection with Likelihood-based Generative Models</article-title>
          . In International Conference on Learning Representations. URL https://openreview.net/forum?id= SyxIWpVYvr.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>van Amersfoort</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Teh</surname>
            ,
            <given-names>Y. W.</given-names>
          </string-name>
          ; and Gal,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rasul</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and Vollgraf,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <surname>Xie</surname>
          </string-name>
          , J.;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and Zhang,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Horizontal and Vertical Ensemble with Deep Representation for Classification.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          <source>CoRR abs/1306</source>
          .2759. URL http://arxiv.org/abs/1306.2759.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Girshick,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Dollar,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ; and
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <article-title>Aggregated Residual Transformations for Deep Neural Networks</article-title>
          .
          <source>In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>5987</fpage>
          -
          <lpage>5995</lpage>
          . Honolulu,
          <source>HI: IEEE. ISBN 978-1-5386-0457-1</source>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>634. URL http://ieeexplore.ieee.org/document/8100117/.</mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <string-name>
            <surname>Zeiler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Fergus,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Visualizing and understanding convolutional networks</article-title>
          .
          <source>In Computer Vision, ECCV 2014 - 13th European Conference, Proceedings, number PART 1 in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          ,
          <fpage>818</fpage>
          -
          <lpage>833</lpage>
          . Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <source>ISBN 9783319105895. doi:10.1007/978-3-319-10590-1_ 53. 13th European Conference on Computer Vision</source>
          , ECCV 2014 ; Conference date:
          <fpage>06</fpage>
          -
          <lpage>09</lpage>
          -2014 Through 12-
          <fpage>09</fpage>
          -
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lapedriza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Oliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Object Detectors Emerge in Deep Scene CNNs</article-title>
          .
          <source>In International Conference on Learning Representations (ICLR).</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>