=Paper=
{{Paper
|id=Vol-2808/Paper_7
|storemode=property
|title=Feature Space Singularity for Out-of-Distribution Detection
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_7.pdf
|volume=Vol-2808
|authors=Haiwen Huang,Zhihan Li,Lulu Wang,Sishuo Chen,Xinyu Zhou,Bin Dong
|dblpUrl=https://dblp.org/rec/conf/aaai/HuangLWCZD21
}}
==Feature Space Singularity for Out-of-Distribution Detection==
Feature Space Singularity for Out-of-Distribution Detection Haiwen Huang1 , Zhihan Li2 , Lulu Wang3 , Sishuo Chen2 Bin Dong2, 4, 5 , Xinyu Zhou3 1 Department of Computer Science, University of Oxford 2 3 Peking University MEGVII Technology 4 Beijing International Center for Mathematical Research 5 Institute for Artificial Intelligence and Center for Data Science haiwen.huang2@cs.ox.ac.uk, zxy@megvii.com, dongbin@math.pku.edu.cn Abstract pool of data in the wild (test set) in order to ensure reliable prediction. One natural solution is to remove test samples Out-of-Distribution (OoD) detection is important for build- far from the training data in some designated distances (Lee ing safe artificial intelligence systems. However, current OoD et al. 2018; van Amersfoort et al. 2020). However, calculat- detection methods still cannot meet the performance require- ments for practical deployment. In this paper, we propose ing the distance to the whole training set needs a formidable a simple yet effective algorithm based on a novel observa- amount of computation without some special design in fea- tion: in a trained neural network, OoD samples with bounded ture and architecture, e.g., training a RBF network (van norms well concentrate in the feature space. We call the cen- Amersfoort et al. 2020). In this paper, we present a sim- ter of OoD features the Feature Space Singularity (FSS), and ple yet effective distance-based solution, which neither com- denote the distance of a sample feature to FSS as FSSD. putes the distance to the training data nor performs extra Then, OoD samples can be identified by taking a thresh- model training than a standard classifier. old on the FSSD. Our analysis of the phenomenon reveals Our approach is based on a novel observation about OoD why our algorithm works. We demonstrate that our algo- samples: rithm achieves state-of-the-art performance on various OoD detection benchmarks. Besides, FSSD also enjoys robust- In a trained neural network, OoD samples with ness to slight corruption in test data and can be further en- bounded norms well concentrate in the feature space hanced by ensembling. These make FSSD a promising algo- of the neural network. rithm to be employed in real world. We release our code at In Figure 1, we show an example where OoD features from https://github.com/megvii-research/FSSD_OoD_Detection. ImageNet (Russakovsky et al. 2015) concentrate in a neu- ral network trained on the facial dataset MS-1M (Guo et al. Introduction 2016). Figure 2 and 3 provide more examples of this phe- nomenon. In fact, we find this phenomenon to be universal Empirical risk minimization fits a statistical model on a in most training configurations for most datasets. training set which is independently sampled from the data To be more precise, for a given feature extractor Fθ distribution. As a result, the yielded model is expected to trained on in-distribution data, the observation states that generalize to in-distribution data drawn from the same dis- there exists a point F ∗ in the output space of Fθ such that tribution. However, in real applications, it is inevitable for kFθ (x) − F ∗ k is small for x ∈ XOoD , where XOoD is the set a model to make predictions on Out-of-Distribution (OoD) of OoD samples. We call F ∗ the Feature Space Singularity data instead of in-distribution data on which the model is (FSS). Moreover, we discover the FSS Distance (FSSD) trained. This can lead to fatal errors such as over-confident or ridiculous predictions (Hein, Andriushchenko, and Bitter- FSSD (x) := kFθ (x) − F ∗ k (1) wolf 2018; Rabanser, Günnemann, and Lipton 2019). There- can reflect the degree of OoD, and thus can be readily used fore, it is crucial to understand the uncertainty of models as a metric for OoD detection. and automatically detect OoD data. In applications like au- Our analysis demonstrates that this phenomenon can be tonomous driving and medical services, if the model knows explained by the training dynamics. The key observation what it does not know, human intervention can be sought is that FSSD can be seen as an approximate movement of and security can be significantly improved. Fθt (x) during training, where F ∗ is the initial concentration Consider one particular example of OoD detection: some point of the features. The difference in the moving speed high-quality human face images are given as in-distribution dFθt (x) stems from the different similarity to the training data (training set for OoD detector), and we are interested dt data measured by the inner product of the gradients. More- in filtering out non-faces and low quality faces from a large over, this similarity measure varies according to the archi- Copyright © 2021 for this paper by its authors. Use permitted un- tecture of the feature extractor. der Creative Commons License Attribution 4.0 International (CC We demonstrate the effectiveness of our proposed method BY 4.0) with multiple neural networks (LeNet (LeCun and Cortes OoD Face quality increasing In-distribution In-distribution OoD FSS F* A B C D A B C D 2 Figure 1: Left: Histogram of FSS Distance (FSSD) of MS1M (in-distribution) and ImageNet (OoD). Exemplar images are shown at different FSSDs. We can see that FSSD reflects the OoD degree: as the FSSD increases, images change from non-faces and pseudo-faces, to low-quality faces and high-quality faces. Right: Principle components of features from the penultimate layer. The spatial relationship among FSS, OoD data, and in-distribution data is shown. 2010), ResNet (He et al. 2016), ResNeXt (Xie et al. 2017), Analyzing and Understanding the LSTM (Hochreiter and Schmidhuber 1997)) trained on var- Concentration Phenomenon ious datasets for classification (FashionMNIST (Xiao, Ra- In this section, we analyze the concentration phenomenon. sul, and Vollgraf 2017), CIFAR10 (Krizhevsky 2009), Ima- The key observation is that during training, the features of geNet (Russakovsky et al. 2015), CelebA (Liu et al. 2015), the training data are supervised to move away from the ini- MS-1M (Guo et al. 2016), bacteria genome dataset (Ren tial point, and the moving speeds of features of other data et al. 2019)) with varying training set sizes. We show depend on their similarity to the training data. Specifically, that FSSD achieves state-of-the-art performance on almost this similarity is measured by the inner product of the gradi- all the considered benchmarks. Moreover, the performance ents. Therefore, data that are dissimilar to the training data margin between FSSD and other methods increases as the will move little and concentrate in the feature space. This is size of the training set increases. In particular, on large-scale how FSSD identifies OoD data. benchmarks (CelebA and MS-1M), FSSD advances the AU- To see this, we derive the training dynamics of the feature ROC by about 5%. We also evaluate the robustness of our vectors. We denote Fθ : Ra → Rb as the feature extrac- algorithm when test images are corrupted. We find that our tor which maps inputs to features and Gφ : Rb → Rc to algorithm can still reliably detect OoD samples under this be the map from features to outputs. The two mappings are circumstance. Finally, we investigate the effects of ensem- parameterized by θ and φ respectively. The corresponding bling FSSDs from different layers of a single neural network loss function can be denoted as Lφ (Fθ (x1 ) , · · · , Fθ (xM )). and multiple trained netowrks. A popular choice is Lφ (Fθ (x1 ) , · · · , Fθ (xM )) = PM We summarize our contributions as follows. m=1 ` (Gφ (Fθ (xm )) , ym ) /M , where ` is the cross en- tropy loss or the mean squared error. Then, the gradient de- • We observe that in feature spaces of trained networks scent dynamics of θ is OoD samples concentrate near a point (FSS), and the dis- dθt ∂ Lφ =− (Fθt (x1 ) , · · · , Fθt (xM )) tance from a sample feature to FSS (FSSD) measures the dt ∂ θt degree of OoD (Section 1). M T (2) X ∂ Fθt (xm ) =− ∂m Lφ , m=1 ∂ θt • We analyze the concentration phenomenon by analyzing ∂L the dynamics of in-distribution and OoD features during where ∂m Lφ = ∂ Fθ (xφ m ) ∈ Rb is the backward propagation t training (Section 2). gradient and subscript t is the training time. The dynamics of the feature extractor Fθ as a function is therefore • We introduce the FSSD algorithm (Section 3) which dFθt (x) ∂ Fθt (x) dθt achieves state-of-the-art performance on various OoD de- = dt ∂ θt dt tection benchmarks with considerable robustness (Section M (3) X ∂ Fθ (x) ∂ Fθ (xm )T 4). =− t t ∂m Lφ . m=1 ∂ θt ∂ θt 30 step 0 30 step 150 30 step 1000 30 step 2000 30 step 12000 20 20 20 20 20 10 10 10 10 10 0 0 0 0 0 10 10 10 10 10 In-distribution (MNIST) OoD (FMNIST) 20 FSS F* 20 20 20 20 50 40 30 20 10 0 10 50 40 30 20 10 0 10 50 40 30 20 10 0 10 50 40 30 20 10 0 10 50 40 30 20 10 0 10 (a) The dynamics of features Fθt (x). step 0 step 150 step 1000 step 2000 step 12000 30 35 in-distribution 30 25 OoD 20 30 25 25 25 20 20 15 20 20 15 15 % % % % % 15 10 15 10 10 10 10 5 5 5 5 5 0 0 0 0 0 50000 100000150000200000250000300000350000 2500 5000 7500 1000012500150001750020000 0 2000 4000 6000 8000 10000 12000 0 1000 2000 3000 4000 5000 6000 7000 8000 0 2500 5000 7500 1000012500150001750020000 ||dF(x)/dt|| ||dF(x)/dt|| ||dF(x)/dt|| ||dF(x)/dt|| ||dF(x)/dt|| (b) The norm of the derivative, i.e., "moving speed", of last-layer feature vector at different time steps. Figure 2: We show first two principle components of the feature vector and the L2 norm of the derivatives (Equation (3)). Fea- tures and derivatives are calculated from the last fully-connected layer of a LeNet trained on MNIST (in-distribution). We feed in FashionMNIST data as OoD samples. At initialization, features of both in-distribution and OoD samples concentrate near FSS F ∗ . After training, features of in-distribution samples are pulled away from FSS F ∗ , while features of OoD samples remain close to FSS F ∗ . Similar dynamics of the softmax layer on in-distribution data was observed by (Li, Zhao, and Scheidegger 2020). From Equation (3), we can see that the relative moving tangent kernel (NTK) (Jacot, Gabriel, and Hongler 2018; Li speed of the feature Fθt (x) depends on the inner product of and Liang 2018; Cao and Gu 2020). In this way, FSSD can the gradient on parameters between x and the training data be seen as a kernel regression result: xm . Note here ∂m Lφ is the same for all x. Since FSSD de- dF (x) fined in Equation 1 can be seen as the integration of θdtt F ∗ ≈ Fθ0 (x) when the initial value Fθ0 (x) is F ∗ for all x, FSSD(x) will FSSD (x) Fθt (x) − Fθ0 (x) Equation (1) also be small when the derivative, i.e., the moving speed, is small. M Z T X In Figure 2, we show both the features and their mov- = Θt (x, xm ) ∂m Lφ dt (4) ing speeds of in-distribution and OoD data at different steps m=1 0 during training. We can see that although in-distribution and M X OoD data are indistinguishable at step 0, they are quickly ≈ Θ (x, xm ) νm , separated since the moving speeds of in-distribution data m=1 are larger than those of OoD data (Figure 2(b)) and thus RT the accumulated movements of in-distribution data are also where νm = 0 ∂m Lφ dt. larger than those of OoD data (Figure 2(a)). In Figure 3, This indicates that the similarity described by the inner ∂F (x) ∂ F (x )T we show examples of the initial concentration of features product Θt (x, xm ) := ∂θθt t θt m might enjoy sim- ∂ θt in LeNet and ResNet-34 for MNIST vs. FashionMNIST and ilar properties to commonly used kernels such as RBF ker- CIFAR10 vs. SVHN dataset pairs respectively. Empirically, nel, which diminishes as the distance between x and xm in- we find the concentration of both in-distribution and OoD creases. Moreover, since the neural tangent kernel depends features at the initial stage to be the common case for most on the neural architecture, this kernel interpretation also sug- popular architectures using random initialization. We show gests that feature extractors of different architectures, in- more examples on our Github page. cluding different layers, can have different properties and As we’ve mentioned, Equation (3) demonstrates that the measure different aspects of the similarity between x and difference in the moving speed of Fθt (x) stems from differ- xm . We can see this more clearly later in the investigation of ∂F (x) ∂ F (x )T ence in Θt (x, xm ) := ∂θθt t θt ∂ θt m . We want to fur- FSSD in different layers. ther point out that Θt (x, xm ) is effectively acting as a ker- nel that measures the similarity between x and xm . In fact, Our Algorithm when the network width is infinite, Θt (x, xm ) will converge Based on this phenomenon, we can now construct our OoD to a time-independent term Θ(x, xm ), which is called neural detection algorithm. Since the uniform noise input can be Initial Trained 4 In-distribution 4 Algorithm 1: Computation of FSSD-Ensem OoD N 2 FSS F* 2 Input: Test samples x = {xtest noise n } n=1 , noise samples S 0 0 xs s=1 , ensemble weights αk , 2 2 perturbation magnitude , 4 4 feature extractors F(k) K k=1 6 6 for each feature extractor F(k) K do ∗ PS k=1 8 8 1. Estimate FSS F(k) = s=1 F(k) xnoise s /S, 2.5 0.0 2.5 5.0 7.5 10.0 12.5 2.5 0.0 2.5 5.0 7.5 10.0 12.5 noise where xs ∼ U[0, 1], s = 1, · · · , S (a) MNIST vs. FMNIST 2. Add perturbation to test sample: Initial Trained 4 4 x̃ = x + sign(∇x F(k) (x) − F(k) ∗ ) 3 3 2 2 3. Calculate FSSD(k) (x) = F(k) (x̃) − F(k) ∗ 1 1 end 0 0 PK Return FSSD-Ensem (x) = k=1 αk FSSD(k) (x) 1 1 2 2 3 3 4 6 5 4 3 2 1 0 4 6 5 4 3 2 1 0 Experiments In this section, we investigate the performance of our FSSD (b) CIFAR10 vs. SVHN algorithm on various OoD detection benchmarks. Figure 3: Both in-distribution and OoD samples are clus- tered in the feature space of Fθ0 (x) at initialization. More- Setup over, F ∗ ≈ Fθ0 (x) for x ∈ XOoD ∪ Xin-dist . Benchmarks To conduct a thorough test of our method, we consider a wide variety of OoD detection benchmarks. In particular, we consider different scales of datasets and dif- ferent types of data. We consider different scales of datasets considered to possess the highest degree of OoD, we use the because large scale datasets tend to have more classes which center of their features as the FSS F ∗ . The FSSD can then can introduce more ambiguous data. The ambiguous data be calculated using Equation (1). Note this calculation of are of high classification uncertainty, but are not out-of- FSS F ∗ is independent from the choice of in-distribution and distribution. We list the benchmarks in Table 1. OoD datasets. When such natural choice of uniform noise We first consider two common benchmarks from previ- is unavailable, we can choose FSS F ∗ to be the center of ous OoD detection literature (van Amersfoort et al. 2020; features of OoD validation data instead. Ren et al. 2019): (A) FMNIST (Xiao, Rasul, and Vollgraf Since a single forward-pass computation through the net- 2017) vs. MNIST (LeCun and Cortes 2010) and (B) CI- work can give us features from each layer, it is also conve- FAR10 (Krizhevsky 2009) vs. SVHN (Netzer et al. 2011). nient to calculate FSSDs from different layers and ensem- They are known to be challenging for many methods (Ren PK ble them as FSSD-Ensem (x) = k=1 αk FSSD(k) (x). The et al. 2019; Nalisnick et al. 2019a). (C) We also construct ensemble weights αk can be determined using logistic re- ImageNet (dogs), a subset of ImageNet (Russakovsky et al. gression on some validation data as in (Lee et al. 2018) (see 2015) , as in-distribution data. The OoD data are non-dog Evaluation section in Experiments). In later experiments, if images from ImageNet. not specified, we use the ensembled FSSD from all layers. For large-scale problems, we consider three benchmarks. We note that it is also possible to ensemble FSSDs from dif- (D) We train models on ImageNet and detect corrupted im- ferent architectures or multiple training snapshots (Xie, Xu, ages from the ImageNet-C dataset (Hendrycks and Diet- and Zhang 2013; Huang et al. 2017). This may further en- terich 2019). We test each method on 80 sets of corruptions hance the performance of OoD detection. We investigate the (16 types and 5 levels). (E) We train models on face im- effect of ensembling in the next section. ages without the “blurry” attribute from CelebA (Liu et al. 2015) and detect face images with the “blurry” attribute. (F) Beside, we also adopt input pre-processing as in (Liang, We train models on web images of celebrities from MS- Li, and Srikant 2018; Lee et al. 2018) . The idea is to Celeb-1M (MS-1M) (Guo et al. 2016) and detect video cap- add small perturbations to the test samples in order to in- tures from IJB-C (Maze et al. 2018) which in general have crease the in-distribution score. It is shown in (Liang, Li, lower quality due to pose, illumination, and resolution is- and Srikant 2018; Kamoi and Kobayashi 2020) that in- sues. We also consider (G) the bacteria genome benchmark distribution data are more sensitive to such perturbation and introduced by (Ren et al. 2019), which consists of sequence it can therefore enlarge the score gap between in-distribution data. and OoD samples. In particular, we perturb as x̃ = x + To train models on in-distribution datasets, we follow pre- sign (∇x FSSD (x)) and take FSSD (x̃) as the final score. vious works (Lee et al. 2018) to train LeNet on FMNIST We present the pseudo-code of computing and ResNet with 34 layers on CIFAR10, ImageNet, and Ima- FSSD-Ensem (x) in Algorithm 1. geNet (dogs). For two face recognition datasets (CelebA and Table 1: OoD detection benchmarks used in our experiments. In-distribution OoD Data type #Classes #Samples #Samples Dataset Dataset (Train/Test) (Train/Test) (Test) A FMNIST 10/10 60k/10k MNIST 10k Image B CIFAR10 10/10 50k/10k SVHN 26k Image C ImageNet (dogs) 50/50 50k/10k ImageNet (non-dogs) 10k Image D ImageNet 1000/1000 1281.2k/50k ImageNet-C 50k Image E CelebA (not blurry) 10122/10122 153.8k/38.5k CelebA (blurry) 10.3k Image F MS-1M 64736/16184 2923.6k/50k IJB-C 50k Image G Genome (before 2011) 10/10 1000k/1000k Genome (after 2016) 6000k Sequence MS-1M), we train ResNeXt with 50 layers. For the genome Ren et al. 2019; Liang, Li, and Srikant 2018) to use a sep- sequence dataset, we use an character embedding layer and arate validation set, which consists of 1,000 images from two Bidirectional LSTM layers (Schuster and Paliwal 1997). each in-distribution and OoD data pair. Ensemble weights αk for FSSD from different layers are extracted from a lo- Compared methods We compare our method with the gistic regression model, which is trained using nested cross following six common methods for OoD detection. Base: validation within the validation set as in (Lee et al. 2018; the baseline method using the maximum softmax probabil- Ma et al. 2018). The same procedure is performed on Maha ity p (ŷ|x) (Hendrycks and Gimpel 2017). ODIN: temper- for fair comparison. The perturbation magnitude of input ature scaling on logits and input pre-processing (Liang, Li, pre-processing for ODIN, Maha, and FSSD is searched from and Srikant 2018). Maha: Mahalanobis distance of the sam- 0 to 0.2 with step size 0.01. The temperature T of ODIN is ple feature to the closest class-conditional Gaussian distri- chosen from 1, 10, 100, and 1000, and the dropout rate of bution which is estimated from the training data (Lee et al. MCD is chosen from 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, and 2018). In our experiments, we follow (Lee et al. 2018) to 0.5. use both feature (layer) ensemble and input pre-processing. DE: Deep Ensemble which averages the softmax probabil- ity predictions from multiple independently trained classi- Main results fiers (Lakshminarayanan, Pritzel, and Blundell 2017). In our experiments, we take the average of 5 classifiers by default. MCD: Monte-Carlo Dropout that uses dropout during both The main results are presented in Table 2 and Figure 4. In training and inference (Gal and Ghahramani 2016). We fol- Table 2, we can see that larger datasets entail greater dif- low (Ovadia et al. 2019) to dropout convolutional layers. ficulty in OoD detection. Notably, the advantage of FSSD For OoD detection, we calculate both the mean and the over other methods increases as the dataset size increases. variance of 32 independent predictions and choose the bet- Other methods like Maha and OE perform well on some ter one to report. OE: Outlier exposure that explicitly en- small benchmarks, but have large variance across different forces uniform probability prediction on an auxiliary dataset datasets. In comparison, FSSD maintains great performance of outliers (Hendrycks, Mazeika, and Dietterich 2019). For on these benchmarks. On the genome sequence dataset, we the choice of auxiliary datsets, we use KMNIST (Clanuwat also observe that FSSD outperforms other methods. These et al. 2018) for benchmark A, CelebA (Liu et al. 2015) for results show that FSSD is a promising effective method for benchmark C, and ImageNet-1K (Russakovsky et al. 2015) a wide range of applications. for benchmark B, E, F. We do not evaluate OE on the se- Inspired by (Ovadia et al. 2019), we also evaluate the quence benchmark, since we can not find a reasonable auxil- methods on the ability of detecting distributional dataset iary dataset. We remark here that Base, ODIN, and FSSD can shift like Gaussian noise and JPEG artifacts. Figure 4 shows be deployed directly with a trained neural network, MCD the means and quartiles of AUROC of the compared meth- needs a trained neural network with dropout layers, while ods over 16 types of corruptions on 5 corruption levels. DE needs multiple trained classifiers. Besides, Maha needs We can observe that for each method, the performance of to use the training data during OoD detection on test data OoD detection increases as the level of corruption increases, and OE trains a neural network either from scratch or by while FSSD enjoys the highest AUROC and much less vari- fine-tuning to utilize the auxiliary dataset. ation over different types of corruptions. The CelebA bench- mark also evaluates the methods on detecting the dataset Evaluation We follow (Ren et al. 2019; Hendrycks, shift of the attribute “blurry”. However, all methods in- Mazeika, and Dietterich 2019) to use the following met- cluding FSSD do not perform very well. There are two rics to assess the performance of OoD detection. AUROC: possible reasons: (1) the attribute “blurry” of CelebA may Area Under the Receiver Operating Characteristic curve. be annotated not clearly enough; (2) the blurs in the wild AUPRC: Area Under the Precision-Recall Curve. FPR80: may be more difficult to detect than the simulated blurs in False Positive Rate when the true positive rate is 80%. ImageNet-C. Overall, we can see that FSSD can more reli- For hyper-parameter tuning, we follow (Lee et al. 2018; ably detect different kinds of distributional shifts. Table 2: Main results. All values are in %. Datasets (Architecture) Metrics Base ODIN Maha DE MCD OE FSSD AUROC 77.3 96.9 99.6 83.9 81.7 99.6 99.6 FMNIST vs. MNIST AUPRC 79.2 93.0 99.7 83.3 85.3 99.6 99.7 (LeNet) FPR80 43.5 2.5 0.0 27.5 36.8 0.0 0.0 Small-scale AUROC 89.9 96.7 99.1 93.7 96.7 90.4 99.5 CIFAR10 vs. SVHN benchmarks AUPRC 85.4 92.5 98.1 90.6 93.9 89.8 99.5 (ResNet34) FPR80 10.1 4.7 0.3 3.7 2.4 12.5 0.4 AUROC 88.5 90.8 83.3 89.0 67.2 92.5 93.1 ImageNet dogs vs. non-dogs AUPRC 86.1 88.6 83.0 89.0 66.9 92.6 92.5 (ResNet34) FPR80 19.5 15.2 30.1 18.8 59.2 7.9 10.2 AUROC 71.7 73.3 73.9 74.5 69.8 71.5 78.3 CelebA non-blurry vs. blurry AUPRC 89.9 91.4 90.9 91.4 88.7 90.7 92.8 (ResNeXt50) Large-scale FPR80 52.0 50.3 46.0 47.1 53.2 54.2 39.2 benchmarks AUROC 60.0 61.3 82.5 63.0 65.5 52.6 86.7 MS-1M vs. IJB-C AUPRC 53.3 55.9 80.6 56.1 59.4 46.6 86.1 (ResNeXt50) FPR80 61.8 59.4 29.6 56.7 58.8 64.2 22.1 AUROC 69.6 70.6 70.4 70.0 69.3 NA 74.8 Sequence Bacteria Genome AUPRC 69.9 71.9 69.3 56.0 70.2 NA 75.8 benchmark (LSTM) FPR80 57.4 55.9 53.7 30.0 58.3 NA 47.4 1.0 100 CIFAR10 vs. SVHN 100 CIFAR10 vs. SVHN 0.9 90 90 80 80 0.8 AUROC AUROC AUROC 70 70 0.7 ODIN Base 60 OE Maha 60 MCD FSSD 0.6 ODIN 50 50 0.8% 1.6% 2.4% 3.0% 0.3% 0.6% 0.9% 1.2% Maha Gaussian Noise Impulse Noise 0.5 FSSD 1 2 3 4 5 100 ImageNet dogs vs. non-dogs 100 ImageNet dogs vs. non-dogs Level of corruption 90 90 Figure 4: Comparison of AUROC on ImageNet vs. 80 80 AUROC AUROC ImageNet-C. FSSD enjoys the highest mean and the least 70 70 variance across all corruption levels. 60 60 50 50 0.8% 1.6% 2.4% 3.0% 0.3% 0.6% 0.9% 1.2% Gaussian Noise Impulse Noise Robustness Figure 5: Comparison of OoD detection robustness among In practice, it is possible that the test data are slightly cor- methods on slightly corrupted test data. rupted or shifted due to the change of data source, e.g., from lab to real world. We evaluate the ability to distin- guish in-distribution data from OoD data when test data has high diversity across different layers, and benefit from (both in-distribution and OoD) are slightly corrupted. Note such diversity to reach higher performance. In Figure 6, that we still use non-corrupted data during network training we find that FSSD in different layers are working differ- and hyper-parameter tuning. We apply Gaussian noise and ently. This can be explained by previous works on under- impulse noise, two typical corruptions, with varying levels. standing neural networks by visualizing the different repre- Test results on CIFAR10 vs. SVHN and ImageNet dogs vs. sentations learned by low and deep layers of a neural net- non-dogs are shown in Figure 5. We can see that FSSD is work (Zeiler and Fergus 2014; Zhou et al. 2015). Gener- robust to corruptions presented in test images, while other ally, FSSDs from deep layers reflect more high-level features methods may degrade. and FSSDs from early layers reflect more low-level statis- tics. ImageNet (dogs) and ImageNet (non-dogs) are from Effects of ensemble the same dataset (ImageNet), and are therefore similar in During our experiments, we find that the ensemble plays terms of low-level statistics; while the differences between an important role in enhancing the performance of FSSD. CIFAR10 and SVHN are in all different levels. From the per- Previous studies show that an important issue for ensemble- spective of kernel interpretation, this means that the neural based algorithms is enforcing diversity (Lakshminarayanan, tangent kernels of different layers diversify well and allow Pritzel, and Blundell 2017). In our case, we find that FSSD the ensemble of FSSD to capture different aspects of the dis- Layer 0 Layer 1 Layer 2 0.06 Layer 3 Layer 4 0.035 0.010 In-distribution 0.008 0.035 OoD 0.007 0.030 0.05 0.030 0.008 0.006 0.025 0.04 0.025 0.005 0.020 0.006 0.03 0.020 0.004 0.015 0.004 0.015 0.003 0.02 0.010 0.010 0.002 0.002 0.01 0.001 0.005 0.005 0.000 0.000 0.000 0.00 0.000 100 150 200 250 300 350 400 450 400 450 500 550 600 650 700 750 100 120 140 160 180 200 220 240 40 50 60 70 80 20 40 60 80 100 120 (a) ImageNet (dogs) vs. ImageNet (non-dogs) Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 0.25 0.14 In-distribution 0.200 0.5 0.12 OoD 0.12 0.175 0.20 0.10 0.4 0.150 0.10 0.08 0.125 0.15 0.3 0.08 0.06 0.100 0.06 0.10 0.2 0.04 0.075 0.04 0.050 0.05 0.1 0.02 0.025 0.02 0.00 0.000 0.00 0.0 0.00 10 20 30 40 25 30 35 40 45 16 18 20 22 24 26 28 30 6 8 10 12 14 16 18 10 15 20 25 30 35 40 45 FSSD FSSD FSSD FSSD FSSD (b) CIFAR10 vs. SVHN Figure 6: FSSDs from different layers behave differently. Each row contains FSSD histograms extracted from different layers of a trained neural network. FSSDs of ImageNet (dogs) and ImageNet (non-dogs) are similar in early layers; while FSSDs of CIFAR10 and SVHN differ in all the layers. This can be explained by the fact that ImageNet (dogs) and ImageNet (non-dogs) are similar in low-level statistics since they are sampled from the same dataset, and that FSSDs in early layers capture more of the difference in low-level statistics. crepancy between the test data and training data. We show nick et al. 2019b; Serrà et al. 2020). However, these methods more examples of FSSDs in different layers on our Github typically have extra training difficulty incurred by large gen- page. erative models. (4) There are also works designing non-Euclidean met- Related works rics to compare test samples to training samples, and regard those with higher distances to training samples as OoD sam- Out-of-distribution detection ples (Lee et al. 2018; van Amersfoort et al. 2020; Kamoi and According to different understandings of OoD samples, pre- Kobayashi 2020; Lakshminarayanan et al. 2020). Our ap- vious OoD detection methods can be summarized into four proach resembles this type most. Instead of comparing test categories. samples to training samples, we compare the features of the (1) Some methods regard OoD samples as those with test samples to the center of OoD features. uniform probability prediction across classes (Hein, An- driushchenko, and Bitterwolf 2018; Hendrycks and Gimpel Conclusion 2017; Liang, Li, and Srikant 2018; Meinke and Hein 2020) and treat the test samples with high entropy or low maximum In this work, we propose a new OoD detection algorithm prediction probability as OoD data. Since these methods are based on a novel observation that OoD samples concentrate based on prediction, they run the risk of mis-classifying am- in the feature space of a trained neural network. We pro- biguous data as OoD samples, e.g., when there are thousands vide analysis and understanding of the concentration phe- of classes in a large-scale dataset. nomenon by analyzing the training dynamics both theoret- (2) OoD samples can also be characterized as samples ically and empirically and further interpreted the algorithm with high epistemic uncertainty which reflects the lack of in- with the neural tangent kernel. We demonstrate that our al- formation on these samples (Lakshminarayanan, Pritzel, and gorithm is state-of-the-art in detection performance and is Blundell 2017; Gal and Ghahramani 2016). Specifically, we robust to measurement noise. Our further investigation on can propagate the uncertainty of models to the uncertainty the effect of ensemble reveals diversity in layer ensembles of predictions, which characterizes the level of OoD. MCD and shows promising performance of network ensembles. In and DE are two popular choices of this type. However, it is summary, we hope that our work can provide new insights reported that current epistemic uncertainty estimations may for understanding properties of neural networks and add an noticeably degrade under dataset distributional shift (Ovadia alternative simple and effective OoD detection method to the et al. 2019). Our experiments on detecting ImageNet-C from safe AI deployment toolkits. ImageNet (Figure 4) confirm this. (3) When the density of data can be approximated, e.g., Acknowledgement using generative models (Kingma and Dhariwal 2018; Sal- Bin Dong is supported in part by Beijing Natural Science imans et al. 2017), OoD samples can be classified as those Foundation (No. 180001); National Natural Science Foun- with low density. Recent works provide many inspiring in- dation of China (NSFC) grant No. 11831002 and Beijing sights on how to improve this idea (Ren et al. 2019; Nalis- Academy of Artificial Intelligence (BAAI). References Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural Cao, Y.; and Gu, Q. 2020. Generalization Error Bounds Tangent Kernel: Convergence and Generalization in Neu- of Gradient Descent for Learning Over-Parameterized Deep ral Networks. In Bengio, S.; Wallach, H.; Larochelle, ReLU Networks. In The Thirty-Fourth AAAI Conference H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., on Artificial Intelligence, 3349–3356. AAAI Press. URL eds., Advances in Neural Information Processing Sys- https://aaai.org/ojs/index.php/AAAI/article/view/5736. tems 31, 8571–8580. Curran Associates, Inc. URL http://papers.nips.cc/paper/8076-neural-tangent-kernel- Clanuwat, T.; Bober-Irizar, M.; Kitamoto, A.; Lamb, A.; Ya- convergence-and-generalization-in-neural-networks.pdf. mamoto, K.; and Ha, D. 2018. Deep Learning for Classical Kamoi, R.; and Kobayashi, K. 2020. Why is the Japanese Literature. Mahalanobis Distance Effective for Anomaly Detection? Gal, Y.; and Ghahramani, Z. 2016. Dropout as a Bayesian arXiv:2003.00402 [cs, stat] URL http://arxiv.org/abs/2003. Approximation: Representing Model Uncertainty in Deep 00402. ArXiv: 2003.00402. Learning. volume 48 of Proceedings of Machine Learning Kingma, D. P.; and Dhariwal, P. 2018. Glow: Genera- Research, 1050–1059. New York, New York, USA: PMLR. tive Flow with Invertible 1x1 Convolutions. In Bengio, S.; URL http://proceedings.mlr.press/v48/gal16.html. Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, Guo, Y.; Zhang, L.; Hu, Y.; He, X.; and Gao, J. 2016. MS- N.; and Garnett, R., eds., Advances in Neural Informa- Celeb-1M: A Dataset and Benchmark for Large-Scale Face tion Processing Systems 31, 10215–10224. Curran Asso- Recognition. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, ciates, Inc. URL http://papers.nips.cc/paper/8224-glow- M., eds., Computer Vision – ECCV 2016, volume 9907, 87– generative-flow-with-invertible-1x1-convolutions.pdf. 102. Cham: Springer International Publishing. ISBN 978-3- Krizhevsky, A. 2009. Learning multiple layers of features 319-46486-2 978-3-319-46487-9. doi:10.1007/978-3-319- from tiny images. Technical report. 46487-9_6. URL http://link.springer.com/10.1007/978-3- Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017. 319-46487-9_6. Series Title: Lecture Notes in Computer Simple and Scalable Predictive Uncertainty Estimation Science. using Deep Ensembles. In Guyon, I.; Luxburg, U. V.; He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Learning for Image Recognition. In 2016 IEEE Conference and Garnett, R., eds., Advances in Neural Information on Computer Vision and Pattern Recognition (CVPR), 770– Processing Systems 30, 6402–6413. Curran Associates, 778. Las Vegas, NV, USA: IEEE. ISBN 978-1-4673-8851- Inc. URL http://papers.nips.cc/paper/7219-simple-and- 1. doi:10.1109/CVPR.2016.90. URL http://ieeexplore.ieee. scalable-predictive-uncertainty-estimation-using-deep- org/document/7780459/. ensembles.pdf. Hein, M.; Andriushchenko, M.; and Bitterwolf, J. 2018. Lakshminarayanan, B.; Tran, D.; Liu, J.; Padhy, S.; Bedrax- Why ReLU Networks Yield High-Confidence Predictions Weiss, T.; and Lin, Z. 2020. Simple and Principled Uncer- Far Away From the Training Data and How to Mitigate the tainty Estimation with Deterministic Deep Learning via Dis- Problem. 2019 IEEE/CVF Conference on Computer Vision tance Awareness. In Advances in Neural Information Pro- and Pattern Recognition (CVPR) 41–50. cessing Systems 33. Hendrycks, D.; and Dietterich, T. 2019. Benchmarking Neu- LeCun, Y.; and Cortes, C. 2010. MNIST handwritten digit ral Network Robustness to Common Corruptions and Per- database URL http://yann.lecun.com/exdb/mnist/. turbations. Proceedings of the International Conference on Lee, K.; Lee, K.; Lee, H.; and Shin, J. 2018. A Simple Uni- Learning Representations . fied Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. In Proceedings of the 32nd Inter- Hendrycks, D.; and Gimpel, K. 2017. A Baseline for De- national Conference on Neural Information Processing Sys- tecting Misclassified and Out-of-Distribution Examples in tems, NIPS’18, 7167–7177. Red Hook, NY, USA: Curran Neural Networks. Proceedings of International Conference Associates Inc. on Learning Representations . Li, M.; Zhao, Z.; and Scheidegger, C. 2020. Visualizing Hendrycks, D.; Mazeika, M.; and Dietterich, T. 2019. Deep Neural Networks with the Grand Tour. Distill doi:10.23915/ Anomaly Detection with Outlier Exposure. In Interna- distill.00025. Https://distill.pub/2020/grand-tour. tional Conference on Learning Representations. URL https: //openreview.net/forum?id=HyxCxhRcY7. Li, Y.; and Liang, Y. 2018. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Struc- Hochreiter, S.; and Schmidhuber, J. 1997. Long Short- tured Data. In Proceedings of the 32nd International Confer- Term Memory. Neural Comput. 9(8): 1735–1780. ISSN ence on Neural Information Processing Systems, NIPS’18, 0899-7667. doi:10.1162/neco.1997.9.8.1735. URL https: 8168–8177. Red Hook, NY, USA: Curran Associates Inc. //doi.org/10.1162/neco.1997.9.8.1735. Liang, S.; Li, Y.; and Srikant, R. 2018. Enhancing The Huang, G.; Li, Y.; Pleiss, G.; Liu, Z.; Hopcroft, J. E.; and Reliability of Out-of-distribution Image Detection in Neu- Weinberger, K. Q. 2017. Snapshot Ensembles: Train 1, get ral Networks. In International Conference on Learning M for free. CoRR abs/1704.00109. URL http://arxiv.org/ Representations. URL https://openreview.net/forum?id= abs/1704.00109. H1VGkIxRZ. Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep Learn- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; ing Face Attributes in the Wild. In Proceedings of Interna- Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; tional Conference on Computer Vision (ICCV). Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Vi- Ma, X.; Li, B.; Wang, Y.; Erfani, S. M.; Wijewickrema, sual Recognition Challenge. International Journal of Com- S.; Schoenebeck, G.; Houle, M. E.; Song, D.; and Bailey, puter Vision (IJCV) 115(3): 211–252. doi:10.1007/s11263- J. 2018. Characterizing Adversarial Subspaces Using Lo- 015-0816-y. cal Intrinsic Dimensionality. In International Conference Salimans, T.; Karpathy, A.; Chen, X.; and Kingma, D. P. on Learning Representations. URL https://openreview.net/ 2017. PixelCNN++: A PixelCNN Implementation with Dis- forum?id=B1gJ1L2aW. cretized Logistic Mixture Likelihood and Other Modifica- Maze, B.; Adams, J.; Duncan, J. A.; Kalka, N.; Miller, tions. In ICLR. T.; Otto, C.; Jain, A. K.; Niggel, W. T.; Anderson, J.; Schuster, M.; and Paliwal, K. 1997. Bidirectional Recur- Cheney, J.; and Grother, P. 2018. IARPA Janus Bench- rent Neural Networks. Trans. Sig. Proc. 45(11): 2673–2681. mark - C: Face Dataset and Protocol. In 2018 Inter- ISSN 1053-587X. doi:10.1109/78.650093. URL https: national Conference on Biometrics (ICB), 158–165. Gold //doi.org/10.1109/78.650093. Coast, QLD: IEEE. ISBN 978-1-5386-4285-6. doi:10. Serrà, J.; Álvarez, D.; Gómez, V.; Slizovskaia, O.; Núñez, 1109/ICB2018.2018.00033. URL https://ieeexplore.ieee. J. F.; and Luque, J. 2020. Input Complexity and Out- org/document/8411217/. of-distribution Detection with Likelihood-based Genera- Meinke, A.; and Hein, M. 2020. Towards neural networks tive Models. In International Conference on Learning that provably know when they don’t know. In Interna- Representations. URL https://openreview.net/forum?id= tional Conference on Learning Representations. URL https: SyxIWpVYvr. //openreview.net/forum?id=ByxGkySKwH. van Amersfoort, J.; Smith, L.; Teh, Y. W.; and Gal, Y. 2020. Nalisnick, E.; Matsukawa, A.; Teh, Y. W.; Gorur, D.; and Simple and Scalable Epistemic Uncertainty Estimation Us- Lakshminarayanan, B. 2019a. Do Deep Generative Models ing a Single Deep Deterministic Neural Network. Know What They Don’t Know? In International Conference Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: on Learning Representations. URL https://openreview.net/ a Novel Image Dataset for Benchmarking Machine Learning forum?id=H1xwNhCcYm. Algorithms. Nalisnick, E.; Matsukawa, A.; Teh, Y. W.; and Lakshmi- Xie, J.; Xu, B.; and Zhang, C. 2013. Horizontal and Verti- narayanan, B. 2019b. Detecting Out-of-Distribution Inputs cal Ensemble with Deep Representation for Classification. to Deep Generative Models Using Typicality. CoRR abs/1306.2759. URL http://arxiv.org/abs/1306.2759. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; and He, K. 2017. and Ng, A. Y. 2011. Reading Digits in Natural Im- Aggregated Residual Transformations for Deep Neural Net- ages with Unsupervised Feature Learning. In NIPS Work- works. In 2017 IEEE Conference on Computer Vision shop on Deep Learning and Unsupervised Feature Learn- and Pattern Recognition (CVPR), 5987–5995. Honolulu, HI: ing 2011. URL http://ufldl.stanford.edu/housenumbers/ IEEE. ISBN 978-1-5386-0457-1. doi:10.1109/CVPR.2017. nips2011_housenumbers.pdf. 634. URL http://ieeexplore.ieee.org/document/8100117/. Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.; Zeiler, M.; and Fergus, R. 2014. Visualizing and understand- Nowozin, S.; Dillon, J. V.; Lakshminarayanan, B.; and ing convolutional networks. In Computer Vision, ECCV Snoek, J. 2019. Can You Trust Your Model’s Uncertainty? 2014 - 13th European Conference, Proceedings, number Evaluating Predictive Uncertainty Under Dataset Shift. In PART 1 in Lecture Notes in Computer Science (including NeurIPS. subseries Lecture Notes in Artificial Intelligence and Lec- Rabanser, S.; Günnemann, S.; and Lipton, Z. 2019. Failing ture Notes in Bioinformatics), 818–833. Springer Verlag. Loudly: An Empirical Study of Methods for Detecting ISBN 9783319105895. doi:10.1007/978-3-319-10590-1_ Dataset Shift. In Wallach, H.; Larochelle, H.; Beygelz- 53. 13th European Conference on Computer Vision, ECCV imer, A.; Alché-Buc, F. d.; Fox, E.; and Garnett, R., 2014 ; Conference date: 06-09-2014 Through 12-09-2014. eds., Advances in Neural Information Processing Sys- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Tor- tems 32, 1396–1408. Curran Associates, Inc. URL http: ralba, A. 2015. Object Detectors Emerge in Deep Scene //papers.nips.cc/paper/8420-failing-loudly-an-empirical- CNNs. In International Conference on Learning Represen- study-of-methods-for-detecting-dataset-shift.pdf. tations (ICLR). Ren, J.; Liu, P. J.; Fertig, E.; Snoek, J.; Poplin, R.; De- pristo, M.; Dillon, J.; and Lakshminarayanan, B. 2019. Likelihood Ratios for Out-of-Distribution Detection. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; Alché-Buc, F. d.; Fox, E.; and Garnett, R., eds., Advances in Neu- ral Information Processing Systems 32, 14707–14718. Cur- ran Associates, Inc. URL http://papers.nips.cc/paper/9611- likelihood-ratios-for-out-of-distribution-detection.pdf.