Out-of-Distribution Detection Using Deep Neural Network Latent Space Uncertainty Fabio Arnez1,* , Ansgar Radermacher1 and François Terrier1 1 Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France Abstract As automated systems increasingly incorporate deep neural networks (DNNs) to perform safety-critical tasks, confidence representation and uncertainty estimation in DNN predictions have become useful and essential to represent DNN ignorance. Predictive uncertainty has often been used to identify samples that can lead to wrong predictions with high confidence, i.e., Out-of-Distribution (OoD) detection. However, predictive uncertainty estimation at the output of a DNN might fail for OoD detection in computer vision tasks such as semantic segmentation due to the lack of information about semantic structures and contexts. We propose using the DNN uncertainty from intermediate latent representations to overcome this problem. Our experiments show promising results in OoD detection for the semantic segmentation task. Keywords Uncertainty Estimation Latent Space Out-of-Distribution Detection Semantic Segmentation Automated Vehicle 1. Introduction bles, Monte-Carlo Dropout, etc.) offer a principled ap- proach to model and quantify uncertainties in DNNs. In the last decade, Deep Neural Networks (DNNs) have However, quantifying uncertainty is challenging since witnessed great advances in real-world applications like we do not have access to ground-truth uncertainty esti- Autonomous Vehicles (AVs) to perform complex tasks mates, i.e., we do not have a clear definition of what a such as object detection and tracking or vehicle control. good uncertainty estimate is. Moreover, computer vision Despite the progress introduced by DNNs in the previous tasks can add an extra level of complexity since tasks decade, they still have significant safety shortcomings such as semantic segmentation require a pixel-level un- due to their complexity, opacity and lack of interpretabil- derstanding of an image. In this case, a Bayesian Deep ity. Moreover, it is well-known that DNN models behave Learning model for semantic segmentation will classify unpredictably under dataset shift [1]. Deep Learning each pixel in the input image and generate an uncertainty (DL) models have training and data bias that directly im- estimate for each classified pixel. pact model predictions and performance. This impedes In semantic segmentation, uncertainty estimation has ensuring the reliability of the DNN models, which is a been used for Out-of-Distribution (OoD) detection under precondition for safety-critical systems to ensure compli- the assumption that samples that are far away from the ance with industry safety standards to avoid jeopardizing training distribution (anomalous or OoD samples) pro- human lives [2]. vide higher predictive uncertainty than samples that are As highly automated systems (e.g., autonomous vehi- observed in the training data [3]. Approaches that use cles or autonomous mobile robots) increasingly rely on BNNs are able to capture aleatoric and epistemic uncer- DNNs to perform safety-critical tasks, different methods tainties in the form of uncertainty maps (Figure 1-top) but have been proposed to represent confidence in the DNN still fail to detect anomalies accurately. BNN methods for predictions. One way to represent DNN confidence is to semantic segmentation are prone to yield false-positive capture the uncertainty associated with a prediction for a predictions, as well as miss-matches between anomaly given input sample. Capturing information about “what instances and uncertain areas caused by the lack of in- the model does not know” is not only useful but essential formation on semantic structures and contexts [4, 5], as in safety-critical tasks. presented in Figure 1-middle. Bayesian Neural Networks (BNNs) and existing Recently, embedding density estimation methods have Bayesian approximate inference methods (Deep Ensem- been proposed to estimate the connection to uncertain- ties from Bayesian methods [6, 3]. In this direction, The 37th AAAI Conference on Artificial Intelligence: SafeAI 2023 work- methods that leverage metrics or statistics from the non- shop, February 07–14, 2023, Washington, DC, USA parametric embedding space density have been proposed * Corresponding author. recently [7, 8], in contrast to a distance-based method that $ fabio.arnez@cea.fr (F. Arnez); ansgar.radermacher@cea.fr often assumes a parametric embedding density [9, 10, 11]. (A. Radermacher); francois.terrier@cea.fr (F. Terrier)  0000-0003-0367-3035 (F. Arnez) The present work combines the benefits from Bayesian © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License methods for uncertainty estimation with methods for la- Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Net encodes each input image 𝑥𝑖 and estimates the prob- ability of these segmentation variants (𝜇𝑝𝑟𝑖𝑜𝑟 , 𝜎𝑝𝑟𝑖𝑜𝑟 2 ). To predict a set of segmentation outputs, a set of samples are drawn from the Prior Net probability distribution. Interestingly, we can draw a connection from this ap- proach to other related work that aims to model complex aleatoric uncertainty (ambiguity, multi-modality) by han- dling stochastic input variables [15, 16, 17]. 3. Methods 3.1. Capturing Uncertainty from Intermediate Latent Representations Despite the benefits introduced by injecting random sam- Figure 1: Semantic segmentation uncertainty estimation com- ples from the latent space into U-Net, aleatoric uncer- parison for in-distribution and out-of-distribution data tainty alone is not enough. For the Out-of-Distribution detection task, epistemic uncertainty is needed [18, 19]. Although the Prior Net encoder 𝑞𝑝𝑟𝑖𝑜𝑟 employs Bayesian tent representation density estimation in the OoD detec- inference to obtain latent vectors z, it does not capture tion task. We propose to capture the entropy of interme- epistemic uncertainty since the encoder lacks a distribu- diate (latent) representations and estimate the entropy tion over parameters 𝜑. To overcome this problem, we densities for In-Distribution (InD) and OoD samples (see took inspiration from Daxberger and Hernández-Lobato Figure 1-bottom). Once entropy densities are estimated, [20], Jesson et al. [21], and propose to capture uncer- we use them to classify new input samples as InD or OoD, tainty in the Probabilistic U-Net Prior Net encoder us- i.e., we build a data-driven monitoring function data that ing 𝑀 Monte Carlo Dropout (MCD) samples [22], i.e., utilizes the input sample entropy for the OoD detection 𝑞𝑝𝑟𝑖𝑜𝑟 (𝑧 | 𝑥, 𝜑𝑚 ). task. ∫︁ 𝑞Φ (z | x, 𝒟𝑝 ) = 𝑞(z | x, 𝜑)𝑝(𝜑 | 𝒟𝑝 )𝑑𝜑 (1) 2. Semantic Segmentation with 𝜑 Probabilistic U-Net Architecture In eq. 1, we adapt the Prior Net encoder to capture the posterior 𝑞(z | x, 𝒟) using a set Φ = {𝜑𝑚 }𝑀 𝑚 of encoder Probabilistic U-Net [12], is a DNN architecture for seman- parameters samples 𝜑𝑚 ∼ 𝑝(𝜑 | 𝒟𝑝 ) that are obtained tic segmentation that combines the U-Net architecture applying MCD at test-time. During execution time, we [13] with the conditional variational autoencoder (CVAE) forward-pass an input image 𝑥𝑖 multiple times into the framework [14]. The goal of Probabilistic U-Net is to han- 𝑞𝑝𝑟𝑖𝑜𝑟 net. Each time we forward-pass the input image, dle input image ambiguities by leveraging the stochastic we will generate a new dropout mask that in consequence nature of the CVAE latent space. Figure 2 shows the will make a new (𝜇𝑝𝑟𝑖𝑜𝑟 , 𝜎𝑝𝑟𝑖𝑜𝑟 2 ) prediction. From each Probabilistic U-Net architecture. predicted (𝜇𝑝𝑟𝑖𝑜𝑟 , 𝜎𝑝𝑟𝑖𝑜𝑟 ) for the same image we sample 2 During training, depicted in Figure 2a, Probabilistic a new latent vector z, as presented in Figure 3. U-Net finds a useful embedding of the segmentation vari- MCD has been applied extensively for simple epistemic ants in the latent space by introducing a Posterior Net. uncertainty estimation. However, dropout was found to This network learns to recognize a segmentation variant be ineffective on convolutional neural networks (CNNs). and to map it into a noisy position in the latent space Standard dropout is ineffective in removing semantic 2 (𝜇𝑝𝑜𝑠𝑡 , 𝜎𝑝𝑜𝑠𝑡 ). In addition, KL divergence is used to pe- information from CNN feature maps because nearby acti- nalize differences between the distributions at the output vations contain closely related information. On the other of prior and posterior nets. The idea here is to bring both hand, dropping continuous regions in 2D feature maps distributions as close as possible so that the Prior Net dis- can help remove semantic information and enforce re- tribution covers the spaces of all presented segmentation maining units to learn features for the assigned task [23]. variants. This effect is also desired for capturing uncertainties, oth- In general, the central component of this architecture erwise, we could get overconfident uncertainty estimates is its latent space. Each value from the latent space en- in the presence of samples that contain anomalies. To codes a segmentation variant. During inference, the Prior overcome the standard dropout limitation, we followed Figure 2: Probabilistic U-Net [12], with Bayesian Prior Net for Semantic Segmentation: a. During training b. During inference with the monitoring function ℳ𝑂𝑂𝐷 at the output of the Prior Net. 3.2. Bayesian Generative Classifier for OoD Detection For OoD detection, we assume that we have access to a dataset of normal (InD) and anomaly (OoD) samples 𝑌 = {normal, anomaly}, with which we can train a Bayesian generative classifier (Not so naive Bayes Classi- fier) using the empirical density of a metric or statistic Figure 3: Prior Net latent vector z predictions with Monte Carlo DropBlock2D. The latent space at the output of the Prior 𝑇 from latent representations z, i.e., 𝑇 (z). To this end, Net is presented in 2D for illustration purposes. we follow Morningstar et al. [7] approach and use a Ker- nel Density Estimation (KDE) method to obtain the 𝑇 (z) densities. Since we aim at leveraging the uncertainty from intermediate latent representations, the 𝑇 statistic the approach from Deepshikha et al. [24], and used Drop- is the entropy at the output of the Prior Net (described in Block2D to capture uncertainty from the Probabilistic the previous section) with which we build the monitoring U-Net. We applied MC DropBlock2D in the last feature function ℳ𝑂𝑂𝐷 , as presented in Figure 2b. map from the Prior Net, as shown in Figure 2 and Figure 3 For each label set, we fit a KDE to obtain a generative (in red). model of the data, i.e., use KDE to compute the likelihood The average surprise or uncertainty of a random vari- 𝑝(𝑇 (z) | 𝑦). Then, we compute the class label prior able 𝑧 is defined by its probability distribution 𝑝(𝑧), and probability 𝑝(𝑌 ), i.e., compute the marginal categorical it is called the entropy of 𝑧, i.e., H(𝑧). For continuous distribution by counting frequencies (from the number random variables, we use the differential entropy, as pre- of samples of each class in the complete training set). For sented in Eq. 2, an unknown latent vector, we can compute the posterior ∫︁ 1 probability of each class 𝑝(𝑦 | 𝑇 (z)), using Baye’s rule H(𝑧) = 𝑝(𝑧) log 𝑑𝑧 (2) in Eq. 4. For the OoD task, we use Eq. 5 𝑧 𝑝(𝑧) To quantify uncertainty from Prior Net MCD samples, 𝑃 (𝑇 (z) | 𝑦)𝑝(𝑦) 𝑝(𝑦 | 𝑇 (z)) = (4) we used standard entropy estimators [25] on 32 Monte 𝑝(𝑇 (z)) Carlo samples (32 image forward passes through Prior Net with MC DropBlock2D turned on). In Eq. 3, the 𝑝(𝑇 (z) | 𝑦)𝑝(𝑦) 𝑝(𝑦 | 𝑇 (z)) = ∑︀ (5) entropy ĤΦ (𝑧 | 𝑥) measures the average surprise of 𝑦∈𝑌 𝑝(𝑇 (z) | 𝑦)𝑝(𝑦) observing latent vector 𝑧 at the output of Prior Net, given For a more details description of the approach for an input image 𝑥. Bayesian generative classification we refer the reader ∫︁ to the works from VanderPlas [26] and Postels et al. [3]. 1 H(𝑧 | 𝑥) = 𝑝(𝑧 | 𝑥) log 𝑑𝑧 (3) 𝑧 𝑝(𝑧 | 𝑥) Figure 4: Dataset for training the OoD monitoring function 4. Early Experiments and Results Dataset Building. For training the DNN model for semantic segmentation we used the Valeo Woodscape dataset 1 [27] with the semantic segmentation labels. Figure 5: Illustration of empirical densities with KDE: Ma- For training the monitoring function (i.e., Bayesian gen- halanobis distance 𝑑𝑀 (top-left), the multivariate Gaussian ^ 𝜑 (z | 𝑥) (top-right), and the entropy from latent erative classifier), our first choice was to use Soiling entropy 𝐻 Woodscape sub-dataset. However, after inspecting the each vector variable 𝐻^ 𝜑 (𝑧𝑖 | 𝑥). dataset, we noticed that samples were taken in small se- quences. To improve dataset diversity and implement our approach, we decided to create a new smaller sub-dataset vals that denote under-confident (uncertainty high) and by taking just one or two samples from the sampling overconfident (uncertainty very low) predictions. In the sequences for each anomaly in soiling Woodscape. We latter case, the entropy from latent vector variables, we called this new dataset OoD Woodscape, and it combines observe that some variables exhibit multimodal density samples from the Woodscape training set (normal class) predictions for OoD samples and density peaks in differ- and samples from the Soiling Woodscape validation set ent entropy value intervals from those obtained with InD (anomaly class). The ooD-Woodscape training set has samples. Finally, the 𝑑𝑀 density shows slight peaks or 280 samples, 140 samples for each class; the validation modes for OoD samples, however, densities for InD and set has 120 samples total, 60 samples for each class. The OoD have a high degree of overlap. dataset-building procedure is depicted in Figure 4. Metrics. To evaluate our monitoring function, we Experiments. We quantify the entropy from inter- used the validation set from OoD-Woodscape (the dataset mediate latent vectors. Using the entropy values, we we designed and built). We report the results using the estimate the entropy density for each sub-dataset, i.e., following metrics, as suggested by Ferreira et al. [28] and samples from normal and anomaly sub-datasets. First, Blum et al. [6]. In this regard, we report the Matthews we quantify the entropy assuming a multivariate Gaus- correlation coefficient (MCC), the F1-score, the area un- sian distribution 𝐻 ˆ 𝜑 (z | 𝑥), as presented in Figure 5 der the Receiver Operating Characteristic (AUROC), and top-right. Next, we compute the entropy estimation for the False-Positive Rate at 90% True Positive Rate (FPR90) each variable in the latent vector 𝐻 ˆ 𝜑 (𝑧𝑖 | 𝑥), as shown in values. Table 1 summarizes the results used for each Figure 5-bottom. Finally, for comparison, we also use the statistic or feature employed in our classifier (monitoring Mahalanobis distance which is a multivariate measure of function), and Figure 6, shows the ROC curve. the distance between a point and distribution. In this last Results & Discussion. We present the results of our case, we built the reference distribution taking intermedi- monitoring function (classifier) in Table 1 and in Figure 6. ate representations zi for each input image 𝑥𝑖 , from the In the results, we can see that the latent vector entropy- Woodscape validation set (see Figure 5 top-left). Then, based methods outperform the Mahalanobis distance- we measure the √︁ distance to this reference distribution based 𝑑𝑀 method in almost all the performance metrics. using 𝑑𝑀 = (z* − 𝜇zval )𝑇 Σ−1 zval (z − 𝜇zval ), for a * We believe that the reason behind the poor performance new input image x and its predicted latent vector z* . * of the 𝑑𝑀 method is the strong assumption on the embed- For entropy, in both cases, we observe that the densi- ding space being class conditional Gaussian we building ties for InD and OoD samples are different. In the first the reference distributions to compute the distance. On case, the estimated latent vector density shows clear mul- the hand, we can see that latent vector variable entropy timodality for OoD samples, with peaks in entropy inter- has the best results. The reason behind the performance is that the classifier benefits from getting more expressive 1 https://woodscape.valeo.com/download (entropy) information at the latent variable level. Method MCC F1 AUROC FPR90 ference on Robotics and Automation (ICRA), IEEE, 𝑑𝑀 0.473 0.763 0.769 0.5 2019, pp. 2083–2089. ^ 𝜑 (z | 𝑥) 𝐻 0.572 0.797 0.855 0.4 [2] F. Arnez, H. Espinoza, A. Radermacher, F. Terrier, A ^ 𝜑 (𝑧𝑖 | 𝑥) 𝐻 0.685 0.849 0.946 0.16 comparison of uncertainty estimation approaches in deep learning components for autonomous vehi- Table 1 cle applications, Proceedings of the Workshop on Evaluation of OoD detection methods using DNN latent rep- Artificial Intelligence Safety 2020 (2020). resentations [3] J. Postels, H. Blum, Y. Strümpler, C. Cadena, R. Sieg- wart, L. Van Gool, F. Tombari, The hidden un- certainty in a neural networks activations, arXiv preprint arXiv:2012.03082 (2020). [4] G. Di Biase, H. Blum, R. Siegwart, C. Cadena, Pixel- wise anomaly detection in complex driving scenes, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16918–16927. [5] Y. Xia, Y. Zhang, F. Liu, W. Shen, A. L. Yuille, Synthe- size then compare: Detecting failures and anoma- lies for semantic segmentation, in: European Con- ference on Computer Vision, Springer, 2020, pp. 145–161. [6] H. Blum, P.-E. Sarlin, J. Nieto, R. Siegwart, C. Ca- dena, The fishyscapes benchmark: measuring blind Figure 6: OoD detector ROC Curve analysis spots in semantic segmentation, International Jour- nal of Computer Vision 129 (2021) 3119–3135. [7] W. Morningstar, C. Ham, A. Gallagher, B. Lakshmi- narayanan, A. Alemi, J. Dillon, Density of states 5. Conclusion estimation for out of distribution detection, in: In- In this work, we presented a method to use the uncer- ternational Conference on Artificial Intelligence tainty from intermediate latent representations for Out- and Statistics, PMLR, 2021, pp. 3232–3240. of-distribution detection in a semantic segmentation task. [8] Y. Sun, Y. Ming, X. Zhu, Y. Li, Out-of-distribution de- Our early results show that using the entropy from latent tection with deep nearest neighbors, arXiv preprint features can be useful in building data-driven monitoring arXiv:2204.06507 (2022). functions. In future work, we aim to explore the impact [9] K. Lee, K. Lee, H. Lee, J. Shin, A simple unified of the structure in the latent space by relaxing the Gaus- framework for detecting out-of-distribution sam- sian assumption [29] and its effect on the metrics and ples and adversarial attacks, Advances in neural statistics used for the OoD detection task. Moreover, it is information processing systems 31 (2018). important to analyze the applicability of our approach in [10] J. Nitsch, M. Itkina, R. Senanayake, J. Nieto, other semantic segmentation architectures that do not M. Schmidt, R. Siegwart, M. J. Kochenderfer, C. Ca- present generative blocks of neural networks. dena, Out-of-distribution detection for automotive perception, in: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), IEEE, Acknowledgement 2021, pp. 2938–2943. [11] C.-L. Li, K. Sohn, J. Yoon, T. Pfister, Cutpaste: Self- This work has been supported by the French government supervised learning for anomaly detection and lo- under the “France 2030” program as part of the SystemX calization, in: Proceedings of the IEEE/CVF Confer- Technological Research Institute within the Confiance.ai ence on Computer Vision and Pattern Recognition, Program (www.confiance.ai). 2021, pp. 9664–9674. [12] S. A. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. H. Maier-Hein, S. Eslami, D. J. References Rezende, O. Ronneberger, A probabilistic u-net for [1] R. McAllister, G. Kahn, J. Clune, S. Levine, Robust- segmentation of ambiguous images, arXiv preprint ness to out-of-distribution inputs via task-aware arXiv:1806.05034 (2018). generative uncertainty, in: 2019 International Con- [13] O. Ronneberger, P. Fischer, T. Brox, U-net: Con- volutional networks for biomedical image segmen- tation, in: International Conference on Medical Informatsii 23 (1987) 9–16. image computing and computer-assisted interven- [26] J. VanderPlas, Python data science handbook: Es- tion, Springer, 2015, pp. 234–241. sential tools for working with data, " O’Reilly Media, [14] K. Sohn, H. Lee, X. Yan, Learning structured output Inc.", 2016. representation using deep conditional generative [27] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, models, Advances in neural information processing P. Varley, D. O’Dea, M. Uricar, S. Milz, M. Simon, systems 28 (2015) 3483–3491. K. Amende, C. Witt, H. Rashed, S. Chennupati, [15] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi- S. Nayak, S. Mansoor, X. Perrotton, P. Perez, Wood- Velez, S. Udluft, Decomposition of uncertainty scape: A multi-task, multi-camera fisheye dataset in bayesian deep learning for efficient and risk- for autonomous driving, in: Proceedings of the sensitive learning, in: International Conference IEEE/CVF International Conference on Computer on Machine Learning, PMLR, 2018, pp. 1184–1193. Vision (ICCV), 2019. [16] M. Henaff, Y. LeCun, A. Canziani, Model-predictive [28] R. S. Ferreira, J. Arlat, J. Guiochet, H. Waeselynck, policy learning with uncertainty regularization for Benchmarking safety monitors for image classifiers driving in dense traffic, in: 7th International Con- with machine learning, in: 2021 IEEE 26th Pacific ference on Learning Representations, ICLR 2019, Rim International Symposium on Dependable Com- 2019. puting (PRDC), IEEE, 2021, pp. 7–16. [17] F. Arnez, H. Espinoza, A. Radermacher, F. Terrier, [29] P. Ghosh, M. S. Sajjadi, A. Vergari, M. Black, Improving robustness of deep neural networks for B. Scholkopf, From variational to deterministic au- aerial navigation by incorporating input uncer- toencoders, in: International Conference on Learn- tainty, in: Computer Safety, Reliability, and Se- ing Representations, 2019. curity. SAFECOMP 2021 Workshops, Springer In- ternational Publishing, Cham, 2021, pp. 219–225. [18] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer vision?, in: Advances in neural information processing systems, 2017, pp. 5574–5584. [19] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Scul- ley, S. Nowozin, J. Dillon, B. Lakshminarayanan, J. Snoek, Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift, Advances in Neural Information Processing Systems 32 (2019) 13991–14002. [20] E. Daxberger, J. M. Hernández-Lobato, Bayesian variational autoencoders for unsupervised out-of-distribution detection, arXiv preprint arXiv:1912.05651 (2019). [21] A. Jesson, S. Mindermann, U. Shalit, Y. Gal, Identify- ing causal-effect inference failure with uncertainty- aware models, Advances in Neural Information Processing Systems 33 (2020) 11637–11649. [22] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap- proximation: Representing model uncertainty in deep learning, in: international conference on ma- chine learning, 2016, pp. 1050–1059. [23] G. Ghiasi, T.-Y. Lin, Q. V. Le, Dropblock: A regu- larization method for convolutional networks, Ad- vances in Neural Information Processing Systems 31 (2018) 10727–10737. [24] K. Deepshikha, S. H. Yelleni, P. Srijith, C. K. Mohan, Monte carlo dropblock for modelling uncertainty in object detection, arXiv preprint arXiv:2108.03614 (2021). [25] L. Kozachenko, N. N. Leonenko, Sample estimate of the entropy of a random vector, Problemy Peredachi