1. Introduction

Out-of-Distribution Detection Using Deep Neural Network Latent Space Uncertainty

Fabio Arnez

Ansgar Radermacher

François Terrier

0 0 Université Paris-Saclay, CEA, List , F-91120, Palaiseau , France

As automated systems increasingly incorporate deep neural networks (DNNs) to perform safety-critical tasks, confidence representation and uncertainty estimation in DNN predictions have become useful and essential to represent DNN ignorance. Predictive uncertainty has often been used to identify samples that can lead to wrong predictions with high confidence, i.e., Out-of-Distribution (OoD) detection. However, predictive uncertainty estimation at the output of a DNN might fail for OoD detection in computer vision tasks such as semantic segmentation due to the lack of information about semantic structures and contexts. We propose using the DNN uncertainty from intermediate latent representations to overcome this problem. Our experiments show promising results in OoD detection for the semantic segmentation task.

Uncertainty Estimation Latent Space Out-of-Distribution Detection Semantic Segmentation Automated Vehicle

1. Introduction

bles, Monte-Carlo Dropout, etc.) ofer a principled approach to model and quantify uncertainties in DNNs.

In the last decade, Deep Neural Networks (DNNs) have However, quantifying uncertainty is challenging since witnessed great advances in real-world applications like we do not have access to ground-truth uncertainty estiAutonomous Vehicles (AVs) to perform complex tasks mates, i.e., we do not have a clear definition of what a such as object detection and tracking or vehicle control. good uncertainty estimate is. Moreover, computer vision Despite the progress introduced by DNNs in the previous tasks can add an extra level of complexity since tasks decade, they still have significant safety shortcomings such as semantic segmentation require a pixel-level undue to their complexity, opacity and lack of interpretabil- derstanding of an image. In this case, a Bayesian Deep ity. Moreover, it is well-known that DNN models behave Learning model for semantic segmentation will classify unpredictably under dataset shift [1]. Deep Learning each pixel in the input image and generate an uncertainty (DL) models have training and data bias that directly im- estimate for each classified pixel. pact model predictions and performance. This impedes In semantic segmentation, uncertainty estimation has ensuring the reliability of the DNN models, which is a been used for Out-of-Distribution (OoD) detection under precondition for safety-critical systems to ensure compli- the assumption that samples that are far away from the ance with industry safety standards to avoid jeopardizing training distribution (anomalous or OoD samples) prohuman lives [2]. vide higher predictive uncertainty than samples that are

As highly automated systems (e.g., autonomous vehi- observed in the training data [ 3 ]. Approaches that use cles or autonomous mobile robots) increasingly rely on BNNs are able to capture aleatoric and epistemic uncerDNNs to perform safety-critical tasks, diferent methods tainties in the form of uncertainty maps (Figure 1-top) but have been proposed to represent confidence in the DNN still fail to detect anomalies accurately. BNN methods for predictions. One way to represent DNN confidence is to semantic segmentation are prone to yield false-positive capture the uncertainty associated with a prediction for a predictions, as well as miss-matches between anomaly given input sample. Capturing information about “what instances and uncertain areas caused by the lack of inthe model does not know” is not only useful but essential formation on semantic structures and contexts [ 4, 5 ], as in safety-critical tasks. presented in Figure 1-middle.

Bayesian Neural Networks (BNNs) and existing Recently, embedding density estimation methods have Bayesian approximate inference methods (Deep Ensem- been proposed to estimate the connection to uncertainties from Bayesian methods [ 6, 3 ]. In this direction, The 37th AAAI Conference on Artificial Intelligence: SafeAI 2023 work- methods that leverage metrics or statistics from the nonshop, February 07–14, 2023, Washington, DC, USA parametric embedding space density have been proposed * Corresponding author. recently [ 7, 8 ], in contrast to a distance-based method that $ fabio.arnez@cea.fr (F. Arnez); ansgar.radermacher@cea.fr often assumes a parametric embedding density [ 9, 10, 11 ]. (A.0R0a0d0e-0rm00a3c-h0e3r6)7;-f3r0a3n5co(Fis..tAerrrnieezr)@cea.fr (F. Terrier) The present work combines the benefits from Bayesian © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License methods for uncertainty estimation with methods for laCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)

Net encodes each input image and estimates the prob

ability of these segmentation variants ( , 2). To predict a set of segmentation outputs, a set of samples are drawn from the Prior Net probability distribution. Interestingly, we can draw a connection from this approach to other related work that aims to model complex aleatoric uncertainty (ambiguity, multi-modality) by handling stochastic input variables [15, 16, 17].

3. Methods

3.1. Capturing Uncertainty from

Intermediate Latent Representations tent representation density estimation in the OoD detection task. We propose to capture the entropy of intermediate (latent) representations and estimate the entropy densities for In-Distribution (InD) and OoD samples (see Figure 1-bottom). Once entropy densities are estimated, we use them to classify new input samples as InD or OoD, i.e., we build a data-driven monitoring function data that utilizes the input sample entropy for the OoD detection task.

2. Semantic Segmentation with Probabilistic U-Net Architecture In eq. 1, we adapt the Prior Net encoder to capture the

posterior (z | x, ) using a set Φ = {} of encoder Probabilistic U-Net [ 12 ], is a DNN architecture for seman- parameters samples ∼ ( | ) that are obtained tic segmentation that combines the U-Net architecture applying MCD at test-time. During execution time, we [13] with the conditional variational autoencoder (CVAE) forward-pass an input image multiple times into the framework [14]. The goal of Probabilistic U-Net is to han- net. Each time we forward-pass the input image, dle input image ambiguities by leveraging the stochastic we will generate a new dropout mask that in consequence nature of the CVAE latent space. Figure 2 shows the will make a new ( , 2) prediction. From each Probabilistic U-Net architecture. predicted ( , 2) for the same image we sample

During training, depicted in Figure 2a, Probabilistic a new latent vector z, as presented in Figure 3. U-Net finds a useful embedding of the segmentation vari- MCD has been applied extensively for simple epistemic ants in the latent space by introducing a Posterior Net. uncertainty estimation. However, dropout was found to This network learns to recognize a segmentation variant be inefective on convolutional neural networks (CNNs). and to map it into a noisy position in the latent space Standard dropout is inefective in removing semantic ( , 2). In addition, KL divergence is used to pe- information from CNN feature maps because nearby actinalize diferences between the distributions at the output vations contain closely related information. On the other of prior and posterior nets. The idea here is to bring both hand, dropping continuous regions in 2D feature maps distributions as close as possible so that the Prior Net dis- can help remove semantic information and enforce retribution covers the spaces of all presented segmentation maining units to learn features for the assigned task [23]. variants. This efect is also desired for capturing uncertainties, oth

In general, the central component of this architecture erwise, we could get overconfident uncertainty estimates is its latent space. Each value from the latent space en- in the presence of samples that contain anomalies. To codes a segmentation variant. During inference, the Prior overcome the standard dropout limitation, we followed (z | x, )( | ) (1)

For OoD detection, we assume that we have access to

a dataset of normal (InD) and anomaly (OoD) samples = {normal, anomaly}, with which we can train a Bayesian generative classifier ( Not so naive Bayes Classiifer ) using the empirical density of a metric or statistic FCiagrulorDer3o:pPBrloiocrk2NDe.tTlhateelnattevnetcstpoarcze aptrethdeicotiuotnpsutwoifththMePornioter from latent representations z, i.e., (z). To this end, Net is presented in 2D for illustration purposes. we follow Morningstar et al. [ 7 ] approach and use a Kernel Density Estimation (KDE) method to obtain the (z) densities. Since we aim at leveraging the uncertainty from intermediate latent representations, the statistic the approach from Deepshikha et al. [24], and used Drop- is the entropy at the output of the Prior Net (described in Block2D to capture uncertainty from the Probabilistic the previous section) with which we build the monitoring U-Net. We applied MC DropBlock2D in the last feature function ℳ, as presented in Figure 2b. map from the Prior Net, as shown in Figure 2 and Figure 3 For each label set, we fit a KDE to obtain a generative (in red). model of the data, i.e., use KDE to compute the likelihood

The average surprise or uncertainty of a random vari- ( (z) | ). Then, we compute the class label prior able is denfied by its probability distribution (), and probability ( ), i.e., compute the marginal categorical it is called the entropy of , i.e., H(). For continuous distribution by counting frequencies (from the number random variables, we use the diferential entropy, as pre- of samples of each class in the complete training set). For sented in Eq. 2, an unknown latent vector, we can compute the posterior ∫︁ probability of each class ( | (z)), using Baye’s rule H() = () log (2) in Eq. 4. For the OoD task, we use Eq. 5

1 ()

To quantify uncertainty from Prior Net MCD samples, we used standard entropy estimators [25] on 32 Monte Carlo samples (32 image forward passes through Prior Net with MC DropBlock2D turned on). In Eq. 3, the entropy HˆΦ( | ) measures the average surprise of observing latent vector at the output of Prior Net, given an input image .

∫︁

H( | ) = ( | ) log

4. Early Experiments and Results Dataset Building. For training the DNN model for

semantic segmentation we used the Valeo Woodscape dataset 1 [27] with the semantic segmentation labels. Figure 5: Illustration of empirical densities with KDE: MaFor training the monitoring function (i.e., Bayesian gen- halanobis distance (top-left), the multivariate Gaussian erative classifier), our first choice was to use Soiling entropy ^(z | ) (top-right), and the entropy from latent Woodscape sub-dataset. However, after inspecting the each vector variable ^( | ). dataset, we noticed that samples were taken in small sequences. To improve dataset diversity and implement our approach, we decided to create a new smaller sub-dataset vals that denote under-confident (uncertainty high) and by taking just one or two samples from the sampling overconfident (uncertainty very low) predictions. In the sequences for each anomaly in soiling Woodscape. We latter case, the entropy from latent vector variables, we called this new dataset OoD Woodscape, and it combines observe that some variables exhibit multimodal density samples from the Woodscape training set (normal class) predictions for OoD samples and density peaks in diferand samples from the Soiling Woodscape validation set ent entropy value intervals from those obtained with InD (anomaly class). The ooD-Woodscape training set has samples. Finally, the density shows slight peaks or 280 samples, 140 samples for each class; the validation modes for OoD samples, however, densities for InD and set has 120 samples total, 60 samples for each class. The OoD have a high degree of overlap. dataset-building procedure is depicted in Figure 4. Metrics. To evaluate our monitoring function, we

Experiments. We quantify the entropy from inter- used the validation set from OoD-Woodscape (the dataset mediate latent vectors. Using the entropy values, we we designed and built). We report the results using the estimate the entropy density for each sub-dataset, i.e., following metrics, as suggested by Ferreira et al. [28] and samples from normal and anomaly sub-datasets. First, Blum et al. [ 6 ]. In this regard, we report the Matthews we quantify the entropy assuming a multivariate Gaus- correlation coeficient (MCC), the F1-score, the area unsian distribution ˆ(z | ), as presented in Figure 5 der the Receiver Operating Characteristic (AUROC), and top-right. Next, we compute the entropy estimation for the False-Positive Rate at 90% True Positive Rate (FPR90) each variable in the latent vector ˆ( | ), as shown in values. Table 1 summarizes the results used for each Figure 5-bottom. Finally, for comparison, we also use the statistic or feature employed in our classifier (monitoring Mahalanobis distance which is a multivariate measure of function), and Figure 6, shows the ROC curve. the distance between a point and distribution. In this last Results & Discussion. We present the results of our case, we built the reference distribution taking intermedi- monitoring function (classifier) in Table 1 and in Figure 6. ate representations zi for each input image , from the In the results, we can see that the latent vector entropyWoodscape validation set (see Figure 5 top-left). Then, based methods outperform the Mahalanobis distancewe measure th√e︁ distance to this reference distribution based method in almost all the performance metrics. using = (z* − zval ) Σ −zv1al (z* − zval ), for a We believe that the reason behind the poor performance new input image x* and its predicted latent vector z* . of the method is the strong assumption on the embed

For entropy, in both cases, we observe that the densi- ding space being class conditional Gaussian we building ties for InD and OoD samples are diferent. In the first the reference distributions to compute the distance. On case, the estimated latent vector density shows clear mul- the hand, we can see that latent vector variable entropy timodality for OoD samples, with peaks in entropy inter- has the best results. The reason behind the performance is that the classifier benefits from getting more expressive 1https://woodscape.valeo.com/download (entropy) information at the latent variable level.

5. Conclusion Acknowledgement This work has been supported by the French government under the “France 2030” program as part of the SystemX Technological Research Institute within the Confiance.ai Program (www.confiance.ai).

In this work, we presented a method to use the uncertainty from intermediate latent representations for Outof-distribution detection in a semantic segmentation task.

Our early results show that using the entropy from latent features can be useful in building data-driven monitoring functions. In future work, we aim to explore the impact of the structure in the latent space by relaxing the Gaussian assumption [29] and its efect on the metrics and statistics used for the OoD detection task. Moreover, it is important to analyze the applicability of our approach in other semantic segmentation architectures that do not present generative blocks of neural networks. tation, in: International Conference on Medical Informatsii 23 (1987) 9–16. image computing and computer-assisted interven- [26] J. VanderPlas, Python data science handbook: Estion, Springer, 2015, pp. 234–241. sential tools for working with data, " O’Reilly Media, [14] K. Sohn, H. Lee, X. Yan, Learning structured output Inc.", 2016.

representation using deep conditional generative [27] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, models, Advances in neural information processing P. Varley, D. O’Dea, M. Uricar, S. Milz, M. Simon, systems 28 (2015) 3483–3491. K. Amende, C. Witt, H. Rashed, S. Chennupati, [15] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi- S. Nayak, S. Mansoor, X. Perrotton, P. Perez, WoodVelez, S. Udluft, Decomposition of uncertainty scape: A multi-task, multi-camera fisheye dataset in bayesian deep learning for eficient and risk- for autonomous driving, in: Proceedings of the sensitive learning, in: International Conference IEEE/CVF International Conference on Computer on Machine Learning, PMLR, 2018, pp. 1184–1193. Vision (ICCV), 2019. [16] M. Henaf, Y. LeCun, A. Canziani, Model-predictive [28] R. S. Ferreira, J. Arlat, J. Guiochet, H. Waeselynck, policy learning with uncertainty regularization for Benchmarking safety monitors for image classifiers driving in dense trafic, in: 7th International Con- with machine learning, in: 2021 IEEE 26th Pacific ference on Learning Representations, ICLR 2019, Rim International Symposium on Dependable Com2019. puting (PRDC), IEEE, 2021, pp. 7–16. [17] F. Arnez, H. Espinoza, A. Radermacher, F. Terrier, [29] P. Ghosh, M. S. Sajjadi, A. Vergari, M. Black, Improving robustness of deep neural networks for B. Scholkopf, From variational to deterministic auaerial navigation by incorporating input uncer- toencoders, in: International Conference on Learntainty, in: Computer Safety, Reliability, and Se- ing Representations, 2019. curity. SAFECOMP 2021 Workshops, Springer International Publishing, Cham, 2021, pp. 219–225. [18] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer vision?, in: Advances in neural information processing systems, 2017, pp. 5574–5584. [19] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, J. Snoek, Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift, Advances in Neural Information Processing

Systems 32 (2019) 13991–14002. [20] E. Daxberger, J. M. Hernández-Lobato, Bayesian variational autoencoders for unsupervised out-of-distribution detection, arXiv preprint arXiv:1912.05651 (2019). [21] A. Jesson, S. Mindermann, U. Shalit, Y. Gal, Identifying causal-efect inference failure with uncertaintyaware models, Advances in Neural Information

Processing Systems 33 (2020) 11637–11649. [22] Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Representing model uncertainty in deep learning, in: international conference on machine learning, 2016, pp. 1050–1059. [23] G. Ghiasi, T.-Y. Lin, Q. V. Le, Dropblock: A regularization method for convolutional networks, Advances in Neural Information Processing Systems 31 (2018) 10727–10737. [24] K. Deepshikha, S. H. Yelleni, P. Srijith, C. K. Mohan,

Monte carlo dropblock for modelling uncertainty in object detection, arXiv preprint arXiv:2108.03614 (2021). [25] L. Kozachenko, N. N. Leonenko, Sample estimate of the entropy of a random vector, Problemy Peredachi

Method

MCC

F1 AUROC FPR90 ference on Robotics and Automation (ICRA) , IEEE, 0.473 0.763 0.769 0.5 2019 , pp. 2083 - 2089 .

^(z | ) 0.572 0.797 0.855 0 . 4 [2]

Arnez ,

Espinoza ,

Radermacher ,

Terrier , A ^( | ) 0.685 0.849 0.946 0 . 16 comparison of uncertainty estimation approaches in deep learning components for autonomous vehi-

Artificial Intelligence Safety 2020 ( 2020 ).

[3]

Postels ,

Blum ,

Strümpler ,

Cadena , R. Sieg-

preprint arXiv: 2012 . 03082 ( 2020 ).

[4]

Di Biase ,

Blum ,

Siegwart ,

Cadena , Pixel-

computer vision and pattern recognition, 2021 , pp.

[5]

Xia ,

Zhang ,

Liu ,

Shen ,

A. L.

Yuille , Synthe-

ference on Computer Vision , Springer, 2020 , pp.

[6]

Blum , P.-E. Sarlin,

Nieto ,

Siegwart , C. Ca-

nal of Computer Vision 129 ( 2021 ) 3119 - 3135 .

[7]

Morningstar ,

Ham ,

Gallagher , B. Lakshmi-

and Statistics, PMLR, 2021 , pp. 3232 - 3240 .

[8]

Sun ,

Ming ,

Zhu ,

Li , Out- of-distribution de-

arXiv:2204.06507 ( 2022 ).

[9]

Lee ,

Shin , A simple unified

information processing systems 31 ( 2018 ).

[10]

Nitsch ,

Itkina ,

Senanayake , J. Nieto,

perception, in: 2021 IEEE International Intelligent

2021 , pp. 2938 - 2943 .

[11] C.-L. Li , K.

Sohn , J.

Yoon , T. Pfister, Cutpaste: Self-

2021 , pp. 9664 - 9674 .

[12]

S. A.

Kohl ,

Romera-Paredes , C. Meyer, J. De Fauw,

Rezende , O.

Ronneberger , A probabilistic u-net for [1 ]

McAllister ,

Kahn ,

Clune ,

Levine , Robust- segmentation of ambiguous images, arXiv preprint ness to out-of-distribution inputs via task-aware arXiv: 1806 . 05034 ( 2018 ).

generative uncertainty , in: 2019 International Con- [13]

Ronneberger ,

Fischer ,

Brox , U-net: Convolutional networks for biomedical image segmen-