1. Introduction

Towards Safety Assurance of Uncertainty-Aware Reinforcement Learning Agents

Felippe Schmoeller Roza

Simon Hadwiger

1 2

Ingo Thorn

Karsten Roscher

0 0 Fraunhofer IKS , Munich , Germany 1 Siemens AG , Nuremberg , Germany 2 University of Wuppertal , Wuppertal , Germany

The necessity of demonstrating that Machine Learning (ML) systems can be safe escalates with the ever-increasing expectation of deploying such systems to solve real-world tasks. While recent advancements in Deep Learning reignited the conviction that ML can perform at the human level of reasoning, the dimensionality and complexity added by Deep Neural Networks pose a challenge to using classical safety verification methods. While some progress has been made towards making verification and validation possible in the supervised learning landscape, works focusing on sequential decision-making tasks are still sparse. A particularly popular approach consists of building uncertainty-aware models, able to identify situations where their predictions might be unreliable. In this paper, we provide evidence obtained in simulation to support that uncertainty estimation can also help to identify scenarios where Reinforcement Learning (RL) agents can cause accidents when facing obstacles semantically diferent from the ones experienced while learning, focusing on industrial-grade applications. We also discuss the aspects we consider necessary for building a safety assurance case for uncertainty-aware RL models.

eol>Uncertainty estimation Distributional shifts Reinforcement Learning Functional Safety

1. Introduction

still not possible for some ML paradigms.

DNNs excel at learning complex representations from This position paper is presented to serve as motivation a bulk of data, allowing to reach state-of-the-art perforfor the long-term objective of using the uncertainty es- mance in tasks such as computer vision, natural language timation capabilities of a Reinforcement Learning (RL) processing, and control of autonomous systems. Howagent to improve its functional safety and enable RL as ever, DNNs are too complex and have too many parama viable framework to be deployed in industrial-grade eters to be verified using standard verification and valiapplications. Although not a new concept, recent accom- dation methods. On top of that, DNN models are often plishments have reignited the interest in using RL as a overconfident and incapable of recognizing that their previable method to obtain agents able to interact with a dictions might be wrong [8]. The combination of these wide range of environments (see [1, 2, 3]). These results factors has put DNNs at the center of safe AI research in were only possible due to the integration of Deep Neu- the past few years. The main goal is to guarantee that ral Networks (DNNs) as function approximators for RL DNNs can be safe, reliable, secure, robust, explainable, agents. and fair [7].

According to some authors (e.g., [4, 5, 6]), the indus- Another dificulty with DNNs, which also extends to try is eager to apply Machine Learning (ML) and DNNs Deep RL, is formalizing how capable they are of genermore broadly in their processes, with the possibility to alizing over novel instances. Despite the excellent reincrease the safety level by aiding humans in processes sults obtained with known benchmarks, diferent findthat are potentially harmful or even automate complex ings show that DNNs are susceptible to distributional tasks beyond human capabilities. According to [7], possi- shifts (e.g., [9, 10]). That means that the model output is ble applications include aircraft control, power systems, not reliable when fed with data drawn from a distribumedical systems, and the automotive domain. However, tion that difers from its training data distribution, i.e., despite the expected gains, industrial players are histori- out-of-distribution (OOD) instances. When considering cally very conservative and, most of the time, only adopt autonomous systems controlled by RL agents, there is new technologies when there is enough evidence sup- the risk of accidents when facing OOD scenarios. This porting their reliability and cost-efectiveness, which is issue can be solved by making sure the model is trained with data that covers every aspect it might encounter after deployment, which is intractable for open-world complex tasks. Alternatively, some methods have been SafeAI 2023: The AAAI’s Workshop on Artificial Intelligence Safety, Feb 13-14, 2023 | Washington, D.C., US $ felippe.schmoeller.da.roza@iks.fraunhofer.de (F. S. Roza)

Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons License suggested to make DNNs robust to distributional shifts, CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) such as in [11]. However, making DNNs able to handle distributional shifts is a challenging task and the exist- being inspired by existing uncertainty quantification aping methods are limited. We follow a diferent direction, proaches and the future outline borrowing ideas from which consists in using a monitor to identify the OOD authors that intend to conform AI systems to safety certiinstances. Once OOD is detected, the system can switch ifcation processes that are, to the best of our knowledge, to a safe control policy to avoid accidents caused by the very limited when it comes to RL. agent’s inabilities (that could be as simple as "stop and wait for help"). We follow the hypothesis that uncertainty should grow higher when facing the unknown (same as given in [12]) and use uncertainty estimation as a proxy metric to classify OOD inputs.

AI for safety-critical applications: Diferent authors

defend that to enable ML models to solve safety-critical tasks, the models must be assured by evidence that the ML components will behave in accordance with existing safety specifications. [ 13] argue that the evidence must 1.1. Scope and structure of the paper cover all aspects necessary to show why these components can be trusted. The authors also present a survey This paper aims at showing how uncertainty-based OOD with diferent methods that help in collecting the evidetection can help in the long-term goal of building a dence for the whole ML lifecycle. In [7], an extensive solid safety case for RL agents, which must be backed by study in neural networks applied to high assurance sysconvincing safety arguments. That is not the only factor tems is presented. In [14], the authors identify problems necessary to make certification of RL models possible, that arise when using ML following ISO 26262, a standard but one of the most important aspects. The paper will that regulates the functional safety of road vehicles. They focus on industrial applications of automated guided ve- claim that the use of ML can result in hazards not experihicles (AGVs). Industrial environments are mostly guided enced with conventional software. [15] also discuss the by specific regulations that are helpful when outlining shortcomings of fitting ML systems to ISO 26262 and how the system requirements and specifications in terms of the Safety of the Intended Functionality (SOTIF), pubsafety. We believe this can also be used as a starting point lished in the ISO PAS 21448, ofers a better alternative for when expanding the framework to a more general case, safety assurance. The authors also present an extensive covering a larger range of open-world applications. list of safety concerns related to DNN models, including

To validate the potential of this approach to help with the risk of the data distribution not being a good approxderiving strong safety arguments, experiments with an imation of the real world and the possibility of distribuenvironment that simulates the application of transport- tional shifts to happen over time. [16] also argue that the ing goods with a vision-based AGV in warehouses were analysis of ML systems is fundamentally incompatible conducted. The obtained results indicate that uncertainty with traditional safety verification since safety engineerestimation and OOD detection can help to identify un- ing approaches focus on faults at the component level and known situations which, in some cases, lead to accidents. their interactions with other system components while At the end of the document, systemic failures experienced in complex systems are not

The document is structured as follows: section 2 shows necessarily consequence of faults from individual parts publications available in the literature to serve as back- of the system. Therefore, the safety arguments should ground and motivation for this paper. In section 3 the also reflect the inherent complexity and unpredictability uncertainty-aware RL algorithm is shown. Section 4 con- of ever-changing environments where ML systems are tains the experiments and preliminary results, and sec- designed to operate. tion 5 presents a short discussion and the future steps we believe are necessary for building the safety assurance case for uncertainty-aware RL systems.

2. Related Work

Publications investigating safety assurance cases for RL systems are limited. Therefore, we will start with relevant works that cover the application of general AI methods in safety-critical applications. That will be followed by works that deal with uncertainty estimation and OOD detection for ML systems, mainly focusing on computer vision problems, and finally, publications that combine uncertainty and RL will be shown. Our work is an intersection of those three topics, with the proposed method

Machine Learning and Uncertainty: The impact of

uncertainty in Machine Learning is a recurrent topic of research, with a plentiful of publications discussing how ML systems should manage uncertainty and presenting methods to quantify uncertainty. In [17], the authors present a more general discussion on the properties of Bayesian Deep Learning models used for computer vision tasks that are afected by aleatoric and epistemic uncertainties (the first is inherent to the system stochastic properties while the former is related to a lack of knowledge). In [18], an introduction to the topic of uncertainty in ML models is provided as well as an overview of the main methods for capturing and handling uncertainty. In [19], the authors show how autonomous systems are afected by uncertainty and how correctly assessing uncertainty can help towards improving the supervision of inherently unsafe AI systems. Furthermore, a conceptual framework for dynamic dependability management based on uncertainty quantification is presented. In [ 20], uncertainty quantification as a proxy for the detection of OOD samples is discussed, with diferent methods compared in image classification datasets, namely CIFAR-10, GTSRB, and NWPU-RESISC45. Some popular uncertainty quantification methods for DNN models worth of mentioning are Monte Carlo Dropout [21], Deep Ensembles [22], and Evidential Deep Learning [23]. used, but Variational Auto Encoders (VAEs) are an interesting choice for vision-based systems. They are considered robust models, are trained in an unsupervised manner (i.e., labeling samples is not necessary), are fast to train, and their generalization capabilities can be visually inspected by comparing the input and reconstructed images. However, the safety argumentation would benefit from a comparison between diferent alternatives, with the strengths and deficiencies of each approach addressed, which will remain as a future work suggestion.

3.1. Reinforcement Learning

Reinforcement Learning and Uncertainty: Most In RL, we consider an agent that sequentially interacts of the work combining uncertainty quantification and with an environment modeled as an MDP. An MDP is ML cover Supervised Learning, with a strong focus on a tuple ℳ := (, , , , 0), where is the set of computer vision tasks. However, some literature also states, is the set of actions, : × × ↦→ R shows how uncertainty-aware RL agents can be obtained. is the reward function, : × × ↦→ [0, 1] is A popular application is to use uncertainty to improve the transition probability function which describes the exploration. This class of algorithms is motivated by the system dynamics, where (+1|, ) is the probability principle of Optimism in the Face of Uncertainty (OFU) of transitioning to state +1, given that the previous and describes the tradeof between using high-confidence state was and the agent took action , and 0 : ↦→ decisions, that come from the already established knowl- [0, 1] is the starting state distribution. At each time step, edge, and the agent’s need to explore state-action pairs the agent observes the current state ∈ , takes an with high epistemic uncertainty [24]. action ∈ , transitions to the next state +1 drawn

However, this paper will rather focus on uncertainty from the distribution (, ), and receives a reward as a proxy for detecting domain shifts in decision-making (, , +1). agents. In [25] it is proposed to define the data distributions in terms of the elements that compose a Markov De- 3.2. Variational Auto Encoders cision Process (MDP), where minor disturbances should fall under the generalization umbrella and large devia- VAEs are a popular class of deep probabilistic generations represent OOD samples. However, determining tive models [28]. Autoencoders follow a simple encoderwhich semantic properties represent such changes and decoder structure, where the model parameters are ophow to measure them is left as an open question. In [26], timized to minimize the diference between the input the authors present an uncertainty-aware model-based sample and the decoded data, as shown in Figure 1. The learning algorithm that adds statistical uncertainty es- trained model is able to compress the inputs into a latent timates combining bootstrapped neural networks and representation with a smaller dimension. VAEs extend Monte Carlo Dropout to its collision predictor. Mobile regular autoencoders by substituting the exact inference robot environments are used to show that the agent acts of the likelihood by the lower bound of the log-likelihood, more cautiously when facing unfamiliar scenarios and given by the evidence lower bound (ELBO): increases the robot’s velocity when it has high confidence. In [27] this method is extended to environments log (x) ≥ ℰ (|)[log (|)]− with moving obstacles. The authors also combine Monte [(|)||()] (1) Carlo dropout and deep ensembles with LSTM models to ≜ ℒ(; , ), obtain uncertainty estimates. A Model Predictive Controller (MPC) is responsible to find the optimal action that minimizes the mean and variance of the collision predictions. where is the observed variable, is the latent variable with prior () and a conditional distribution (|), (|) is an approximation to the true posterior distribution (|). (|) and (|) are neural networks parametrized by and (encoder and decoder, respectively). is the Kullback–Leibler divergence.

3. Background

In this section, we present the background for each component of the proposed uncertainty-aware RL algorithm. Diferent uncertainty quantification methods could be

Output Layer ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8

3.3. Uncertainty estimation based on Variational Auto Encoders

OOD detection using VAEs assumes that the model as- represented by a wooden pallet, while avoiding obstacles signs higher likelihoods to the samples drawn from the or hitting the walls. in-distribution (ID) pool than the OOD samples, which An RGB camera is attached to the AGV and its control is valid for diferent benchmarks as shown in [ 12]. Met- decisions are made based on the state encoded by the rics derived from the model likelihood are then used as input images and the coordinates of the AGV and the uncertainty estimates. We follow the Evidence Lower goal. The image resolution can be configured, but for Bound (ELBO) Ratio method proposed in the same pa- the results shown below, RGB images with 84 x 84 pixels per, which represents the ratio of lower bounds of the were used. The observation encoding also includes the log-likelihood of a given sample and the maximum ELBO positions of the AGV and the goal. The AGV action is obtained with the ID samples [12]. For notation simplifi- a 2-dimensional vector, , representing the linear and cation, considering a fixed VAE model parametrized by angular velocities. A reward of 100 is given if the agent and , the ELBO value ℒ(; , ) will be represented as reaches the goal position, -100 if it hits an obstacle, and (), with () representing the ELBO for -10 if it times out (i.e., it reaches the maximum number a VAE model only trained with ID samples. Following of steps). this notation, the ELBO Ratio uncertainty (0) for an To attest to the capacity of the uncertainty estimator arbitrary input 0 is shown in equation 2. to spot critical failures that might be related to OOD instances, an ID and an OOD environment were designed. (0) = (0) , (2) The diferences consist of the type of static obstacles () present in each environment, with obstacles that difer where () is the maximum value in color and shape, as shown in figure 2. calculated for all ID samples (a sort of calibration based AGV controller framework: The controller used to on the training data). solve the motion planning described above is shown in ifgure 3. The first module is a path planner, responsible to determine the optimal path to reach the goal position 4. Experiments and Preliminary based on the agent’s location. The planner takes the AGV Results kinematic model and solves the planning with the 1 Hermite Interpolation Problem with clothoids. InterpoEnvironment: To better support the proposed idea, ex- lating a sequence of waypoints using clothoid splines periments were conducted, and the preliminary results will result in a smooth trajectory, suitable for the motion will be presented as further evidence. For the experi- planning of mobile robots, as shown in [30, 31]. The ments, a custom environment was created using PyBullet planner takes a simplified observation ˜, consisting of [29]. It was designed to represent a warehouse with a the AGV and goal coordinates, as input. Its output is a configurable layout limited by walls, goods to be trans- position in the polar coordinate system = ( , ), ported by an automated guided vehicle (AGV), and a set where and are the radial and angular coordinates of obstacles that might be in the way. The goal is to reach at time , respectively. Note that the planner does not a certain location that contains a good to be transported, account for obstacles, since it is assumed that obstacles are not known a priori and the RL agent should be re

External Environment

Path Planning ˜ Low-level Controller

RL Agent * , (a) ID input images.

(b) ID reconstructed images. sponsible to react and adjust if an unexpected obstacle is in the way. The second module is a non-linear con- (a) OOD input images. (b) OOD reconstructed imtroller used to calculate the control action necessary ages. to reach the coordinate . The last module is the RL Figure 5: VAE model compression-decompression capabilities agent. Its goal is to follow the proposed trajectory, i.e., with OOD images after 10 epochs of training. keeping ≈ * as much as possible, proposing a diferent control action * ̸= only to avoid a collision. To fulfil this task, an intrinsic reward was added, with in reality the number of unknown obstacles can be ex = 0.0 if * = (a small diference is tolerated) and tremely high, these experiments should be extended to = − 0.1 otherwise. The optimal policy becomes a a set of obstacles that is statistically significant to the tradeof between avoiding the risk of collision (with the problem dimension. expressive -100 reward as punishment) and following the Figure 4 shows how the VAE learns to reconstruct the path planner to avoid the small punishments. The RL images observed in the environment populated with ID agent was trained in the ID environment using the Soft obstacles, with the input and reconstructed images. After Actor-Critic algorithm [32]. 10 epochs of training, the obstacles are recovered with a

Uncertainty estimator: The VAE uncertainty estima- good definition. However, the model is not able to recontion model was trained to fit instances randomly sampled struct the floor textures completely, which is of minor from the ID environment in a Supervised Learning man- relevance in this scenario but should be investigated if ner. To that end, 20.000 images were collected from the such features would represent safety-critical aspects (e.g., ID environment and 2.000 from the OOD, which are used oil in the floor, large cracks or holes). for validation purposes during the model training. The Figure 5 on the other hand, represents the same model model was trained for 10 epochs. trained in the ID environment trying to reconstruct im

After training the RL agent and the VAE uncertainty es- ages with OOD obstacles in it. It is visible that, even after timator, rollouts are performed in the OOD environment 10 epochs of training, the model is not able to recover the with this agent, and (state, action, reward) tuples are obstacle color or shape correctly, with blurred obstacles saved for post-analysis. The episode termination states rendered in the output. That inability to correctly comare then passed through the uncertainty estimator to press and decompress the images with OOD obstacles is verify if crashes present a significant correlation to high responsible for increasing the calculated uncertainty. uncertainty levels. The hypothesis is that if a crash hap- Figure 6 shows the obtained results for the RL agent pens due to the agent not being able to avoid an obstacle running in the OOD environment. The agent ran for semantically diferent from the ones experienced during 10.000 steps, which was equivalent to around 70 episodes. training, the OOD detector could flag this instance before The y-axis represents the ELBO Ratio, which was normalthe crash occurs. ID inputs on the other hand should sig- ized to get the values in the interval [0,1]. Episodes that nal low uncertainty, indicating that the RL agent is able ended with a crash are represented by the red bars while to handle such situations. It is worth mentioning that the blue bars picture the remaining episodes. The results these experiments only consider a very limited number show that some crash episodes presented high uncerof distinguishing features for the OOD obstacles. Since tainty, while very few non-crash episodes presented sig- of uncertainty estimation and OOD detection in the nificant uncertainty levels. On the other hand, some fail- whole Safe AI spectrum, but we believe a more structured ures did not trigger a high uncertainty level. These states way to integrate these systems and empirical results to could represent residual insuficiencies of the trained RL create a compelling safety assurance case is needed, esagent (e.g., caused by lack of training), that the OOD pecially for RL systems. To reach this long-term goal, we detector is not accurate for these inputs, or that the colli- suggest the following future steps: sion was not caused by an OOD element (e.g., the AGV crashed to a wall). To attest to the calibration of the uncertainty quantification, the same experiment was repeated in the ID environment, with the results shown in figure 7.

The ELBO Ratio values are much lower for the entirety of the episodes and more consistent. That is expected, since in this case all the states should be considered ID, showing that the VAE is not outputting false positives for these data samples.

5. Discussion and Future Perspective

This paper focuses on motivating the promising perspective of using uncertainty quantification for improving the safety case of RL systems deployed in industrial applications, concentrating on camera-based systems. For that end, an environment modeling a typical warehouse was created. The preliminary results obtained with a VAE-based uncertainty estimator suggest this monitor can distinguish some of the states that result in accidents related to environmental distributional shifts. However, it is important to notice that not all accidents are caused by OOD obstacles, but can rather be influenced by the reward function definition, observation encoding, model generalization capabilities, among other aspects. Identifying and separating accidents caused by the inability of the agent to handle novel obstacles from accidents caused by other unrelated limitations is necessary before assessing the efectiveness of the OOD detection monitor.

Many published works already discuss the importance • Operational Design Domain (ODD) [33]: In real-world applications, the number of contextual combination possibilities makes any attempt for extensive testing intractable. Therefore, precise system specification is paramount before starting to build the assurance case. The ODD should include all contextual information that covers the intended operation of the system. • Extensive experimentation: Once an appropriate ODD is derived, the experiments described in this document can be extended to a much broader scope. Varying parameters, changing scenario configuration, considering more obstacles, and adding sensor noise are just a few aspects that should be extensively considered. Strong safety arguments will depend on the experiments achieving a high statistical confidence level for the contexts described in the ODD. This should also include multiple uncertainty estimation methods, not covered in this paper. • Qualitative analysis: Understanding the system at a higher level of abstraction is also important to build a strong safety case. For that, it is important to visualize the scenarios that lead to high or low uncertainty and try to understand patterns that lead to wrong predictions, outliers, false positives and negatives, etc. • Residual error: The uncertainty monitor is not intended to cover every safety aspect, but rather covers failures caused by the inability of the system to handle domain shifts. Therefore, risks Not necessarily those items were touched on in this paper, but this list serves as a roadmap to guide our research eforts in the near future, as we believe that covering these points in deeper detail will result in incremental progress towards achieving a sound argumentation to enable uncertainty-aware RL agents to be deployed in safety-critical applications.

Acknowledgments

This work was funded by the Bavarian Ministry for Economic Afairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems.

Machine Learning Lifecycle: Desiderata, Methods, Automation (ICRA), IEEE, 2019, pp. 8662–8668. and Challenges (2019). [28] D. P. Kingma, M. Welling, Auto-encoding varia[14] R. Salay, R. Queiroz, K. Czarnecki, An Analysis tional bayes, arXiv preprint arXiv:1312.6114 (2013). of ISO 26262: Using Machine Learning Safely in [29] E. Coumans, Y. Bai, Pybullet, a python module for Automotive Software (2017). physics simulation for games, robotics and machine [15] O. Willers, S. Sudholt, S. Raafatnia, S. Abrecht, learning (2016).

Safety Concerns and Mitigation Approaches Re- [30] E. Bertolazzi, M. Frego, G1 fitting with clothoids, garding the Use of Deep Learning in Safety-Critical Mathematical Methods in the Applied Sciences 38 Perception Tasks (2020). (2015) 881–897. [16] S. Burton, J. A. McDermid, P. Garnett, R. Weaver, [31] P. Bevilacqua, M. Frego, E. Bertolazzi, D. Fontanelli, Safety, Complexity, and Automated Driving: Holis- L. Palopoli, F. Biral, Path planning maximising tic Perspectives on Safety Assurance, Computer 54 human comfort for assistive robots, in: 2016 IEEE (2021). Conference on Control Applications (CCA), IEEE, [17] A. Kendall, Y. Gal, What uncertainties do we need 2016, pp. 1421–1427.

in bayesian deep learning for computer vision?, Ad- [32] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft vances in neural information processing systems actor-critic: Of-policy maximum entropy deep re30 (2017). inforcement learning with a stochastic actor, in: In[18] E. Hüllermeier, W. Waegeman, Aleatoric and epis- ternational conference on machine learning, PMLR, temic uncertainty in machine learning: An intro- 2018, pp. 1861–1870. duction to concepts and methods, Machine Learn- [33] K. Czarnecki, Operational design domain for autoing (2021). mated driving systems, Waterloo Intelligent Sys[19] M. Henne, A. Schwaiger, G. Weiss, Managing un- tems Engineering (WISE) (2018). certainty of ai-based perception for autonomous systems., in: AISafety@ IJCAI, 2019, pp. 11–12. [20] A. Schwaiger, P. Sinhamahapatra, J. Gansloser,

K. Roscher, Is uncertainty quantification in deep learning suficient for out-of-distribution detection?, in: AISafety@ IJCAI, 2020. [21] Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Representing model uncertainty in deep learning, in: international conference on machine learning, PMLR, 2016, pp. 1050–1059. [22] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, Advances in neural information processing systems 30 (2017). [23] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep learning to quantify classification uncertainty, Advances in neural information processing systems 31 (2018). [24] T. Yang, H. Tang, C. Bai, J. Liu, J. Hao, Z. Meng,

P. Liu, Exploration in deep reinforcement learning: a comprehensive survey, arXiv preprint arXiv:2109.06668 (2021). [25] T. Haider, F. S. Roza, D. Eilers, K. Roscher, S. Günnemann, Domain shifts in reinforcement learning: Identifying disturbances in environments., in:

AISafety@ IJCAI, 2021. [26] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, S. Levine,

Uncertainty-aware reinforcement learning for collision avoidance, arXiv preprint arXiv:1702.01182 (2017). [27] B. Lütjens, M. Everett, J. P. How, Safe reinforcement learning with model uncertainty estimates, in: 2019 International Conference on Robotics and