1. Introduction

Safety Assurance with Ensemble-based Uncertainty Estimation and overlapping alternative Predictions in Reinforcement Learning

Dirk Eilers

Simon Burton

Felippe Schmoeller Roza

Karsten Roscher

0 0 Fraunhofer Institute for Cognitive Systems IKS, Fraunhofer Gesellschaft , Munich , Germany

2023

A number of challenges are associated with the use of machine learning technologies in safety-related applications. These include the dificulty of specifying adequately safe behaviour in complex environments (specification uncertainty), ensuring a predictably safe behaviour under all operating conditions (technical uncertainty) and arguing that the safety goals of the system have been met with suficient confidence (assurance uncertainty). An assurance argument is therefore required that demonstrates that the efects of these uncertainties do not lead to an unacceptable level of risk during operation. A reinforcement learning model will predict an action in whatever state it is in - even in previously unseen states for which a valid (safe) outcome cannot be determined due to lack of training. Uncertainty estimation is a well understood approach in machine learning to identify states with a high probability of an invalid action due a lack of training experience, thus addressing technical uncertainty. However, the impact of alternative possible predictions which may be equally valid (and represent a safe state) in estimating uncertainty in reinforcement learning is not so clear and to our knowledge, not so well documented in current literature. In this paper we build on work where we investigated uncertainty estimation on simplified scenarios in a gridworld environment. Using model ensemble-based uncertainty estimation we proposed an algorithm based on action count variance to deal with discrete action spaces whilst considering in-distribution action variance calculation to handle the overlap with alternative predictions. The method indicates potentially unsafe states when the agent is near out-of-distribution elements and can distinguish it from overlapping alternative, but equally valid predictions. Here, we present these results within the context of a safety assurance framework and highlight the activities and evidences required to build a convincing safety argument. We show that our previous approach is able to act as an external observer and can fulfil the requirements of an assurance argumentation for systems based on machine learning with ontological uncertainty.

eol>Safe Reinforcement Learning (Safe RL) Safety Assurance Argumentation Distributional Shift Ensemble-based Uncertainty Estimation Out-of-Distribution (OOD) detection

1. Introduction

The application of Machine Learning (ML) to safetycritical cyber-physical systems such as industrial robots and automated vehicles has the potential for greatly increasing the level of automation in complex environments. However, the use of ML is met with many practical challenges, in particular regarding resource, timing and performance constraints. The most dominant obstacle to the deployment of such systems is the dificulty in demonstrating the absence of unreasonable risk of unsafe actions due to erroneous outputs of the ML model. These errors are caused by a combination of insuficiencies due by epistemic uncertainty in the model and the occurrence of inputs and states that uncover these insuficiencies, themselves subject to aleatoric uncertainty. [ 1 ] argued that a causal understanding of insuficiencies can be used to reduce uncertainties in the performance of ML in an iterative manner. Based on a specification of safety acceptance criteria, a measurement of the error rate of the ML function is used to evaluate the impact and potential causes of ML insuficiencies. This analysis is used to derive design-time and operation-time measures to reduce residual safety risk (Figure 1). Design-time measures reduce the occurrence of insuficiencies in the model, e.g. by restricting the scope of the operating environment, optimizing the ML technique and architecture or redefining training conditions. Operation-time measures reduce the impact of residual insuficiencies in the model, e.g. through plausibility analysis or heterogeneously redundant calculations of the target function.

Reinforcement learning (RL) is well suited to systems operating in complex environments with high demands on flexibility such as in route planing or motion control of mobile robots. As an RL agent will predict an action in whatever state it finds itself, the application can benefit from an awareness of the certainty and confidence in its own decisions. This includes situations that fall both within as well as outside of the distribution of previously seen training data. This paper focuses on uncertainty estimation as an operation-time measure to detect states which could lead to errors in the ML model. Specifically, classification of test data that difer in some respect from data available during training. A model with insuficient training data is less able to generalize based on unseen test data even if this data does not contain “novel” concepts. Similarly, [ 2 ] characterize OOD as test data, which are from a diferent distribution as the training data and describe OOD detection as a threshold–based process.

The closer the data are to the training data distribution the more likely it is that this is caused by a lack of training data only. OOD data further away from the training data distribution are more likely to represent conceptual sFyigstuermes1(:aSdaafpetteydAfsrsoumra[n1c]e) Framework for machine learning or semantic diferences, such as samples which are completely outside of the given classifications. Deep neural networks (DNNs) tend to be overconfident in predictions on unseen data and can give unpredictable results for we evaluate the detection of out-of-distribution (OOD) far-from-distribution test data [ 3 ]. inputs to address the impact of distributional shift. Prior work focused on OOD as a concept of samples

Distributional shift in data science is widely under- that fall outside the defined set of classes. If samples stood as the distributional diference between training are from outside this set, correct classifications for these and test data (respectively data used during the infer- samples, by definition, cannot be learned, even with unence or deployment phase) [ 2 ][ 3 ][ 4 ][ 5 ]. Distributional limited training. To address this issue, it is common to shift can have diferent causes, such as natural perturba- specify a separate OOD class to train the model on [ 6 ]. tions to the data-set due to aleatoric uncertainty as well Similarly, [ 7 ] and [ 8 ] define OOD samples as examples as evolving conditions in the environment. In machine for classes diferent from those in the in-distribution (ID) learning, a shift in the probability distribution over state- dataset. [ 4 ] describe ID as a distribution trained by a clasaction pairs often leads to degraded performance in the sifier and OOD as suficiently diferent from it. Also, [ 9 ] inference phase, leading the agent to propose wrong or follow the approach of considering a strong diference sub-optimal actions. When the testing distribution difers between training and test data to be OOD. They describe from the training distribution, machine learning systems ID data as conceptually similar to training data and OOD may not only demonstrate poor performance, but also data as difering strongly from training data. [ 10] go as have false confidence in the validity of their actions. far as to define OOD by the distributional gap in between

To overcome this limitation, safe reinforcement learn- classified ID data sets. They propose to maximize the ing (safe RL) solutions must be capable of detecting and discrepancy between the decision boundaries of e.g. two handling the uncertainty in the decision-making process. classifiers to push OOD samples outside. They also follow For instance, uncertainty estimation can detect a lack of the concept of near and far from the distribution. generalization due to insuficient training and unseen Furthermore, it is important to understand the diferstates during training (OOD, epistemic uncertainty) as ence of epistemic and aleatoric uncertainty for uncerwell as uncertainty resulting from randomness in the tainty estimation as a proxy for OOD detectors. Episenvironment (aleatoric uncertainty). For epistemic un- temic uncertainty arises out of a lack of suficient data to certainty, a set of alternatively trained agents (ensemble) exactly infer the underlying system [11]. It can indicate can be used. In states with high uncertainty due to a samples that reside far away as well as close to the data lack of training, the diferent agents will likely predict distribution [ 5 ]. In contrast, aleatoric uncertainty arises diferent actions, due to a lack of substance of the predic- from stochastic environments and must be accounted for tion. This variance can be utilized to indicate uncertainty. in risk-sensitive applications [12], [13]. Aleatoric uncerHowever, there may be states with various, equally valid tainty cannot be solved just by more training. The impact actions that would also result in a variance in the outputs of aleatoric uncertainty is therefore a significant factor in of the ensemble. Therefore, it is necessary to diferentiate arguing the safety of RL-based safety-critical applications. between the two efects. In [14] the authors propose ensemble quantile networks (EQN) based on the work of [15] where they combine im2. Related work plicit quantile networks (IQN) for aleatoric uncertainty detection and utilize random prior functions (RPF) [16] Recent work has addressed OOD detection in the clas- as an ensemble based method for epistemic uncertainty sical image classification domain as well as some work estimation. As epistemic uncertainty originates from in the RL domain. [ 6 ] define novelty detection as the model insuficiencies, a model ensemble will output a distribution over diferent estimates in an uncertain state (distribution over outputs). Aleatoric uncertainty arises risk, and to do so with suficient accuracy and timeliness from the randomness in the environment and causes a so that the system can be brought into a safe state before distribution over returns from the environment to the the risk becomes unacceptable. models input (distribution over returns/inputs). How- The interaction between the development of ML speever, in this paper we focus on the detection of epistemic cific methods for optimizing performance and safety asuncertainty, as we focus on distributional shift and OOD surance was not observed in much previous work. In [21] detection. the authors describe a collaborative and iterative process [17] investigate a distributional RL algorithm D3PG, where ML method developers are supported by safety which models the uncertainty in the form of a return engineers to ensure the method contributes to the overdistribution in which the expected value is the Q-value. all system safety assurance argument. This includes the Diferent actions might be used when the distribution systematic argumentation of the efectiveness of design is bimodal or multimodal depending on the application and operation-time measures, an evaluation of the perscenario. [18] present an uncertainty-aware model-based formance of the ML function against quantitative safety learning algorithm that estimates the probability of col- acceptance criteria and an analysis of the causes of insuflision together with a statistical estimate of uncertainty. ficiencies in the model in order to derive more efective The predictive model is based on bootstrapped neural design and operation-time methods. Nevertheless, unnetworks using dropout. In regions of high uncertainty, certainties in the assurance of the safety of ML functions their risk-averse cost function causes the robot to revert will remain. to a cautious low-speed strategy. In [19] the authors pro- The closed-box nature of ML algorithms and the conpose an action-advising framework where the agent asks sequent reliance on observational evidence coupled with for advice when its epistemic uncertainty is high for a the inherent epistemic uncertainty of the models (comcertain state to accelerate reinforcement learning. They pared to traditional software) whilst operating within an add as a last layer multiple heads estimating separately environment with high aleatoric uncertainty lead to a expected values for each action, as done in Bootstrapped lack of confidence in our statements about the safety of deep Q-learning (DQN). As the learning algorithm up- the resulting system (assurance uncertainty). [ 1 ] refers dates the network, their predictions get closer to the real to this challenge as the need to infer certain safety claims function, and one close to the others. [11] use uncer- based on incomplete observations and defines a set of tainty based OOD, using Q-value uncertainty in DQN conditions to formalise this statement. This requires the Algorithm. They compare MC-Dropout, Bootstrapped use of rigorous argumentation to justify why an acceptand Bootstrapped with prior functions. They also address able level of safety can be asserted despite the inherent the problem of overlapping alternative (equally valid) pre- limitations in the available evidence. In [22], based on dictions of the model agent. Unfortunately, they do not [23], Baconian probability is proposed as a concept for dig into detail, when it comes to the uncertainty estima- estimating confidence in assurance arguments [ 24] based tion in those cases but rather calculate an overall estimate on how many possible assurance deficits (known as “defor the epoch. featers”) of an argument can be eliminated. Confidence [20] estimate uncertainty for RL based on ensembles claim patterns were introduced which aim to identify all with randomized prior functions (RPF). They are based possible defeaters and demonstrate that they are either on [16] and propose a criterion function. They choose unlikely or not of significance. In section 4 we return safe actions in unknown situations far from the training to the challenge of arguing the safety contribution of an distribution. In [14] they also utilize an ensemble of DQN uncertainty estimation-based OOD detector by highlightagents to estimate Q-value uncertainty to switch back ing some of the possible defeaters to the safety argument to a fallback policy in uncertain situations given a cer- can be identified and addressed as part of an iterative tain threshold. An ensemble is trained on bootstrapped process. data, which provides a distribution over the estimated Q-values to provide a Bayesian estimation of the epistemic uncertainty. The epistemic uncertainty estimate 3. Ensemble uncertainty is then be used to choose less risky actions in unknown estimation based on action situations. However, they don’t take into account a poten- count variance and delta to ID tial overlapping uncertainty due to possible alternative actions. In [25] we proposed uncertainty estimation with action

One of the key questions to be answered from a safety count variance (ACV) and delta to ID (IDD). In the next assurance perspective is how does the OOD detection section we will give a slightly reduced repetition. as an operation-time measure really help to prevent hazardous conditions? This requires OOD detection to be able to detect novel data that relates to an increase of

3.1. Background 3.1.1. Reinforcement learning and MDP

In RL, the goal is to find the best policy for an agent that makes sequential decisions while interacting with an environment modeled as a Markov decision process (MDP). An MDP is defined as a tuple ℳ := (, , , , 0), composed by the set of states , the set of actions , the reward function : × × ↦→ R, the transition probability function : × × ↦→ [ 0, 1 ], and the starting state distribution 0. The transition probability function (+1|, ) models the system dynamics by mapping the probability of transitioning from a previous state to the state +1 when taking the action .

The reward function represents the return as sum of the discounted reward with being the discount factor at time steps , given by ability to make proper decisions severely afected. Epistemic uncertainty can be used as a proxy for detecting distributional shifts and is usually associated with a lack of suficient data to better infer the underlying system.

Defining distributional shift within the RL domain is not trivial [27]. In this paper we assume that distributional shift can be characterized by changes in the system dynamics. More specifically, the shift of the distribution over the state transitions given state action pairs between training and test in MDPs, as shown below: [ + 1|, ] ̸= [ + 1|, ]. (4)

Additionally, when considering partially observable MDPs (POMDPs) where the system’s state cannot be assessed but rather an observation is available to the agent, the shift of the distribution over observations given states has to be taken into account: ∞ = ∑︁ +.

=0 (1) [|] ̸= [|]. (5)

3.1.3. Ensemble-based uncertainty estimation

In the MDP framework, at each timestep, the agent observes the current state, takes an action, transitions to We focus on ensemble-based epistemic uncertainty estithe next state drawn from the distribution, and receives a mation to detect distributional shift and OOD data during reward. The action-value function, also known as the Q- test time respectively during the inference or deployment value function, where represents the expected return phase. An ensemble of trained agents on a subset of the when following a policy (which basically maps states available data will estimate with low variance in well into actions), as shown below. trained states. When the ensemble members face too few trained states, the estimates vary naturally across (, ) = E[| = , = , ]. (2) the members and give a distribution over the estimated Q-learning, in which a policy is learned using Q- Q-values. The variance of the estimated Q-values can be values, is a popular model-free method. Deep Q-networks used to quantify the epistemic uncertainty of a decision. (DQNs) extend Q-learning with the usage of neural net- An ensemble on bootstrapped data over DQNs proworks as function approximators. To do so, the temporal- vides a distribution over the estimated Q-values to prodiference error can be derived from the Q-value func- vide a Bayesian estimation of the epistemic uncertainty. tion using the Bellman operator, resulting in the equation The Q-values will converge to the real values in situations below. the agent suficiently learned. In untrained situations, the Q-value estimates will still diverge and the variance will = + max (︀ +1, ; − )︀ − (, ; ) , (3) therefore give an estimate of the epistemic uncertainty. Random prior functions can be used to introduce diversity in an ensemble of agents trained on bootstrapped data [16]. The expected return is then given by where − and are the DQN parameters from the target and the prediction network as denfied in [ 26], respectively. (, ) = (, ; ) + (, ; ˆ), (6)

3.1.2. Distributional shift and OOD

Distributional shift and OOD are two concepts that are closely related, but it is important to distinguish distributions that are closer or further away from the training distribution. It is expected that an RL agent would be able to perform well in scenarios that are slightly diferent from those used in training, as it should be able to generalize. However, when the situation is too dissimilar (perhaps at a semantic level) the agent might have its where is the Q-function of the kth ensemble member, ˆ are the parameters of the prior function and is a factor to weight the impact of the prior function.

The variance of the Q-values of the ensemble estimates can be used to derive an uncertainty estimation threshold to invoke a backup policy [20] [14] that ensures a safe state. With the variance [(, )] < 2 the policy with threshold can be calculated by was proposed. This method, called Delta to ID (IDD), consists in comparing the given (and potentially OOD) ⎪⎧ arg max if [(, )] < 2, situation to its nearest ID counterpart to diferentiate () = ⎪⎨ E[(, )] high uncertainty resulting from these "ambiguous" states ⎪⎩⎪ () otherwise. fproosmeddtisotrsiubbuttriaocntatlhsehiIfDts.uTnocegrettaainctoym(rpeaprriesosenn,titedwabsyptrhoe(7) ACV) from the given OOD uncertainty and use the result as a cleaned (delta) version of the ACV for uncertainty in3.2. Action count variance uncertainty dication. Because of the (1-x) characteristic of the ACV to estimation the uncertainty, we actually subtract ( − )− ( − ) which inverses The Q-value is a continuous variable where high vari- the ACV characteristic to match the uncertainty’s and ance in the predictions means high uncertainty of the results for the subtraction in: ensemble. However, when given encapsulated agents or when the Q-values are not accessible due to other reasons, it is possible to take the deviation over the proposed [(, )] = [(, )] actions of the ensemble members, to indicate uncertainty. − [(, )]. (9) In cases where the action space is continuous, the variance can be directly calculated as action variance like There can be diferent approaches to get a nearest ID with the Q-values. However, with discrete action spaces, from a given OOD scenario. To simplify here, we stick this will lead to false results, as the actions themselves to an OOD scenario with one dedicated OOD obstacle. are orthogonal and a mean action can not be calculated. The OOD obstacle in the given OOD scenario will then Therefore, in cases where the action space is discrete, we be exchanged with a corresponding ID obstacle. The proposed in [25] to calculate an action count on each observation function () changes as given in equation action over the ensemble given a certain state and then 10. calculate the variance of that action count (ACV - action count variance). When the ACV is low, there is a balance _ℎ_ ( ) in the proposed diferent actions over the ensemble and = . (10) the uncertainty is therefore high. In contrast, when the action count variance is high, there is a concentration of one or more actions in the ensemble and the uncertainty is low. The higher the ACV gets, the lower the uncertainty. A backup policy can then be chosen based on the ACV calculation as given in equation 8.

A high delta of the ACV will indicate high uncertainty and a low delta low uncertainty in both cases, respectively. To decide on an uncertain situation in a given state, we proposed to use a threshold to mask out insignificant variance-delta to ID. This threshold can be used in future work to switch to a backup policy as operation-time measure for safety assurance methods e.g. in an iterative causal model like proposed in [ 1 ] and we will come back to in section 4. if [(, )] > ℎℎ, () = ⎧ arg max ⎪ ⎨⎪ E[(, )] ⎪ ⎪⎩ () otherwise.

(8)

4. Safety assurance 3.3. Delta to ID uncertainty estimation

One problem with uncertainty estimation within reinforcement learning is that often multiple decisions are equally valid in a given state. These can be called alternative possible actions - or more generally alternative predictions. When an agent is in a state with alternative possible actions, the ensemble may already deviate in its prediction, although it might be trained suficiently in this state. This means, alternative possible actions will pose high uncertainty and might falsely flag an OOD instance. Traditional methods fail to distinguish these two cases and, therefore, in [25] an alternative solution

4.1. Background

The use of ML for highly automated safety-critical applications leads to a number of safety assurance challenges.

These challenges are related to the complexity and unpredictability of the operating environment (aleatoric uncertainty), as well as the complexity of the technical system and task itself. A complex system can be defined as system that exhibits behaviours that are emergent properties of the interactions between the parts of the system, where the behaviours would not be predicted based on knowledge of the parts and their interactions alone. This definition is closely related to the general concept of uncertainty, defined as any deviation from the unachievable ideal of completely deterministic knowledge of the relevant - this requires an external perspective to resolve. Onsystem [28]. tological uncertainty is a specific cause of specification

For safety-critical autonomous systems, uncertainty insuficiencies, which in turn will lead to epistemic unmanifests itself in various forms not restricted to the nar- certainty in the trained model. In this paper, we describe row definitions used in ML. Specification uncertainty is an operation-time measure to mitigate the efects of this the uncertainty in the appropriateness and completeness uncertainty by introducing an observer external to the of safety acceptance criteria and the definition of accept- ML component to detect the conditions where previously ably safe behavior in all situations that can reasonably unseen inputs might impact the safety requirements. be anticipated to occur within the target environment.

Incomplete, or otherwise insuficient training data can 4.2. Safety assurance argumentation be seen as a consequence of specification uncertainty. using ensemble-based uncertainty Technical uncertainty stems from a lack of predictability in the performance of the technical components of a sys- estimation and IDD tem. An example of which is the unpredictable reaction In this section we discuss the impact of the Ensembleof the system to previously unseen events, or diferences based uncertainty estimation from the following perspecin the system behavior despite similar input conditions tives. First we discuss the role of the uncertainty es(epistemic uncertainty in the trained model). Assurance timation as an operation-time measure for mitigating uncertainty relates to lack of confidence in claims regard- the impact of residual errors in the ML component and ing safety properties of the ML system. This can include how this supports a safety assurance argument for the a insuficient integrity of evidence supporting the assur- function. Second we examine issues of uncertainty in ance arguments as well as the chain of reasoning itself. the assurance argument itself and how confidence in the Safety assurance for ML-based systems must therefore argument can be increased. minimise these uncertainties and thus maximise the con- Figure 2 shows a simplified and incomplete excerpt (inifdence that the system fulfils its safety expectations. The spired by [30]) of a safety assurance argument described approaches described in [ 1, 21 ] and summarised in Figure using the Goal Structuring Notation (GSN) [31][32] for 1 are designed to iteratively minimise these uncertainties the claim that the residual risk of the system colliding and thereby safety risk as part of a continuous assurance with obstacles is suficiently low. GSN is a graphical process based on an understanding of the environment, notation that represents the elements of an assurance insuficiencies in the ML system and potential deficits argument and the relationships between them. It shows in the safety assurance argumentation. To support this how goals (claims) can be broken into sub-goals until approach an assurance argument is proposed to support they can be supported by direct references to evidence. a systematic evaluation that well defined safety claims It documents argumentation strategies as well as the are supported by evidence and that all assumptions are context information, including assumptions and justifiexplicitly stated and validated. cations. The assurance strategy illustrated here is based

Complexity and unpredictability of the operational on an identification of potential causes of insuficiencies domain and of the system itself lead to semantic gaps, in the function and measures for reducing their impact which indicate discrepancies between the intended and during development and operation. specified functionality, also known as specification in- Uncertainty estimation is one of a number of complesuficiencies. In safety-critical systems this can lead to mentary measures used to form a broad argument for hazardous systemic failures. From our consideration, safety. However, as mentioned above, the complexity of specification uncertainty is also a problem in RL, for ex- the system can undermine the confidence in the arguample when inappropriate reward functions are used. ment. [23] describes confidence in assurance arguments This might manifest itself in a manner that appears to be in terms of trust in assertions related to the evidence, epistemic uncertainty, the root cause is however subtly context (including assumptions) and inference (or strucdiferent to, for example, a lack of training data. ture of the argument itself). For each of these aspects a

To better understand the characteristics and impact number of defeaters could potentially be identified that of uncertainty, one can diferentiate between statistical, undermine the argument [22]. For the example arguscenario and ontological uncertainty. Statistical uncer- mentation in Figure 2 these can include an incomplete tainty can be expressed in quantitative statistical terms, definition of operating environment or incorrect assumpsuch as confidence intervals expressed over probability tions regarding the performance of the perception comdistributions. Scenario uncertainty can only be described ponents (asserted context), as well as the validity of test using qualitative scenarios, which are potentially mul- results demonstrating the generalisation performance tiple plausible states of the system and its environment. of the trained function due to the dificulty in covering Ontological uncertainty [29] defines a lack of awareness previously unknown corner cases (asserted evidence). that the knowledge about the system itself is incomplete Furthermore, the assertion that all possible causes of insuficiencies have been addressed could also be incorrect (asserted inference).

As proposed in [ 1 ], it is advantageous to iterate through the assurance process when dealing with ontological uncertainties - and we can show in the following, that this also is beneficial even with our simplified case study. The addition of the uncertainty estimator was the initial step to mitigate against the residual uncertainties in the assurance argument with the extension of the IDD a further step to increase confidence in the efectiveness of the uncertainty estimation itself.

5. Experimental results

In [25] we presented the results from more extended experiments. Here, we summarise the results and conduct additional experiments to argue the safety assurance.

Setup and Training: We trained complete agents in parallel with a randomly placed set of 10 obstacles singly placed in a gridworld of 10x10 positions and training runs of 1 million steps each agent. For testing we set up diferent scenarios with previously seen obstacles as ID and added a single dedicated obstacle not seen during training as an OOD condition. For the paper we focused on an ID scenario with a line of known obstacles in the middle of the grid and the goal at the end of the line.

For the visualization of the uncertainty estimation we calculated heatmaps over the grid showing each resulting uncertainty estimation for each position of the agent in the grid given the overall scenario.

Uncertainty heatmaps: For the depicted results, the uncertainty calculation based on the action count variance of the ensemble members is used. Figure 3 is the base scenario with the known ID obstacle line in blue and the goal in green. As we use variance in the action count, a higher brighter colour means more concentration on fewer actions (and therefore more certainty) and darker colour means a less concentration in the actions or more equally distributed action (and therefore higher uncertainty). As one can see in the base scenario - due to possible alternative action predictions there are some “uncertainties” along the diagonals to the goal, as these coordinates have equal probabilities vertically and horizontally to approach the goal, since the action space only allows for up/down and left/right movement and cannot realize a diagonal path directly. This shows the limits of the uncertainty metric here as well - the actions along the diagonal are no more or less dangerous but they are monitoring a high “uncertainty”. This consideration applies e.g. also for the point on the left of the obstacle line, as the probabilities for up and down are equally distributed.

Figure 4a shows the predictions with one unknown obstacle inserted in the middle direct on top of the line shown in purple. There is increased uncertainty, especially in the area surrounding the unknown obstacle. Nevertheless, the uncertainty indication is superposed by the already given "uncertainty" of the possible alternative predictions from the base ID scenario. In contrast, ifgure 4b shows the predictions with a known obstacle inserted in the middle direct on top of the line shown in blue, instead of the OOD obstacle. Now, the uncertainty indication is much closer to the base ID scenario.

Our approach proposed to subtract the base variance from the OOD variance and therefore try to eliminate the base variance resulting from the possible alternative predictions. In Figure 5 the results are depicted for the delta to ID with the known obstacle without and with threshold (5a and 5b). It seems to feasibly indicate a (a) OOD obstacle in the middle (b) ID obstacle in the middle given OOD hotspot considering a dedicated threshold, although the indication is not totally sharp.

In the given scenario, the OOD hotspot lies directly in an area of low uncertainty (the yellow area on top of the blue line). In order to validate that the approach generalizes to diferent scenarios, we also ran setups where the hotspot lies in an area of previously known uncertainty from possible alternative predictions - such as in the upper middle section (see Figure 6).

(a) OOD obstacle at the top (b) ID obstacle at the top

Figure 7 shows the resulting indication for the delta to ID with known obstacle in 7a and 7b.

False-Positive and False-Negative rates: In order to argue the safety assurance, we set up an additional experiment and measure the probability of an agent without an observer to hit the unknown obstacle and compared this to the probability of an agent with only baseline uncertainty estimation (UE-BL) and the probabilities with the IDD uncertainty estimator as external observers. The agent without an observer will get no uncertainty estimation (UE) indications which is equivalent to falsenegatives (FN). For the two with external observers we assume the agent will follow an alternative route and not hit the unknown obstacle when the uncertainty estimator indicates uncertainty above a given threshold.

We iterate over all possible positions of the unknown obstacle and all possible positions of the agent without introducing randomness in the setup, to focus on the demonstration of the efects here. We calculate the mean probabilities for false-positive (FP) indications (which slow the agent down) and the false-negatives (FN) (which result in hazards). As varying hyper-parameters we use diferent ACV-thresholds for IDD and UE-BL, a variable sized bounding box around the OOD position wherein each indication is TP (true-positive), for FN the percentage threshold of consent of the ensemble to hit in the next state, absolute amount for the delta to ID vs. a cutof under zero, and for IDD a substitution with a known obstacle vs. an empty space. The approaches are compared in table 1 where the hyper-parameters are tuned for equal FP probabilities to achieve directly comparable FN probabilities, and as a ROC (Receiver Operating Characteristic) curve in figure 8.

mean P( FN ) (false-negative) mean P( FP ) (false-positive) w/o UE 41.61e-3

UE-BL 3.86e-3

UE-IDD 2.45e-3 11.49e-2 11.22e-2

IDD significantly reduces FNs compared to the agent with the baseline uncertainty estimator (about factor 1.75) and the agent without UE (about factor 20). This comes with the cost of an increasing FP rate for the UE agents, whereas the agent without UE naturally has no FPs.

When mapping the experimental results to the safety assurance argumentation from Section 4.2 and the iterative causal analysis model, it becomes clear that the safety claim may not be met without an uncertainty estimator and would then be improved within the 1st iteration when introducing the UE-BL. The results show a significant improvement for the FN, but assuming an even higher safety claim of e.g. FN less then 3‰, a 2nd iteration identified additional measures to further reduce the FN. The 2nd iteration with IDD as an additional measure then reached the required claim. When then looking at the high remaining FP and a potential additional claim in respect to that, a 3rd iteration could address this aspect. However, this will be the target of future work.

Finally, whether all possible scenarios have been considered and whether the safety assurance achieved is suficient as rigorous evidence for certification purposes needs further investigation on more realistic applications in future work.

6. Conclusion

This paper investigated the safety assurance argumentation for an ensemble based epistemic uncertainty estimation on gridworld scenarios with discrete action spaces and overlapping alternative predictions.

We build on previous work with discrete actions spaces and variance calculation based on action count variance (ACV) and a delta to ID (IDD) approach to deal with overlapping alternative predictions, where we showed that action count variance with IDD is able to indicate uncertain states based on a threshold calculation with high probability. As utilizing a backup policy based on that indication can be a feasible solution, we established a safety assurance argumentation in this paper. With the definition of the assurance case and an iterative assurance approach, we demonstrated that the IDD-enhanced uncertainty estimator can be utilized as an operationtime measure as external observer to indicate ontological uncertainty.

Future work will address to reduce the FP rate of the observer, investigate methods to determine a suficient near ID scenario for a given OOD scenario and extend the approach to more general and realistic environments and applications. Further, it will focus on rigorous argumentation and elaboration of the experimental results for the safety assurance and on strategies to react upon the uncertainty estimation during operation to reduce situational risk.

Acknowledgments

This work was funded by the Bavarian Ministry for Economic Afairs, Regional Development and Energy as part of a project to support the thematic development of the Institute for Cognitive Systems. [10] Q. Yu, K. Aizawa, Unsupervised Out-of- (2011).

Distribution Detection by Maximum Classifier Dis- [24] J. Goodenough, C. Weinstock, A. Klein, Toward crepancy, arXiv:1908.04951 [cs] (2019). a Theory of Assurance Case Confidence, Techni[11] A. Sedlmeier, T. Gabor, T. Phan, L. Belzner, cal Report CMU/SEI-2012-TR-002, Software EngiC. Linnhof-Popien, Uncertainty-based out-of- neering Institute, Carnegie Mellon University, Pittsdistribution classification in deep reinforcement burgh, PA, 2012. URL: http://resources.sei.cmu.edu/ learning, arXiv preprint arXiv:2001.00496 (2019). library/asset-view.cfm?AssetID=28067. [12] W. R. Clements, B. Van Delft, B.-M. Robaglia, R. B. [25] D. Eilers, F. S. Roza, K. Roscher, Ensemble-based unSlaoui, S. Toth, Estimating Risk and Uncertainty certainty estimation with overlapping alternative in Deep Reinforcement Learning, arXiv:1905.09638 predictions, Deep RL Workshop at the 36th Con[cs, stat] (2020). ference on Neural Information Processing Systems [13] K. Chua, R. Calandra, R. McAllister, S. Levine, Deep (NeurIPS) (2022).

Reinforcement Learning in a Handful of Trials using [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, Probabilistic Dynamics Models, arXiv:1805.12114 I. Antonoglou, D. Wierstra, M. Riedmiller, Play(2018). ing atari with deep reinforcement learning, [14] C.-J. Hoel, K. Wolf, L. Laine, Ensemble quan- arXiv:1312.5602 (2013).

tile networks: Uncertainty-aware reinforcement [27] T. Haider, F. S. Roza, D. Eilers, K. Roscher, S. Günlearning with applications in autonomous driving, nemann, Domain shifts in reinforcement learnarXiv:2105.10266 (2021). ing: Identifying disturbances in environments., [15] W. Dabney, G. Ostrovski, D. Silver, R. Munos, Im- AISafety@ IJCAI (2021).

plicit quantile networks for distributional reinforce- [28] W. E. Walker, P. Harremoës, J. Rotmans, J. P. Van ment learning, arXiv:1806.06923 (2018). Der Sluijs, M. B. Van Asselt, P. Janssen, M. P. [16] I. Osband, J. Aslanides, A. Cassirer, Randomized Krayer von Krauss, Defining uncertainty: a concepPrior Functions for Deep Reinforcement Learning, tual basis for uncertainty management in modelarXiv:1806.03335 (2018). based decision support, Integrated assessment 4 [17] P. Wang, Y. Li, S. Shekhar, W. F. Northrop, Un- (2003) 5–17.

certainty Estimation with Distributional Reinforce- [29] R. Gansch, A. Adee, System theoretic view on unment Learning for Applications in Intelligent Trans- certainties, in: 2020 Design, Automation & Test portation Systems: A Case Study, in: 2019 in Europe Conference & Exhibition (DATE), IEEE, IEEE Intelligent Transportation Systems Confer- 2020, pp. 1345–1350. ence (ITSC), 2019, pp. 3822–3827. doi:10.1109/ [30] S. Burton, I. Kurzidem, A. Schwaiger, P. Schleiss, ITSC.2019.8917429. M. Unterreiner, T. Graeber, P. Becker, Safety assur[18] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, S. Levine, ance of machine learning for chassis control funcUncertainty-Aware Reinforcement Learning for tions, in: International Conference on Computer Collision Avoidance, arXiv:1702.01182 (2017). Safety, Reliability, and Security, Springer, 2021, pp. [19] F. L. Da Silva, P. Hernandez-Leal, B. Kartal, M. E. 149–162.

Taylor, Uncertainty-aware action advising for deep [31] Goal structuring notation community standard verreinforcement learning agents, in: Proceedings sion 2, Technical Report, Assurance Case Workof the AAAI conference on artificial intelligence, ing Group (ACWG), https://scsc.uk/r141B:1?t=1, acvolume 34, 2020, pp. 5792–5799. cessed on 04/05/2019, 2018. [20] C.-J. Hoel, K. Wolf, L. Laine, Tactical Decision- [32] J. Spriggs, GSN - The Goal Structuring Notation: A Making in Autonomous Driving by Reinforce- Structured Approach to Presenting, 2012. ment Learning with Uncertainty Estimation, arXiv:2004.10439 (2020). [21] S. Burton, C. Hellert, F. Hüger, M. Mock, A. Rohatschek, Safety assurance of machine learning for perception functions, in: Deep Neural Networks and Data for Automated Driving, Springer, Cham, 2022, pp. 335–358. [22] P. J. Graydon, Defining baconian probability for use in assurance argumentation,

NASA/TM–2016–219341 (2016). [23] R. Hawkins, T. Kelly, J. Knight, P. Graydon, A new approach to creating clear safety arguments, Advances in systems safety, pp. 3-23. Springer, London

[1]

Burton , A causal model of safety assurance for machine learning , arXiv:2201.05451 ( 2022 ).

[2]

Hendrycks ,

Gimpel , A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , arXiv: 1610 .02136 [cs] ( 2018 ).

[3]

Lütjens ,

Everett ,

J. P.

How , Safe reinforcement learning with model uncertainty estimates , in: 2019 International Conference on Robotics and Automation (ICRA) , IEEE, 2019 , pp. 8662 - 8668 .

[4]

Lee ,

Shin ,

A Simple

Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , arXiv: 1807 .03888 [cs, stat] ( 2018 ).

[5]

Postels ,

Blum ,

Strümpler ,

Cadena ,

Siegwart ,

L. Van

Gool ,

Tombari , The Hidden Uncertainty in a Neural Networks Activations , arXiv: 2012 . 03082 ( 2020 ).

[6]

Pimentel ,

Clifton ,

Tarassenko , A review of novelty detection, Signal Process . ( 2014 ). doi: 10 .1016/j.sigpro. 2013 . 12 .026.

[7]

DeVries ,

G. W.

Taylor , Learning confidence for out-of-distribution detection in neural networks , arXiv preprint arXiv: 1802 . 04865 ( 2018 ).

[8]

Mohseni ,

Pitale ,

Yadawa ,

Wang , Self-supervised learning for generalizable out-ofdistribution detection , Proceedings of the AAAI Conference on Artificial Intelligence 34 ( 2020 ). doi:10.1609/AAAI.V34I04 .5966.

[9]

Schwaiger ,

Sinhamahapatra ,

Gansloser ,

Roscher , Is Uncertainty Quantification in Deep Learning Suficient for Out-of-Distribution Detection? , in: Proc. AISafety@IJCAI2020 , volume 2640 of CEUR Workshop Proceedings , 2020 , p. 8 .