<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Safety Assurance with Ensemble-based Uncertainty Estimation and overlapping alternative Predictions in Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dirk Eilers</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Burton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felippe Schmoeller Roza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karsten Roscher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Cognitive Systems IKS, Fraunhofer Gesellschaft</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>A number of challenges are associated with the use of machine learning technologies in safety-related applications. These include the dificulty of specifying adequately safe behaviour in complex environments (specification uncertainty), ensuring a predictably safe behaviour under all operating conditions (technical uncertainty) and arguing that the safety goals of the system have been met with suficient confidence (assurance uncertainty). An assurance argument is therefore required that demonstrates that the efects of these uncertainties do not lead to an unacceptable level of risk during operation. A reinforcement learning model will predict an action in whatever state it is in - even in previously unseen states for which a valid (safe) outcome cannot be determined due to lack of training. Uncertainty estimation is a well understood approach in machine learning to identify states with a high probability of an invalid action due a lack of training experience, thus addressing technical uncertainty. However, the impact of alternative possible predictions which may be equally valid (and represent a safe state) in estimating uncertainty in reinforcement learning is not so clear and to our knowledge, not so well documented in current literature. In this paper we build on work where we investigated uncertainty estimation on simplified scenarios in a gridworld environment. Using model ensemble-based uncertainty estimation we proposed an algorithm based on action count variance to deal with discrete action spaces whilst considering in-distribution action variance calculation to handle the overlap with alternative predictions. The method indicates potentially unsafe states when the agent is near out-of-distribution elements and can distinguish it from overlapping alternative, but equally valid predictions. Here, we present these results within the context of a safety assurance framework and highlight the activities and evidences required to build a convincing safety argument. We show that our previous approach is able to act as an external observer and can fulfil the requirements of an assurance argumentation for systems based on machine learning with ontological uncertainty.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Safe Reinforcement Learning (Safe RL)</kwd>
        <kwd>Safety Assurance Argumentation</kwd>
        <kwd>Distributional Shift</kwd>
        <kwd>Ensemble-based Uncertainty Estimation</kwd>
        <kwd>Out-of-Distribution (OOD) detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The application of Machine Learning (ML) to
safetycritical cyber-physical systems such as industrial robots
and automated vehicles has the potential for greatly
increasing the level of automation in complex
environments. However, the use of ML is met with many
practical challenges, in particular regarding resource, timing
and performance constraints. The most dominant
obstacle to the deployment of such systems is the dificulty in
demonstrating the absence of unreasonable risk of unsafe
actions due to erroneous outputs of the ML model. These
errors are caused by a combination of insuficiencies due
by epistemic uncertainty in the model and the occurrence
of inputs and states that uncover these insuficiencies,
themselves subject to aleatoric uncertainty. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] argued
that a causal understanding of insuficiencies can be used
to reduce uncertainties in the performance of ML in an
iterative manner. Based on a specification of safety
acceptance criteria, a measurement of the error rate of the
ML function is used to evaluate the impact and potential
causes of ML insuficiencies. This analysis is used to
derive design-time and operation-time measures to reduce
residual safety risk (Figure 1). Design-time measures
reduce the occurrence of insuficiencies in the model, e.g.
by restricting the scope of the operating environment,
optimizing the ML technique and architecture or
redefining training conditions. Operation-time measures reduce
the impact of residual insuficiencies in the model, e.g.
through plausibility analysis or heterogeneously
redundant calculations of the target function.
      </p>
      <p>
        Reinforcement learning (RL) is well suited to systems
operating in complex environments with high demands
on flexibility such as in route planing or motion control
of mobile robots. As an RL agent will predict an action in
whatever state it finds itself, the application can benefit
from an awareness of the certainty and confidence in
its own decisions. This includes situations that fall both
within as well as outside of the distribution of previously
seen training data. This paper focuses on uncertainty
estimation as an operation-time measure to detect states
which could lead to errors in the ML model. Specifically,
classification of test data that difer in some respect from
data available during training. A model with insuficient
training data is less able to generalize based on unseen
test data even if this data does not contain “novel”
concepts. Similarly, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] characterize OOD as test data, which
are from a diferent distribution as the training data and
describe OOD detection as a threshold–based process.
      </p>
      <p>
        The closer the data are to the training data distribution
the more likely it is that this is caused by a lack of
training data only. OOD data further away from the training
data distribution are more likely to represent conceptual
sFyigstuermes1(:aSdaafpetteydAfsrsoumra[n1c]e) Framework for machine learning or semantic diferences, such as samples which are
completely outside of the given classifications. Deep neural
networks (DNNs) tend to be overconfident in predictions
on unseen data and can give unpredictable results for
we evaluate the detection of out-of-distribution (OOD) far-from-distribution test data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
inputs to address the impact of distributional shift. Prior work focused on OOD as a concept of samples
      </p>
      <p>
        Distributional shift in data science is widely under- that fall outside the defined set of classes. If samples
stood as the distributional diference between training are from outside this set, correct classifications for these
and test data (respectively data used during the infer- samples, by definition, cannot be learned, even with
unence or deployment phase) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Distributional limited training. To address this issue, it is common to
shift can have diferent causes, such as natural perturba- specify a separate OOD class to train the model on [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
tions to the data-set due to aleatoric uncertainty as well Similarly, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] define OOD samples as examples
as evolving conditions in the environment. In machine for classes diferent from those in the in-distribution (ID)
learning, a shift in the probability distribution over state- dataset. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] describe ID as a distribution trained by a
clasaction pairs often leads to degraded performance in the sifier and OOD as suficiently diferent from it. Also, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
inference phase, leading the agent to propose wrong or follow the approach of considering a strong diference
sub-optimal actions. When the testing distribution difers between training and test data to be OOD. They describe
from the training distribution, machine learning systems ID data as conceptually similar to training data and OOD
may not only demonstrate poor performance, but also data as difering strongly from training data. [ 10] go as
have false confidence in the validity of their actions. far as to define OOD by the distributional gap in between
      </p>
      <p>
        To overcome this limitation, safe reinforcement learn- classified ID data sets. They propose to maximize the
ing (safe RL) solutions must be capable of detecting and discrepancy between the decision boundaries of e.g. two
handling the uncertainty in the decision-making process. classifiers to push OOD samples outside. They also follow
For instance, uncertainty estimation can detect a lack of the concept of near and far from the distribution.
generalization due to insuficient training and unseen Furthermore, it is important to understand the
diferstates during training (OOD, epistemic uncertainty) as ence of epistemic and aleatoric uncertainty for
uncerwell as uncertainty resulting from randomness in the tainty estimation as a proxy for OOD detectors.
Episenvironment (aleatoric uncertainty). For epistemic un- temic uncertainty arises out of a lack of suficient data to
certainty, a set of alternatively trained agents (ensemble) exactly infer the underlying system [11]. It can indicate
can be used. In states with high uncertainty due to a samples that reside far away as well as close to the data
lack of training, the diferent agents will likely predict distribution [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In contrast, aleatoric uncertainty arises
diferent actions, due to a lack of substance of the predic- from stochastic environments and must be accounted for
tion. This variance can be utilized to indicate uncertainty. in risk-sensitive applications [12], [13]. Aleatoric
uncerHowever, there may be states with various, equally valid tainty cannot be solved just by more training. The impact
actions that would also result in a variance in the outputs of aleatoric uncertainty is therefore a significant factor in
of the ensemble. Therefore, it is necessary to diferentiate arguing the safety of RL-based safety-critical applications.
between the two efects. In [14] the authors propose ensemble quantile networks
(EQN) based on the work of [15] where they combine
im2. Related work plicit quantile networks (IQN) for aleatoric uncertainty
detection and utilize random prior functions (RPF) [16]
Recent work has addressed OOD detection in the clas- as an ensemble based method for epistemic uncertainty
sical image classification domain as well as some work estimation. As epistemic uncertainty originates from
in the RL domain. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] define novelty detection as the model insuficiencies, a model ensemble will output a
distribution over diferent estimates in an uncertain state
(distribution over outputs). Aleatoric uncertainty arises risk, and to do so with suficient accuracy and timeliness
from the randomness in the environment and causes a so that the system can be brought into a safe state before
distribution over returns from the environment to the the risk becomes unacceptable.
models input (distribution over returns/inputs). How- The interaction between the development of ML
speever, in this paper we focus on the detection of epistemic cific methods for optimizing performance and safety
asuncertainty, as we focus on distributional shift and OOD surance was not observed in much previous work. In [21]
detection. the authors describe a collaborative and iterative process
[17] investigate a distributional RL algorithm D3PG, where ML method developers are supported by safety
which models the uncertainty in the form of a return engineers to ensure the method contributes to the
overdistribution in which the expected value is the Q-value. all system safety assurance argument. This includes the
Diferent actions might be used when the distribution systematic argumentation of the efectiveness of design
is bimodal or multimodal depending on the application and operation-time measures, an evaluation of the
perscenario. [18] present an uncertainty-aware model-based formance of the ML function against quantitative safety
learning algorithm that estimates the probability of col- acceptance criteria and an analysis of the causes of
insuflision together with a statistical estimate of uncertainty. ficiencies in the model in order to derive more efective
The predictive model is based on bootstrapped neural design and operation-time methods. Nevertheless,
unnetworks using dropout. In regions of high uncertainty, certainties in the assurance of the safety of ML functions
their risk-averse cost function causes the robot to revert will remain.
to a cautious low-speed strategy. In [19] the authors pro- The closed-box nature of ML algorithms and the
conpose an action-advising framework where the agent asks sequent reliance on observational evidence coupled with
for advice when its epistemic uncertainty is high for a the inherent epistemic uncertainty of the models
(comcertain state to accelerate reinforcement learning. They pared to traditional software) whilst operating within an
add as a last layer multiple heads estimating separately environment with high aleatoric uncertainty lead to a
expected values for each action, as done in Bootstrapped lack of confidence in our statements about the safety of
deep Q-learning (DQN). As the learning algorithm up- the resulting system (assurance uncertainty). [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] refers
dates the network, their predictions get closer to the real to this challenge as the need to infer certain safety claims
function, and one close to the others. [11] use uncer- based on incomplete observations and defines a set of
tainty based OOD, using Q-value uncertainty in DQN conditions to formalise this statement. This requires the
Algorithm. They compare MC-Dropout, Bootstrapped use of rigorous argumentation to justify why an
acceptand Bootstrapped with prior functions. They also address able level of safety can be asserted despite the inherent
the problem of overlapping alternative (equally valid) pre- limitations in the available evidence. In [22], based on
dictions of the model agent. Unfortunately, they do not [23], Baconian probability is proposed as a concept for
dig into detail, when it comes to the uncertainty estima- estimating confidence in assurance arguments [ 24] based
tion in those cases but rather calculate an overall estimate on how many possible assurance deficits (known as
“defor the epoch. featers”) of an argument can be eliminated. Confidence
[20] estimate uncertainty for RL based on ensembles claim patterns were introduced which aim to identify all
with randomized prior functions (RPF). They are based possible defeaters and demonstrate that they are either
on [16] and propose a criterion function. They choose unlikely or not of significance. In section 4 we return
safe actions in unknown situations far from the training to the challenge of arguing the safety contribution of an
distribution. In [14] they also utilize an ensemble of DQN uncertainty estimation-based OOD detector by
highlightagents to estimate Q-value uncertainty to switch back ing some of the possible defeaters to the safety argument
to a fallback policy in uncertain situations given a cer- can be identified and addressed as part of an iterative
tain threshold. An ensemble is trained on bootstrapped process.
data, which provides a distribution over the estimated
Q-values to provide a Bayesian estimation of the
epistemic uncertainty. The epistemic uncertainty estimate 3. Ensemble uncertainty
is then be used to choose less risky actions in unknown estimation based on action
situations. However, they don’t take into account a poten- count variance and delta to ID
tial overlapping uncertainty due to possible alternative
actions. In [25] we proposed uncertainty estimation with action
      </p>
      <p>One of the key questions to be answered from a safety count variance (ACV) and delta to ID (IDD). In the next
assurance perspective is how does the OOD detection section we will give a slightly reduced repetition.
as an operation-time measure really help to prevent
hazardous conditions? This requires OOD detection to be
able to detect novel data that relates to an increase of</p>
      <sec id="sec-1-1">
        <title>3.1. Background</title>
        <sec id="sec-1-1-1">
          <title>3.1.1. Reinforcement learning and MDP</title>
          <p>
            In RL, the goal is to find the best policy for an agent that
makes sequential decisions while interacting with an
environment modeled as a Markov decision process (MDP).
An MDP is defined as a tuple ℳ := (, , , ,  0),
composed by the set of states , the set of actions , the
reward function  :  ×  ×  ↦→ R, the transition
probability function  :  ×  ×  ↦→ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ], and the
starting state distribution  0. The transition probability
function  (+1|, ) models the system dynamics by
mapping the probability of transitioning from a previous
state  to the state +1 when taking the action .
          </p>
          <p>The reward function represents the return as sum of
the discounted reward with   being the discount factor
at time steps , given by
ability to make proper decisions severely afected.
Epistemic uncertainty can be used as a proxy for detecting
distributional shifts and is usually associated with a lack
of suficient data to better infer the underlying system.</p>
          <p>Defining distributional shift within the RL domain is
not trivial [27]. In this paper we assume that
distributional shift can be characterized by changes in the system
dynamics. More specifically, the shift of the distribution
over the state transitions given state action pairs between
training and test in MDPs, as shown below:
[ + 1|, ] ̸= [ + 1|, ].
(4)</p>
          <p>Additionally, when considering partially observable
MDPs (POMDPs) where the system’s state cannot be
assessed but rather an observation  is available to the
agent, the shift of the distribution over observations given
states has to be taken into account:
∞
 = ∑︁  +.</p>
          <p>=0
(1)
[|] ̸= [|].
(5)</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>3.1.3. Ensemble-based uncertainty estimation</title>
          <p>In the MDP framework, at each timestep, the agent
observes the current state, takes an action, transitions to We focus on ensemble-based epistemic uncertainty
estithe next state drawn from the distribution, and receives a mation to detect distributional shift and OOD data during
reward. The action-value function, also known as the Q- test time respectively during the inference or deployment
value function, where  represents the expected return phase. An ensemble of trained agents on a subset of the
when following a policy  (which basically maps states available data will estimate with low variance in well
into actions), as shown below. trained states. When the ensemble members face too
few trained states, the estimates vary naturally across
 (, ) = E[| = ,  = ,  ]. (2) the members and give a distribution over the estimated
Q-learning, in which a policy is learned using Q- Q-values. The variance of the estimated Q-values can be
values, is a popular model-free method. Deep Q-networks used to quantify the epistemic uncertainty of a decision.
(DQNs) extend Q-learning with the usage of neural net- An ensemble on bootstrapped data over DQNs
proworks as function approximators. To do so, the temporal- vides a distribution over the estimated Q-values to
prodiference error   can be derived from the Q-value func- vide a Bayesian estimation of the epistemic uncertainty.
tion using the Bellman operator, resulting in the equation The Q-values will converge to the real values in situations
below. the agent suficiently learned. In untrained situations, the
Q-value estimates will still diverge and the variance will
  =  +  max  (︀ +1, ;  − )︀ −  (, ;  ) , (3) therefore give an estimate of the epistemic uncertainty.
 Random prior functions can be used to introduce
diversity in an ensemble of agents trained on bootstrapped
data [16]. The expected return is then given by
where  − and  are the DQN parameters from the target
and the prediction network as denfied in [ 26],
respectively.
(, ) =  (, ;  ) +  (, ; ˆ),
(6)</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>3.1.2. Distributional shift and OOD</title>
          <p>Distributional shift and OOD are two concepts that are
closely related, but it is important to distinguish
distributions that are closer or further away from the training
distribution. It is expected that an RL agent would be
able to perform well in scenarios that are slightly
diferent from those used in training, as it should be able to
generalize. However, when the situation is too dissimilar
(perhaps at a semantic level) the agent might have its
where  is the Q-function of the kth ensemble member,
ˆ are the parameters of the prior function and  is a factor
to weight the impact of the prior function.</p>
          <p>The variance of the Q-values of the ensemble estimates
can be used to derive an uncertainty estimation threshold
to invoke a backup policy [20] [14] that ensures a safe
state. With the variance  [(, )] &lt;  2 the policy
with threshold can be calculated by was proposed. This method, called Delta to ID (IDD),
consists in comparing the given (and potentially OOD)
⎪⎧ arg max if  [(, )] &lt;  2, situation to its nearest ID counterpart to diferentiate
  () = ⎪⎨ E[(, )] high uncertainty resulting from these "ambiguous" states
⎪⎩⎪  () otherwise.
fproosmeddtisotrsiubbuttriaocntatlhsehiIfDts.uTnocegrettaainctoym(rpeaprriesosenn,titedwabsyptrhoe(7) ACV) from the given OOD uncertainty and use the result
as a cleaned (delta) version of the ACV for uncertainty
in3.2. Action count variance uncertainty dication. Because of the (1-x) characteristic of the ACV to
estimation the uncertainty, we actually subtract (  −
)− (  − ) which inverses
The Q-value is a continuous variable where high vari- the ACV characteristic to match the uncertainty’s and
ance in the predictions means high uncertainty of the results for the subtraction in:
ensemble. However, when given encapsulated agents or
when the Q-values are not accessible due to other
reasons, it is possible to take the deviation over the proposed  [(, )] =  [(, )]
actions of the ensemble members, to indicate uncertainty. −  [(, )]. (9)
In cases where the action space is continuous, the
variance can be directly calculated as action variance like There can be diferent approaches to get a nearest ID
with the Q-values. However, with discrete action spaces, from a given OOD scenario. To simplify here, we stick
this will lead to false results, as the actions themselves to an OOD scenario with one dedicated OOD obstacle.
are orthogonal and a mean action can not be calculated. The OOD obstacle in the given OOD scenario will then
Therefore, in cases where the action space is discrete, we be exchanged with a corresponding ID obstacle. The
proposed in [25] to calculate an action count on each observation function () changes as given in equation
action over the ensemble given a certain state and then 10.
calculate the variance of that action count (ACV - action
count variance). When the ACV is low, there is a balance _ℎ_ ( )
in the proposed diferent actions over the ensemble and = . (10)
the uncertainty is therefore high. In contrast, when the
action count variance is high, there is a concentration of
one or more actions in the ensemble and the uncertainty
is low. The higher the ACV gets, the lower the
uncertainty. A backup policy can then be chosen based on the
ACV calculation as given in equation 8.</p>
          <p>
            A high delta of the ACV will indicate high uncertainty
and a low delta low uncertainty in both cases,
respectively. To decide on an uncertain situation in a given state,
we proposed to use a threshold to mask out insignificant
variance-delta to ID. This threshold can be used in
future work to switch to a backup policy as operation-time
measure for safety assurance methods e.g. in an iterative
causal model like proposed in [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] and we will come back
to in section 4.
if  [(, )]
&gt;  ℎℎ,
  () =
⎧ arg max
⎪
⎨⎪ E[(, )]
⎪
⎪⎩  ()
otherwise.
          </p>
          <p>(8)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Safety assurance</title>
      <sec id="sec-2-1">
        <title>3.3. Delta to ID uncertainty estimation</title>
        <p>One problem with uncertainty estimation within
reinforcement learning is that often multiple decisions are
equally valid in a given state. These can be called
alternative possible actions - or more generally alternative
predictions. When an agent is in a state with alternative
possible actions, the ensemble may already deviate in its
prediction, although it might be trained suficiently in
this state. This means, alternative possible actions will
pose high uncertainty and might falsely flag an OOD
instance. Traditional methods fail to distinguish these
two cases and, therefore, in [25] an alternative solution</p>
      </sec>
      <sec id="sec-2-2">
        <title>4.1. Background</title>
        <p>The use of ML for highly automated safety-critical
applications leads to a number of safety assurance challenges.</p>
        <p>These challenges are related to the complexity and
unpredictability of the operating environment (aleatoric
uncertainty), as well as the complexity of the technical
system and task itself. A complex system can be defined
as system that exhibits behaviours that are emergent
properties of the interactions between the parts of the system,
where the behaviours would not be predicted based on
knowledge of the parts and their interactions alone. This
definition is closely related to the general concept of
uncertainty, defined as any deviation from the unachievable
ideal of completely deterministic knowledge of the relevant - this requires an external perspective to resolve.
Onsystem [28]. tological uncertainty is a specific cause of specification</p>
        <p>For safety-critical autonomous systems, uncertainty insuficiencies, which in turn will lead to epistemic
unmanifests itself in various forms not restricted to the nar- certainty in the trained model. In this paper, we describe
row definitions used in ML. Specification uncertainty is an operation-time measure to mitigate the efects of this
the uncertainty in the appropriateness and completeness uncertainty by introducing an observer external to the
of safety acceptance criteria and the definition of accept- ML component to detect the conditions where previously
ably safe behavior in all situations that can reasonably unseen inputs might impact the safety requirements.
be anticipated to occur within the target environment.</p>
        <p>
          Incomplete, or otherwise insuficient training data can 4.2. Safety assurance argumentation
be seen as a consequence of specification uncertainty. using ensemble-based uncertainty
Technical uncertainty stems from a lack of predictability
in the performance of the technical components of a sys- estimation and IDD
tem. An example of which is the unpredictable reaction In this section we discuss the impact of the
Ensembleof the system to previously unseen events, or diferences based uncertainty estimation from the following
perspecin the system behavior despite similar input conditions tives. First we discuss the role of the uncertainty
es(epistemic uncertainty in the trained model). Assurance timation as an operation-time measure for mitigating
uncertainty relates to lack of confidence in claims regard- the impact of residual errors in the ML component and
ing safety properties of the ML system. This can include how this supports a safety assurance argument for the
a insuficient integrity of evidence supporting the assur- function. Second we examine issues of uncertainty in
ance arguments as well as the chain of reasoning itself. the assurance argument itself and how confidence in the
Safety assurance for ML-based systems must therefore argument can be increased.
minimise these uncertainties and thus maximise the con- Figure 2 shows a simplified and incomplete excerpt
(inifdence that the system fulfils its safety expectations. The spired by [30]) of a safety assurance argument described
approaches described in [
          <xref ref-type="bibr" rid="ref1">1, 21</xref>
          ] and summarised in Figure using the Goal Structuring Notation (GSN) [31][32] for
1 are designed to iteratively minimise these uncertainties the claim that the residual risk of the system colliding
and thereby safety risk as part of a continuous assurance with obstacles is suficiently low. GSN is a graphical
process based on an understanding of the environment, notation that represents the elements of an assurance
insuficiencies in the ML system and potential deficits argument and the relationships between them. It shows
in the safety assurance argumentation. To support this how goals (claims) can be broken into sub-goals until
approach an assurance argument is proposed to support they can be supported by direct references to evidence.
a systematic evaluation that well defined safety claims It documents argumentation strategies as well as the
are supported by evidence and that all assumptions are context information, including assumptions and
justifiexplicitly stated and validated. cations. The assurance strategy illustrated here is based
        </p>
        <p>Complexity and unpredictability of the operational on an identification of potential causes of insuficiencies
domain and of the system itself lead to semantic gaps, in the function and measures for reducing their impact
which indicate discrepancies between the intended and during development and operation.
specified functionality, also known as specification in- Uncertainty estimation is one of a number of
complesuficiencies. In safety-critical systems this can lead to mentary measures used to form a broad argument for
hazardous systemic failures. From our consideration, safety. However, as mentioned above, the complexity of
specification uncertainty is also a problem in RL, for ex- the system can undermine the confidence in the
arguample when inappropriate reward functions are used. ment. [23] describes confidence in assurance arguments
This might manifest itself in a manner that appears to be in terms of trust in assertions related to the evidence,
epistemic uncertainty, the root cause is however subtly context (including assumptions) and inference (or
strucdiferent to, for example, a lack of training data. ture of the argument itself). For each of these aspects a</p>
        <p>To better understand the characteristics and impact number of defeaters could potentially be identified that
of uncertainty, one can diferentiate between statistical, undermine the argument [22]. For the example
arguscenario and ontological uncertainty. Statistical uncer- mentation in Figure 2 these can include an incomplete
tainty can be expressed in quantitative statistical terms, definition of operating environment or incorrect
assumpsuch as confidence intervals expressed over probability tions regarding the performance of the perception
comdistributions. Scenario uncertainty can only be described ponents (asserted context), as well as the validity of test
using qualitative scenarios, which are potentially mul- results demonstrating the generalisation performance
tiple plausible states of the system and its environment. of the trained function due to the dificulty in covering
Ontological uncertainty [29] defines a lack of awareness previously unknown corner cases (asserted evidence).
that the knowledge about the system itself is incomplete
Furthermore, the assertion that all possible causes of
insuficiencies have been addressed could also be incorrect
(asserted inference).</p>
        <p>
          As proposed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], it is advantageous to iterate
through the assurance process when dealing with
ontological uncertainties - and we can show in the following,
that this also is beneficial even with our simplified case
study. The addition of the uncertainty estimator was the
initial step to mitigate against the residual uncertainties
in the assurance argument with the extension of the IDD
a further step to increase confidence in the efectiveness
of the uncertainty estimation itself.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experimental results</title>
      <p>In [25] we presented the results from more extended
experiments. Here, we summarise the results and conduct
additional experiments to argue the safety assurance.</p>
      <p>Setup and Training: We trained complete agents in
parallel with a randomly placed set of 10 obstacles singly
placed in a gridworld of 10x10 positions and training
runs of 1 million steps each agent. For testing we set up
diferent scenarios with previously seen obstacles as ID
and added a single dedicated obstacle not seen during
training as an OOD condition. For the paper we focused
on an ID scenario with a line of known obstacles in the
middle of the grid and the goal at the end of the line.</p>
      <p>For the visualization of the uncertainty estimation we
calculated heatmaps over the grid showing each resulting
uncertainty estimation for each position of the agent in
the grid given the overall scenario.</p>
      <p>Uncertainty heatmaps: For the depicted results, the
uncertainty calculation based on the action count
variance of the ensemble members is used. Figure 3 is the
base scenario with the known ID obstacle line in blue
and the goal in green. As we use variance in the action
count, a higher brighter colour means more
concentration on fewer actions (and therefore more certainty) and
darker colour means a less concentration in the actions
or more equally distributed action (and therefore higher
uncertainty). As one can see in the base scenario - due
to possible alternative action predictions there are some
“uncertainties” along the diagonals to the goal, as these
coordinates have equal probabilities vertically and
horizontally to approach the goal, since the action space only
allows for up/down and left/right movement and cannot
realize a diagonal path directly. This shows the limits of
the uncertainty metric here as well - the actions along the
diagonal are no more or less dangerous but they are
monitoring a high “uncertainty”. This consideration applies
e.g. also for the point on the left of the obstacle line, as
the probabilities for up and down are equally distributed.</p>
      <p>Figure 4a shows the predictions with one unknown
obstacle inserted in the middle direct on top of the line
shown in purple. There is increased uncertainty,
especially in the area surrounding the unknown obstacle.
Nevertheless, the uncertainty indication is superposed
by the already given "uncertainty" of the possible
alternative predictions from the base ID scenario. In contrast,
ifgure 4b shows the predictions with a known obstacle
inserted in the middle direct on top of the line shown in
blue, instead of the OOD obstacle. Now, the uncertainty
indication is much closer to the base ID scenario.</p>
      <p>Our approach proposed to subtract the base variance
from the OOD variance and therefore try to eliminate
the base variance resulting from the possible alternative
predictions. In Figure 5 the results are depicted for the
delta to ID with the known obstacle without and with
threshold (5a and 5b). It seems to feasibly indicate a
(a) OOD obstacle in the middle (b) ID obstacle in the middle
given OOD hotspot considering a dedicated threshold,
although the indication is not totally sharp.</p>
      <p>In the given scenario, the OOD hotspot lies directly
in an area of low uncertainty (the yellow area on top
of the blue line). In order to validate that the approach
generalizes to diferent scenarios, we also ran setups
where the hotspot lies in an area of previously known
uncertainty from possible alternative predictions - such
as in the upper middle section (see Figure 6).</p>
      <p>(a) OOD obstacle at the top
(b) ID obstacle at the top</p>
      <p>Figure 7 shows the resulting indication for the delta to
ID with known obstacle in 7a and 7b.</p>
      <p>False-Positive and False-Negative rates: In order to
argue the safety assurance, we set up an additional
experiment and measure the probability of an agent without
an observer to hit the unknown obstacle and compared
this to the probability of an agent with only baseline
uncertainty estimation (UE-BL) and the probabilities with
the IDD uncertainty estimator as external observers. The
agent without an observer will get no uncertainty
estimation (UE) indications which is equivalent to
falsenegatives (FN). For the two with external observers we
assume the agent will follow an alternative route and not
hit the unknown obstacle when the uncertainty estimator
indicates uncertainty above a given threshold.</p>
      <p>We iterate over all possible positions of the unknown
obstacle and all possible positions of the agent without
introducing randomness in the setup, to focus on the
demonstration of the efects here. We calculate the mean
probabilities for false-positive (FP) indications (which
slow the agent down) and the false-negatives (FN) (which
result in hazards). As varying hyper-parameters we use
diferent ACV-thresholds for IDD and UE-BL, a variable
sized bounding box around the OOD position wherein
each indication is TP (true-positive), for FN the
percentage threshold of consent of the ensemble to hit in the
next state, absolute amount for the delta to ID vs. a
cutof under zero, and for IDD a substitution with a known
obstacle vs. an empty space. The approaches are
compared in table 1 where the hyper-parameters are tuned
for equal FP probabilities to achieve directly
comparable FN probabilities, and as a ROC (Receiver Operating
Characteristic) curve in figure 8.</p>
      <p>mean P( FN )
(false-negative)
mean P( FP )
(false-positive)
w/o UE
41.61e-3</p>
      <p>UE-BL
3.86e-3</p>
      <p>UE-IDD
2.45e-3
11.49e-2
11.22e-2</p>
      <p>IDD significantly reduces FNs compared to the agent
with the baseline uncertainty estimator (about factor 1.75)
and the agent without UE (about factor 20). This comes
with the cost of an increasing FP rate for the UE agents,
whereas the agent without UE naturally has no FPs.</p>
      <p>When mapping the experimental results to the safety
assurance argumentation from Section 4.2 and the
iterative causal analysis model, it becomes clear that the safety
claim may not be met without an uncertainty estimator
and would then be improved within the 1st iteration when
introducing the UE-BL. The results show a significant
improvement for the FN, but assuming an even higher
safety claim of e.g. FN less then 3‰, a 2nd iteration
identified additional measures to further reduce the FN. The
2nd iteration with IDD as an additional measure then
reached the required claim. When then looking at the
high remaining FP and a potential additional claim in
respect to that, a 3rd iteration could address this aspect.
However, this will be the target of future work.</p>
      <p>Finally, whether all possible scenarios have been
considered and whether the safety assurance achieved is
suficient as rigorous evidence for certification purposes
needs further investigation on more realistic applications
in future work.</p>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion</title>
      <p>This paper investigated the safety assurance
argumentation for an ensemble based epistemic uncertainty
estimation on gridworld scenarios with discrete action spaces
and overlapping alternative predictions.</p>
      <p>We build on previous work with discrete actions spaces
and variance calculation based on action count variance
(ACV) and a delta to ID (IDD) approach to deal with
overlapping alternative predictions, where we showed
that action count variance with IDD is able to indicate
uncertain states based on a threshold calculation with
high probability. As utilizing a backup policy based on
that indication can be a feasible solution, we established
a safety assurance argumentation in this paper. With the
definition of the assurance case and an iterative
assurance approach, we demonstrated that the IDD-enhanced
uncertainty estimator can be utilized as an
operationtime measure as external observer to indicate ontological
uncertainty.</p>
      <p>Future work will address to reduce the FP rate of the
observer, investigate methods to determine a suficient
near ID scenario for a given OOD scenario and extend
the approach to more general and realistic environments
and applications. Further, it will focus on rigorous
argumentation and elaboration of the experimental results
for the safety assurance and on strategies to react upon
the uncertainty estimation during operation to reduce
situational risk.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was funded by the Bavarian Ministry for
Economic Afairs, Regional Development and Energy as part
of a project to support the thematic development of the
Institute for Cognitive Systems.
[10] Q. Yu, K. Aizawa, Unsupervised Out-of- (2011).</p>
      <p>Distribution Detection by Maximum Classifier Dis- [24] J. Goodenough, C. Weinstock, A. Klein, Toward
crepancy, arXiv:1908.04951 [cs] (2019). a Theory of Assurance Case Confidence,
Techni[11] A. Sedlmeier, T. Gabor, T. Phan, L. Belzner, cal Report CMU/SEI-2012-TR-002, Software
EngiC. Linnhof-Popien, Uncertainty-based out-of- neering Institute, Carnegie Mellon University,
Pittsdistribution classification in deep reinforcement burgh, PA, 2012. URL: http://resources.sei.cmu.edu/
learning, arXiv preprint arXiv:2001.00496 (2019). library/asset-view.cfm?AssetID=28067.
[12] W. R. Clements, B. Van Delft, B.-M. Robaglia, R. B. [25] D. Eilers, F. S. Roza, K. Roscher, Ensemble-based
unSlaoui, S. Toth, Estimating Risk and Uncertainty certainty estimation with overlapping alternative
in Deep Reinforcement Learning, arXiv:1905.09638 predictions, Deep RL Workshop at the 36th
Con[cs, stat] (2020). ference on Neural Information Processing Systems
[13] K. Chua, R. Calandra, R. McAllister, S. Levine, Deep (NeurIPS) (2022).</p>
      <p>Reinforcement Learning in a Handful of Trials using [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
Probabilistic Dynamics Models, arXiv:1805.12114 I. Antonoglou, D. Wierstra, M. Riedmiller,
Play(2018). ing atari with deep reinforcement learning,
[14] C.-J. Hoel, K. Wolf, L. Laine, Ensemble quan- arXiv:1312.5602 (2013).</p>
      <p>tile networks: Uncertainty-aware reinforcement [27] T. Haider, F. S. Roza, D. Eilers, K. Roscher, S.
Günlearning with applications in autonomous driving, nemann, Domain shifts in reinforcement
learnarXiv:2105.10266 (2021). ing: Identifying disturbances in environments.,
[15] W. Dabney, G. Ostrovski, D. Silver, R. Munos, Im- AISafety@ IJCAI (2021).</p>
      <p>plicit quantile networks for distributional reinforce- [28] W. E. Walker, P. Harremoës, J. Rotmans, J. P. Van
ment learning, arXiv:1806.06923 (2018). Der Sluijs, M. B. Van Asselt, P. Janssen, M. P.
[16] I. Osband, J. Aslanides, A. Cassirer, Randomized Krayer von Krauss, Defining uncertainty: a
concepPrior Functions for Deep Reinforcement Learning, tual basis for uncertainty management in
modelarXiv:1806.03335 (2018). based decision support, Integrated assessment 4
[17] P. Wang, Y. Li, S. Shekhar, W. F. Northrop, Un- (2003) 5–17.</p>
      <p>certainty Estimation with Distributional Reinforce- [29] R. Gansch, A. Adee, System theoretic view on
unment Learning for Applications in Intelligent Trans- certainties, in: 2020 Design, Automation &amp; Test
portation Systems: A Case Study, in: 2019 in Europe Conference &amp; Exhibition (DATE), IEEE,
IEEE Intelligent Transportation Systems Confer- 2020, pp. 1345–1350.
ence (ITSC), 2019, pp. 3822–3827. doi:10.1109/ [30] S. Burton, I. Kurzidem, A. Schwaiger, P. Schleiss,
ITSC.2019.8917429. M. Unterreiner, T. Graeber, P. Becker, Safety
assur[18] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, S. Levine, ance of machine learning for chassis control
funcUncertainty-Aware Reinforcement Learning for tions, in: International Conference on Computer
Collision Avoidance, arXiv:1702.01182 (2017). Safety, Reliability, and Security, Springer, 2021, pp.
[19] F. L. Da Silva, P. Hernandez-Leal, B. Kartal, M. E. 149–162.</p>
      <p>Taylor, Uncertainty-aware action advising for deep [31] Goal structuring notation community standard
verreinforcement learning agents, in: Proceedings sion 2, Technical Report, Assurance Case
Workof the AAAI conference on artificial intelligence, ing Group (ACWG), https://scsc.uk/r141B:1?t=1,
acvolume 34, 2020, pp. 5792–5799. cessed on 04/05/2019, 2018.
[20] C.-J. Hoel, K. Wolf, L. Laine, Tactical Decision- [32] J. Spriggs, GSN - The Goal Structuring Notation: A
Making in Autonomous Driving by Reinforce- Structured Approach to Presenting, 2012.
ment Learning with Uncertainty Estimation,
arXiv:2004.10439 (2020).
[21] S. Burton, C. Hellert, F. Hüger, M. Mock, A.
Rohatschek, Safety assurance of machine learning for
perception functions, in: Deep Neural Networks
and Data for Automated Driving, Springer, Cham,
2022, pp. 335–358.
[22] P. J. Graydon, Defining baconian
probability for use in assurance argumentation,</p>
      <p>NASA/TM–2016–219341 (2016).
[23] R. Hawkins, T. Kelly, J. Knight, P. Graydon, A new
approach to creating clear safety arguments,
Advances in systems safety, pp. 3-23. Springer, London</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Burton</surname>
          </string-name>
          ,
          <article-title>A causal model of safety assurance for machine learning</article-title>
          ,
          <source>arXiv:2201.05451</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <article-title>A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks</article-title>
          , arXiv:
          <fpage>1610</fpage>
          .02136 [cs] (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lütjens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Everett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>How</surname>
          </string-name>
          ,
          <article-title>Safe reinforcement learning with model uncertainty estimates</article-title>
          ,
          <source>in: 2019 International Conference on Robotics and Automation (ICRA)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>8662</fpage>
          -
          <lpage>8668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Simple</given-names>
            <surname>Unified</surname>
          </string-name>
          <article-title>Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks</article-title>
          , arXiv:
          <year>1807</year>
          .03888 [cs, stat] (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Postels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Blum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Strümpler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cadena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siegwart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tombari</surname>
          </string-name>
          ,
          <article-title>The Hidden Uncertainty in a Neural Networks Activations</article-title>
          , arXiv:
          <year>2012</year>
          .
          <volume>03082</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pimentel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Clifton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Clifton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tarassenko</surname>
          </string-name>
          ,
          <article-title>A review of novelty detection, Signal Process</article-title>
          . (
          <year>2014</year>
          ). doi:
          <volume>10</volume>
          .1016/j.sigpro.
          <year>2013</year>
          .
          <volume>12</volume>
          .026.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>DeVries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <article-title>Learning confidence for out-of-distribution detection in neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>04865</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohseni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pitale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yadawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Self-supervised learning for generalizable out-ofdistribution detection</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>34</volume>
          (
          <year>2020</year>
          ).
          <source>doi:10.1609/AAAI.V34I04</source>
          .5966.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwaiger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sinhamahapatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gansloser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roscher</surname>
          </string-name>
          ,
          <article-title>Is Uncertainty Quantification in Deep Learning Suficient for Out-of-Distribution Detection?</article-title>
          ,
          <source>in: Proc. AISafety@IJCAI2020</source>
          , volume
          <volume>2640</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          , p.
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>