=Paper=
{{Paper
|id=Vol-3381/paper_32
|storemode=property
|title=Towards Safety Assurance of Uncertainty-Aware Reinforcement Learning
        Agents
|pdfUrl=https://ceur-ws.org/Vol-3381/32.pdf
|volume=Vol-3381
|authors=Felippe Schmoeller Roza,Simon Hadwiger,Ingo Thon,Karsten Roscher
|dblpUrl=https://dblp.org/rec/conf/aaai/RozaHTR23
}}
==Towards Safety Assurance of Uncertainty-Aware Reinforcement Learning
        Agents==
<pdf width="1500px">https://ceur-ws.org/Vol-3381/32.pdf</pdf>
<pre>
Towards Safety Assurance of Uncertainty-Aware
Reinforcement Learning Agents
Felippe Schmoeller Roza1 , Simon Hadwiger2,3 , Ingo Thorn2 and Karsten Roscher1
1
  Fraunhofer IKS, Munich, Germany
2
  Siemens AG, Nuremberg, Germany
3
  University of Wuppertal, Wuppertal, Germany


                                          Abstract
                                          The necessity of demonstrating that Machine Learning (ML) systems can be safe escalates with the ever-increasing expectation
                                          of deploying such systems to solve real-world tasks. While recent advancements in Deep Learning reignited the conviction
                                          that ML can perform at the human level of reasoning, the dimensionality and complexity added by Deep Neural Networks pose
                                          a challenge to using classical safety verification methods. While some progress has been made towards making verification
                                          and validation possible in the supervised learning landscape, works focusing on sequential decision-making tasks are still
                                          sparse. A particularly popular approach consists of building uncertainty-aware models, able to identify situations where
                                          their predictions might be unreliable. In this paper, we provide evidence obtained in simulation to support that uncertainty
                                          estimation can also help to identify scenarios where Reinforcement Learning (RL) agents can cause accidents when facing
                                          obstacles semantically different from the ones experienced while learning, focusing on industrial-grade applications. We also
                                          discuss the aspects we consider necessary for building a safety assurance case for uncertainty-aware RL models.

                                          Keywords
                                          Uncertainty estimation, Distributional shifts, Reinforcement Learning, Functional Safety


1. Introduction                                                                                        still not possible for some ML paradigms.
                                                                                                          DNNs excel at learning complex representations from
This position paper is presented to serve as motivation a bulk of data, allowing to reach state-of-the-art perfor-
for the long-term objective of using the uncertainty es- mance in tasks such as computer vision, natural language
timation capabilities of a Reinforcement Learning (RL) processing, and control of autonomous systems. How-
agent to improve its functional safety and enable RL as ever, DNNs are too complex and have too many param-
a viable framework to be deployed in industrial-grade eters to be verified using standard verification and vali-
applications. Although not a new concept, recent accom- dation methods. On top of that, DNN models are often
plishments have reignited the interest in using RL as a overconfident and incapable of recognizing that their pre-
viable method to obtain agents able to interact with a dictions might be wrong [8]. The combination of these
wide range of environments (see [1, 2, 3]). These results factors has put DNNs at the center of safe AI research in
were only possible due to the integration of Deep Neu- the past few years. The main goal is to guarantee that
ral Networks (DNNs) as function approximators for RL DNNs can be safe, reliable, secure, robust, explainable,
agents.                                                                                                and fair [7].
   According to some authors (e.g., [4, 5, 6]), the indus-                                                Another difficulty with DNNs, which also extends to
try is eager to apply Machine Learning (ML) and DNNs Deep RL, is formalizing how capable they are of gener-
more broadly in their processes, with the possibility to alizing over novel instances. Despite the excellent re-
increase the safety level by aiding humans in processes sults obtained with known benchmarks, different find-
that are potentially harmful or even automate complex ings show that DNNs are susceptible to distributional
tasks beyond human capabilities. According to [7], possi- shifts (e.g., [9, 10]). That means that the model output is
ble applications include aircraft control, power systems, not reliable when fed with data drawn from a distribu-
medical systems, and the automotive domain. However, tion that differs from its training data distribution, i.e.,
despite the expected gains, industrial players are histori- out-of-distribution (OOD) instances. When considering
cally very conservative and, most of the time, only adopt autonomous systems controlled by RL agents, there is
new technologies when there is enough evidence sup- the risk of accidents when facing OOD scenarios. This
porting their reliability and cost-effectiveness, which is issue can be solved by making sure the model is trained
                                                                                                       with data that covers every aspect it might encounter
SafeAI 2023: The AAAI’s Workshop on Artificial Intelligence Safety, after deployment, which is intractable for open-world
Feb 13-14, 2023 | Washington, D.C., US                                                                 complex tasks. Alternatively, some methods have been
$ felippe.schmoeller.da.roza@iks.fraunhofer.de (F. S. Roza)                                            suggested to make DNNs robust to distributional shifts,
          Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   such as in [11]. However, making DNNs able to handle
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
distributional shifts is a challenging task and the exist-   being inspired by existing uncertainty quantification ap-
ing methods are limited. We follow a different direction,    proaches and the future outline borrowing ideas from
which consists in using a monitor to identify the OOD        authors that intend to conform AI systems to safety certi-
instances. Once OOD is detected, the system can switch       fication processes that are, to the best of our knowledge,
to a safe control policy to avoid accidents caused by the    very limited when it comes to RL.
agent’s inabilities (that could be as simple as "stop and
wait for help"). We follow the hypothesis that uncertainty   AI for safety-critical applications: Different authors
should grow higher when facing the unknown (same as          defend that to enable ML models to solve safety-critical
given in [12]) and use uncertainty estimation as a proxy     tasks, the models must be assured by evidence that the
metric to classify OOD inputs.                               ML components will behave in accordance with existing
                                                             safety specifications. [13] argue that the evidence must
1.1. Scope and structure of the paper                        cover all aspects necessary to show why these compo-
                                                             nents can be trusted. The authors also present a survey
This paper aims at showing how uncertainty-based OOD         with different methods that help in collecting the evi-
detection can help in the long-term goal of building a       dence for the whole ML lifecycle. In [7], an extensive
solid safety case for RL agents, which must be backed by     study in neural networks applied to high assurance sys-
convincing safety arguments. That is not the only factor     tems is presented. In [14], the authors identify problems
necessary to make certification of RL models possible,       that arise when using ML following ISO 26262, a standard
but one of the most important aspects. The paper will        that regulates the functional safety of road vehicles. They
focus on industrial applications of automated guided ve-     claim that the use of ML can result in hazards not experi-
hicles (AGVs). Industrial environments are mostly guided     enced with conventional software. [15] also discuss the
by specific regulations that are helpful when outlining      shortcomings of fitting ML systems to ISO 26262 and how
the system requirements and specifications in terms of       the Safety of the Intended Functionality (SOTIF), pub-
safety. We believe this can also be used as a starting point lished in the ISO PAS 21448, offers a better alternative for
when expanding the framework to a more general case,         safety assurance. The authors also present an extensive
covering a larger range of open-world applications.          list of safety concerns related to DNN models, including
   To validate the potential of this approach to help with   the risk of the data distribution not being a good approx-
deriving strong safety arguments, experiments with an        imation of the real world and the possibility of distribu-
environment that simulates the application of transport-     tional shifts to happen over time. [16] also argue that the
ing goods with a vision-based AGV in warehouses were         analysis of ML systems is fundamentally incompatible
conducted. The obtained results indicate that uncertainty    with traditional safety verification since safety engineer-
estimation and OOD detection can help to identify un-        ing approaches focus on faults at the component level and
known situations which, in some cases, lead to accidents.    their interactions with other system components while
At the end of the document,                                  systemic failures experienced in complex systems are not
   The document is structured as follows: section 2 shows    necessarily consequence of faults from individual parts
publications available in the literature to serve as back-   of the system. Therefore, the safety arguments should
ground and motivation for this paper. In section 3 the       also reflect the inherent complexity and unpredictability
uncertainty-aware RL algorithm is shown. Section 4 con-      of ever-changing environments where ML systems are
tains the experiments and preliminary results, and sec-      designed to operate.
tion 5 presents a short discussion and the future steps we
believe are necessary for building the safety assurance
                                                             Machine Learning and Uncertainty: The impact of
case for uncertainty-aware RL systems.
                                                             uncertainty in Machine Learning is a recurrent topic of
                                                             research, with a plentiful of publications discussing how
2. Related Work                                              ML systems should manage uncertainty and presenting
                                                             methods to quantify uncertainty. In [17], the authors
Publications investigating safety assurance cases for RL present a more general discussion on the properties of
systems are limited. Therefore, we will start with relevant Bayesian Deep Learning models used for computer vision
works that cover the application of general AI methods tasks that are affected by aleatoric and epistemic uncer-
in safety-critical applications. That will be followed by tainties (the first is inherent to the system stochastic prop-
works that deal with uncertainty estimation and OOD erties while the former is related to a lack of knowledge).
detection for ML systems, mainly focusing on computer In [18], an introduction to the topic of uncertainty in ML
vision problems, and finally, publications that combine models is provided as well as an overview of the main
uncertainty and RL will be shown. Our work is an inter- methods for capturing and handling uncertainty. In [19],
section of those three topics, with the proposed method the authors show how autonomous systems are affected
by uncertainty and how correctly assessing uncertainty        used, but Variational Auto Encoders (VAEs) are an in-
can help towards improving the supervision of inherently      teresting choice for vision-based systems. They are con-
unsafe AI systems. Furthermore, a conceptual framework        sidered robust models, are trained in an unsupervised
for dynamic dependability management based on uncer-          manner (i.e., labeling samples is not necessary), are fast
tainty quantification is presented. In [20], uncertainty      to train, and their generalization capabilities can be visu-
quantification as a proxy for the detection of OOD sam-       ally inspected by comparing the input and reconstructed
ples is discussed, with different methods compared in         images. However, the safety argumentation would ben-
image classification datasets, namely CIFAR-10, GTSRB,        efit from a comparison between different alternatives,
and NWPU-RESISC45. Some popular uncertainty quan-             with the strengths and deficiencies of each approach ad-
tification methods for DNN models worth of mentioning         dressed, which will remain as a future work suggestion.
are Monte Carlo Dropout [21], Deep Ensembles [22], and
Evidential Deep Learning [23].                                3.1. Reinforcement Learning
Reinforcement Learning and Uncertainty: Most                  In RL, we consider an agent that sequentially interacts
of the work combining uncertainty quantification and          with an environment modeled as an MDP. An MDP is
ML cover Supervised Learning, with a strong focus on          a tuple ℳ := (𝑆, 𝐴, 𝑅, 𝑃, 𝜇0 ), where 𝑆 is the set of
computer vision tasks. However, some literature also          states, 𝐴 is the set of actions, 𝑅 : 𝑆 × 𝐴 × 𝑆 ↦→ R
shows how uncertainty-aware RL agents can be obtained.        is the reward function, 𝑃 : 𝑆 × 𝐴 × 𝑆 ↦→ [0, 1] is
A popular application is to use uncertainty to improve        the transition probability function which describes the
exploration. This class of algorithms is motivated by the     system dynamics, where 𝑃 (𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) is the probability
principle of Optimism in the Face of Uncertainty (OFU)        of transitioning to state 𝑠𝑡+1 , given that the previous
and describes the tradeoff between using high-confidence      state was 𝑠𝑡 and the agent took action 𝑎𝑡 , and 𝜇0 : 𝑆 ↦→
decisions, that come from the already established knowl-      [0, 1] is the starting state distribution. At each time step,
edge, and the agent’s need to explore state-action pairs      the agent observes the current state 𝑠𝑡 ∈ 𝑆, takes an
with high epistemic uncertainty [24].                         action 𝑎𝑡 ∈ 𝐴, transitions to the next state 𝑠𝑡+1 drawn
   However, this paper will rather focus on uncertainty       from the distribution 𝑃 (𝑠𝑡 , 𝑎𝑡 ), and receives a reward
as a proxy for detecting domain shifts in decision-making     𝑅(𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 ).
agents. In [25] it is proposed to define the data distribu-
tions in terms of the elements that compose a Markov De-      3.2. Variational Auto Encoders
cision Process (MDP), where minor disturbances should
fall under the generalization umbrella and large devia-       VAEs are a popular class of deep probabilistic genera-
tions represent OOD samples. However, determining             tive models [28]. Autoencoders follow a simple encoder-
which semantic properties represent such changes and          decoder structure, where the model parameters are op-
how to measure them is left as an open question. In [26],     timized to minimize the difference between the input
the authors present an uncertainty-aware model-based          sample and the decoded data, as shown in Figure 1. The
learning algorithm that adds statistical uncertainty es-      trained model is able to compress the inputs into a latent
timates combining bootstrapped neural networks and            representation with a smaller dimension. VAEs extend
Monte Carlo Dropout to its collision predictor. Mobile        regular autoencoders by substituting the exact inference
robot environments are used to show that the agent acts       of the likelihood by the lower bound of the log-likelihood,
more cautiously when facing unfamiliar scenarios and          given by the evidence lower bound (ELBO):
increases the robot’s velocity when it has high confi-
                                                                        log 𝑝𝜃 (x) ≥ ℰ𝑞𝜑 (𝑧|𝑥) [log 𝑝𝜃 (𝑥|𝑧)]−
dence. In [27] this method is extended to environments
with moving obstacles. The authors also combine Monte                                 𝐷𝐾𝐿 [𝑞𝜑 (𝑧|𝑥)||𝑝(𝑧)]              (1)
Carlo dropout and deep ensembles with LSTM models to                                ≜ ℒ(𝑥; 𝜃, 𝜑),
obtain uncertainty estimates. A Model Predictive Con-
troller (MPC) is responsible to find the optimal action       where 𝑥 is the observed variable, 𝑧 is the latent variable
that minimizes the mean and variance of the collision         with prior 𝑝(𝑧) and a conditional distribution 𝑝𝜃 (𝑥|𝑧),
predictions.                                                  𝑞𝜑 (𝑧|𝑥) is an approximation to the true posterior dis-
                                                              tribution 𝑝𝜃 (𝑧|𝑥). 𝑞𝜑 (𝑧|𝑥) and 𝑝𝜃 (𝑥|𝑧) are neural net-
                                                              works parametrized by 𝜑 and 𝜃 (encoder and decoder,
3. Background                                                 respectively). 𝐷𝐾𝐿 is the Kullback–Leibler divergence.

In this section, we present the background for each com-
ponent of the proposed uncertainty-aware RL algorithm.
Different uncertainty quantification methods could be
                                  Latent
     Input Layer                                Output Layer
                              Representation
         𝑥1                                         𝑥
                                                    ^1


         𝑥2                                         𝑥
                                                    ^2


         𝑥3                                         𝑥
                                                    ^3


         𝑥4                                         𝑥
                                                    ^4


         𝑥5                                         𝑥
                                                    ^5


         𝑥6                                         𝑥
                                                    ^6


         𝑥7                                         𝑥
                                                    ^7


         𝑥8                                         𝑥
                                                    ^8


Figure 1: Example of an autoencoder network.
                                                                     Figure 2: Examples of ID and OOD obstacles (top images and
                                                                     bottom images respectively). In the ID scenario, the obstacles
                                                                     are blue and dark red, while the OOD obstacles are green.
3.3. Uncertainty estimation based on
     Variational Auto Encoders
OOD detection using VAEs assumes that the model as-                  represented by a wooden pallet, while avoiding obstacles
signs higher likelihoods to the samples drawn from the               or hitting the walls.
in-distribution (ID) pool than the OOD samples, which                   An RGB camera is attached to the AGV and its control
is valid for different benchmarks as shown in [12]. Met-             decisions are made based on the state 𝑠𝑡 encoded by the
rics derived from the model likelihood are then used as              input images and the coordinates of the AGV and the
uncertainty estimates. We follow the Evidence Lower                  goal. The image resolution can be configured, but for
Bound (ELBO) Ratio method proposed in the same pa-                   the results shown below, RGB images with 84 x 84 pixels
per, which represents the ratio of lower bounds of the               were used. The observation encoding also includes the
log-likelihood of a given sample and the maximum ELBO                positions of the AGV and the goal. The AGV action is
obtained with the ID samples [12]. For notation simplifi-            a 2-dimensional vector, 𝑢𝑡 , representing the linear and
cation, considering a fixed VAE model parametrized by 𝜑              angular velocities. A reward of 100 is given if the agent
and 𝜃, the ELBO value ℒ(𝑥; 𝜃, 𝜑) will be represented as              reaches the goal position, -100 if it hits an obstacle, and
𝐸𝐿𝐵𝑂(𝑥), with 𝐸𝐿𝐵𝑂𝐼 (𝑥) representing the ELBO for                    -10 if it times out (i.e., it reaches the maximum number
a VAE model only trained with ID samples. Following                  of steps).
this notation, the ELBO Ratio uncertainty 𝒰(𝑥0 ) for an                 To attest to the capacity of the uncertainty estimator
arbitrary input 𝑥0 is shown in equation 2.                           to spot critical failures that might be related to OOD
                                                                     instances, an ID and an OOD environment were designed.
                               𝐸𝐿𝐵𝑂(𝑥0 )                             The differences consist of the type of static obstacles
                   𝒰(𝑥0 ) =                 ,                  (2)
                              𝐸𝐿𝐵𝑂𝐼 (𝑥𝑚𝑎𝑥 )                          present in each environment, with obstacles that differ
where 𝐸𝐿𝐵𝑂𝐼 (𝑥𝑚𝑎𝑥 ) is the maximum 𝐸𝐿𝐵𝑂 value                        in color and shape, as shown in figure 2.
calculated for all ID samples (a sort of calibration based              AGV controller framework: The controller used to
on the training data).                                               solve the motion planning described above is shown in
                                                                     figure 3. The first module is a path planner, responsible
                                                                     to determine the optimal path to reach the goal position
4. Experiments and Preliminary                                       based on the agent’s location. The planner takes the AGV
   Results                                                           kinematic model and solves the planning with the 𝐺1
                                                                     Hermite Interpolation Problem with clothoids. Interpo-
Environment: To better support the proposed idea, ex-                lating a sequence of waypoints using clothoid splines
periments were conducted, and the preliminary results                will result in a smooth trajectory, suitable for the motion
will be presented as further evidence. For the experi-               planning of mobile robots, as shown in [30, 31]. The
ments, a custom environment was created using PyBullet               planner takes a simplified observation ˜𝑠𝑡 , consisting of
[29]. It was designed to represent a warehouse with a                the AGV and goal coordinates, as input. Its output is a
configurable layout limited by walls, goods to be trans-             position in the polar coordinate system 𝑝𝑡 = (𝜌𝑡 , 𝜃𝑡 ),
ported by an automated guided vehicle (AGV), and a set               where 𝜌𝑡 and 𝜃𝑡 are the radial and angular coordinates
of obstacles that might be in the way. The goal is to reach          at time 𝑡, respectively. Note that the planner does not
a certain location that contains a good to be transported,           account for obstacles, since it is assumed that obstacles
                                                                     are not known a priori and the RL agent should be re-
                         External
                       Environment

                             ˜
                             𝑠𝑡

                           Path
                         Planning

                             𝑝𝑡
                                                                   (a) ID input images.      (b) ID reconstructed images.
                        Low-level
                                                              Figure 4: VAE model compression-decompression capabilities
                        Controller
                                                              with ID images after 10 epochs of training.
                             𝑢𝑡

                 𝑢*𝑡                 𝑠𝑡 , 𝑟 𝑡
                         RL Agent


Figure 3: RL-based controller framework.


sponsible to react and adjust if an unexpected obstacle
is in the way. The second module is a non-linear con-             (a) OOD input images.     (b) OOD reconstructed im-
                                                                                                ages.
troller used to calculate the control action 𝑢𝑡 necessary
to reach the coordinate 𝑝𝑡 . The last module is the RL        Figure 5: VAE model compression-decompression capabilities
agent. Its goal is to follow the proposed trajectory, i.e.,   with OOD images after 10 epochs of training.
keeping 𝑢𝑡 ≈ 𝑢*𝑡 as much as possible, proposing a differ-
ent control action 𝑢*𝑡 ̸= 𝑢𝑡 only to avoid a collision. To
fulfil this task, an intrinsic reward 𝑟𝑖𝑡 was added, with     in reality the number of unknown obstacles can be ex-
𝑟𝑖𝑡 = 0.0 if 𝑢*𝑡 = 𝑢𝑡 (a small difference is tolerated) and   tremely high, these experiments should be extended to
𝑟𝑖𝑡 = −0.1 otherwise. The optimal policy becomes a            a set of obstacles that is statistically significant to the
tradeoff between avoiding the risk of collision (with the     problem dimension.
expressive -100 reward as punishment) and following the          Figure 4 shows how the VAE learns to reconstruct the
path planner to avoid the small punishments. The RL           images observed in the environment populated with ID
agent was trained in the ID environment using the Soft        obstacles, with the input and reconstructed images. After
Actor-Critic algorithm [32].                                  10 epochs of training, the obstacles are recovered with a
   Uncertainty estimator: The VAE uncertainty estima-         good definition. However, the model is not able to recon-
tion model was trained to fit instances randomly sampled      struct the floor textures completely, which is of minor
from the ID environment in a Supervised Learning man-         relevance in this scenario but should be investigated if
ner. To that end, 20.000 images were collected from the       such features would represent safety-critical aspects (e.g.,
ID environment and 2.000 from the OOD, which are used         oil in the floor, large cracks or holes).
for validation purposes during the model training. The           Figure 5 on the other hand, represents the same model
model was trained for 10 epochs.                              trained in the ID environment trying to reconstruct im-
   After training the RL agent and the VAE uncertainty es-    ages with OOD obstacles in it. It is visible that, even after
timator, rollouts are performed in the OOD environment        10 epochs of training, the model is not able to recover the
with this agent, and (state, action, reward) tuples are       obstacle color or shape correctly, with blurred obstacles
saved for post-analysis. The episode termination states       rendered in the output. That inability to correctly com-
are then passed through the uncertainty estimator to          press and decompress the images with OOD obstacles is
verify if crashes present a significant correlation to high   responsible for increasing the calculated uncertainty.
uncertainty levels. The hypothesis is that if a crash hap-       Figure 6 shows the obtained results for the RL agent
pens due to the agent not being able to avoid an obstacle     running in the OOD environment. The agent ran for
semantically different from the ones experienced during       10.000 steps, which was equivalent to around 70 episodes.
training, the OOD detector could flag this instance before    The y-axis represents the ELBO Ratio, which was normal-
the crash occurs. ID inputs on the other hand should sig-     ized to get the values in the interval [0,1]. Episodes that
nal low uncertainty, indicating that the RL agent is able     ended with a crash are represented by the red bars while
to handle such situations. It is worth mentioning that        the blue bars picture the remaining episodes. The results
these experiments only consider a very limited number         show that some crash episodes presented high uncer-
of distinguishing features for the OOD obstacles. Since
Figure 6: Uncertainty estimates on terminating states of                Figure 7: Uncertainty estimates on terminating
episodes for the OOD environment.                                       states of episodes for the ID environment.


tainty, while very few non-crash episodes presented sig-         of uncertainty estimation and OOD detection in the
nificant uncertainty levels. On the other hand, some fail-       whole Safe AI spectrum, but we believe a more structured
ures did not trigger a high uncertainty level. These states      way to integrate these systems and empirical results to
could represent residual insufficiencies of the trained RL       create a compelling safety assurance case is needed, es-
agent (e.g., caused by lack of training), that the OOD           pecially for RL systems. To reach this long-term goal, we
detector is not accurate for these inputs, or that the colli-    suggest the following future steps:
sion was not caused by an OOD element (e.g., the AGV
crashed to a wall). To attest to the calibration of the uncer-        • Operational Design Domain (ODD) [33]: In
tainty quantification, the same experiment was repeated                 real-world applications, the number of contextual
in the ID environment, with the results shown in figure 7.              combination possibilities makes any attempt for
The ELBO Ratio values are much lower for the entirety                   extensive testing intractable. Therefore, precise
of the episodes and more consistent. That is expected,                  system specification is paramount before starting
since in this case all the states should be considered ID,              to build the assurance case. The ODD should
showing that the VAE is not outputting false positives                  include all contextual information that covers the
for these data samples.                                                 intended operation of the system.
                                                                      • Extensive experimentation: Once an appropri-
                                                                        ate ODD is derived, the experiments described
5. Discussion and Future                                                in this document can be extended to a much
   Perspective                                                          broader scope. Varying parameters, changing
                                                                        scenario configuration, considering more obsta-
This paper focuses on motivating the promising perspec-                 cles, and adding sensor noise are just a few
tive of using uncertainty quantification for improving                  aspects that should be extensively considered.
the safety case of RL systems deployed in industrial ap-                Strong safety arguments will depend on the ex-
plications, concentrating on camera-based systems. For                  periments achieving a high statistical confidence
that end, an environment modeling a typical warehouse                   level for the contexts described in the ODD. This
was created. The preliminary results obtained with a                    should also include multiple uncertainty estima-
VAE-based uncertainty estimator suggest this monitor                    tion methods, not covered in this paper.
can distinguish some of the states that result in accidents           • Qualitative analysis: Understanding the system
related to environmental distributional shifts. However,                at a higher level of abstraction is also important to
it is important to notice that not all accidents are caused             build a strong safety case. For that, it is important
by OOD obstacles, but can rather be influenced by the                   to visualize the scenarios that lead to high or low
reward function definition, observation encoding, model                 uncertainty and try to understand patterns that
generalization capabilities, among other aspects. Iden-                 lead to wrong predictions, outliers, false positives
tifying and separating accidents caused by the inability                and negatives, etc.
of the agent to handle novel obstacles from accidents                 • Residual error: The uncertainty monitor is not
caused by other unrelated limitations is necessary before               intended to cover every safety aspect, but rather
assessing the effectiveness of the OOD detection monitor.               covers failures caused by the inability of the sys-
    Many published works already discuss the importance                 tem to handle domain shifts. Therefore, risks
       associated with other aspects will still be present        ing atari with deep reinforcement learning, arXiv
       and should be addressed by other methods.                  preprint arXiv:1312.5602 (2013).
     • Integration of uncertainty monitor and RL              [2] D. Silver, J. Schrittwieser, K. Simonyan,
       agent: This paper focuses on how OOD scenar-               I. Antonoglou, A. Huang, A. Guez, T. Hu-
       ios might lead to system failures and how OOD              bert, L. Baker, M. Lai, A. Bolton, et al., Mastering
       detection can help in detecting such states before         the game of go without human knowledge, nature
       the failure happens. However, an important ques-           550 (2017) 354–359.
       tion is not addressed here and should be a high        [3] C. Berner, G. Brockman, B. Chan, V. Cheung,
       priority next step: what to do when an OOD in-             P. Dębiak, C. Dennison, D. Farhi, Q. Fischer,
       put is detected? In other words, how to integrate          S. Hashme, C. Hesse, et al., Dota 2 with large
       OOD detection and a safe fallback policy into the          scale deep reinforcement learning, arXiv preprint
       decision-making system.                                    arXiv:1912.06680 (2019).
     • Failure rate calibration: The uncertainty values       [4] C. Esposito, X. Su, S. A. Aljawarneh, C. Choi, Secur-
       are not sufficient to estimate a failure probabil-         ing collaborative deep learning in industrial appli-
       ity because an OOD instance does not necessar-             cations within adversarial scenarios, IEEE Transac-
       ily imply a failure will happen. However, upper            tions on Industrial Informatics 14 (2018) 4972–4981.
       bound probabilities could be derived from the un-      [5] R. A. Khalil, N. Saeed, M. Masood, Y. M. Fard, M.-S.
       certainty estimates, i.e., if the model predicts that      Alouini, T. Y. Al-Naffouri, Deep learning in the
       there is a 30% probability of the 𝑠𝑡 being OOD,            industrial internet of things: Potentials, challenges,
       the risk of failures caused by distributional shifts       and emerging applications, IEEE Internet of Things
       should be below 30%.                                       Journal 8 (2021) 11016–11040.
     • SOTIF: As shown in Section 2, traditional func-        [6] M. Maqsood, I. Mehmood, R. Kharel, K. Muhammad,
       tional safety standards fail to properly address ML        J. Lee, W. Alnumay, Exploring the role of deep
       systems. In contrast, SOTIF is a much more appro-          learning in industrial applications: a case study
       priate framework to build a safety argumentation           on coastal crane casting recognition, Hum. Cent.
       for such cases. However, building an assurance             Comput. Inf. Sci 11 (2021) 1–14.
       case based on an uncertainty-aware RL agent, to        [7] J. M. P. Schumann, Y. Liu, Applications of neural
       the best of our knowledge, was not yet done. In            networks in high assurance systems, volume 268,
       SOTIF it is necessary to attest to the absence of          Springer, 2010.
       unreasonable risk due to hazards resulting from        [8] F. Schwaiger, M. Henne, F. Küppers, F. S. Roza,
       functional insufficiencies of the intended func-           K. Roscher, A. Haselhoff, From black-box to white-
       tionality, which is challenging due to the nature          box: Examining confidence calibration under dif-
       of model-free RL and sequential decision-making            ferent conditions, arXiv preprint arXiv:2101.02971
       systems in general.                                        (2021).
                                                              [9] A. Filos, P. Tigkas, R. McAllister, N. Rhinehart,
   Not necessarily those items were touched on in this pa-        S. Levine, Y. Gal, Can autonomous vehicles iden-
per, but this list serves as a roadmap to guide our research      tify, recover from, and adapt to distribution shifts?,
efforts in the near future, as we believe that covering           in: International Conference on Machine Learning,
these points in deeper detail will result in incremental          PMLR, 2020, pp. 3145–3153.
progress towards achieving a sound argumentation to [10] Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, M. Hardt,
enable uncertainty-aware RL agents to be deployed in              Test-time training with self-supervision for gener-
safety-critical applications.                                     alization under distribution shifts, in: International
                                                                  conference on machine learning, PMLR, 2020, pp.
                                                                  9229–9248.
Acknowledgments                                              [11] S. Thulasidasan, S. Thapa, S. Dhaubhadel, G. Chen-
                                                                  nupati, T. Bhattacharya, J. Bilmes, An effective
This work was funded by the Bavarian Ministry for Eco-            baseline for robustness to distributional shift, in:
nomic Affairs, Regional Development and Energy as part            2021 20th IEEE International Conference on Ma-
of a project to support the thematic development of the           chine Learning and Applications (ICMLA), IEEE,
Institute for Cognitive Systems.                                  2021, pp. 278–285.
                                                             [12] X. Ran, M. Xu, L. Mei, Q. Xu, Q. Liu, Detecting
References                                                        out-of-distribution samples via variational auto-
                                                                  encoder with reliable uncertainty estimation, Neu-
 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,               ral Networks (2022).
      I. Antonoglou, D. Wierstra, M. Riedmiller, Play- [13] R. Ashmore, R. Calinescu, C. Paterson, Assuring the
     Machine Learning Lifecycle: Desiderata, Methods,             Automation (ICRA), IEEE, 2019, pp. 8662–8668.
     and Challenges (2019).                                  [28] D. P. Kingma, M. Welling, Auto-encoding varia-
[14] R. Salay, R. Queiroz, K. Czarnecki, An Analysis              tional bayes, arXiv preprint arXiv:1312.6114 (2013).
     of ISO 26262: Using Machine Learning Safely in          [29] E. Coumans, Y. Bai, Pybullet, a python module for
     Automotive Software (2017).                                  physics simulation for games, robotics and machine
[15] O. Willers, S. Sudholt, S. Raafatnia, S. Abrecht,            learning (2016).
     Safety Concerns and Mitigation Approaches Re-           [30] E. Bertolazzi, M. Frego, G1 fitting with clothoids,
     garding the Use of Deep Learning in Safety-Critical          Mathematical Methods in the Applied Sciences 38
     Perception Tasks (2020).                                     (2015) 881–897.
[16] S. Burton, J. A. McDermid, P. Garnett, R. Weaver,       [31] P. Bevilacqua, M. Frego, E. Bertolazzi, D. Fontanelli,
     Safety, Complexity, and Automated Driving: Holis-            L. Palopoli, F. Biral, Path planning maximising
     tic Perspectives on Safety Assurance, Computer 54            human comfort for assistive robots, in: 2016 IEEE
     (2021).                                                      Conference on Control Applications (CCA), IEEE,
[17] A. Kendall, Y. Gal, What uncertainties do we need            2016, pp. 1421–1427.
     in bayesian deep learning for computer vision?, Ad-     [32] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft
     vances in neural information processing systems              actor-critic: Off-policy maximum entropy deep re-
     30 (2017).                                                   inforcement learning with a stochastic actor, in: In-
[18] E. Hüllermeier, W. Waegeman, Aleatoric and epis-             ternational conference on machine learning, PMLR,
     temic uncertainty in machine learning: An intro-             2018, pp. 1861–1870.
     duction to concepts and methods, Machine Learn-         [33] K. Czarnecki, Operational design domain for auto-
     ing (2021).                                                  mated driving systems, Waterloo Intelligent Sys-
[19] M. Henne, A. Schwaiger, G. Weiss, Managing un-               tems Engineering (WISE) (2018).
     certainty of ai-based perception for autonomous
     systems., in: AISafety@ IJCAI, 2019, pp. 11–12.
[20] A. Schwaiger, P. Sinhamahapatra, J. Gansloser,
     K. Roscher, Is uncertainty quantification in deep
     learning sufficient for out-of-distribution detec-
     tion?, in: AISafety@ IJCAI, 2020.
[21] Y. Gal, Z. Ghahramani, Dropout as a bayesian ap-
     proximation: Representing model uncertainty in
     deep learning, in: international conference on ma-
     chine learning, PMLR, 2016, pp. 1050–1059.
[22] B. Lakshminarayanan, A. Pritzel, C. Blundell, Sim-
     ple and scalable predictive uncertainty estimation
     using deep ensembles, Advances in neural infor-
     mation processing systems 30 (2017).
[23] M. Sensoy, L. Kaplan, M. Kandemir, Evidential deep
     learning to quantify classification uncertainty, Ad-
     vances in neural information processing systems
     31 (2018).
[24] T. Yang, H. Tang, C. Bai, J. Liu, J. Hao, Z. Meng,
     P. Liu, Exploration in deep reinforcement learn-
     ing: a comprehensive survey, arXiv preprint
     arXiv:2109.06668 (2021).
[25] T. Haider, F. S. Roza, D. Eilers, K. Roscher, S. Gün-
     nemann, Domain shifts in reinforcement learn-
     ing: Identifying disturbances in environments., in:
     AISafety@ IJCAI, 2021.
[26] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, S. Levine,
     Uncertainty-aware reinforcement learning for col-
     lision avoidance, arXiv preprint arXiv:1702.01182
     (2017).
[27] B. Lütjens, M. Everett, J. P. How, Safe reinforce-
     ment learning with model uncertainty estimates,
     in: 2019 International Conference on Robotics and

</pre>