<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reinforcement Learning With Imperfect Safety Constraints</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jin Woo Ro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerald Lu¨ ttgen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diedrich Wolter</string-name>
          <email>diedrich.wolterg@uni-bamberg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bamberg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>An obvious approach for ensuring safety in applying deep reinforcement learning is to use it with safety monitors. An ideal monitor captures all unsafe system states, but this is hardly possible in practical systems due to uncertainties in the environment, complicated system dynamics, and limited environment knowledge. Even with a perfect monitor, dynamic adjustments may still be required to account for system changes such as ageing and damage. Therefore, what to do if the monitors are imperfect? This paper proposes an approach for estimating the undetected states of imperfect monitors in conjunction with deep Q-learning. A new Q-value function is developed to capture the target states effectively, and a system is designed to manage control and near-critical safety separately. Our experiments show that the approach proposed in this paper better considers unsafe states than a classic learning-based approach and achieves a faster convergence to the maximum control reward.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Deep reinforcement learning (DRL) excels when it is used
to control a continuous system in complex and uncertain
environments. Thus, cyber-physical systems (CPSs) such as
robotics and intelligent transportation systems can be the
most appropriate applications for DRL. However, the
deployment of DRL in real systems is currently facing
difficulties of ensuring safety requirement of CPS, where a set
of constraints needs to be always satisfied to prevent
catastrophic system failures. In particular, some control actions
selected by DRL can violate the safety constraints. To
overcome this problem, recent studies utilise a safety monitor in
conjunction with DRL to detect safety violations and block
the causing control action (Junges, Torfah, and Seshia 2021).
If the safety monitor perfectly captures all unsafe system
states and the actions leading to them, DRL can be enforced
to select only safe control actions (Mirchevska et al. 2018).</p>
      <p>However, a perfect safety monitor is generally not realistic
in many CPS due to the complexities and uncertainties in the
system environment. Furthermore, there maybe only limited
known information about the environment (e.g., robots in
an unknown environment (Kantaros et al. 2020)).
Conservative approaches assume the worst-case scenario and
overapproximate the unsafe states at the price of degraded
system performance. However, finding the worst-case scenario
itself is generally non-trivial, and there may be unrealistic
assumptions. Even if there is a perfect monitor, it may
become ineffective after some time due to external factors,
such as system ageing and damage. To fill the gap between
ideal and realistic monitors, there is a need for learning
components dedicated to capturing the undetected unsafe states.</p>
      <p>
        This paper proposes an approach for capturing the
undetected states of imperfect monitors using deep Q-learning.
The idea of detecting unsafe states using a learning model
is not new; however, existing studies
        <xref ref-type="bibr" rid="ref3 ref4">(Fisac et al. 2018;
Berkenkamp et al. 2018)</xref>
        assume that the perfect safety
constraints are known a priori. They trust that the safety
interpretation given by the constraints are always correct and use
them directly in the learning process. However, the
imperfect monitor can produce false-negative outputs (i.e., the
system is unsafe but interpreted as safe). Thus, we differentiate
the cases where one can or cannot trust the monitor’s
interpretation. For this, we define a new Q-value function that is
suitable for finding the undetected unsafe states. We
mathematically prove that the proposed new Q-value function can
eventually find all undetected unsafe states. Lastly, through
benchmarking based on an inverse pendulum example, we
compare the deep Q-learning performance with and
without our safety estimation. The results show that safety
violations can be reduced noticeably, and also that the learning
converges faster to the optimal control policy.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Motivating Example</title>
      <p>Consider the inverse pendulum example shown in Figure 1.
The bar can rotate either clockwise or anticlockwise about
a fixed anchor. The gravity g and the force F are the two
factors affecting the rotation. Two variables can represent
the instantaneous state of the bar: the angle and the angular
velocity ! (i.e., two-dimensional state space).</p>
      <p>We assume that the safety constraints cannot be
specified completely, and a safety monitor is constructed as a
finite state machine as shown in Figure 2a. The constraint
j j 0:35 is essentially our guess for identifying two states:
the Upright state (i.e., safe) and the Fallen state (i.e., unsafe).
The monitor outputs 1 or -1 to indicate safe and unsafe,
reBar Mass
Bar Length
Gravity
(Fixed Anchor)
spectively.</p>
      <p>There is an equilibrium point where the gravity force and
F cancel out: F d cos + m2gd sin = 0. Based on our
parameter values, the equilibrium is at = 0:201. If exceeds
this point, the angular acceleration due to the gravity
becomes larger than that of F , and the bar will always
accelerate towards the ground. Also, we need to consider the effect
of instantaneous angular velocity !. A large ! cannot be
instantaneously reduced to zero (i.e., deceleration takes time).
Therefore, the bar can still exceed the equilibrium point even
if the force F is applied to the opposite direction.</p>
      <p>Figure 2b shows the state space of the inverse pendulum
system. First, the areas with labels 1 and 2 represent the
set of states that satisfy j j 0:35 (i.e., the Fallen state). The
areas 3 , 4 , and 5 represent the possible Upright states.
However, the states in 3 and 4 always converge to 1 and
2 , respectively, due to either exceeding the equilibrium
angle or a large angular velocity. Although 3 and 4 also need
to be considered as unsafe states, the imperfect safety
monitor actually recognises them as safe (i.e., false-negative
detections). In this paper, we call such states (that are
incorrectly recognised) as hidden unsafe states, and our objective
is not only to find them but also to identify the actions
leading to them.</p>
    </sec>
    <sec id="sec-3">
      <title>Problem Formulation</title>
      <p>In this section, we formalise the hidden unsafe states
detection problem, given that there is a safety monitor that can
provide the ground truth about the safe and unsafe states.
We start with a recap of markov decision process (MDP),
and formally define the safety monitor. Finally, we define
our problem.</p>
      <sec id="sec-3-1">
        <title>Markov Decision Processes</title>
        <p>In reinforcement learning, a control agent interacts with the
environment by reading the environment state and
performing an action that updates that state. Such a system is
typically modelled as a MDP (Sutton and Barto 2018).
Upright</p>
        <p>/1
Fallen
/-1
(a) Monitor</p>
        <p>Upright states
1</p>
        <p>5</p>
        <p>In every discrete time step t, the system control takes an
action at 2 A from a state st 2 S based on policy :
S ! A. Then, the system reaches the next state st+1 2
S and the agent receives the reward R(st; at). The
actionvalue function Q(s; a) – or simply called Q-value – defines
the value of taking action a from state s:</p>
        <p>1
Q(s; a) = Es [X
t=0
tR(st; at)jst = s; at = a]
(1)
where Es is the expected value of following policy from
state s. Intuitively, the returned value from Equation 1 is
the expected sum of the rewards over all t. The Bellman
optimality equation for the optimal action-value function
Q (s; a) that yields the maximum cumulative reward is
Q (s; a) = E[R(s; a) +
max Q (s0; a0)]
a0
(2)
where s0 is the successor state and a0 is an action that can be
taken from state s0. In this paper, we consider a MDP with a
continuous state space and a discrete action space, as in the
case with the inverse pendulum system example.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Safety Monitors</title>
        <p>Generally, a safety monitor is a finite state machine. In our
study, the monitor takes the inputs s and a from the MDP
and outputs value 1 to indicate safe or -1 to indicate unsafe.
Definition 2 (Safety Monitor). A safety monitor is a
deterministic Moore machine defined as a tuple A =
hL; l0; I; ; Gi, where:
• L is the set of finite discrete locations.
• l0 is the singleton initial location.
• I is the set of input variables.
• : L 2AP (I) ! L is the transition function, where
AP (I) is a set of boolean expressions over the input
variables.
• G : L ! f 1; 1g maps each location l 2 L to either
-1 or 1. We denote the set of safe locations by Lsafe =
fl 2 LjG(l) = 1g. Similarly, the set of unsafe locations
is Lunsafe = fl 2 LjG(l) = 1g.</p>
        <p>If there are multiple monitors in the system, we can
represent them as a single monitor whose output is 1 if the outputs
of all monitors are 1, otherwise, -1. Therefore, in this paper,
we can simply assume that there is exactly a single safety
monitor in the system.</p>
        <p>At any time instant, the state of the entire system
including the monitor can be expressed as a state pair (s; l), where
s 2 S is the state of the MDP and l 2 L is the location of the
monitor. Also, an infinite path can be generated by following
the policy starting from (s0; l0):
p(s0;l0) = (s0; l0) a!0 (s1; l1) a!1 (s2; l2) a!2 :::
(3)</p>
        <sec id="sec-3-2-1">
          <title>Lunsafe.</title>
          <p>In the path, a state pair (s; l) is identified as unsafe if l 2
Lunsafe. Also, an action a is identified as an unsafe action
in (s; l) if there exist a transition (s; l) !a (s0; l0) where l0 2</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Hidden Unsafe States Detection Problem</title>
        <p>We denote by Sunsafe the set of all state pairs (s; l)
satisfying l 2 Lunsafe. Similarly, Ssafe denotes the set of all state
pairs (s; l) satisfying l 2 Lsafe.</p>
        <p>Definition 3 (Hidden Unsafe States Detection Problem).
Let Shidden Ssafe represent a set of hidden unsafe states,
where all possible paths starting from a state (s; l) 2 Shidden
always eventually reach a state (s0; l0) 2 Sunsafe. Then, the
hidden unsafe states detection problem is to obtain the actual
unsafe set</p>
        <p>Sunsafe = Shidden [ Sunsafe
(4)
by finding Shidden based on Sunsafe which is known a
priori.</p>
        <p>Intuitively, Sunsafe is what the imperfect monitor
recognises as unsafe, and Sunsafe is what the perfect monitor
should recognise as unsafe. Thus, Shidden is the gap between
the perfect and imperfect monitors, and this is what we aim
to find in our approach.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>System Model</title>
      <p>This section presents our approach to estimate Sunsafe and
block the actions that immediately lead to such states. We
first give an overview of our system model.</p>
      <sec id="sec-4-1">
        <title>Model Overview</title>
        <p>The system overview is shown in Figure 3. First, the
environment passes the state s and the reward R(s; a) to the
controller based on action a. Such a feedback loop between
environment and controller is the standard design of
reinforcement learning. State s and action a are also passed to the
n
o
it
c
A</p>
        <p>Environment</p>
        <p>State ,
Reward</p>
        <p>Controller
(Deep Q-Learning)</p>
        <p>State</p>
        <p>X
Action
Action mask</p>
        <p>Location ,
Safety output
Safety Estimator
(Deep Q-Learning)
safety monitor for updating the monitor location and
generating the output G(l). The safety estimator is an additional
learning component in the system. It aims to learn Sunsafe
and produces the output for indicating which actions need
to be blocked. It requires information s, a, and l to learn the
system path (Equation 3), and also G(l) to know what is safe
and unsafe. The action mask output M(s;l) is an array of
binary values with size equal to the number of control actions:
M(s;l) = [m0; m1; :::; mn]
(5)
Intuitively, each binary value is associated with a certain
action. For example, mi = 0 means that action ai 2 A needs
to be blocked. In this way, we can selectively block unsafe
actions.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Controller</title>
        <p>The controller is based on the deep Q-Learning method
proposed in (Mnih et al. 2015). Generally, the optimal
Qvalue Q (s; a) in Equation 2 is not known to the controller.
Hence, we use a neural network to estimate Q (s; a) with
Q(s; a; ), where is the parameters of the neural network
(i.e., weights and bias). Note that is the standard notation
for neural network parameters, and it should not be confused
with the angle in Figure 1.</p>
        <p>For each discrete time step, the experience replay buffer
U stores the controller’s experience (s; a; r; s0), where s is
current the state, a is the action, r is the reward, and s0 is
the next state. Note that the use of replay buffer is based on
the work in (Mnih et al. 2015). By using the data stored in
buffer U , the neural network is trained using the following
loss function:</p>
        <p>Loss( ) = E(s;a;r;s0) U</p>
        <sec id="sec-4-2-1">
          <title>Qtarget</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Qpred</title>
          <p>2
where,</p>
          <p>Qtarget = r +
max Q(s0; a0;
a0
)</p>
          <p>Qpred = Q(s; a; )
where is the discount factor. Here, (s; a; r; s0) U means
that the experience data are sampled from buffer U . The term
(Qtarget Qpre)2 in Equation 6 represents the squared error
between Qpred (the prediction of the Q-value) and Qtarget
(6)
(7)
(8)
(the target Q-value). In the equation of Qtarget, symbol
represents the previous value of . More precisely, we update
= every k steps, but during the k steps, is kept
fixed. Note that k is a hyperparameter used to control the
learning process based on (Mnih et al. 2015). Finally, the
neural network learns to minimise the loss function.</p>
          <p>The -greedy action selection method can be performed
with the action mask M(s;l). Through the neural network,
we can obtain the distribution of Q-values over the action
space:</p>
          <p>Qs = [Q(s; a0); Q(s; a1); :::; Q(s; an)]
(9)
Qsfinal:
The array Qs contains the Q-values of all actions. Next, we
apply the action mask to obtain the array of final Q-values
s
Qfinal = Q
s</p>
          <p>M (s; l)
= [Q(s; a0); :::; Q(s; an)] [m0; :::; mn]
= [Q(s; a0)m0; :::; Q(s; an)mn]
(10)
where is the element-wise multiplication. For example,
the multiplication of the first element of Qs and M (s; l) is
the first element of Qsfinal. Because the action mask M (s; l)
contains binary values, the element-wise multiplication
enforces the Q-values to either keep their values or become
zero. During the action selection process, the actions with
zero Q-values are ignored so that unsafe actions can be
blocked from being selected. If Qsfinal only contains zeros,
it means that all actions are unsafe. In this case, a
predefined safe backup plan can be instantiated or the maximum
Q-value in Qs can be selected directly.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Safety Estimator</title>
        <p>Similar to the controller that learns from the environment,
the safety estimator learns from the safety monitor.
Basically, G(l) is the reward of the safety monitor. However,
unlike R(s; a) from the environment, G(l) cannot be fully
trusted because G(l) = 1 cannot differentiate between safe
and hidden unsafe states. Though, we trust G(l) = 1.</p>
        <p>Another important aspect is that the safety estimator aims
to solve a classification problem rather than an optimisation
problem. Therefore, the safety estimator does not need to
learn to maximise the reward but to identify hidden unsafe
states. Here, the cumulative reward value is not
meaningful because we cannot know what rewards are added. For
example, a few bad rewards may be overwhelmed by good
ones. As such, the cumulative reward maximisation using
the Bellman equation (Equation 2) is inappropriate for the
goal of the safety estimation. Therefore, we define a
different Q-value function for our safety estimator.</p>
        <p>The Q-value Q(s; l; a) is defined in Equation 11. We
express [x]lu==kk21 to indicate that the value of x in the
parenthesis is capped by the upper bound k1 and the lower bound k2.
For example, [5]lu==11 = 1 and [ 2]lu==11 = 1.</p>
        <p>"
Q(s; l; a) =</p>
        <p>G(l0) + 2 s 8a02A</p>
        <p>P</p>
        <p>Q(s0; l0; a0) #u=1</p>
        <p>A
j j
l= 1
(11)
where, jAj is the number of actions. Basically, the sigma
term calculates the average Q-value of all actions in (s0; l0).
Parameter s can have two possible values as shown in
Equation 12:
s =</p>
        <p>Proof. We directly prove Lemma 1. Consider an arbitrary
transition (s; l) !a (s0; l0), where (s0; l0) 2 Sunsafe. In
this case, we have G(l0) = 1 from the monitor, which is
then used to get s = 0 using Equation 12. By substituting
s = 0 and G(l0) = 1 into Equation 11, we always have
Q(s; l; a) = [ 1]lu==11 = 1.</p>
        <p>If G(l0) = 1, the average value term in Equation 11
is eliminated by s = 0, and the Q-value is just the value
of G(l0). This reflects that we trust the case G(l0) = 1.
In contrast, the case of G(l0) = 1 requires additional
information to differentiate between the safe and hidden
unsafe states. Therefore, the average term in Equation 11 is the
key to identify hidden unsafe states. Intuitively, s in our
approach operates as a switch between two cases.
Lemma 2. The Q-value of an action that directly leads to a
hidden unsafe state is always -1.</p>
        <p>"
Proof. Consider a transition (s; l) !a (s0; l0), where (s; l)
and (s0; l0) are both members of Ssafe so that G(l) =
G(l0) = 1 . To identify (s0; l0) as a hidden unsafe state,
we further assume that all actions a0 taken in (s0; l0)
directly lead to some unsafe states. In this case, we know that
Q(s0; l0; a0) = 1 based on Lemma 1. Furthermore, s = 1
because G(l0) = 1. Hence, Equation 11 simplifies to:
P</p>
        <p>Q(s0; l0; a0) #u=1
Q(s; l; a) =</p>
        <p>G(l0) + 2 s 8a02A</p>
        <p>A
j j
( 1) u=1
A
j j
Here, the sigma term P8a02A Q(s0; l0; a0) is substituted with
jAj 1 because all the Q-values from (s0; l0) are -1.
Therefore, the Q-value of an action leading to a hidden unsafe state
is -1.</p>
        <p>If there is at least one safe action, we always have
P A 1, which makes the Q-value
not8ae0q2uAalQto(s-01; lb0u;at0a)l&gt;argjerjvalue.</p>
        <p>It is an important factor in reinforcement learning,
whether the system reward converges to the optimal
solution. In our case, the optimal solution is the actual set of
unsafe states Sunsafe (Definition 3), and convergence means
that our safety estimator can eventually detect all states in
Sunsafe by finding the set Shidden of hidden unsafe states.
Theorem 1. All hidden unsafe states and the actions leading
to them are eventually detected by the safety estimator.
Proof. Consider an arbitrary path from (s0; l0) to (sn; ln) of
length n:
(s0; l0) a!0 (s1; l1) a!1 ::: an !1 (sn; ln)
(14)
Based on the safety monitor, the unsafe states that produce
G( ) = 1 are known in a priori. Without loss of
generality, we assume that (sn; ln) is an unsafe state so that
G(ln) = 1. By Lemma 1, the Q-value of the previous
action Q(sn 1; ln 1; an 1) is -1 because it leads to an
unsafe state. If the state (sn 1; ln 1) only has unsafe actions,
the Q-value of its previous action Q(sn 2; ln 2; an 2) is
-1 based on Lemma 2 because it leads to a hidden
unsafe state. Similarly, if (sn 2; ln 2) only has unsafe actions,
Q(sn 3; ln 3; an 3) = 1, and so on. This chain of
Qvalue relation uses an unsafe state as a reference point, and
by inductively backtracking the path, we can find all
hidden unsafe states. Because every hidden unsafe state is
connected to unsafe states by definition, the safety estimator can
eventually obtain Sunsafe.</p>
        <p>Finally, the safety estimator’s Q-value outputs are mapped
into the action mask M (s; l). If the Q-value is -1, the mask
value is set to 0, otherwise, set to 1.</p>
        <p>8ma 2 M (s; l); ma =</p>
      </sec>
      <sec id="sec-4-4">
        <title>System Execution and Learning Process</title>
        <p>A neural network is used to estimate the Q-value of the
safety estimator. The estimated Q-value is denoted by
Q(s; l; a; s), where s are the neural network parameters.
The experience data (s; l; a; G(l); s0; l0) is stored in two
buffers Us and Uv depending on the value of G(l). If the
value is 1, the experience data are stored in Us, otherwise, in
Uv.</p>
        <p>If only a single buffer is used, keeping the unsafe
experience data in the buffer is difficult as they are frequently
overwritten by new experience data. In particular, as the system
control becomes better, the new data are most likely to be a
safe state experience. Since the neural network training
relies on the training data quality, biased data leads to biased
learning (i.e., only learning the safe states) that degrades the
safety estimation accuracy. Hence, to consistently provide
unsafe states data, we propose using double buffers so that
one buffer (Uv) is dedicated for storing unsafe state data.
In the experiment results section below, the safety
estimator performances with respect to a single buffer case and a
double buffer case are compared.</p>
        <p>The overall system execution and the learning process are
shown in Algorithm 1. We first initialise the buffers and
the neural network with random parameters s. An episode
consists of T discrete time steps (Line 4). In each discrete
time step, the current states st and lt are used to compute
Qst and M (st; lt). The action at is chosen based on the
greedy method after masking. We execute the action and get
Algorithm 1: Safety Estimator Learning
the next state st+1 and lt+1. Based on G(lt+1), the
experience data (st; lt; at; G(lt+1); st+1; lt+1) is stored in either
Us or Uv. M inibatch is randomly sampled from both of
the buffers. In our case, we sampled from Us and Uv with
a 1:1 ratio. Next, we iterate over each experience data in
M inibatch. The target Q-value is decided by G(l0). The
loss function computes the squared error between the target
Q-value and Q(s; l; a; s), and a gradient descent method is
used to minimise the loss. Lastly, the controller is trained.
One important note is that the safety estimator’s Q-value
needs to be restricted between -1 and 1. This can be done
naturally by setting the output layer activation function. In
our case, we use the hyperbolic tangent (tanh) function.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>This section demonstrates the efficacy of our approach for
the inverse pendulum example. We first explain the
experiment setup, followed by the experiment results.</p>
      <sec id="sec-5-1">
        <title>Experimental Setup</title>
        <p>The benefit of our approach can be effectively shown by
directly comparing the performance of deep Q-learning with
and without the safety estimator. Furthermore, we consider
the safety estimator with a single buffer and with double
buffers as separate models to be evaluated. Therefore, three
models are evaluated and compared.
2. DQN with our safety estimator (single buffer)
3. DQN with our safety estimator (double buffer)</p>
        <p>We are interested in the following observation criteria:
• How frequently safety is violated.
• How fast control reward reaches the optimal solution
(i.e., convergence).
• The evolution of estimated set of hidden unsafe states
over episodes.</p>
        <p>Lastly, we describe the inverse pendulum experiment
setup. Each episode terminates if 200 discrete steps have
passed or the bar is considered as fallen down based on the
safety monitor in Figure 2a. The reward is 1 for each
discrete time step when the bar is upright; otherwise, it is 0.
Thus, the maximum cumulative reward of a single episode
(or episode reward) is 200. To maximise the reward, the bar
must keep the upright position as long as possible. We run
500 episodes for each model.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Inverse Pendulum Results</title>
        <p>Figure 4 shows the plot of the episode reward (vertical axis)
over the number of episodes (horizontal axis). In the first 100
episodes, we observe that DQN with safety estimator learns
faster than the conventional DQN. This is because avoiding
the safety violation can quickly increase the cumulative
reward. In our inverse pendulum system, the maximum reward
is only possible if the bar does not fall. Thus, a reward of less
than 200 means an occurrence of a safety violation. In this
sense, the conventional DQN and the single buffer model are
experiencing frequent safety violations. However, there is a
noticeable improvement in the double buffer case as the
reward is maintained at the maximum value more consistently.
In 500 episodes, the number of episodes violating the safety
is 186 for the conventional DQN, 149 for the single buffer
model, and 121 for the double buffer model. Thus, with the
single buffer model, violation is reduced by 19.9%, and with
the double buffer model by 34.9%.</p>
        <p>As shown in the proof of Lemma 2, the hidden unsafe
states can be recognised by checking the average Q-values.
If the value is -1, the state is a hidden unsafe state.
Figure 5 shows the heatmap of the average Q-value for the
single buffer case at different episodes. The blue colour marks
the average Q-value of 1, whereas red marks -1. Intuitively,
the red areas indicate the hidden unsafe states. In Episode 0,
the estimator is initialised. As it experiences more episodes,
the average Q-value is updated. However, compared to the
analytical solution in Figure 2b, the estimation accuracy is
poor and unstable. Especially after Episode 460, every state
is considered safe. This problem is caused by the biased data
in the buffer. As the controller gets better and better, it
becomes more difficult to collect the information about unsafe
states. Consequently, the buffer becomes full of safe state
information, and the neural network can only learn from safe
states. Overall, the poor and unstable estimation accuracy
also explains why the single buffer case in Figure 4 exhibit
severe fluctuations.</p>
        <p>The heatmaps for the double buffer case are shown in
Figure 6. It is clear that the use of double buffers has
significantly improved the estimation accuracy compared to
the single buffer case. Furthermore, the estimated regions
are not significantly changed over episodes, and there is
no more biased learning problem in Episode 460. Despite
the improved accuracy, the estimation is still an
underapproximation of the analytic solution in Figure 2b. This is
because the neural network learning also causes
uncertainties. Hence, violations can still occur, but they are reduced
by 34.9%.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Related work</title>
      <p>
        Safe policy extraction
        <xref ref-type="bibr" rid="ref5">(Watkins and Dayan 1992)</xref>
        is a
technique that restricts the action space during policy
extraction to ensure constraint satisfaction, which is widely
used in various applications such as autonomous vehicle
control (Mirchevska et al. 2018) and robot systems
        <xref ref-type="bibr" rid="ref1">(Alshiekh et al. 2018)</xref>
        . However, action masking often results
in non-optimal solutions due to the lack of long-term
future consideration. To address this, the study (Kalweit et al.
2020) specifies multi-step constraints using the
formulation of constrained MDP
        <xref ref-type="bibr" rid="ref2">(Altman 1999)</xref>
        . However, these
only look a fixed number of steps ahead. In contrast, the
proposed approach do not have such a limitation. Another
approach called Reward Constrained Policy Optimisation
(RCPO) (Tessler, Mankowitz, and Mannor 2019) specifies
penalties that are added to the reward, which essentially
represent the constraints. Similar to others, RCPO requires the
complete constraints to be given a priori.
      </p>
      <p>Some studies combine the effort of formal methods and
machine learning. For example in (Junges, Torfah, and
Seshia 2021), a runtime monitor is syntactically composed
with the discrete MDP system. In the resultant automaton,
all unsafe actions are known; hence, actions can be blocked
appropriately. However, this approach cannot be used for the
continuous state space due to the state explosion problem,
and the given set of constraints need to be perfect.</p>
      <p>To the best of our knowledge, no work has been done to
deal with partially known (i.e., some unsafe states are not
detected) and partially correct (i.e., the monitor’s output can
be incorrect) safety monitors.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>This paper addressed the problem of dealing with imperfect
safety monitors. Such monitors can only detect a subset of
the actual unsafe states. Our approach used deep Q-learning
to estimate the missing states which we called hidden
unsafe states. The Q-value function was adjusted to find such
states, and an action masking mechanism was employed to
block unsafe actions. The experimental results of an inverse
pendulum example showed that the learning speed is
accelerated and the occurrence of safety violations is noticeably
reduced.</p>
      <p>In future work, the proposed approach can be extensively
analysed using more complex benchmarking example with
a larger action space, complicated reward functions, and
dynamic changes in the deployment environment such as
damaging. Also, further research can investigate how to reuse
the learned knowledge about safety to improve/adjust safety
monitors.
(e) Episode 180</p>
      <p>Junges, S.; Torfah, H.; and Seshia, S. A. 2021.
Runtime monitors for Markov decision processes. In
International Conference on Computer Aided Verification, 553–
576. Springer.</p>
      <p>Kalweit, G.; Huegle, M.; Werling, M.; and Boedecker, J.
2020. Deep Inverse Q-learning with Constraints. In 34th
Conference on Neural Information Processing Systems,
volume 33, 14291–14302. Curran Associates, Inc.</p>
      <p>Kantaros, Y.; Malencia, M.; Kumar, V.; and Pappas, G. J.
2020. Reactive temporal logic planning for multiple robots
in unknown environments. In 2020 IEEE International
Conference on Robotics and Automation (ICRA), 11479–11485.
IEEE.</p>
      <p>Mirchevska, B.; Pek, C.; Werling, M.; Althoff, M.; and
Boedecker, J. 2018. High-level decision making for safe
and reasonable autonomous lane changing using
reinforcement learning. In 21st International Conference on
Intelligent Transportation Systems (ITSC), 2156–2162. IEEE.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.;
Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control
through deep reinforcement learning. Nature, 518(7540):
529–533.</p>
      <p>Sutton, R. S.; and Barto, A. G. 2018. Reinforcement
learning: An introduction. MIT Press.</p>
      <p>Tessler, C.; Mankowitz, D. J.; and Mannor, S. 2019. Reward
Constrained Policy Optimization. In 7th International
Conference on Learning Representations, ICLR, 1–15.
OpenReview.net.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is carried out in the Dependable Intelligent
Software Lab (DISL), funded by the German Federal Ministry of
Education and Research (BMBF – Bundesministerium fu¨r
Bildung und Forschung).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Alshiekh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bloem</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Ehlers,
          <string-name>
            <given-names>R.</given-names>
            ; Ko¨nighofer, B.;
            <surname>Niekum</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Topcu,
          <string-name>
            <surname>U.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Safe reinforcement learning via shielding</article-title>
          .
          <source>In 32nd AAAI Conference on Artificial Intelligence</source>
          ,
          <fpage>2669</fpage>
          -
          <lpage>2678</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Altman</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Constrained Markov decision processes, volume 7</article-title>
          . CRC Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Berkenkamp</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Turchetta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schoellig</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Safe model-based reinforcement learning with stability guarantees</article-title>
          .
          <source>In 31st Conference on Neural Information Processing Systems</source>
          , volume
          <volume>2</volume>
          ,
          <fpage>909</fpage>
          -
          <lpage>919</lpage>
          . Curran Associates, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Fisac</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Akametalu</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zeilinger</surname>
            ,
            <given-names>M. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaynama</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gillula</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Tomlin</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>A general safety framework for learning-based control in uncertain robotic systems</article-title>
          .
          <source>IEEE Transactions on Automatic Control</source>
          ,
          <volume>64</volume>
          (
          <issue>7</issue>
          ):
          <fpage>2737</fpage>
          -
          <lpage>2752</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Watkins</surname>
            ,
            <given-names>C. J.;</given-names>
          </string-name>
          and Dayan,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>1992</year>
          .
          <article-title>Q-learning</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          -4):
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>