=Paper=
{{Paper
|id=Vol-3087/paper_38
|storemode=property
|title=Reinforcement Learning With Imperfect Safety Constraints
|pdfUrl=https://ceur-ws.org/Vol-3087/paper_38.pdf
|volume=Vol-3087
|authors=Jin Woo Ro,Gerald Lüttgen,Diedrich Wolter
|dblpUrl=https://dblp.org/rec/conf/aaai/RoLW22
}}
==Reinforcement Learning With Imperfect Safety Constraints==
<pdf width="1500px">https://ceur-ws.org/Vol-3087/paper_38.pdf</pdf>
<pre>
                   Reinforcement Learning With Imperfect Safety Constraints

                                      Jin Woo Ro, Gerald Lüttgen, Diedrich Wolter
                                                    University of Bamberg, Germany
                                      {jin.ro, gerald.luettgen, diedrich.wolter}@uni-bamberg.de


                            Abstract                                 an unknown environment (Kantaros et al. 2020)). Conser-
                                                                     vative approaches assume the worst-case scenario and over-
  An obvious approach for ensuring safety in applying deep           approximate the unsafe states at the price of degraded sys-
  reinforcement learning is to use it with safety monitors. An       tem performance. However, finding the worst-case scenario
  ideal monitor captures all unsafe system states, but this is
                                                                     itself is generally non-trivial, and there may be unrealistic
  hardly possible in practical systems due to uncertainties in
  the environment, complicated system dynamics, and limited          assumptions. Even if there is a perfect monitor, it may be-
  environment knowledge. Even with a perfect monitor, dy-            come ineffective after some time due to external factors,
  namic adjustments may still be required to account for sys-        such as system ageing and damage. To fill the gap between
  tem changes such as ageing and damage. Therefore, what to          ideal and realistic monitors, there is a need for learning com-
  do if the monitors are imperfect? This paper proposes an ap-       ponents dedicated to capturing the undetected unsafe states.
  proach for estimating the undetected states of imperfect mon-         This paper proposes an approach for capturing the unde-
  itors in conjunction with deep Q-learning. A new Q-value           tected states of imperfect monitors using deep Q-learning.
  function is developed to capture the target states effectively,    The idea of detecting unsafe states using a learning model
  and a system is designed to manage control and near-critical       is not new; however, existing studies (Fisac et al. 2018;
  safety separately. Our experiments show that the approach
                                                                     Berkenkamp et al. 2018) assume that the perfect safety con-
  proposed in this paper better considers unsafe states than a
  classic learning-based approach and achieves a faster conver-      straints are known a priori. They trust that the safety inter-
  gence to the maximum control reward.                               pretation given by the constraints are always correct and use
                                                                     them directly in the learning process. However, the imper-
                                                                     fect monitor can produce false-negative outputs (i.e., the sys-
                        Introduction                                 tem is unsafe but interpreted as safe). Thus, we differentiate
                                                                     the cases where one can or cannot trust the monitor’s inter-
Deep reinforcement learning (DRL) excels when it is used             pretation. For this, we define a new Q-value function that is
to control a continuous system in complex and uncertain en-          suitable for finding the undetected unsafe states. We mathe-
vironments. Thus, cyber-physical systems (CPSs) such as              matically prove that the proposed new Q-value function can
robotics and intelligent transportation systems can be the           eventually find all undetected unsafe states. Lastly, through
most appropriate applications for DRL. However, the de-              benchmarking based on an inverse pendulum example, we
ployment of DRL in real systems is currently facing diffi-           compare the deep Q-learning performance with and with-
culties of ensuring safety requirement of CPS, where a set           out our safety estimation. The results show that safety vio-
of constraints needs to be always satisfied to prevent catas-        lations can be reduced noticeably, and also that the learning
trophic system failures. In particular, some control actions         converges faster to the optimal control policy.
selected by DRL can violate the safety constraints. To over-
come this problem, recent studies utilise a safety monitor in                          Motivating Example
conjunction with DRL to detect safety violations and block
the causing control action (Junges, Torfah, and Seshia 2021).        Consider the inverse pendulum example shown in Figure 1.
If the safety monitor perfectly captures all unsafe system           The bar can rotate either clockwise or anticlockwise about
states and the actions leading to them, DRL can be enforced          a fixed anchor. The gravity g and the force F are the two
to select only safe control actions (Mirchevska et al. 2018).        factors affecting the rotation. Two variables can represent
   However, a perfect safety monitor is generally not realistic      the instantaneous state of the bar: the angle θ and the angular
in many CPS due to the complexities and uncertainties in the         velocity ω (i.e., two-dimensional state space).
system environment. Furthermore, there maybe only limited               We assume that the safety constraints cannot be speci-
known information about the environment (e.g., robots in             fied completely, and a safety monitor is constructed as a
                                                                     finite state machine as shown in Figure 2a. The constraint
Copyright © 2022, Copyright © 2022 for this paper by its authors.    |θ| ≥ 0.35 is essentially our guess for identifying two states:
Use permitted under Creative Commons License Attribution 4.0         the Upright state (i.e., safe) and the Fallen state (i.e., unsafe).
International (CC BY 4.0)                                            The monitor outputs 1 or -1 to indicate safe and unsafe, re-
                                                                                                                Upright states


                                                                               Upright                          1                  5       4           2
                                                                                 /1


                                                                                                                                       0


                                                                                                                          -0.201


                                                                                                                                           0.201
                                                                                Fallen
                                                                                 /-1
                                                                                                                               3
          Bar Mass
          Bar Length
                                                                                                            Fall states                            Fall states
          Gravity

                                   (Fixed Anchor)                             (a) Monitor                                 (b) State space

Figure 1: Inverse pendulum system, where a bar rotates                Figure 2: The safety monitor for the inverse pendulum ex-
about the fixed anchor. We assume that m = 2kg, d = 1m,               ample, and the visualisation of safe states, unsafe states, and
g = −9.81ms−2 and θ is given in radian. The force |F | =              hidden unsafe states as regions in the two-dimensional state
2N is a constant magnitude horizontal force applied at the            space.
tip of the bar. The direction of F can be controlled (left:
F = −2N or right: F = 2N) to keep the bar upright.
                                                                      Definition 1 (Markov Decision Process). A MDP is a tuple
                                                                      M = hS, A, P, R, γi, where:
spectively.                                                            • S is the set of states, i.e., the state space.
    There is an equilibrium point where the gravity force and          • A is the set of actions, i.e., the action space.
F cancel out: F d cos θ + mgd   2 sin θ = 0. Based on our pa-          • P (s0 |s, a) is the probability distribution over state s0
rameter values, the equilibrium is at θ = 0.201. If θ exceeds             given state s and action a.
this point, the angular acceleration due to the gravity be-            • R(s, a) is the immediate reward when taking action a
comes larger than that of F , and the bar will always acceler-            from state s.
ate towards the ground. Also, we need to consider the effect           • γ ∈ [0, 1] is the discount factor.
of instantaneous angular velocity ω. A large ω cannot be in-             In every discrete time step t, the system control takes an
stantaneously reduced to zero (i.e., deceleration takes time).        action at ∈ A from a state st ∈ S based on policy π :
Therefore, the bar can still exceed the equilibrium point even        S → A. Then, the system reaches the next state st+1 ∈
if the force F is applied to the opposite direction.                  S and the agent receives the reward R(st , at ). The action-
    Figure 2b shows the state space of the inverse pendulum           value function Q(s, a) – or simply called Q-value – defines
system. First, the areas with labels 1 and 2 represent the            the value of taking action a from state s:
set of states that satisfy |θ| ≥ 0.35 (i.e., the Fallen state). The
                                                                                               ∞
areas 3 , 4 , and 5 represent the possible Upright states.                                     X
However, the states in 3 and 4 always converge to 1 and                      Q(s, a) = Eπs [         γ t R(st , at )|st = s, at = a]                             (1)
 2 , respectively, due to either exceeding the equilibrium an-                                 t=0
gle or a large angular velocity. Although 3 and 4 also need           where Eπs is the expected value of following policy π from
to be considered as unsafe states, the imperfect safety mon-          state s. Intuitively, the returned value from Equation 1 is
itor actually recognises them as safe (i.e., false-negative de-       the expected sum of the rewards over all t. The Bellman
tections). In this paper, we call such states (that are incor-        optimality equation for the optimal action-value function
rectly recognised) as hidden unsafe states, and our objective         Q∗ (s, a) that yields the maximum cumulative reward is
is not only to find them but also to identify the actions lead-
ing to them.                                                                   Q∗ (s, a) = E[R(s, a) + γ max
                                                                                                          0
                                                                                                             Q∗ (s0 , a0 )]                                      (2)
                                                                                                                      a

                 Problem Formulation                                  where s0 is the successor state and a0 is an action that can be
                                                                      taken from state s0 . In this paper, we consider a MDP with a
In this section, we formalise the hidden unsafe states detec-         continuous state space and a discrete action space, as in the
tion problem, given that there is a safety monitor that can           case with the inverse pendulum system example.
provide the ground truth about the safe and unsafe states.
We start with a recap of markov decision process (MDP),               Safety Monitors
and formally define the safety monitor. Finally, we define
                                                                      Generally, a safety monitor is a finite state machine. In our
our problem.
                                                                      study, the monitor takes the inputs s and a from the MDP
Markov Decision Processes                                             and outputs value 1 to indicate safe or -1 to indicate unsafe.
In reinforcement learning, a control agent interacts with the         Definition 2 (Safety Monitor). A safety monitor is a
environment by reading the environment state and perform-             deterministic Moore machine defined as a tuple A =
ing an action that updates that state. Such a system is typi-         hL, l0 , I, δ, Gi, where:
cally modelled as a MDP (Sutton and Barto 2018).                       • L is the set of finite discrete locations.
 • l0 is the singleton initial location.                                                               State
 • I is the set of input variables.                                                  Environment                              Safety Monitor

 • δ : L × 2AP (I) → L is the transition function, where
    AP (I) is a set of boolean expressions over the input vari-                                                                 Location ,
    ables.                                                                            State ,                                  Safety output
                                                                                                           X


                                                                          Action
                                                                                   Reward
 • G : L → {−1, 1} maps each location l ∈ L to either
    -1 or 1. We denote the set of safe locations by Lsaf e =
    {l ∈ L|G(l) = 1}. Similarly, the set of unsafe locations
                                                                                       Controller                             Safety Estimator
    is Lunsaf e = {l ∈ L|G(l) = −1}.                                               (Deep Q-Learning)                         (Deep Q-Learning)
                                                                                                       Action
   If there are multiple monitors in the system, we can repre-
sent them as a single monitor whose output is 1 if the outputs
of all monitors are 1, otherwise, -1. Therefore, in this paper,                                        Action mask
we can simply assume that there is exactly a single safety
monitor in the system.                                                                  Figure 3: System Model Overview
   At any time instant, the state of the entire system includ-
ing the monitor can be expressed as a state pair (s, l), where
s ∈ S is the state of the MDP and l ∈ L is the location of the          safety monitor for updating the monitor location and gener-
monitor. Also, an infinite path can be generated by following           ating the output G(l). The safety estimator is an additional
the policy π starting from (s0 , l0 ):                                                                                          ∗
                                                                        learning component in the system. It aims to learn Sunsaf     e
                          a              a              a               and produces the output for indicating which actions need
    pπ(s0 ,l0 ) = (s0 , l0 ) −→
                              0             1
                                (s1 , l1 ) −→             2
                                              (s2 , l2 ) −→ ...   (3)
                                                                        to be blocked. It requires information s, a, and l to learn the
In the path, a state pair (s, l) is identified as unsafe if l ∈         system path (Equation 3), and also G(l) to know what is safe
Lunsaf e . Also, an action a is identified as an unsafe action          and unsafe. The action mask output M(s,l) is an array of bi-
                                             a
                                             → (s0 , l0 ) where l0 ∈
in (s, l) if there exist a transition (s, l) −                          nary values with size equal to the number of control actions:
Lunsaf e .
                                                                                             M(s,l) = [m0 , m1 , ..., mn ]                     (5)
Hidden Unsafe States Detection Problem                                  Intuitively, each binary value is associated with a certain ac-
We denote by Sunsaf e the set of all state pairs (s, l) satisfy-        tion. For example, mi = 0 means that action ai ∈ A needs
ing l ∈ Lunsaf e . Similarly, Ssaf e denotes the set of all state       to be blocked. In this way, we can selectively block unsafe
pairs (s, l) satisfying l ∈ Lsaf e .                                    actions.
Definition 3 (Hidden Unsafe States Detection Problem).
Let Shidden ⊆ Ssaf e represent a set of hidden unsafe states,
                                                                        Controller
where all possible paths starting from a state (s, l) ∈ Shidden         The controller is based on the deep Q-Learning method
always eventually reach a state (s0 , l0 ) ∈ Sunsaf e . Then, the       proposed in (Mnih et al. 2015). Generally, the optimal Q-
hidden unsafe states detection problem is to obtain the actual          value Q∗ (s, a) in Equation 2 is not known to the controller.
unsafe set                                                              Hence, we use a neural network to estimate Q∗ (s, a) with
                  ∗                                                     Q(s, a; θ), where θ is the parameters of the neural network
                 Sunsaf e = Shidden ∪ Sunsaf e                    (4)   (i.e., weights and bias). Note that θ is the standard notation
by finding Shidden based on Sunsaf e which is known a pri-              for neural network parameters, and it should not be confused
ori.                                                                    with the angle θ in Figure 1.
   Intuitively, Sunsaf e is what the imperfect monitor recog-              For each discrete time step, the experience replay buffer
                          ∗
nises as unsafe, and Sunsaf                                             U stores the controller’s experience (s, a, r, s0 ), where s is
                               e is what the perfect monitor
                                                                        current the state, a is the action, r is the reward, and s0 is
should recognise as unsafe. Thus, Shidden is the gap between
                                                                        the next state. Note that the use of replay buffer is based on
the perfect and imperfect monitors, and this is what we aim
                                                                        the work in (Mnih et al. 2015). By using the data stored in
to find in our approach.
                                                                        buffer U , the neural network is trained using the following
                                                                        loss function:
                        System Model                                                                                        2 
                                                ∗
This section presents our approach to estimate Sunsaf                           Loss(θ) = E(s,a,r,s0 )∼U Qtarget − Qpred            (6)
                                                      e and
block the actions that immediately lead to such states. We              where,
first give an overview of our system model.
                                                                                       Qtarget = r + γ max
                                                                                                        0
                                                                                                           Q(s0 , a0 ; θ− )                    (7)
                                                                                                                a
Model Overview
                                                                                         Qpred = Q(s, a; θ)                                    (8)
The system overview is shown in Figure 3. First, the envi-
ronment passes the state s and the reward R(s, a) to the con-           where γ is the discount factor. Here, (s, a, r, s0 ) ∼ U means
troller based on action a. Such a feedback loop between en-             that the experience data are sampled from buffer U . The term
vironment and controller is the standard design of reinforce-           (Qtarget −Qpre )2 in Equation 6 represents the squared error
ment learning. State s and action a are also passed to the              between Qpred (the prediction of the Q-value) and Qtarget
(the target Q-value). In the equation of Qtarget , symbol θ−         where, |A| is the number of actions. Basically, the sigma
represents the previous value of θ. More precisely, we update        term calculates the average Q-value of all actions in (s0 , l0 ).
θ− = θ every k steps, but during the k steps, θ− is kept             Parameter γs can have two possible values as shown in
fixed. Note that k is a hyperparameter used to control the           Equation 12:
learning process based on (Mnih et al. 2015). Finally, the                                 
                                                                                             0    if G(l0 ) = −1
neural network learns to minimise the loss function.                                 γs =                                      (12)
   The -greedy action selection method can be performed                                     1    if G(l0 ) = 1
with the action mask M(s,l) . Through the neural network,            Lemma 1. The Q-value of an action that directly leads to an
we can obtain the distribution of Q-values over the action           unsafe state is always -1.
space:
                                                                     Proof. We directly prove Lemma 1. Consider an arbitrary
                                                                                        a
           Qs = [Q(s, a0 ), Q(s, a1 ), ..., Q(s, an )]        (9)                      → (s0 , l0 ), where (s0 , l0 ) ∈ Sunsaf e . In
                                                                     transition (s, l) −
                                                                     this case, we have G(l0 ) = −1 from the monitor, which is
The array Qs contains the Q-values of all actions. Next, we          then used to get γs = 0 using Equation 12. By substituting
apply the action mask to obtain the array of final Q-values          γs = 0 and G(l0 ) = −1 into Equation 11, we always have
Qsf inal :                                                           Q(s, l, a) = [−1]u=1
                                                                                       l=−1 = −1.
     Qsf inal = Qs ⊗ M (s, l)                                           If G(l0 ) = −1, the average value term in Equation 11
              = [Q(s, a0 ), ..., Q(s, an )] ⊗ [m0 , ..., mn ]        is eliminated by γs = 0, and the Q-value is just the value
              = [Q(s, a0 )m0 , ..., Q(s, an )mn ]             (10)   of G(l0 ). This reflects that we trust the case G(l0 ) = −1.
                                                                     In contrast, the case of G(l0 ) = 1 requires additional in-
where ⊗ is the element-wise multiplication. For example,             formation to differentiate between the safe and hidden un-
the multiplication of the first element of Qs and M (s, l) is        safe states. Therefore, the average term in Equation 11 is the
the first element of Qsf inal . Because the action mask M (s, l)     key to identify hidden unsafe states. Intuitively, γs in our
contains binary values, the element-wise multiplication en-          approach operates as a switch between two cases.
forces the Q-values to either keep their values or become            Lemma 2. The Q-value of an action that directly leads to a
zero. During the action selection process, the actions with          hidden unsafe state is always -1.
zero Q-values are ignored so that unsafe actions can be                                                     a
blocked from being selected. If Qsf inal only contains zeros,        Proof. Consider a transition (s, l) −  → (s0 , l0 ), where (s, l)
                                                                             0 0
it means that all actions are unsafe. In this case, a prede-         and (s , l ) are both members of Ssaf e so that G(l) =
fined safe backup plan can be instantiated or the maximum            G(l0 ) = 1 . To identify (s0 , l0 ) as a hidden unsafe state,
Q-value in Qs can be selected directly.                              we further assume that all actions a0 taken in (s0 , l0 ) di-
                                                                     rectly lead to some unsafe states. In this case, we know that
Safety Estimator                                                     Q(s0 , l0 , a0 ) = −1 based on Lemma 1. Furthermore, γs = 1
                                                                     because G(l0 ) = 1. Hence, Equation 11 simplifies to:
Similar to the controller that learns from the environment,
                                                                                                             Q(s0 , l0 , a0 ) #u=1
                                                                                                        P
the safety estimator learns from the safety monitor. Basi-
                                                                                         "
                                                                                              0       ∀a0 ∈A
cally, G(l) is the reward of the safety monitor. However,                 Q(s, l, a) = G(l ) + 2γs
unlike R(s, a) from the environment, G(l) cannot be fully                                                     |A|
                                                                                                                               l=−1
trusted because G(l) = 1 cannot differentiate between safe                                                   u=1
                                                                                                  |A| × (−1)
and hidden unsafe states. Though, we trust G(l) = −1.                                   = 1 + 2γs
   Another important aspect is that the safety estimator aims                                         |A|      l=−1
to solve a classification problem rather than an optimisation                                  u=1
problem. Therefore, the safety estimator does not need to                               = 1−2
learn to maximise the reward but to identify hidden unsafe                                      l=−1
states. Here, the cumulative reward value is not meaning-                           = −1                                       (13)
ful because we cannot know what rewards are added. For                                     P             0 0 0
                                                                     Here, the sigma term ∀a0 ∈A Q(s , l , a ) is substituted with
example, a few bad rewards may be overwhelmed by good                |A|×−1 because all the Q-values from (s0 , l0 ) are -1. There-
ones. As such, the cumulative reward maximisation using              fore, the Q-value of an action leading to a hidden unsafe state
the Bellman equation (Equation 2) is inappropriate for the           is -1.
goal of the safety estimation. Therefore, we define a differ-
ent Q-value function for our safety estimator.                       PIf there is0 at0 least   one safe action, we always have
                                                                                          0
   The Q-value Q(s, l, a) is defined in Equation 11. We ex-             ∀a0 ∈A Q(s , l , a ) > |A| × −1, which makes the Q-value
          u=k1                                                       not equal to -1 but a larger value.
press [x]l=k 2
               to indicate that the value of x in the parenthe-
sis is capped by the upper bound k1 and the lower bound k2 .            It is an important factor in reinforcement learning,
For example, [5]u=1                    u=1
                  l=−1 = 1 and [−2]l=−1 = −1.
                                                                     whether the system reward converges to the optimal solu-
                                                                     tion. In our case, the optimal solution is the actual set of un-
                                       Q(s0 , l0 , a0 ) #u=1                       ∗
                                 P
                "                                                    safe states Sunsaf   e (Definition 3), and convergence means
                               ∀a 0 ∈A
   Q(s, l, a) = G(l0 ) + 2γs                                 (11)    that our safety estimator can eventually detect all states in
                                        |A|                            ∗
                                                                     Sunsaf  e by finding the set Shidden of hidden unsafe states.
                                                     l=−1
Theorem 1. All hidden unsafe states and the actions leading          Algorithm 1: Safety Estimator Learning
to them are eventually detected by the safety estimator.              1: Initialise Us and Uv to capacity N
                                                                      2: Initialise a neural network Q with random weights θs
Proof. Consider an arbitrary path from (s0 , l0 ) to (sn , ln ) of    3: for episode = 1, N do
length n:                                                             4:     for t = 1, T do
                   a 0          a  1    an−1                          5:          Get the current states st and lt
         (s0 , l0 ) −→ (s1 , l1 ) −→ ... −−−→ (sn , ln )     (14)     6:          Get Qst from the controller (Equation 9)
Based on the safety monitor, the unsafe states that produce           7:          Get M(st ,lt ) based on Q(st , lt , ∀a ∈ A; θs )
G(·) = −1 are known in a priori. Without loss of gen-                 8:          Compute Qsftinal (Equation 10)
erality, we assume that (sn , ln ) is an unsafe state so that         9:          if a random value between 0 and 1 ≥  then
G(ln ) = −1. By Lemma 1, the Q-value of the previous                 10:               at ← arg maxa Qsftinal
action Q(sn−1 , ln−1 , an−1 ) is -1 because it leads to an un-       11:          else
safe state. If the state (sn−1 , ln−1 ) only has unsafe actions,     12:               at ← random non-zero element in Qsftinal
the Q-value of its previous action Q(sn−2 , ln−2 , an−2 ) is         13:          end if
-1 based on Lemma 2 because it leads to a hidden un-                 14:          Execute at and get the next states st+1 and lt+1
safe state. Similarly, if (sn−2 , ln−2 ) only has unsafe actions,    15:          Get G(lt+1 ) from the safety monitor
Q(sn−3 , ln−3 , an−3 ) = −1, and so on. This chain of Q-             16:          if G(lt+1 ) = 1 then
value relation uses an unsafe state as a reference point, and        17:               Store (st , lt , at , G(lt+1 ), st+1 , lt+1 ) in Us
by inductively backtracking the path, we can find all hid-           18:          else if G(lt+1 ) = −1 then
den unsafe states. Because every hidden unsafe state is con-         19:               Store (st , lt , at , G(lt+1 ), st+1 , lt+1 ) in Uv
nected to unsafe states by definition, the safety estimator can      20:          end if
                      ∗
eventually obtain Sunsaf    e.                                       21:          Minibatch ← Sample(Us ∪ Uv );
                                                                     22:          for (s, l, a, G(l0 ), s0 , l0 ) ∈ Minibatch do
   Finally, the safety estimator’s Q-value outputs are mapped        23:               if G(l0 ) = 1 then
into the action mask M (s, l). If the Q-value is -1, the mask                                           "        P
                                                                                                                         Q(s0 ,l0 ,a0 ;θs )
                                                                                                                                              #u=1
value is set to 0, otherwise, set to 1.                                                                         ∀a0 ∈A
                                                                     24:                   qtarget ← 1+2                   |A|
                                                                                                                                              l=−1
                               0      if Q(s, l, a) = −1             25:           else if G(l0 ) = −1 then
   ∀ma ∈ M (s, l), ma =                                   (15)
                               1       otherwise                     26:               qtarget ← −1
                                                                     27:           end if
System Execution and Learning Process                                28:           loss ← (qtarget − Q(s, l, a; θs ))2
A neural network is used to estimate the Q-value of the              29:           Minimise loss via a gradient descent method
safety estimator. The estimated Q-value is denoted by                30:           Train the controller using Equation 6
Q(s, l, a; θs ), where θs are the neural network parameters.         31:        end for
The experience data (s, l, a, G(l), s0 , l0 ) is stored in two       32:    end for
buffers Us and Uv depending on the value of G(l). If the             33: end for
value is 1, the experience data are stored in Us , otherwise, in
Uv .
   If only a single buffer is used, keeping the unsafe experi-       the next state st+1 and lt+1 . Based on G(lt+1 ), the expe-
ence data in the buffer is difficult as they are frequently over-    rience data (st , lt , at , G(lt+1 ), st+1 , lt+1 ) is stored in either
written by new experience data. In particular, as the system         Us or Uv . M inibatch is randomly sampled from both of
control becomes better, the new data are most likely to be a         the buffers. In our case, we sampled from Us and Uv with
safe state experience. Since the neural network training re-         a 1:1 ratio. Next, we iterate over each experience data in
lies on the training data quality, biased data leads to biased       M inibatch. The target Q-value is decided by G(l0 ). The
learning (i.e., only learning the safe states) that degrades the     loss function computes the squared error between the target
safety estimation accuracy. Hence, to consistently provide           Q-value and Q(s, l, a; θs ), and a gradient descent method is
unsafe states data, we propose using double buffers so that          used to minimise the loss. Lastly, the controller is trained.
one buffer (Uv ) is dedicated for storing unsafe state data.         One important note is that the safety estimator’s Q-value
In the experiment results section below, the safety estima-          needs to be restricted between -1 and 1. This can be done
tor performances with respect to a single buffer case and a          naturally by setting the output layer activation function. In
double buffer case are compared.                                     our case, we use the hyperbolic tangent (tanh) function.
   The overall system execution and the learning process are
shown in Algorithm 1. We first initialise the buffers and
the neural network with random parameters θs . An episode                               Experimental Results
consists of T discrete time steps (Line 4). In each discrete
time step, the current states st and lt are used to compute          This section demonstrates the efficacy of our approach for
Qst and M (st , lt ). The action at is chosen based on the -        the inverse pendulum example. We first explain the experi-
greedy method after masking. We execute the action and get           ment setup, followed by the experiment results.
Experimental Setup
The benefit of our approach can be effectively shown by di-
rectly comparing the performance of deep Q-learning with
and without the safety estimator. Furthermore, we consider
the safety estimator with a single buffer and with double
buffers as separate models to be evaluated. Therefore, three
models are evaluated and compared.
1. Conventional deep Q-learning (DQN) (Mnih et al. 2015).
2. DQN with our safety estimator (single buffer)
3. DQN with our safety estimator (double buffer)
  We are interested in the following observation criteria:
 • How frequently safety is violated.
 • How fast control reward reaches the optimal solution             Figure 4: The cumulative reward of each episode is plotted
   (i.e., convergence).                                             over the number of episodes. The horizontal is the number
 • The evolution of estimated set of hidden unsafe states           of episodes, and the vertical axis is the cumulative reward of
   over episodes.                                                   each episode.
   Lastly, we describe the inverse pendulum experiment
setup. Each episode terminates if 200 discrete steps have           in the buffer. As the controller gets better and better, it be-
passed or the bar is considered as fallen down based on the         comes more difficult to collect the information about unsafe
safety monitor in Figure 2a. The reward is 1 for each dis-          states. Consequently, the buffer becomes full of safe state in-
crete time step when the bar is upright; otherwise, it is 0.        formation, and the neural network can only learn from safe
Thus, the maximum cumulative reward of a single episode             states. Overall, the poor and unstable estimation accuracy
(or episode reward) is 200. To maximise the reward, the bar         also explains why the single buffer case in Figure 4 exhibit
must keep the upright position as long as possible. We run          severe fluctuations.
500 episodes for each model.                                           The heatmaps for the double buffer case are shown in
                                                                    Figure 6. It is clear that the use of double buffers has sig-
Inverse Pendulum Results                                            nificantly improved the estimation accuracy compared to
Figure 4 shows the plot of the episode reward (vertical axis)       the single buffer case. Furthermore, the estimated regions
over the number of episodes (horizontal axis). In the first 100     are not significantly changed over episodes, and there is
episodes, we observe that DQN with safety estimator learns          no more biased learning problem in Episode 460. Despite
faster than the conventional DQN. This is because avoiding          the improved accuracy, the estimation is still an under-
the safety violation can quickly increase the cumulative re-        approximation of the analytic solution in Figure 2b. This is
ward. In our inverse pendulum system, the maximum reward            because the neural network learning also causes uncertain-
is only possible if the bar does not fall. Thus, a reward of less   ties. Hence, violations can still occur, but they are reduced
than 200 means an occurrence of a safety violation. In this         by 34.9%.
sense, the conventional DQN and the single buffer model are
experiencing frequent safety violations. However, there is a
noticeable improvement in the double buffer case as the re-
                                                                                          Related work
ward is maintained at the maximum value more consistently.          Safe policy extraction (Watkins and Dayan 1992) is a
In 500 episodes, the number of episodes violating the safety        technique that restricts the action space during policy ex-
is 186 for the conventional DQN, 149 for the single buffer          traction to ensure constraint satisfaction, which is widely
model, and 121 for the double buffer model. Thus, with the          used in various applications such as autonomous vehicle
single buffer model, violation is reduced by 19.9%, and with        control (Mirchevska et al. 2018) and robot systems (Al-
the double buffer model by 34.9%.                                   shiekh et al. 2018). However, action masking often results
   As shown in the proof of Lemma 2, the hidden unsafe              in non-optimal solutions due to the lack of long-term fu-
states can be recognised by checking the average Q-values.          ture consideration. To address this, the study (Kalweit et al.
If the value is -1, the state is a hidden unsafe state. Fig-        2020) specifies multi-step constraints using the formula-
ure 5 shows the heatmap of the average Q-value for the sin-         tion of constrained MDP (Altman 1999). However, these
gle buffer case at different episodes. The blue colour marks        only look a fixed number of steps ahead. In contrast, the
the average Q-value of 1, whereas red marks -1. Intuitively,        proposed approach do not have such a limitation. Another
the red areas indicate the hidden unsafe states. In Episode 0,      approach called Reward Constrained Policy Optimisation
the estimator is initialised. As it experiences more episodes,      (RCPO) (Tessler, Mankowitz, and Mannor 2019) specifies
the average Q-value is updated. However, compared to the            penalties that are added to the reward, which essentially rep-
analytical solution in Figure 2b, the estimation accuracy is        resent the constraints. Similar to others, RCPO requires the
poor and unstable. Especially after Episode 460, every state        complete constraints to be given a priori.
is considered safe. This problem is caused by the biased data          Some studies combine the effort of formal methods and
         (a) Episode 0                    (b) Episode 20                 (c) Episode 40                  (d) Episode 140


        (e) Episode 180                   (f) Episode 220               (g) Episode 260                  (h) Episode 300


        (i) Episode 400                   (j) Episode 420               (k) Episode 440                  (l) Episode 460

Figure 5: The average Q-value heatmaps of the single buffer case at different episodes. In each heatmap, the x-axis is the angle
or the bar, and the y-axis is the angular velocity.


machine learning. For example in (Junges, Torfah, and Se-          analysed using more complex benchmarking example with
shia 2021), a runtime monitor is syntactically composed            a larger action space, complicated reward functions, and dy-
with the discrete MDP system. In the resultant automaton,          namic changes in the deployment environment such as dam-
all unsafe actions are known; hence, actions can be blocked        aging. Also, further research can investigate how to reuse
appropriately. However, this approach cannot be used for the       the learned knowledge about safety to improve/adjust safety
continuous state space due to the state explosion problem,         monitors.
and the given set of constraints need to be perfect.
   To the best of our knowledge, no work has been done to                                 References
deal with partially known (i.e., some unsafe states are not        Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.;
detected) and partially correct (i.e., the monitor’s output can    Niekum, S.; and Topcu, U. 2018. Safe reinforcement learn-
be incorrect) safety monitors.                                     ing via shielding. In 32nd AAAI Conference on Artificial
                                                                   Intelligence, 2669–2678.
                          Conclusions                              Altman, E. 1999. Constrained Markov decision processes,
This paper addressed the problem of dealing with imperfect         volume 7. CRC Press.
safety monitors. Such monitors can only detect a subset of         Berkenkamp, F.; Turchetta, M.; Schoellig, A. P.; and Krause,
the actual unsafe states. Our approach used deep Q-learning        A. 2018. Safe model-based reinforcement learning with sta-
to estimate the missing states which we called hidden un-          bility guarantees. In 31st Conference on Neural Information
safe states. The Q-value function was adjusted to find such        Processing Systems, volume 2, 909–919. Curran Associates,
states, and an action masking mechanism was employed to            Inc.
block unsafe actions. The experimental results of an inverse       Fisac, J. F.; Akametalu, A. K.; Zeilinger, M. N.; Kaynama,
pendulum example showed that the learning speed is accel-          S.; Gillula, J.; and Tomlin, C. J. 2018. A general safety
erated and the occurrence of safety violations is noticeably       framework for learning-based control in uncertain robotic
reduced.                                                           systems. IEEE Transactions on Automatic Control, 64(7):
   In future work, the proposed approach can be extensively        2737–2752.
         (a) Episode 0                    (b) Episode 20                (c) Episode 40                  (d) Episode 140


        (e) Episode 180                  (f) Episode 220                (g) Episode 260                 (h) Episode 300


        (i) Episode 400                  (j) Episode 420                (k) Episode 440                 (l) Episode 460

Figure 6: The average Q-value heatmaps of the double buffer case at different episodes. The x-axis is the angle or bar for each
heatmap, and the y-axis is the angular velocity.


Junges, S.; Torfah, H.; and Seshia, S. A. 2021. Run-              Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learn-
time monitors for Markov decision processes. In Inter-            ing: An introduction. MIT Press.
national Conference on Computer Aided Verification, 553–          Tessler, C.; Mankowitz, D. J.; and Mannor, S. 2019. Reward
576. Springer.                                                    Constrained Policy Optimization. In 7th International Con-
Kalweit, G.; Huegle, M.; Werling, M.; and Boedecker, J.           ference on Learning Representations, ICLR, 1–15. OpenRe-
2020. Deep Inverse Q-learning with Constraints. In 34th           view.net.
Conference on Neural Information Processing Systems, vol-         Watkins, C. J.; and Dayan, P. 1992. Q-learning. Machine
ume 33, 14291–14302. Curran Associates, Inc.                      learning, 8(3-4): 279–292.
Kantaros, Y.; Malencia, M.; Kumar, V.; and Pappas, G. J.
2020. Reactive temporal logic planning for multiple robots                           Acknowledgments
in unknown environments. In 2020 IEEE International Con-          This work is carried out in the Dependable Intelligent Soft-
ference on Robotics and Automation (ICRA), 11479–11485.           ware Lab (DISL), funded by the German Federal Ministry of
IEEE.                                                             Education and Research (BMBF – Bundesministerium für
Mirchevska, B.; Pek, C.; Werling, M.; Althoff, M.; and            Bildung und Forschung).
Boedecker, J. 2018. High-level decision making for safe
and reasonable autonomous lane changing using reinforce-
ment learning. In 21st International Conference on Intelli-
gent Transportation Systems (ITSC), 2156–2162. IEEE.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidje-
land, A. K.; Ostrovski, G.; et al. 2015. Human-level control
through deep reinforcement learning. Nature, 518(7540):
529–533.

</pre>