Reinforcement Learning With Imperfect Safety Constraints

Reinforcement Learning With Imperfect Safety Constraints JinWoo University of Bamberg

Germany

GeraldLÜttgen gerald.luettgen@uni-bamberg.de University of Bamberg

Germany

DiedrichWolter diedrich.wolter@uni-bamberg.de University of Bamberg

Germany

Reinforcement Learning With Imperfect Safety Constraints 2EBF88736116EDF22E773F5C1D56E68F GROBID - A machine learning software for extracting information from scholarly documents

An obvious approach for ensuring safety in applying deep reinforcement learning is to use it with safety monitors. An ideal monitor captures all unsafe system states, but this is hardly possible in practical systems due to uncertainties in the environment, complicated system dynamics, and limited environment knowledge. Even with a perfect monitor, dynamic adjustments may still be required to account for system changes such as ageing and damage. Therefore, what to do if the monitors are imperfect? This paper proposes an approach for estimating the undetected states of imperfect monitors in conjunction with deep Q-learning. A new Q-value function is developed to capture the target states effectively, and a system is designed to manage control and near-critical safety separately. Our experiments show that the approach proposed in this paper better considers unsafe states than a classic learning-based approach and achieves a faster convergence to the maximum control reward.

Introduction

Deep reinforcement learning (DRL) excels when it is used to control a continuous system in complex and uncertain environments. Thus, cyber-physical systems (CPSs) such as robotics and intelligent transportation systems can be the most appropriate applications for DRL. However, the deployment of DRL in real systems is currently facing difficulties of ensuring safety requirement of CPS, where a set of constraints needs to be always satisfied to prevent catastrophic system failures. In particular, some control actions selected by DRL can violate the safety constraints. To overcome this problem, recent studies utilise a safety monitor in conjunction with DRL to detect safety violations and block the causing control action (Junges, Torfah, and Seshia 2021). If the safety monitor perfectly captures all unsafe system states and the actions leading to them, DRL can be enforced to select only safe control actions (Mirchevska et al. 2018).

However, a perfect safety monitor is generally not realistic in many CPS due to the complexities and uncertainties in the system environment. Furthermore, there maybe only limited known information about the environment (e.g., robots in an unknown environment (Kantaros et al. 2020)). Conservative approaches assume the worst-case scenario and overapproximate the unsafe states at the price of degraded system performance. However, finding the worst-case scenario itself is generally non-trivial, and there may be unrealistic assumptions. Even if there is a perfect monitor, it may become ineffective after some time due to external factors, such as system ageing and damage. To fill the gap between ideal and realistic monitors, there is a need for learning components dedicated to capturing the undetected unsafe states.

This paper proposes an approach for capturing the undetected states of imperfect monitors using deep Q-learning. The idea of detecting unsafe states using a learning model is not new; however, existing studies (Fisac et al. 2018;Berkenkamp et al. 2018) assume that the perfect safety constraints are known a priori. They trust that the safety interpretation given by the constraints are always correct and use them directly in the learning process. However, the imperfect monitor can produce false-negative outputs (i.e., the system is unsafe but interpreted as safe). Thus, we differentiate the cases where one can or cannot trust the monitor's interpretation. For this, we define a new Q-value function that is suitable for finding the undetected unsafe states. We mathematically prove that the proposed new Q-value function can eventually find all undetected unsafe states. Lastly, through benchmarking based on an inverse pendulum example, we compare the deep Q-learning performance with and without our safety estimation. The results show that safety violations can be reduced noticeably, and also that the learning converges faster to the optimal control policy.

Motivating Example

Consider the inverse pendulum example shown in Figure 1. The bar can rotate either clockwise or anticlockwise about a fixed anchor. The gravity g and the force F are the two factors affecting the rotation. Two variables can represent the instantaneous state of the bar: the angle θ and the angular velocity ω (i.e., two-dimensional state space).

We assume that the safety constraints cannot be specified completely, and a safety monitor is constructed as a finite state machine as shown in Figure 2a. The constraint |θ| ≥ 0.35 is essentially our guess for identifying two states: the Upright state (i.e., safe) and the Fallen state (i.e., unsafe). The monitor outputs 1 or -1 to indicate safe and unsafe, re- spectively.

There is an equilibrium point where the gravity force and F cancel out: F d cos θ + mgd 2 sin θ = 0. Based on our parameter values, the equilibrium is at θ = 0.201. If θ exceeds this point, the angular acceleration due to the gravity becomes larger than that of F , and the bar will always accelerate towards the ground. Also, we need to consider the effect of instantaneous angular velocity ω. A large ω cannot be instantaneously reduced to zero (i.e., deceleration takes time). Therefore, the bar can still exceed the equilibrium point even if the force F is applied to the opposite direction.

Figure 2b shows the state space of the inverse pendulum system. First, the areas with labels 1 and 2 represent the set of states that satisfy |θ| ≥ 0.35 (i.e., the Fallen state). The areas 3 , 4 , and 5 represent the possible Upright states. However, the states in 3 and 4 always converge to 1 and 2 , respectively, due to either exceeding the equilibrium angle or a large angular velocity. Although 3 and 4 also need to be considered as unsafe states, the imperfect safety monitor actually recognises them as safe (i.e., false-negative detections). In this paper, we call such states (that are incorrectly recognised) as hidden unsafe states, and our objective is not only to find them but also to identify the actions leading to them.

Problem Formulation

In this section, we formalise the hidden unsafe states detection problem, given that there is a safety monitor that can provide the ground truth about the safe and unsafe states. We start with a recap of markov decision process (MDP), and formally define the safety monitor. Finally, we define our problem.

Markov Decision Processes

In reinforcement learning, a control agent interacts with the environment by reading the environment state and performing an action that updates that state. Such a system is typically modelled as a MDP (Sutton and Barto 2018). Figure 2: The safety monitor for the inverse pendulum example, and the visualisation of safe states, unsafe states, and hidden unsafe states as regions in the two-dimensional state space.

Definition 1 (Markov Decision Process). A MDP is a tuple M = S, A, P, R, γ , where:

• S is the set of states, i.e., the state space.

• A is the set of actions, i.e., the action space.

• P (s |s, a) is the probability distribution over state s given state s and action a. • R(s, a) is the immediate reward when taking action a from state s.

• γ ∈ [0, 1] is the discount factor.

In every discrete time step t, the system control takes an action a t ∈ A from a state s t ∈ S based on policy π : S → A. Then, the system reaches the next state s t+1 ∈ S and the agent receives the reward R(s t , a t ). The actionvalue function Q(s, a) -or simply called Q-value -defines the value of taking action a from state s:

Q(s, a) = E π s [ ∞ t=0 γ t R(s t , a t )|s t = s, a t = a](1)

where E π s is the expected value of following policy π from state s. Intuitively, the returned value from Equation 1 is the expected sum of the rewards over all t. The Bellman optimality equation for the optimal action-value function Q * (s, a) that yields the maximum cumulative reward is

Q * (s, a) = E[R(s, a) + γ max a Q * (s , a )] (2)

where s is the successor state and a is an action that can be taken from state s . In this paper, we consider a MDP with a continuous state space and a discrete action space, as in the case with the inverse pendulum system example.

Safety Monitors

Generally, a safety monitor is a finite state machine. In our study, the monitor takes the inputs s and a from the MDP and outputs value 1 to indicate safe or -1 to indicate unsafe. Definition 2 (Safety Monitor). A safety monitor is a deterministic Moore machine defined as a tuple A = L, l 0 , I, δ, G , where:

• L is the set of finite discrete locations.

• l 0 is the singleton initial location.

• I is the set of input variables.

• δ : L × 2 AP (I) → L is the transition function, where AP (I) is a set of boolean expressions over the input variables. • G : L → {−1, 1} maps each location l ∈ L to either -1 or 1. We denote the set of safe locations by L saf e = {l ∈ L|G(l) = 1}. Similarly, the set of unsafe locations is L unsaf e = {l ∈ L|G(l) = −1}.

If there are multiple monitors in the system, we can represent them as a single monitor whose output is 1 if the outputs of all monitors are 1, otherwise, -1. Therefore, in this paper, we can simply assume that there is exactly a single safety monitor in the system.

At any time instant, the state of the entire system including the monitor can be expressed as a state pair (s, l), where s ∈ S is the state of the MDP and l ∈ L is the location of the monitor. Also, an infinite path can be generated by following the policy π starting from (s 0 , l 0 ):

p π (s0,l0) = (s 0 , l 0 ) a0 − → (s 1 , l 1 ) a1 − → (s 2 , l 2 ) a2 − → ...(3)

In the path, a state pair (s, l) is identified as unsafe if l ∈ L unsaf e . Also, an action a is identified as an unsafe action in (s, l) if there exist a transition (s, l)

a − → (s , l ) where l ∈ L unsaf e .

Hidden Unsafe States Detection Problem

We denote by S unsaf e the set of all state pairs (s, l) satisfying l ∈ L unsaf e . Similarly, S saf e denotes the set of all state pairs (s, l) satisfying l ∈ L saf e . Definition 3 (Hidden Unsafe States Detection Problem). Let S hidden ⊆ S saf e represent a set of hidden unsafe states, where all possible paths starting from a state (s, l) ∈ S hidden always eventually reach a state (s , l ) ∈ S unsaf e . Then, the hidden unsafe states detection problem is to obtain the actual unsafe set

S * unsaf e = S hidden ∪ S unsaf e (4)

by finding S hidden based on S unsaf e which is known a priori.

Intuitively, S unsaf e is what the imperfect monitor recognises as unsafe, and S * unsaf e is what the perfect monitor should recognise as unsafe. Thus, S hidden is the gap between the perfect and imperfect monitors, and this is what we aim to find in our approach.

System Model

This section presents our approach to estimate S * unsaf e and block the actions that immediately lead to such states. We first give an overview of our system model.

Model Overview

The system overview is shown in Figure 3. First, the environment passes the state s and the reward R(s, a) to the controller based on action a. Such a feedback loop between environment and controller is the standard design of reinforcement learning. State s and action a are also passed to the and produces the output for indicating which actions need to be blocked. It requires information s, a, and l to learn the system path (Equation 3), and also G(l) to know what is safe and unsafe. The action mask output M (s,l) is an array of binary values with size equal to the number of control actions:

M (s,l) = [m 0 , m 1 , ..., m n ](5)

Intuitively, each binary value is associated with a certain action. For example, m i = 0 means that action a i ∈ A needs to be blocked. In this way, we can selectively block unsafe actions.

Controller

The controller is based on the deep Q-Learning method proposed in (Mnih et al. 2015). Generally, the optimal Qvalue Q * (s, a) in Equation 2is not known to the controller. Hence, we use a neural network to estimate Q * (s, a) with Q(s, a; θ), where θ is the parameters of the neural network (i.e., weights and bias). Note that θ is the standard notation for neural network parameters, and it should not be confused with the angle θ in Figure 1.

For each discrete time step, the experience replay buffer U stores the controller's experience (s, a, r, s ), where s is current the state, a is the action, r is the reward, and s is the next state. Note that the use of replay buffer is based on the work in (Mnih et al. 2015). By using the data stored in buffer U , the neural network is trained using the following loss function:

Loss(θ) = E (s,a,r,s )∼U Q target − Q pred 2 (6)

where,

Q target = r + γ max a Q(s , a ; θ − ) (7) Q pred = Q(s, a; θ) (8)

where γ is the discount factor. Here, (s, a, r, s ) ∼ U means that the experience data are sampled from buffer U . The term 6represents the squared error between Q pred (the prediction of the Q-value) and Q target (the target Q-value). In the equation of Q target , symbol θ − represents the previous value of θ. More precisely, we update θ − = θ every k steps, but during the k steps, θ − is kept fixed. Note that k is a hyperparameter used to control the learning process based on (Mnih et al. 2015). Finally, the neural network learns to minimise the loss function.

(Q target −Q pre ) 2 in Equation

The -greedy action selection method can be performed with the action mask M (s,l) . Through the neural network, we can obtain the distribution of Q-values over the action space:

Q s = [Q(s, a 0 ), Q(s, a 1 ), ..., Q(s, a n )](9)

The array Q s contains the Q-values of all actions. Next, we apply the action mask to obtain the array of final Q-values Q s f inal :

Q s f inal = Q s ⊗ M (s, l) = [Q(s, a 0 ), ..., Q(s, a n )] ⊗ [m 0 , ..., m n ] = [Q(s, a 0 )m 0 , ..., Q(s, a n )m n ] (10)

where ⊗ is the element-wise multiplication. For example, the multiplication of the first element of Q s and M (s, l) is the first element of Q s f inal . Because the action mask M (s, l) contains binary values, the element-wise multiplication enforces the Q-values to either keep their values or become zero. During the action selection process, the actions with zero Q-values are ignored so that unsafe actions can be blocked from being selected. If Q s f inal only contains zeros, it means that all actions are unsafe. In this case, a predefined safe backup plan can be instantiated or the maximum Q-value in Q s can be selected directly.

Safety Estimator

Similar to the controller that learns from the environment, the safety estimator learns from the safety monitor. Basically, G(l) is the reward of the safety monitor. However, unlike R(s, a) from the environment, G(l) cannot be fully trusted because G(l) = 1 cannot differentiate between safe and hidden unsafe states. Though, we trust G(l) = −1.

Another important aspect is that the safety estimator aims to solve a classification problem rather than an optimisation problem. Therefore, the safety estimator does not need to learn to maximise the reward but to identify hidden unsafe states. Here, the cumulative reward value is not meaningful because we cannot know what rewards are added. For example, a few bad rewards may be overwhelmed by good ones. As such, the cumulative reward maximisation using the Bellman equation (Equation 2) is inappropriate for the goal of the safety estimation. Therefore, we define a different Q-value function for our safety estimator.

The Q-value Q(s, l, a) is defined in Equation 11. We express [x] u=k1 l=k2 to indicate that the value of x in the parenthesis is capped by the upper bound k 1 and the lower bound k 2 . For example,

[5] u=1 l=−1 = 1 and [−2] u=1 l=−1 = −1. Q(s, l, a) = G(l ) + 2γ s ∀a ∈A Q(s , l , a ) |A| u=1 l=−1 (11)

where, |A| is the number of actions. Basically, the sigma term calculates the average Q-value of all actions in (s , l ).

Parameter γ s can have two possible values as shown in Equation 12:

γ s = 0 if G(l ) = −1 1 if G(l ) = 1 (12)

Lemma 1. The Q-value of an action that directly leads to an unsafe state is always -1.

Proof. We directly prove Lemma 1. Consider an arbitrary transition (s, l) a − → (s , l ), where (s , l ) ∈ S unsaf e . In this case, we have G(l ) = −1 from the monitor, which is then used to get γ s = 0 using Equation 12. By substituting γ s = 0 and G(l ) = −1 into Equation 11, we always have 11is eliminated by γ s = 0, and the Q-value is just the value of G(l ). This reflects that we trust the case G(l ) = −1. In contrast, the case of G(l ) = 1 requires additional information to differentiate between the safe and hidden unsafe states. Therefore, the average term in Equation 11is the key to identify hidden unsafe states. Intuitively, γ s in our approach operates as a switch between two cases. Lemma 2. The Q-value of an action that directly leads to a hidden unsafe state is always -1.

Q(s, l, a) = [−1] u=1 l=−1 = −1. If G(l ) = −1, the average value term in Equation

Proof. Consider a transition (s, l) a − → (s , l ), where (s, l) and (s , l ) are both members of S saf e so that G(l) = G(l ) = 1 . To identify (s , l ) as a hidden unsafe state, we further assume that all actions a taken in (s , l ) directly lead to some unsafe states. In this case, we know that Q(s , l , a ) = −1 based on Lemma 1. Furthermore, γ s = 1 because G(l ) = 1. Hence, Equation 11 simplifies to:

Q(s, l, a) = G(l ) + 2γ s ∀a ∈A Q(s , l , a ) |A| u=1 l=−1 = 1 + 2γ s |A| × (−1) |A| u=1 l=−1 = 1 − 2 u=1 l=−1 = −1 (13)

Here, the sigma term ∀a ∈A Q(s , l , a ) is substituted with |A|×−1 because all the Q-values from (s , l ) are -1. Therefore, the Q-value of an action leading to a hidden unsafe state is -1.

If there is at least one safe action, we always have ∀a ∈A Q(s , l , a ) > |A| × −1, which makes the Q-value not equal to -1 but a larger value.

It is an important factor in reinforcement learning, whether the system reward converges to the optimal solution. In our case, the optimal solution is the actual set of unsafe states S * unsaf e (Definition 3), and convergence means that our safety estimator can eventually detect all states in S * unsaf e by finding the set S hidden of hidden unsafe states.

Theorem 1. All hidden unsafe states and the actions leading to them are eventually detected by the safety estimator.

Proof. Consider an arbitrary path from (s 0 , l 0 ) to (s n , l n ) of length n:

(s 0 , l 0 ) a0 − → (s 1 , l 1 ) a1 − → ... an−1 − −− → (s n , l n )(14)

Based on the safety monitor, the unsafe states that produce G(•) = −1 are known in a priori. Without loss of generality, we assume that (s n , l n ) is an unsafe state so that G(l n ) = −1. By Lemma 1, the Q-value of the previous action Q(s n−1 , l n−1 , a n−1 ) is -1 because it leads to an unsafe state. If the state (s n−1 , l n−1 ) only has unsafe actions, the Q-value of its previous action Q(s n−2 , l n−2 , a n−2 ) is -1 based on Lemma 2 because it leads to a hidden unsafe state. Similarly, if (s n−2 , l n−2 ) only has unsafe actions, Q(s n−3 , l n−3 , a n−3 ) = −1, and so on. This chain of Qvalue relation uses an unsafe state as a reference point, and by inductively backtracking the path, we can find all hidden unsafe states. Because every hidden unsafe state is connected to unsafe states by definition, the safety estimator can eventually obtain S * unsaf e .

Finally, the safety estimator's Q-value outputs are mapped into the action mask M (s, l). If the Q-value is -1, the mask value is set to 0, otherwise, set to 1.

∀m a ∈ M (s, l), m a = 0 if Q(s, l, a) = −1 1 otherwise(15)

System Execution and Learning Process

A neural network is used to estimate the Q-value of the safety estimator. The estimated Q-value is denoted by Q(s, l, a; θ s ), where θ s are the neural network parameters.

The experience data (s, l, a, G(l), s , l ) is stored in two buffers U s and U v depending on the value of G(l). If the value is 1, the experience data are stored in U s , otherwise, in U v .

If only a single buffer is used, keeping the unsafe experience data in the buffer is difficult as they are frequently overwritten by new experience data. In particular, as the system control becomes better, the new data are most likely to be a safe state experience. Since the neural network training relies on the training data quality, biased data leads to biased learning (i.e., only learning the safe states) that degrades the safety estimation accuracy. Hence, to consistently provide unsafe states data, we propose using double buffers so that one buffer (U v ) is dedicated for storing unsafe state data. In the experiment results section below, the safety estimator performances with respect to a single buffer case and a double buffer case are compared.

The overall system execution and the learning process are shown in Algorithm 1. We first initialise the buffers and the neural network with random parameters θ s . An episode consists of T discrete time steps (Line 4). In each discrete time step, the current states s t and l t are used to compute Q st and M (s t , l t ). The action a t is chosen based on thegreedy method after masking. We execute the action and get Algorithm 1: Safety Estimator Learning 1: Initialise U s and U v to capacity N 2: Initialise a neural network Q with random weights θ s 3: for episode = 1, N do 4:

for t = 1, T do 5:

Get the current states s t and l t 6:

Get Q st from the controller (Equation 9) 7:

Get M (st,lt) based on Q(s t , l t , ∀a ∈ A; θ s ) 8:

Compute Q st f inal (Equation 10) 9:

if a random value between 0 and 1 ≥ then 10: if G(l ) = 1 then 24: else if G(l ) = −1 then 26:

a t ←q target ← 1+2 ∀a ∈A Q(s ,q target ← −1 27: end if 28:

loss ← (q target − Q(s, l, a; θ s )) 2

29:

Minimise loss via a gradient descent method 30:

Train the controller using Equation 631:

end for 32:

end for 33: end for the next state s t+1 and l t+1 . Based on G(l t+1 ), the experience data (s t , l t , a t , G(l t+1 ), s t+1 , l t+1 ) is stored in either U s or U v . M inibatch is randomly sampled from both of the buffers. In our case, we sampled from U s and U v with a 1:1 ratio. Next, we iterate over each experience data in M inibatch. The target Q-value is decided by G(l ). The loss function computes the squared error between the target Q-value and Q(s, l, a; θ s ), and a gradient descent method is used to minimise the loss. Lastly, the controller is trained. One important note is that the safety estimator's Q-value needs to be restricted between -1 and 1. This can be done naturally by setting the output layer activation function. In our case, we use the hyperbolic tangent (tanh) function.

Experimental Results

This section demonstrates the efficacy of our approach for the inverse pendulum example. We first explain the experiment setup, followed by the experiment results. in the buffer. As the controller gets better and better, it becomes more difficult to collect the information about unsafe states. Consequently, the buffer becomes full of safe state information, and the neural network can only learn from safe states. Overall, the poor and unstable estimation accuracy also explains why the single buffer case in Figure 4 exhibit severe fluctuations.

The heatmaps for the double buffer case are shown in Figure 6. It is clear that the use of double buffers has significantly improved the estimation accuracy compared to the single buffer case. Furthermore, the estimated regions are not significantly changed over episodes, and there is no more biased learning problem in Episode 460. Despite the improved accuracy, the estimation is still an underapproximation of the analytic solution in Figure 2b. This is because the neural network learning also causes uncertainties. Hence, violations can still occur, but they are reduced by 34.9%.

Related work

Safe policy extraction (Watkins and Dayan 1992) is a technique that restricts the action space during policy extraction to ensure constraint satisfaction, which is widely used in various applications such as autonomous vehicle control (Mirchevska et al. 2018) and robot systems (Alshiekh et al. 2018). However, action masking often results in non-optimal solutions due to the lack of long-term future consideration. To address this, the study (Kalweit et al. 2020) specifies multi-step constraints using the formulation of constrained MDP (Altman 1999). However, these only look a fixed number of steps ahead. In contrast, the proposed approach do not have such a limitation. Another approach called Reward Constrained Policy Optimisation (RCPO) (Tessler, Mankowitz, and Mannor 2019) specifies penalties that are added to the reward, which essentially represent the constraints. Similar to others, RCPO requires the complete constraints to be given a priori.

Some studies combine the effort of formal methods and Figure 5: The average Q-value heatmaps of the single buffer case at different episodes. In each heatmap, the x-axis is the angle or the bar, and the y-axis is the angular velocity. machine learning. For example in (Junges, Torfah, and Seshia 2021), a runtime monitor is syntactically composed with the discrete MDP system. In the resultant automaton, all unsafe actions are known; hence, actions can be blocked appropriately. However, this approach cannot be used for the continuous state space due to the state explosion problem, and the given set of constraints need to be perfect.

To the best of our knowledge, no work has been done to deal with partially known (i.e., some unsafe states are not detected) and partially correct (i.e., the monitor's output can be incorrect) safety monitors.

Conclusions

This paper addressed the problem of dealing with imperfect safety monitors. Such monitors can only detect a subset of the actual unsafe states. Our approach used deep Q-learning to estimate the missing states which we called hidden unsafe states. The Q-value function was adjusted to find such states, and an action masking mechanism was employed to block unsafe actions. The experimental results of an inverse pendulum example showed that the learning speed is accelerated and the occurrence of safety violations is noticeably reduced.

In future work, the proposed approach can be extensively analysed using more complex benchmarking example with a larger action space, complicated reward functions, and dynamic changes in the deployment environment such as damaging. Also, further research can investigate how to reuse the learned knowledge about safety to improve/adjust safety monitors.

Figure 1 :1Figure 1: Inverse pendulum system, where a bar rotates about the fixed anchor. We assume that m = 2kg, d = 1m, g = −9.81ms −2 and θ is given in radian. The force |F | = 2N is a constant magnitude horizontal force applied at the tip of the bar. The direction of F can be controlled (left: F = −2N or right: F = 2N) to keep the bar upright.

Figure 3 :3Figure 3: System Model Overview

Figure 4 :4Figure 4: The cumulative reward of each episode is plotted over the number of episodes. The horizontal is the number of episodes, and the vertical axis is the cumulative reward of each episode.

Store (s t , l t , a t , G(l t+1 ), s t+1 , l t+1 ) in U s Store (s t , l t , a t , G(l t+1 ), s t+1 , l t+1 ) in U varg max a Q st f inal11: 12:else a t ← random non-zero element in Q st f inal13:end if14:Execute a t and get the next states s t+1 and l t+115:Get G(l t+1 ) from the safety monitor16:if G(l t+1 ) = 1 then17:18:else if G(l t+1 ) = −1 then19:20:end if21:Minibatch ← Sample(U s ∪ U v );22:for (s, l, a, G(l ), s , l ) ∈ Minibatch do23:

Acknowledgments

This work is carried out in the Dependable Intelligent Software Lab (DISL), funded by the German Federal Ministry of Education and Research (BMBF -Bundesministerium für Bildung und Forschung).

Experimental Setup

The benefit of our approach can be effectively shown by directly comparing the performance of deep Q-learning with and without the safety estimator. Furthermore, we consider the safety estimator with a single buffer and with double buffers as separate models to be evaluated. Therefore, three models are evaluated and compared.

1. Conventional deep Q-learning (DQN) (Mnih et al. 2015). 2. DQN with our safety estimator (single buffer) 3. DQN with our safety estimator (double buffer)

We are interested in the following observation criteria:

• How frequently safety is violated.

• How fast control reward reaches the optimal solution (i.e., convergence). • The evolution of estimated set of hidden unsafe states over episodes.

Lastly, we describe the inverse pendulum experiment setup. Each episode terminates if 200 discrete steps have passed or the bar is considered as fallen down based on the safety monitor in Figure 2a. The reward is 1 for each discrete time step when the bar is upright; otherwise, it is 0. Thus, the maximum cumulative reward of a single episode (or episode reward) is 200. To maximise the reward, the bar must keep the upright position as long as possible. We run 500 episodes for each model.

Inverse Pendulum Results

Figure 4 shows the plot of the episode reward (vertical axis) over the number of episodes (horizontal axis). In the first 100 episodes, we observe that DQN with safety estimator learns faster than the conventional DQN. This is because avoiding the safety violation can quickly increase the cumulative reward. In our inverse pendulum system, the maximum reward is only possible if the bar does not fall. Thus, a reward of less than 200 means an occurrence of a safety violation. In this sense, the conventional DQN and the single buffer model are experiencing frequent safety violations. However, there is a noticeable improvement in the double buffer case as the reward is maintained at the maximum value more consistently. In 500 episodes, the number of episodes violating the safety is 186 for the conventional DQN, 149 for the single buffer model, and 121 for the double buffer model. Thus, with the single buffer model, violation is reduced by 19.9%, and with the double buffer model by 34.9%.

As shown in the proof of Lemma 2, the hidden unsafe states can be recognised by checking the average Q-values. If the value is -1, the state is a hidden unsafe state. Figure 5 shows the heatmap of the average Q-value for the single buffer case at different episodes. The blue colour marks the average Q-value of 1, whereas red marks -1. Intuitively, the red areas indicate the hidden unsafe states. In Episode 0, the estimator is initialised. As it experiences more episodes, the average Q-value is updated. However, compared to the analytical solution in Figure 2b, the estimation accuracy is poor and unstable. Especially after Episode 460, every state is considered safe. This problem is caused by the biased data

Safe reinforcement learning via shielding MAlshiekh RBloem REhlers BKönighofer SNiekum UTopcu AAAI Conference on Artificial Intelligence 2018 Constrained Markov decision processes EAltman 1999 CRC Press 7 A general safety framework for learning-based control in uncertain robotic systems FBerkenkamp MTurchetta APSchoellig AKrause Curran Associates 31st Conference on Neural Information Processing Systems 2018. 2018 2 Safe model-based reinforcement learning with stability guarantees Runtime monitors for Markov decision processes SJunges HTorfah SASeshia International Conference on Computer Aided Verification Springer 2021 Deep Inverse Q-learning with Constraints GKalweit MHuegle MWerling JBoedecker 34th Conference on Neural Information Processing Systems Curran Associates, Inc 2020 33 Reactive temporal logic planning for multiple robots in unknown environments YKantaros MMalencia VKumar GJPappas IEEE International Conference on Robotics and Automation (ICRA) IEEE 2020. 2020 High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning BMirchevska CPek MWerling MAlthoff JBoedecker 21st International Conference on Intelligent Transportation Systems (ITSC) IEEE 2018 Human-level control through deep reinforcement learning VMnih KKavukcuoglu DSilver AARusu JVeness MGBellemare AGraves MRiedmiller AKFidjeland GOstrovski Nature 518 2015. 7540 Reinforcement learning: An introduction RSSutton AGBarto 2018 MIT Press Reward Constrained Policy Optimization CTessler DJMankowitz SMannor Watkins 7th International Conference on Learning Representations, ICLR 2019. 1992 8 Q-learning. Machine learning