<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Safe Reinforcement Learning through Phasic Safety Oriented Policy Optimization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sumanta Dey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pallab Dasgupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soumyajit Dey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Kharagpur, 721302</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Exploration is an essential feature of Reinforcement Learning (RL) algorithms, as they attempt to learn the optimal policy through trial and error. In safety-constrained environments, safety violations during exploration are a significant challenge when the training is online. In this context, this paper proposes Phasic Safety-oriented Policy Optimization (PSPO), where the policy learning is divided into multiple phases with safety updates. This approach utilizes an adaptive safety shield to minimize repetitive unsafe explorations of the RL agent by action-masking, and at the same time learn an auxiliary policy which provides safety updates to the main policy. Such periodic updates reduce the number of safety infractions during training, without compromising rewards as in purely conservative safety shield based approaches. We have demonstrated the effectiveness of our approach in multiple safety-critical environments. Our experimental results exhibit fewer failures during training while demonstrating similar or faster convergence than prior methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Safe Reinforcement Learning</kwd>
        <kwd>Safe Exploration</kwd>
        <kwd>Safe Policy Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>may avoid safe states in the proximity of unsafe states,
thereby missing out the better policies in Π  .</p>
      <p>
        In order to learn a policy that maximizes the total ex- Consider a simple volcanic grid-world in Figure 1. The
pected reward [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a Reinforcement Learning (RL) agent robot tries to find the shortest path to the treasure in
looperating in a model free environment needs to perform cation, (a,2), from its initial location, (a,0). An active
adequate exploration of its environment. In safety crit- volcanic crater is present in location, (a,2), while the
locaical domains, the training phase poses a challenge if it tion, (c,1), is blocked. Consider the following trajectories
is performed in the real environment, as the uninformed of the agent, shown in green and black respectively:
agent may lead itself to unsafe states during the
exploration. A growing body of work addresses the safe RL
problem from different directions, including the use of
safety-shields, reward shaping, etc. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
      </p>
      <p>The goal of a traditional RL agent is to learn an optimal
policy ( * ) for a given starting state distribution ( ) that
maximizes the overall expected return, .</p>
      <p>* ←
argmax E []
 ∈Π
However, in safety constrained RL setups, the RL agent
tries to learn an optimal policy ( * ) that maximizes the Path-1: (, 0) →− − (, 0) →− (, 1) →− (, 2→)−− − ℎ
overall return () while also satisfying the safety con- (, 2)
straints ().</p>
      <p>* ←
argmax E []
 ∈Π
where Π  is the set of safe policies. It may so happen that
the best policies in Π  have trajectories that run close to
unsafe states. In an attempt to remain safe, an RL agent
Path-2: (, 0) →− −</p>
      <p>(, 1) →−
(, 2)
(, 2) →−− −
(, 0) →− − (, 0) →− − (, 0) →−
ℎ ℎ ℎ
(, 2) →−− −</p>
      <p>(, 2) →−− −</p>
      <sec id="sec-1-1">
        <title>Path-1 is optimal and significantly shorter than Path-2.</title>
        <p>During training, if the agent takes the right action from
location, (, 1), or the up action from location, (, 0), the
robot will fall in the volcanic cater and terminate that
episode with failure. Experiencing failures from (, 1)
may result in the policy settling on Path-2 instead of
Pathoptimal path, Path-1, but will not make the policy safety jective is to optimize the policy ( ) with
aware in the absence of the shield. respect to the reward function.</p>
        <p>
          In the literature, safety has been treated either as a • Safety Optimization Phase: In this phase,
discrete (safe/unsafe) binary, or as a continuous cost func- an auxiliary policy ( ) is trained with
tion. Methods such as Constrained Policy Optimization the explored state-actions along with the
(CPO) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and Proximal Policy Optimization Lagrangian safety shield masked unsafe actions. This
(PPO-Lagrangian)[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] are effective in reduction of safety auxiliary policy ( ) is then used to
ininfractions during online training, provided that safety is duce safe behavior in the main policy ( )
specified as a continuous cost function, that is, the envi- through behavioral cloning.
ronment returns a set of non-negative real valued costs for
all the safety constraints, and a safety violation happens Since the safety shield prevents repeated failures, the
when the cumulative cost exceeds some defined threshold. policy learning does not get pushed to conservative
subIn [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the authors use a safety critic to guide the RL agent optimal trajectories. The periodic safety updates from
while learning to avoid unsafe instances. However, the the auxiliary policy induces safe behavior right from the
safety critic must be trained in the pre-training phase with inception of training.
safe and unsafe states. We provide experimental results on several Gym
Envi
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], for model-based MDPs (Markov Decision Pro- ronments [10]. Our results demonstrate considerable
recess), the authors propose the use of a safety-shield de- ductions in safety infractions and high returns in episodic
rived from Linear Temporal Logic (LTL) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] specifica- rewards in all of these environments.
tions to restrict the RL agent (by action shaping) to
explore within a safe region. The model-based assumption 2. Preliminaries
makes this method infeasible for many real-world
applications where the transition function is unknown. Also, this We use Constrained MDP (CMDP) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to define our
probmethod suffers from scalability issues due to the product lem setting and we use the Proximal Policy Optimization
MDP construction in the shield synthesis phase. Deci- (PPO) [11] method in the back-end.
sion trees may be used as safety shields, as in [9], but this
work assumes the existence of a next state predictor which Constrained MDP (CMDP). As defined in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] a CMDP
limits its generalizability. is a tuple of (, , , ℛ, , , ), where  defines the
        </p>
        <p>
          In a completely unknown environment, the initial RL state space,  is the action space,  :  ×× → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
policy as well as the safety shield are oblivious of safety. is a transition function/matrix. ℛ refers to the reward
funcAs the exploration begins, the safety shield begins to take tion defined as  ×  → ℛ and  ∈ (0, 1) and  are the
shape with every safety infraction, but this knowledge discount factor and starting state distribution respectively.
does not influence the policy learning directly. This pa- Finally,  = {( :  ≤   → {0, 1}|  ∈ R),  ∈ }
per bridges this gap to accelerate the convergence to a is a set of safety constraints that the RL agent must
folsafe RL policy. In this paper, we propose the Phasic low in order to be safe.  denotes the i-th constraint
Safety-oriented Policy Optimization (PSPO) framework function, and   denotes the maximal allowable limit of
to reduce safety infractions during exploration. PSPO non-satisfaction in terms of the expected probability of
works on a model-free MDP with continuous space, con- failure.
tinuous/discrete actions, and binary safety assumptions. In this paper, we consider the safety as a binary
The approach uses periodic safety updates using an auxil- function {0, 1}. Therefore,  returns SAFE (0) or
iary policy trained from safety infractions detected by the UNSAFE (1). If any of the constraint functions return 1
safety shield, The main features of each period of PSPO (UNSAFE) for a given state, then the state is treated as
are as follows (Fig. 2): unsafe.
        </p>
        <p>1. A safety shield model is learned on-the-fly from
state-to-unsafe action mapping from past
exploration. The safety shield is continuously updated
to adapt to newly visited unsafe states. The
exploration is based on the current policy, which is not
updated at this phase.
2. The policy network is updated in two separate
phases as follows:</p>
        <p>Proximal Policy Optimization (PPO). Proximal Policy
Optimization (PPO) [11] is an advantage-based
policygradient reinforcement learning algorithm proposed by
OpenAI. The objective of commonly used policy gradient
(PG) methods have the following form:
 ( ) = E[︀   (|)ˆ]︀
• Policy Training Phase: In this phase, the Here, ˆ denotes the estimated advantage [12] and  
policy network is trained only with the ex- is the policy parameterized by  . On the other hand, the
plored state-actions, and the primary
obmain objective in Proximal Policy Optimization (PPO) is Our goal is to induce safe behavior by correcting a
poldefined as: icy learned on the basis of reward. The pivot of the whole
 ( ) = E[︀ (( )ˆ, approach is a safety shield which gets updated whenever
failures occur. The agent explores with the safety shield
︀( (, 1 − , 1 +  )ˆ)︀] in place, and the trajectories are recorded. During this
Here, ( ) denotes the probability ratio between new exploration, some of its actions may be thwarted by the
safety shield – whenever this happens, the safety shield
policy and old policy determined as   ((||)) . Finally, notes the exceptions. These exceptions are used to train
PPO restricted the policy update by using the clip method an auxiliary policy, which essentially captures the
behavto directly limit the update range to [1 -  , 1 +  ], where iors in which we would like the current policy to behave
 is a hyper-parameter that decides the clipping interval. differently. The auxiliary policy is therefore used
periodiWith the above small change in the objective function, the cally to update the current policy, thereby inducing safe
PPO methods provide better stability and reliability over actions in the relevant states, without affecting the rest
the vanilla policy gradient implementation. of the policy. This ensures proximality with the optimal
policy chosen by the safety agnostic learning algorithm.
3. Ideation An important benefit of the proposed approach is
that the policy learning and the safety augmentation are
The task of inducing safe behavior into a RL policy re- separate phasic components. Therefore the method is
quires careful balancing between safety and optimality. In adaptive to changes in either of the components. We can
an unknown model free environment an RL agent with plug in any learning algorithm for the former, and
hanlimited domain knowledge will reach unsafe states during dle additional safety corners when the agent reaches them.
exploration, but it should ensure that it does not repeat
the same mistakes. Relying solely on penalizing the agent Problem Statement. Given a CMDP, the objectives of
for safety infractions may push the agent away from opti- our proposed framework are as follows: 1) Learning the
mal paths that are on the border of unsafe regions, which conditions that led the RL agent to failures from the past
then results in longer convergence times and sub-optimal exploration history and later guide the RL agent through
policies. Learning safety shields from failures protects action masking to avoid the repetitive unsafe explorations.
against future violations, but does not make the policy 2) Updating the policy network to incorporate the safety
safety aware, and the agent continues to rely on the safety guidance from the previous step while minimally affecting
shield. the currently learned policy.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Algorithm 1 Phasic Safe Policy Optimization (PSPO)</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Phasic Safety-Oriented Policy</title>
    </sec>
    <sec id="sec-3">
      <title>Optimization</title>
      <sec id="sec-3-1">
        <title>This section discusses the overall process flow of our pro</title>
        <p>posed Phasic Safety-oriented Policy Optimization (PSPO)
framework. Figure 2 depicts the main components in
this flow, and will be used as a reference in each of the
following three phases:</p>
      </sec>
      <sec id="sec-3-2">
        <title>We shall elaborate each of these phases now. Algorithm 1</title>
        <p>summarizes the implementation of the proposed approach.</p>
        <p>It may be noted that we address the problem in the
online training setting only, with the goal of reducing
safety infractions without distracting the algorithm for
policy learning based on rewards.</p>
        <sec id="sec-3-2-1">
          <title>4.1. Adaptive Safety Shield Framework</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>In RL, safety shields (coined by [7]) are used to block un</title>
        <p>safe actions during exploration. Traditional safety shields
deal with MDPs with discrete action space and
modelbased assumptions, which does not favor our model-free
environment. Instead, we propose an Adaptive Safety
Shield framework to learn a Safety Shield (SS) with the
explored state-to-actions data collected on-the-fly during
exploration. The explored state-action pairs are labeled
based on the constraint violations in the next state during
exploration.
37:
38:
39:
40:
41: }
42: end for
}
%Auxiliary Policy Training
Perform rollouts on (EB+UB) {</p>
        <p>Optimize    wrt Policy
( +  [ ])
}
%AuxUpdate Phase
Perform rollouts on (EB+UB) {</p>
        <p>Optimize   wrt Policy Distance (,</p>
      </sec>
      <sec id="sec-3-4">
        <title>Loss</title>
        <p>))
(, ) =
{︃UNSAFE if ∃ i  : +1 → 1
SAFE
otherwise</p>
      </sec>
      <sec id="sec-3-5">
        <title>It is possible to bootstrap the initial safety shield based on</title>
        <p>prior knowledge on domain safety. It is also possible to
start in the absence of prior knowledge.</p>
        <p>In our implementation, the safety shield is a ML
model trained to classify state-action pairs as safe/unsafe
(Algo. 1: Line 26-27). With new explorations, the model
is updated with new state-action pairs.</p>
        <p>The variable _ (Algo. 1: Line 27) is a
hyperparameter that denotes the number of episodes after
which the shield update is performed, that is, it controls
the update frequency.</p>
        <p>This safety shield is used to predict the probability
(Algo. 1: Line 8) of reaching an unsafe state for a
state12) and the RL agent is asked to sample another action
KL-divergence as the distance metric between the
distributions.  (.|[ +  ]) returns the action distributions of
the states presented in the exploration buffer (EB) and the
(Algo. 1: Line 13). This loop (Algo. 1: Line 7-15) contin- unsafe buffer (UB),  (.|[∖ ]) returns the action
disues until the safety shield finds the sampled action to be
safe, or if the sample count exceeds a defined threshold,
_. In the latter case, the exploration continues
with the first action proposed by the RL agent.
tributions of the states presented in the exploration buffer
(EB) but not the unsafe buffer (UB). Finally,  (.| )
returns the action distributions of the states present in the
unsafe buffer (UB).</p>
        <sec id="sec-3-5-1">
          <title>4.2. Policy Training Phase</title>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>We use the standard Proximal Policy Optimization (PPO)</title>
        <p>algorithm for learning the policy for the RL agent (Algo. 1:</p>
      </sec>
      <sec id="sec-3-7">
        <title>Line 29-33). This phase in shown with blue background in</title>
        <sec id="sec-3-7-1">
          <title>4.3. Safety Optimization Phase</title>
          <p>In this phase, an alternative policy network, called
auxiliary policy network, is trained with traces from both
the exploration buffer (EB) and the unsafe buffer (UB)
(Algo. 1: Line 34-37). Then the PPO actor network is
updated through behavioral cloning with the auxiliary
policy network for the states in the unsafe buffer. Such
updates may alter the learned policy distribution for the
other states in the PPO actor network. Therefore to reduce
such interference, we also consider the policy
distributions for the states present in the exploration buffer (Algo.
1: Line 38-41). Hence the overall objective of AuxUpdate
is:
 ( ) = _  (.|[ +  ]),
︂[
︂(
 (.|[∖ ]) +  (.| )
︂])</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experimental Setup</title>
      <p>We provide empirical support to the following claims
through experiments conducted on different gym
environments with continuous and discrete action spaces:
ures, and
mary learning objective.
• The PSPO approach reduces the number of
fail• The PSPO approach marginally affects the
pri</p>
      <sec id="sec-4-1">
        <title>All the experiments were run on a machine with Ubuntu</title>
      </sec>
      <sec id="sec-4-2">
        <title>Graphics unit.</title>
      </sec>
      <sec id="sec-4-3">
        <title>We tested our framework against four different gym</title>
        <p>environments, two of which are with discrete action
space and the other two with continuous action space. By
default, the gym environments are not safety constrained.
We have defined the custom binary safety constraints for
all the environments following [13]. The name of the
environment with associated safe constraints are given
in Table 1. We consider the safety constraints such that
few are perfectly aligned with the original goal of the
underlying Gym environment, and few aren’t exactly
aligned with the original goal.</p>
        <p>
          Cart Pole Environment: The cart pole environment is
taken from Gym Classic control environments and is an
environment with discrete action space [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. The aim
is to keep the pole over the cart without falling by taking
actions 0 and 1. We have considered the following set of
safety constraints:
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>1. The cart Position should remain within -2.4 or</title>
        <p>+2.4
2. The cart Momentum should not be lesser than
-2.0 or greater than 2.0
3. The pole Angle should not be greater than 0.2</p>
      </sec>
      <sec id="sec-4-5">
        <title>Among these, the third safety constraint is directly aligned with the original goal of the cart pole environment, whereas the remaining two are not exactly aligned with the original goal.</title>
        <p>Inverted Pendulum Environment: This environment is
taken from the Gym MuJoCo environments. This
environment is similar to the Cart Pole environment, except
the action space of this environment is continuous. Here
the pole on top of the cart is controlled by applying a
force between (-1, +1) to the cart to prevent the pole from
falling over. The safety constraints are identical to those
for Cart Pole environment.
of a state.</p>
        <p>Implementation Details. For the environments, we
considered using Random Forest [17] based on the ensemble
method as Adaptive Safety Shield(SS). The advantages of
using Random Forest for safety predictors are the
following:
• Explainable. The individual decision trees of
Random forest are explainable and can easily be
interpreted and verified by human experts. They can
also be factored into a rule-based system.
• Efficient. In a robust system, failures are rare,
and thereby the number of updates needed for the
safety shield declines rapidly.
• Augmentable. Individual Decision trees can easily
be augmented with known safety constraints [18],
which enables us to include safety constraints
provided by domain experts.</p>
        <p>Lunar Lander Environment: The Lunar Lander envi- One problem of using a random forest or decision tree
ronment is taken from the Box2D environments. We have as the Safety Shield, mainly in the case of discrete
enviapplied our framework to both the continuous and discrete ronments, is not considering the action in each decision
versions of the environment. This environment aims to branch. If an action from a state is found unsafe, then the
land the lunar lander smoothly on the helipad marked with safety shield could predict all the remaining actions as
two flags by controlling the thrust of the rocket engines on unsafe, whereas there could be an unexplored action that
the lander’s left, right, and bottom. For this environment is safe. To avoid such issues, we use a simple re-labeling
we consider the following set of safety constraints. trick, where we use label (i+1) if the i-th action (starting
1. The lander should land only on the helipad, or the index from 0) is unsafe, and we use label 0 if the action
X-position (PosX) of the lander must be within is safe. During prediction, if the (i+1)-th label prediction
[-0.2,+0.2]. probability is greater than the provided safety threshold
( ), then the i-th action is considered unsafe; otherwise,
2. The lander tilt Angle should not be beyond -1 or the action is considered safe.</p>
        <p>+1 if the lander Y-position (PosY) is less than 0.1 We use the standard PyTorch implementation of the
and the lander is just over the helipad. PPO algorithm provided in OpenAI’s Spinningup library
as the Policy Learner. The Auxiliary Policy Network is
Baselines. We have used the standard PPO with neg- constructed by replicating the PPO’s Actor network. We
ative reward for constrained violation as Baseline 1 have also done experiments of our PSPO method with
(BASE1), SAC (Soft Actor-Critic) [14] with negative different safety bound ( ) values to show the impact of the
reward for constrained violation as Baseline 2 (BASE2), safety bound hyperparameter in the PSPO framework.
and VPG (Vanilla Policy Gradient) [15] with negative
reward for constrained violation as Baseline 3 (BASE3). We
have implemented all baselines on OpenAI’s Spinningup 6. Results and Discussion
library[16]. Negative rewards on constraint violations
are provided to all the baselines, and the proposed PSPO In this section we describe the results we have obtained
framework. The SAC algorithm does not support discrete in comparison to the baselines described in the previous
action space; hence, BASE2 is not considered in Cart Pole section. Figure 4 shows Epoch (X-axis) vs Total Number
and Lunar Lander (Discrete) environments. of Safety Violations (Y-axis) for all four Gym
environ</p>
        <p>There are other approaches for safe exploration in ments. All the graphs show that the PSPO framework
RL that are available. For example, Constrained Policy has significantly fewer safety violations during training
Optimization (CPO) and PPO-Lagrangian. However, than the BASE1 method. These figures also show that the
both these methods assume safety as a continuous cost average number of safety violations per epoch eventually
function that is returned by the environment, like the tends to zero, as the PSPO graphs are slowly entering
reward generated each time the agent applies an action. In the lag phase. In these figures, we did not consider the
our setup, we have considered safety as a binary function other two baselines due to their large number of safety
violations. Table 2 shows the number of safety violations</p>
        <p>Algorithms</p>
        <p>Failed / (Total) Episodes
CartPole-v1
(100 Epochs)
InvertedPendulum-v2
(100 Epochs)
LunarLander-v2
(200 Epochs)
LunarLanderContinuous-v2
(200 Epochs)
is better than other baselines. This evidence supports our
second claim.</p>
        <p>Figure 3 shows the impact of the safety bound,  , in
the PSPO framework in terms of total safety violations
and average episodic return. With higher values of this
hyperparameter,  , the safety shield intervenes/corrects
fewer unsafe actions. In contrast, with the low values for
 , the safety shield unnecessarily corrects a higher number
of safe actions. Hence, the safety bound,  , controls the
tradeoff between False Safe versus False Unsafe. In this
case, the safety bound,  = 0.9 provided the best results,
both for episodic returns and fewer safety violations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7. Related Work</title>
      <p>agent.</p>
      <p>
        [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5, 31</xref>
        ] consider safety as a continuous cost function,
and the cumulative cost should be within a specified limit
to be safe. In [
        <xref ref-type="bibr" rid="ref6">6, 32</xref>
        ], a safety-critic-based approach is
proposed, where if the safety critic predicts an action as
unsafe, the agent samples a different action. In [32], the
safety critic is learned along with policy, whereas in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
the safety critic is learned separately.
      </p>
      <p>Another line of work can be found in [13], where
authors propose a method to incorporate the safety in a
learned policy by finding the counterexamples or the
failure states and then modifying the policy for the
corresponding states minimally.</p>
    </sec>
    <sec id="sec-6">
      <title>8. Conclusion</title>
      <sec id="sec-6-1">
        <title>We have presented a method for safe exploration in RL.</title>
        <p>We use an Adaptive Safety Shield to learn the
state-tounsafe action mapping from the past exploration and
provide guidance to the RL agent to avoid repeating its
mistakes. We have provided an auxiliary policy based update
method to incorporate the safety guidance provided by
the safety shield into the RL agent while minimally
affecting the policy network for other state-actions. We have
also presented various experiments which empirically
validate that our method incurs fewer safety incidents while
achieving higher or similar performance.
ety, 1977, pp. 46–57. URL: https://doi.org/10.1109/ tenna tilt optimisation using shielding and multiple
SFCS.1977.32. doi:10.1109/SFCS.1977.32. baselines, arXiv preprint arXiv:2012.01296 (2020).
[9] S. Dey, A. Mujumdar, P. Dasgupta, S. Dey, Adaptive [20] D. Amodei, C. Olah, J. Steinhardt, P. Christiano,
safety shields for reinforcement learning-based cell J. Schulman, D. Mané, Concrete problems in ai
shaping, IEEE Transactions on Network and Ser- safety, arXiv preprint arXiv:1606.06565 (2016).
vice Management (2022) 1–1. doi:10.1109/TNSM. [21] P. Abbeel, A. Coates, A. Y. Ng, Autonomous
heli2022.3194566. copter aerobatics through apprenticeship learning,
[10] G. Brockman, V. Cheung, L. Pettersson, J. Schnei- International Journal of Robotics Research (IJRR)
der, J. Schulman, J. Tang, W. Zaremba, Openai gym, 29 (2010).</p>
        <p>2016. [22] T. J. Perkins, A. G. Barto, Lyapunov design for
[11] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, safe reinforcement learning, Journal of Machine
O. Klimov, Proximal policy optimization algo- Learning Research 3 (2002) 803–832.
rithms, arXiv preprint arXiv:1707.06347 (2017). [23] F. Berkenkamp, R. Moriconi, A. P. Schoellig,
[12] J. Schulman, P. Moritz, S. Levine, M. Jordan, A. Krause, Safe Learning of Regions of Attraction
P. Abbeel, High-dimensional continuous control us- for Uncertain, Nonlinear Systems with Gaussian
ing generalized advantage estimation, arXiv preprint Processes, 2016 IEEE 55th Conference on Decision
arXiv:1506.02438 (2015). and Control, CDC 2016 (2016) 4661–4666. doi:10.
[13] B. Gangopadhyay, P. Dasgupta, Counterexample 1109/CDC.2016.7798979. arXiv:1603.04915.
guided rl policy refinement using bayesian optimiza- [24] Y. Chow, O. Nachum, E. Duenez-Guzman,
tion, Advances in Neural Information Processing M. Ghavamzadeh, A lyapunov-based approach
Systems 34 (2021) 22783–22794. to safe reinforcement learning, arXiv preprint
[14] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, arXiv:1805.07708 (2018).</p>
        <p>P. Abbeel, High-dimensional continuous control us- [25] T. Koller, F. Berkenkamp, M. Turchetta, A. Krause,
ing generalized advantage estimation, in: Y. Bengio, Learning-based model predictive control for safe
Y. LeCun (Eds.), 4th International Conference on exploration, in: 2018 IEEE conference on decision
Learning Representations, ICLR 2016, San Juan, and control (CDC), IEEE, 2018, pp. 6059–6066.
Puerto Rico, May 2-4, 2016, Conference Track [26] F. Berkenkamp, M. Turchetta, A. Schoellig,
Proceedings, 2016. URL: http://arxiv.org/abs/1506. A. Krause, Safe model-based reinforcement
learn02438. ing with stability guarantees, Advances in neural
[15] R. S. Sutton, D. A. McAllester, S. Singh, Y. Man- information processing systems 30 (2017).
sour, Policy gradient methods for reinforcement [27] B. Thananjeyan, A. Balakrishna, U. Rosolia, F. Li,
learning with function approximation, in: S. A. R. McAllister, J. E. Gonzalez, S. Levine, F.
BorSolla, T. K. Leen, K. Müller (Eds.), Advances in relli, K. Goldberg, Safety augmented value
estimaNeural Information Processing Systems 12, [NIPS tion from demonstrations (saved): Safe deep
modelConference, Denver, Colorado, USA, November based rl for sparse cost robotic tasks, IEEE Robotics
29 - December 4, 1999], The MIT Press, 1999, pp. and Automation Letters 5 (2020) 3612–3619.
1057–1063. [28] N. Fulton, A. Platzer, Safe reinforcement learning
[16] J. Achiam, Spinning Up in Deep Reinforcement via formal methods: Toward safe control through</p>
        <p>Learning (2018). proof and learning, in: Proceedings of the AAAI
[17] T. K. Ho, Random decision forests, in: Proceedings Conference on Artificial Intelligence, volume 32,
of 3rd International Conference on Document Anal- 2018.
ysis and Recognition, volume 1, 1995, pp. 278–282 [29] A. Nikou, A. Mujumdar, M. Orlic, A. V. Feljan,
vol.1. doi:10.1109/ICDAR.1995.598994. Symbolic reinforcement learning for safe ran
con[18] S. Dey, P. Dasgupta, B. Gangopadhyay, Safety aug- trol, arXiv preprint arXiv:2103.06602 (2021).
mentation in decision trees, in: Proceedings of the [30] C. Baier, J.-P. Katoen, Principles of Model
CheckWorkshop on Artificial Intelligence Safety 2020 co- ing, MIT press, 2008.
located with the 29th International Joint Conference [31] Y. Chow, O. Nachum, A. Faust, E. Duenez-Guzman,
on Artificial Intelligence and the 17th Pacific Rim M. Ghavamzadeh, Lyapunov-based safe policy
opInternational Conference on Artificial Intelligence timization for continuous control, arXiv preprint
(IJCAI-PRICAI 2020), Yokohama, Japan, January, arXiv:1901.10031 (2019).
2021, volume 2640 of CEUR Workshop Proceed- [32] H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine,
ings, CEUR-WS.org, 2020. URL: http://ceur-ws. F. Shkurti, A. Garg, Conservative safety critics for
org/Vol-2640/paper_13.pdf. exploration, in: International Conference on
Learn[19] S. Feghhi, E. Aumayr, F. Vannella, E. A. Hakim, ing Representations, 2021. URL: https://openreview.</p>
        <p>G. Iakovidis, Safe reinforcement learning for an- net/forum?id=iaO86DUuKi.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Garcıa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on safe reinforcement learning</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>16</volume>
          (
          <year>2015</year>
          )
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          , Cs 188: Introduction to artificial intelligence,
          <year>2018</year>
          . URL: https://inst.eecs.berkeley.edu/ ~cs188/fa18/, accessed:
          <fpage>2022</fpage>
          -08-06.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Held</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>Constrained policy optimization</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          , Benchmarking Safe Exploration in Deep Reinforcement Learning (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eysenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <article-title>Learning to be safe: Deep RL with a safety critic</article-title>
          , CoRR abs/
          <year>2010</year>
          .14603 (
          <year>2020</year>
          ). URL: https://arxiv. org/abs/
          <year>2010</year>
          .14603. arXiv:
          <year>2010</year>
          .14603.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Alshiekh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bloem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ehlers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Könighofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Niekum</surname>
          </string-name>
          , U. Topcu,
          <article-title>Safe reinforcement learning via shielding</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>32</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pnueli</surname>
          </string-name>
          ,
          <article-title>The temporal logic of programs</article-title>
          ,
          <source>in: 18th Annual Symposium on Foundations of Computer Science</source>
          , Providence, Rhode Island, USA, 31 October - 1
          <source>November</source>
          <year>1977</year>
          , IEEE Computer Soci-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>