1. Introduction

Safe Reinforcement Learning through Phasic Safety Oriented Policy Optimization

Sumanta Dey

Pallab Dasgupta

Soumyajit Dey

0 0 Indian Institute of Technology , Kharagpur, 721302 , India

Exploration is an essential feature of Reinforcement Learning (RL) algorithms, as they attempt to learn the optimal policy through trial and error. In safety-constrained environments, safety violations during exploration are a significant challenge when the training is online. In this context, this paper proposes Phasic Safety-oriented Policy Optimization (PSPO), where the policy learning is divided into multiple phases with safety updates. This approach utilizes an adaptive safety shield to minimize repetitive unsafe explorations of the RL agent by action-masking, and at the same time learn an auxiliary policy which provides safety updates to the main policy. Such periodic updates reduce the number of safety infractions during training, without compromising rewards as in purely conservative safety shield based approaches. We have demonstrated the effectiveness of our approach in multiple safety-critical environments. Our experimental results exhibit fewer failures during training while demonstrating similar or faster convergence than prior methods.

eol>Safe Reinforcement Learning Safe Exploration Safe Policy Learning

1. Introduction

may avoid safe states in the proximity of unsafe states, thereby missing out the better policies in Π .

In order to learn a policy that maximizes the total ex- Consider a simple volcanic grid-world in Figure 1. The pected reward [ 1 ], a Reinforcement Learning (RL) agent robot tries to find the shortest path to the treasure in looperating in a model free environment needs to perform cation, (a,2), from its initial location, (a,0). An active adequate exploration of its environment. In safety crit- volcanic crater is present in location, (a,2), while the locaical domains, the training phase poses a challenge if it tion, (c,1), is blocked. Consider the following trajectories is performed in the real environment, as the uninformed of the agent, shown in green and black respectively: agent may lead itself to unsafe states during the exploration. A growing body of work addresses the safe RL problem from different directions, including the use of safety-shields, reward shaping, etc. [ 2 ]

The goal of a traditional RL agent is to learn an optimal policy ( * ) for a given starting state distribution ( ) that maximizes the overall expected return, .

* ← argmax E [] ∈Π However, in safety constrained RL setups, the RL agent tries to learn an optimal policy ( * ) that maximizes the Path-1: (, 0) →− − (, 0) →− (, 1) →− (, 2→)−− − ℎ overall return () while also satisfying the safety con- (, 2) straints ().

* ← argmax E [] ∈Π where Π is the set of safe policies. It may so happen that the best policies in Π have trajectories that run close to unsafe states. In an attempt to remain safe, an RL agent Path-2: (, 0) →− −

(, 1) →− (, 2) (, 2) →−− − (, 0) →− − (, 0) →− − (, 0) →− ℎ ℎ ℎ (, 2) →−− −

(, 2) →−− −

Path-1 is optimal and significantly shorter than Path-2.

During training, if the agent takes the right action from location, (, 1), or the up action from location, (, 0), the robot will fall in the volcanic cater and terminate that episode with failure. Experiencing failures from (, 1) may result in the policy settling on Path-2 instead of Pathoptimal path, Path-1, but will not make the policy safety jective is to optimize the policy ( ) with aware in the absence of the shield. respect to the reward function.

In the literature, safety has been treated either as a • Safety Optimization Phase: In this phase, discrete (safe/unsafe) binary, or as a continuous cost func- an auxiliary policy ( ) is trained with tion. Methods such as Constrained Policy Optimization the explored state-actions along with the (CPO) [ 4 ] and Proximal Policy Optimization Lagrangian safety shield masked unsafe actions. This (PPO-Lagrangian)[ 5 ] are effective in reduction of safety auxiliary policy ( ) is then used to ininfractions during online training, provided that safety is duce safe behavior in the main policy ( ) specified as a continuous cost function, that is, the envi- through behavioral cloning. ronment returns a set of non-negative real valued costs for all the safety constraints, and a safety violation happens Since the safety shield prevents repeated failures, the when the cumulative cost exceeds some defined threshold. policy learning does not get pushed to conservative subIn [ 6 ], the authors use a safety critic to guide the RL agent optimal trajectories. The periodic safety updates from while learning to avoid unsafe instances. However, the the auxiliary policy induces safe behavior right from the safety critic must be trained in the pre-training phase with inception of training. safe and unsafe states. We provide experimental results on several Gym Envi

In [ 7 ], for model-based MDPs (Markov Decision Pro- ronments [10]. Our results demonstrate considerable recess), the authors propose the use of a safety-shield de- ductions in safety infractions and high returns in episodic rived from Linear Temporal Logic (LTL) [ 8 ] specifica- rewards in all of these environments. tions to restrict the RL agent (by action shaping) to explore within a safe region. The model-based assumption 2. Preliminaries makes this method infeasible for many real-world applications where the transition function is unknown. Also, this We use Constrained MDP (CMDP) [ 2 ] to define our probmethod suffers from scalability issues due to the product lem setting and we use the Proximal Policy Optimization MDP construction in the shield synthesis phase. Deci- (PPO) [11] method in the back-end. sion trees may be used as safety shields, as in [9], but this work assumes the existence of a next state predictor which Constrained MDP (CMDP). As defined in [ 2 ] a CMDP limits its generalizability. is a tuple of (, , , ℛ, , , ), where defines the

In a completely unknown environment, the initial RL state space, is the action space, : ×× → [ 0, 1 ] policy as well as the safety shield are oblivious of safety. is a transition function/matrix. ℛ refers to the reward funcAs the exploration begins, the safety shield begins to take tion defined as × → ℛ and ∈ (0, 1) and are the shape with every safety infraction, but this knowledge discount factor and starting state distribution respectively. does not influence the policy learning directly. This pa- Finally, = {( : ≤ → {0, 1}| ∈ R), ∈ } per bridges this gap to accelerate the convergence to a is a set of safety constraints that the RL agent must folsafe RL policy. In this paper, we propose the Phasic low in order to be safe. denotes the i-th constraint Safety-oriented Policy Optimization (PSPO) framework function, and denotes the maximal allowable limit of to reduce safety infractions during exploration. PSPO non-satisfaction in terms of the expected probability of works on a model-free MDP with continuous space, con- failure. tinuous/discrete actions, and binary safety assumptions. In this paper, we consider the safety as a binary The approach uses periodic safety updates using an auxil- function {0, 1}. Therefore, returns SAFE (0) or iary policy trained from safety infractions detected by the UNSAFE (1). If any of the constraint functions return 1 safety shield, The main features of each period of PSPO (UNSAFE) for a given state, then the state is treated as are as follows (Fig. 2): unsafe.

1. A safety shield model is learned on-the-fly from state-to-unsafe action mapping from past exploration. The safety shield is continuously updated to adapt to newly visited unsafe states. The exploration is based on the current policy, which is not updated at this phase. 2. The policy network is updated in two separate phases as follows:

Proximal Policy Optimization (PPO). Proximal Policy Optimization (PPO) [11] is an advantage-based policygradient reinforcement learning algorithm proposed by OpenAI. The objective of commonly used policy gradient (PG) methods have the following form: ( ) = E[︀ (|)ˆ]︀ • Policy Training Phase: In this phase, the Here, ˆ denotes the estimated advantage [12] and policy network is trained only with the ex- is the policy parameterized by . On the other hand, the plored state-actions, and the primary obmain objective in Proximal Policy Optimization (PPO) is Our goal is to induce safe behavior by correcting a poldefined as: icy learned on the basis of reward. The pivot of the whole ( ) = E[︀ (( )ˆ, approach is a safety shield which gets updated whenever failures occur. The agent explores with the safety shield ︀( (, 1 − , 1 + )ˆ)︀] in place, and the trajectories are recorded. During this Here, ( ) denotes the probability ratio between new exploration, some of its actions may be thwarted by the safety shield – whenever this happens, the safety shield policy and old policy determined as ((||)) . Finally, notes the exceptions. These exceptions are used to train PPO restricted the policy update by using the clip method an auxiliary policy, which essentially captures the behavto directly limit the update range to [1 - , 1 + ], where iors in which we would like the current policy to behave is a hyper-parameter that decides the clipping interval. differently. The auxiliary policy is therefore used periodiWith the above small change in the objective function, the cally to update the current policy, thereby inducing safe PPO methods provide better stability and reliability over actions in the relevant states, without affecting the rest the vanilla policy gradient implementation. of the policy. This ensures proximality with the optimal policy chosen by the safety agnostic learning algorithm. 3. Ideation An important benefit of the proposed approach is that the policy learning and the safety augmentation are The task of inducing safe behavior into a RL policy re- separate phasic components. Therefore the method is quires careful balancing between safety and optimality. In adaptive to changes in either of the components. We can an unknown model free environment an RL agent with plug in any learning algorithm for the former, and hanlimited domain knowledge will reach unsafe states during dle additional safety corners when the agent reaches them. exploration, but it should ensure that it does not repeat the same mistakes. Relying solely on penalizing the agent Problem Statement. Given a CMDP, the objectives of for safety infractions may push the agent away from opti- our proposed framework are as follows: 1) Learning the mal paths that are on the border of unsafe regions, which conditions that led the RL agent to failures from the past then results in longer convergence times and sub-optimal exploration history and later guide the RL agent through policies. Learning safety shields from failures protects action masking to avoid the repetitive unsafe explorations. against future violations, but does not make the policy 2) Updating the policy network to incorporate the safety safety aware, and the agent continues to rely on the safety guidance from the previous step while minimally affecting shield. the currently learned policy.

Algorithm 1 Phasic Safe Policy Optimization (PSPO) 4. Phasic Safety-Oriented Policy Optimization This section discusses the overall process flow of our pro

posed Phasic Safety-oriented Policy Optimization (PSPO) framework. Figure 2 depicts the main components in this flow, and will be used as a reference in each of the following three phases:

We shall elaborate each of these phases now. Algorithm 1

summarizes the implementation of the proposed approach.

It may be noted that we address the problem in the online training setting only, with the goal of reducing safety infractions without distracting the algorithm for policy learning based on rewards.

4.1. Adaptive Safety Shield Framework In RL, safety shields (coined by [7]) are used to block un

safe actions during exploration. Traditional safety shields deal with MDPs with discrete action space and modelbased assumptions, which does not favor our model-free environment. Instead, we propose an Adaptive Safety Shield framework to learn a Safety Shield (SS) with the explored state-to-actions data collected on-the-fly during exploration. The explored state-action pairs are labeled based on the constraint violations in the next state during exploration. 37: 38: 39: 40: 41: } 42: end for } %Auxiliary Policy Training Perform rollouts on (EB+UB) {

Optimize wrt Policy ( + [ ]) } %AuxUpdate Phase Perform rollouts on (EB+UB) {

Optimize wrt Policy Distance (,

Loss

)) (, ) = {︃UNSAFE if ∃ i : +1 → 1 SAFE otherwise

It is possible to bootstrap the initial safety shield based on

prior knowledge on domain safety. It is also possible to start in the absence of prior knowledge.

In our implementation, the safety shield is a ML model trained to classify state-action pairs as safe/unsafe (Algo. 1: Line 26-27). With new explorations, the model is updated with new state-action pairs.

The variable _ (Algo. 1: Line 27) is a hyperparameter that denotes the number of episodes after which the shield update is performed, that is, it controls the update frequency.

This safety shield is used to predict the probability (Algo. 1: Line 8) of reaching an unsafe state for a state12) and the RL agent is asked to sample another action KL-divergence as the distance metric between the distributions. (.|[ + ]) returns the action distributions of the states presented in the exploration buffer (EB) and the (Algo. 1: Line 13). This loop (Algo. 1: Line 7-15) contin- unsafe buffer (UB), (.|[∖ ]) returns the action disues until the safety shield finds the sampled action to be safe, or if the sample count exceeds a defined threshold, _. In the latter case, the exploration continues with the first action proposed by the RL agent. tributions of the states presented in the exploration buffer (EB) but not the unsafe buffer (UB). Finally, (.| ) returns the action distributions of the states present in the unsafe buffer (UB).

4.2. Policy Training Phase We use the standard Proximal Policy Optimization (PPO)

algorithm for learning the policy for the RL agent (Algo. 1:

Line 29-33). This phase in shown with blue background in 4.3. Safety Optimization Phase

In this phase, an alternative policy network, called auxiliary policy network, is trained with traces from both the exploration buffer (EB) and the unsafe buffer (UB) (Algo. 1: Line 34-37). Then the PPO actor network is updated through behavioral cloning with the auxiliary policy network for the states in the unsafe buffer. Such updates may alter the learned policy distribution for the other states in the PPO actor network. Therefore to reduce such interference, we also consider the policy distributions for the states present in the exploration buffer (Algo. 1: Line 38-41). Hence the overall objective of AuxUpdate is: ( ) = _ (.|[ + ]), ︂[ ︂( (.|[∖ ]) + (.| ) ︂])

5. Experimental Setup

We provide empirical support to the following claims through experiments conducted on different gym environments with continuous and discrete action spaces: ures, and mary learning objective. • The PSPO approach reduces the number of fail• The PSPO approach marginally affects the pri

All the experiments were run on a machine with Ubuntu Graphics unit. We tested our framework against four different gym

environments, two of which are with discrete action space and the other two with continuous action space. By default, the gym environments are not safety constrained. We have defined the custom binary safety constraints for all the environments following [13]. The name of the environment with associated safe constraints are given in Table 1. We consider the safety constraints such that few are perfectly aligned with the original goal of the underlying Gym environment, and few aren’t exactly aligned with the original goal.

Cart Pole Environment: The cart pole environment is taken from Gym Classic control environments and is an environment with discrete action space [ 0, 1 ]. The aim is to keep the pole over the cart without falling by taking actions 0 and 1. We have considered the following set of safety constraints:

1. The cart Position should remain within -2.4 or

+2.4 2. The cart Momentum should not be lesser than -2.0 or greater than 2.0 3. The pole Angle should not be greater than 0.2

Among these, the third safety constraint is directly aligned with the original goal of the cart pole environment, whereas the remaining two are not exactly aligned with the original goal.

Inverted Pendulum Environment: This environment is taken from the Gym MuJoCo environments. This environment is similar to the Cart Pole environment, except the action space of this environment is continuous. Here the pole on top of the cart is controlled by applying a force between (-1, +1) to the cart to prevent the pole from falling over. The safety constraints are identical to those for Cart Pole environment. of a state.

Implementation Details. For the environments, we considered using Random Forest [17] based on the ensemble method as Adaptive Safety Shield(SS). The advantages of using Random Forest for safety predictors are the following: • Explainable. The individual decision trees of Random forest are explainable and can easily be interpreted and verified by human experts. They can also be factored into a rule-based system. • Efficient. In a robust system, failures are rare, and thereby the number of updates needed for the safety shield declines rapidly. • Augmentable. Individual Decision trees can easily be augmented with known safety constraints [18], which enables us to include safety constraints provided by domain experts.

Lunar Lander Environment: The Lunar Lander envi- One problem of using a random forest or decision tree ronment is taken from the Box2D environments. We have as the Safety Shield, mainly in the case of discrete enviapplied our framework to both the continuous and discrete ronments, is not considering the action in each decision versions of the environment. This environment aims to branch. If an action from a state is found unsafe, then the land the lunar lander smoothly on the helipad marked with safety shield could predict all the remaining actions as two flags by controlling the thrust of the rocket engines on unsafe, whereas there could be an unexplored action that the lander’s left, right, and bottom. For this environment is safe. To avoid such issues, we use a simple re-labeling we consider the following set of safety constraints. trick, where we use label (i+1) if the i-th action (starting 1. The lander should land only on the helipad, or the index from 0) is unsafe, and we use label 0 if the action X-position (PosX) of the lander must be within is safe. During prediction, if the (i+1)-th label prediction [-0.2,+0.2]. probability is greater than the provided safety threshold ( ), then the i-th action is considered unsafe; otherwise, 2. The lander tilt Angle should not be beyond -1 or the action is considered safe.

+1 if the lander Y-position (PosY) is less than 0.1 We use the standard PyTorch implementation of the and the lander is just over the helipad. PPO algorithm provided in OpenAI’s Spinningup library as the Policy Learner. The Auxiliary Policy Network is Baselines. We have used the standard PPO with neg- constructed by replicating the PPO’s Actor network. We ative reward for constrained violation as Baseline 1 have also done experiments of our PSPO method with (BASE1), SAC (Soft Actor-Critic) [14] with negative different safety bound ( ) values to show the impact of the reward for constrained violation as Baseline 2 (BASE2), safety bound hyperparameter in the PSPO framework. and VPG (Vanilla Policy Gradient) [15] with negative reward for constrained violation as Baseline 3 (BASE3). We have implemented all baselines on OpenAI’s Spinningup 6. Results and Discussion library[16]. Negative rewards on constraint violations are provided to all the baselines, and the proposed PSPO In this section we describe the results we have obtained framework. The SAC algorithm does not support discrete in comparison to the baselines described in the previous action space; hence, BASE2 is not considered in Cart Pole section. Figure 4 shows Epoch (X-axis) vs Total Number and Lunar Lander (Discrete) environments. of Safety Violations (Y-axis) for all four Gym environ

There are other approaches for safe exploration in ments. All the graphs show that the PSPO framework RL that are available. For example, Constrained Policy has significantly fewer safety violations during training Optimization (CPO) and PPO-Lagrangian. However, than the BASE1 method. These figures also show that the both these methods assume safety as a continuous cost average number of safety violations per epoch eventually function that is returned by the environment, like the tends to zero, as the PSPO graphs are slowly entering reward generated each time the agent applies an action. In the lag phase. In these figures, we did not consider the our setup, we have considered safety as a binary function other two baselines due to their large number of safety violations. Table 2 shows the number of safety violations

Algorithms

Failed / (Total) Episodes CartPole-v1 (100 Epochs) InvertedPendulum-v2 (100 Epochs) LunarLander-v2 (200 Epochs) LunarLanderContinuous-v2 (200 Epochs) is better than other baselines. This evidence supports our second claim.

Figure 3 shows the impact of the safety bound, , in the PSPO framework in terms of total safety violations and average episodic return. With higher values of this hyperparameter, , the safety shield intervenes/corrects fewer unsafe actions. In contrast, with the low values for , the safety shield unnecessarily corrects a higher number of safe actions. Hence, the safety bound, , controls the tradeoff between False Safe versus False Unsafe. In this case, the safety bound, = 0.9 provided the best results, both for episodic returns and fewer safety violations.

7. Related Work

agent.

[ 4, 5, 31 ] consider safety as a continuous cost function, and the cumulative cost should be within a specified limit to be safe. In [ 6, 32 ], a safety-critic-based approach is proposed, where if the safety critic predicts an action as unsafe, the agent samples a different action. In [32], the safety critic is learned along with policy, whereas in [ 6 ], the safety critic is learned separately.

Another line of work can be found in [13], where authors propose a method to incorporate the safety in a learned policy by finding the counterexamples or the failure states and then modifying the policy for the corresponding states minimally.

8. Conclusion We have presented a method for safe exploration in RL.

We use an Adaptive Safety Shield to learn the state-tounsafe action mapping from the past exploration and provide guidance to the RL agent to avoid repeating its mistakes. We have provided an auxiliary policy based update method to incorporate the safety guidance provided by the safety shield into the RL agent while minimally affecting the policy network for other state-actions. We have also presented various experiments which empirically validate that our method incurs fewer safety incidents while achieving higher or similar performance. ety, 1977, pp. 46–57. URL: https://doi.org/10.1109/ tenna tilt optimisation using shielding and multiple SFCS.1977.32. doi:10.1109/SFCS.1977.32. baselines, arXiv preprint arXiv:2012.01296 (2020). [9] S. Dey, A. Mujumdar, P. Dasgupta, S. Dey, Adaptive [20] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, safety shields for reinforcement learning-based cell J. Schulman, D. Mané, Concrete problems in ai shaping, IEEE Transactions on Network and Ser- safety, arXiv preprint arXiv:1606.06565 (2016). vice Management (2022) 1–1. doi:10.1109/TNSM. [21] P. Abbeel, A. Coates, A. Y. Ng, Autonomous heli2022.3194566. copter aerobatics through apprenticeship learning, [10] G. Brockman, V. Cheung, L. Pettersson, J. Schnei- International Journal of Robotics Research (IJRR) der, J. Schulman, J. Tang, W. Zaremba, Openai gym, 29 (2010).

2016. [22] T. J. Perkins, A. G. Barto, Lyapunov design for [11] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, safe reinforcement learning, Journal of Machine O. Klimov, Proximal policy optimization algo- Learning Research 3 (2002) 803–832. rithms, arXiv preprint arXiv:1707.06347 (2017). [23] F. Berkenkamp, R. Moriconi, A. P. Schoellig, [12] J. Schulman, P. Moritz, S. Levine, M. Jordan, A. Krause, Safe Learning of Regions of Attraction P. Abbeel, High-dimensional continuous control us- for Uncertain, Nonlinear Systems with Gaussian ing generalized advantage estimation, arXiv preprint Processes, 2016 IEEE 55th Conference on Decision arXiv:1506.02438 (2015). and Control, CDC 2016 (2016) 4661–4666. doi:10. [13] B. Gangopadhyay, P. Dasgupta, Counterexample 1109/CDC.2016.7798979. arXiv:1603.04915. guided rl policy refinement using bayesian optimiza- [24] Y. Chow, O. Nachum, E. Duenez-Guzman, tion, Advances in Neural Information Processing M. Ghavamzadeh, A lyapunov-based approach Systems 34 (2021) 22783–22794. to safe reinforcement learning, arXiv preprint [14] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, arXiv:1805.07708 (2018).

P. Abbeel, High-dimensional continuous control us- [25] T. Koller, F. Berkenkamp, M. Turchetta, A. Krause, ing generalized advantage estimation, in: Y. Bengio, Learning-based model predictive control for safe Y. LeCun (Eds.), 4th International Conference on exploration, in: 2018 IEEE conference on decision Learning Representations, ICLR 2016, San Juan, and control (CDC), IEEE, 2018, pp. 6059–6066. Puerto Rico, May 2-4, 2016, Conference Track [26] F. Berkenkamp, M. Turchetta, A. Schoellig, Proceedings, 2016. URL: http://arxiv.org/abs/1506. A. Krause, Safe model-based reinforcement learn02438. ing with stability guarantees, Advances in neural [15] R. S. Sutton, D. A. McAllester, S. Singh, Y. Man- information processing systems 30 (2017). sour, Policy gradient methods for reinforcement [27] B. Thananjeyan, A. Balakrishna, U. Rosolia, F. Li, learning with function approximation, in: S. A. R. McAllister, J. E. Gonzalez, S. Levine, F. BorSolla, T. K. Leen, K. Müller (Eds.), Advances in relli, K. Goldberg, Safety augmented value estimaNeural Information Processing Systems 12, [NIPS tion from demonstrations (saved): Safe deep modelConference, Denver, Colorado, USA, November based rl for sparse cost robotic tasks, IEEE Robotics 29 - December 4, 1999], The MIT Press, 1999, pp. and Automation Letters 5 (2020) 3612–3619. 1057–1063. [28] N. Fulton, A. Platzer, Safe reinforcement learning [16] J. Achiam, Spinning Up in Deep Reinforcement via formal methods: Toward safe control through

Learning (2018). proof and learning, in: Proceedings of the AAAI [17] T. K. Ho, Random decision forests, in: Proceedings Conference on Artificial Intelligence, volume 32, of 3rd International Conference on Document Anal- 2018. ysis and Recognition, volume 1, 1995, pp. 278–282 [29] A. Nikou, A. Mujumdar, M. Orlic, A. V. Feljan, vol.1. doi:10.1109/ICDAR.1995.598994. Symbolic reinforcement learning for safe ran con[18] S. Dey, P. Dasgupta, B. Gangopadhyay, Safety aug- trol, arXiv preprint arXiv:2103.06602 (2021). mentation in decision trees, in: Proceedings of the [30] C. Baier, J.-P. Katoen, Principles of Model CheckWorkshop on Artificial Intelligence Safety 2020 co- ing, MIT press, 2008. located with the 29th International Joint Conference [31] Y. Chow, O. Nachum, A. Faust, E. Duenez-Guzman, on Artificial Intelligence and the 17th Pacific Rim M. Ghavamzadeh, Lyapunov-based safe policy opInternational Conference on Artificial Intelligence timization for continuous control, arXiv preprint (IJCAI-PRICAI 2020), Yokohama, Japan, January, arXiv:1901.10031 (2019). 2021, volume 2640 of CEUR Workshop Proceed- [32] H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, ings, CEUR-WS.org, 2020. URL: http://ceur-ws. F. Shkurti, A. Garg, Conservative safety critics for org/Vol-2640/paper_13.pdf. exploration, in: International Conference on Learn[19] S. Feghhi, E. Aumayr, F. Vannella, E. A. Hakim, ing Representations, 2021. URL: https://openreview.

G. Iakovidis, Safe reinforcement learning for an- net/forum?id=iaO86DUuKi.

[1]

R. S.

Sutton ,

A. G.

Barto , Reinforcement learning: An introduction ( 2018 ).

[2]

Garcıa ,

Fernández , A comprehensive survey on safe reinforcement learning , Journal of Machine Learning Research 16 ( 2015 ) 1437 - 1480 .

[3]

Abbeel , Cs 188: Introduction to artificial intelligence, 2018 . URL: https://inst.eecs.berkeley.edu/ ~cs188/fa18/, accessed: 2022 -08-06.

[4]

Achiam ,

Held ,

Tamar ,

Abbeel , Constrained policy optimization , in: International conference on machine learning, PMLR , 2017 , pp. 22 - 31 .

[5]

Ray ,

Achiam ,

Amodei , Benchmarking Safe Exploration in Deep Reinforcement Learning ( 2019 ).

[6]

Srinivasan ,

Eysenbach ,

Ha ,

Tan ,

Finn , Learning to be safe: Deep RL with a safety critic , CoRR abs/ 2010 .14603 ( 2020 ). URL: https://arxiv. org/abs/ 2010 .14603. arXiv: 2010 .14603.

[7]

Alshiekh ,

Bloem ,

Ehlers ,

Könighofer ,

Niekum , U. Topcu, Safe reinforcement learning via shielding , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 32 , 2018 .

[8]

Pnueli , The temporal logic of programs , in: 18th Annual Symposium on Foundations of Computer Science , Providence, Rhode Island, USA, 31 October - 1 November 1977 , IEEE Computer Soci-