=Paper=
{{Paper
|id=Vol-3087/paper_17
|storemode=property
|title=HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning
|pdfUrl=https://ceur-ws.org/Vol-3087/paper_17.pdf
|volume=Vol-3087
|authors=Zikang Xiong,Ishika Agarwal,Suresh Jagannathan
|dblpUrl=https://dblp.org/rec/conf/aaai/XiongAJ22
}}
==HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning==
HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning Zikang Xiong, Ishika Agarwal, Suresh Jagannathan Computer Science Department, Purdue University, West Lafayette, Indiana 47906 xiong84@cs.purdue.edu, agarwali@purdue.edu, suresh@cs.purdue.edu, Abstract navigation policy. However, a safe plan cannot guarantee safety at runtime. Because the low-level controller is not re- We propose a two-level hierarchical framework for safe rein- forcement learning in a complex environment. The high-level quired to follow the plan perfectly, agents can deviate from part is an adaptive planner, which aims at learning and gen- a safe plan and produce unsafe behaviors, such as hitting erating safe and efficient paths for tasks with imperfect map obstacles. To solve this problem, we apply a learned neural information. The lower-level part contains a learning-based Lyapunov function (Berkenkamp et al. 2016; Chang, Roohi, controller and its corresponding neural Lyapunov function, and Gao 2020) as a runtime monitor and harness its stability which characterizes the controller’s stability property. This property to enforce that the agent stays in a region specified learned neural Lyapunov function serves two purposes. First, by the function. We also incorporate the neural Lyapunov it will be part of the high-level heuristic for our planning algo- function as part of our planner heuristic. This incorpora- rithm. Second, it acts as a part of a runtime shield to guard the tion fuses our high-level planner and the low-level controller safety of the whole system. We use a robot navigation exam- seamlessly. Moreover, our learned neural Lyapunov func- ple to demonstrate that our framework can operate efficiently and safely in complex environments, even under adversarial tion does not require any knowledge about system dynam- attacks. ics. Hence, compared with (Bastani, Pu, and Solar-Lezama 2018; Zhu et al. 2019), our approach can extend to complex unknown dynamics. 1 Introduction The backend algorithms of our planner are A∗ (Hart, Nils- Although deep reinforcement learning has achieved promis- son, and Raphael 1972) and RRT∗ (Urmson and Simmons ing results in various domains, ensuring its safety still is a 2003). When we have accurate map information, both A∗ concern. One line of work (Bastani, Pu, and Solar-Lezama and RRT∗ work correctly. However, if map information is 2018; Zhu et al. 2019; Chang, Roohi, and Gao 2020; Dai inaccurate or outdated, our backend algorithm is unlikely to et al. 2021) provides rigorous safety guarantees by using an- work as expected. Hence, we further strengthen our planner alytic approaches. These proposals generally require knowl- with a refinement policy. The refinement policy’s mission edge of a system’s underlying dynamics and constraints, is repairing and refining planning decisions during runtime. making it challenging to generalize their method to han- Inspired by the elastic band algorithm (Quinlan and Khatib dle complex dynamics. On the other hand, hierarchical rein- 1993), we propose a learning-based approach to generate the forcement learning algorithms (Levy, Jr., and Saenko 2017; refinement policy. Nachum et al. 2019; Kreidieh et al. 2019) are attractive It is well-known that most deep learning algorithms suffer because they can support complex tasks without requiring from robustness issues. Adversarial attacks (Sun et al. 2018; knowledge of an underlying environment structure. How- Mankowitz et al. 2020; Chen et al. 2019; Zhang et al. 2020) ever, this lack of knowledge and the non-stationary MDP can effectively test the robustness of a learning-enabled sys- problem make these algorithms be time-intensive without tem. Hence, we formulate an attack mechanism against our providing any assurance of safety after training. framework to force an agent (i.e, robot) to deviate outside In contrast, our framework assumes, in the specific plan- the region specified by the Lyapunov function. Our results ning setting we consider, that our learning algorithm can as- show that unless we apply an unrealistically high attack fre- sess the map information of one environment. Such informa- quency and force significant perturbation, our framework is tion can be a priori modeled with a high-definition map or robust and can keep the robot within a safe region. collected during runtime using techniques such as SLAM. The contributions of this work can be summarized as fol- With map information, a high-level planner can generate a lows: safe and efficient path. This planner frees our agents from • We propose a hierarchical framework that reconciles unsafe and inefficient exploration for generating a high-level both efficiency and safety. Our framework can enhance Copyright © 2022 for this paper by its authors. Use permitted under the safety of an agent in complex environments, where Creative Commons License Attribution 4.0 International (CC BY the dynamics of the controlled robots are unknown, and 4.0). the environment information (i.e., map and obstacle in- 𝑝𝑜𝑠$ 𝑑$ Subgoal 𝑔! Subgoal 𝑔!"# Figure 2: A demonstration on how the low-level controller executes a plan. The star is a subgoal, and the green line is the plan. The low-level controller may drift from the plan and generate the dotted path. Although our plan avoids hit- ting the cat, the low-level controller still cannot ensure safety during the runtime. Figure 1: Sweeping robot in room example. We need to con- trol the sweeping robot to reach the charger without hitting Lyapunov Function The large deviation dt can cause un- cats and walls. The green lines are planed path. safe behaviors. Hence, we hope to constrain the robot around our plan. The Lyapunov function is a positive-definite func- formation) can be imperfect. tion for analyzing the stability of a system. A Lyapunov function V (x) characterizes a control system’s Region Of • We consider an approach to adapt the high-level planner Attraction (ROA). The ROA binds the robot to stay around to deal with imperfect information. the plan. We will introduce the technical details for the neu- • We demonstrate the robustness of our framework under ral Lyapunov in Section 3.2. adversarial attack. 2 Motivation Example 2.1 Hierarchical Framework Subgoal 𝑔!"# The simple navigation task in Figure 1 requires navigat- ROA Subgoal 𝑔!"% ing the sweeping robot from the initial position s0 to the goal position gT (the charger). We solve this problem with a two-level hierarchical framework. First, a high-level planner Subgoal 𝑔! Subgoal 𝑔!"$ finds a safe plan (i.e., no collision with walls and cats) from Captured the initial position s0 to the goal position gT . The plan is a sequence of subgoals shown as the green line in Figure 1. We denote it as P (s0 , gT ) = (g0 , g1 , . . . , gT ), where gi is Figure 3: The way to apply the Lyapunov function. We built a subgoal, g0 = s0 . Then, A low-level controller πl (s|g) a sequence of ROAs around the subgoals. One ROA is a re- is introduced to execute the plan. The low-level controller gion that the robot will stay in when the robot is heading to is conditioned by a subgoal g, and it is trained to predict the corresponding subgoal (i.e., the sink of the ROA). Each the optimal action under state s to reach g. The low-level ROA has intersections with its neighbors. controller is trained with TD3 (Fujimoto, Hoof, and Meger 2018) using a reward for achieving the given subgoal g with the shortest path. Runtime Shield With the plan and ROAs built by the Lya- punov function, we can construct a runtime shield to guard 2.2 Runtime Safety the safety of our robot. Unlike the previous method (Bas- tani, Pu, and Solar-Lezama 2018; Zhu et al. 2019), we do Although the high-level planner can provide a safe plan, not require an additional safe policy for the recovering pur- the low-level controller does not necessarily follow the plan pose. Instead, we construct the runtime shield by switching strictly. Especially when we train the low-level controller the robot’s heading toward different subgoals. Specifically, with a model-free reinforcement learning algorithm, it is we select two consecutive subgoals during the runtime. The quite common that the agent finds an unexpected approach first one is the latest subgoals achieved; the second one is to achieve the goal. Once the unstable low-level controller the next subgoals. Then, we select the first subgoal’s ROA leads our robot to deviate from our safety plan, we cannot to monitor our robot. While the robot stays in the selected guarantee the safety of our robot. ROA, we set its subgoal to the second subgoal that we have Stability of Low-level Controller We measure the stabil- not achieved yet. When the robot goes beyond the selected ity of a low-level controller with the deviation between the ROA, we check whether it is “captured” by the next ROA actual trajectory and the plan. In Figure 2, the robot’s po- (i.e., the robot enters the intersection part of two ROA, see sition at time t is post , the deviation dt is defined as the Figure 3). If the robot is captured, we will change the se- Euclidean distance from post to the plan fragment between lected ROA to the next ROA until the robot achieves the 2 subgoals. second subgoal. Nevertheless, if the robot is not captured by the next ROA and slides out of the selected ROA, we will set Refinement Policy Naturally, the map will change after its subgoal to the first subgoal to pull it back. The runtime the data is collected. Hence, the plan made in the outdated shield binds a robot to stay around its plan. We can consider map may cause undesired results. Thus, we consider a re- the ROA while planning and generating a plan with safety pair schema during runtime. The key idea is the elastic boundaries. Consequently, the runtime safety of our system band (Quinlan and Khatib 1993) which optimizes a planning is guarded. A more formal description is provided in Sec- “string” with the internal and repulsive force. The internal tion 3.3. force, formally, can be described as 2.3 Planner gi−1 − gi gi+1 − gi Fin = kin · + The backend algorithm of our planner is A∗ and RRT∗ . Both kgi−1 − gi k kgi+1 − gi k of them require a heuristic guiding the efficient search. The where kin is the elasticity coefficient. The internal force for most straightforward heuristic might be the Euclidean dis- subgoal gi is computed with the sum of the two direction tance. However, this heuristic cannot guarantee that the gen- vectors toward its neighbors. In addition to the internal force, erated path is safe for our system. For example, a planning a repulsive force is applied to prevent the ROA from inter- path may get close to the wall, which does not leave enough secting with obstacles. When we have perfect map data, we space for the shield to switch the subgoal for avoidance. can compute the repulsive force easily. However, it can be Thus, we have to consider the Lyapunov function as part a problem when the map data is imperfect. In this case, we of the heuristic. We further consider scenarios that we do only have the raw sensor data during runtime. To address not have the perfect map data. In this case, we need to col- this problem, we learn a function kre (gi |hi ) to predict the lect the sensor data during the runtime and refine our plan. repulsive coefficient, given the subgoal gi and a sequence We achieved the refinement by employing a high-level re- of sensor data history hi . Finally, the applied force ∆f for inforcement learning policy. This refinement policy enables subgoal gi is our system to operate safely with imperfect map data. gi−1 − gi gi+1 − gi ∆F = (kin − kre (gi |hi )) · + kgi−1 − gi k kgi+1 − gi k 𝑔!"# We can iteratively update the gi with gi0 = gi + α∆F to find the equilibrium where ∆F = 0. α here is a small constant. 𝑔! An example is provided in Figure 4(b). The blue gi was up- dated to the yellow position with iteratively applying ∆F . 𝑔!"# 2.4 Adversarial Attack (a) Robot sensors (b) Refinement in U-maze environment Figure 4: (a) demonstrates the sensors on the robot. The Li- DARs scans 16 positions around, which generates an obser- vation vector with 16 distance elements. (b) is an example 90 cm for refinement policy. The dashed region does not exist in ~ 40 cm real word, but the map used for planning marks it as obsta- 30 cm ı cle. 50 cm Heuristic A simple Euclidean distance heuristic is the L2 distance from current subgoal position to the final goal po- sition, heuc (gt ) = ||gT − gt ||. This heuristic guides the search toward the final goal. For the U-maze example in Figure 4(b), the path generated by heuc (gt ) will stick to the Figure 5: Cat parade environment for robustness evaluation wall in the lower part because this path is closer to the fi- nal goal. However, such a path does not provide any space To better evaluate the robustness of our framework, we as a safety distance. Thus, we also need to consider a Lya- consider attacking the framework from two aspects. Firstly, punov heuristic hlyap (gt ). Given xt , the relative position the neural network is criticized for being unrobust to the in- from the sink of ROA gt to the closest obstacle, we define put perturbations. Thus, we add adversarial noise to the low- the hlyap (xt ) to characterize the safety. If the closest ob- level controller and the neural Lyapunov function’s inputs in stacle is located outside the ROA, hlyap (xt ) returns 0. Oth- the first type of attack. Secondly, the robustness of a system erwise, hlyap (xt ) returns −∞ to disable the search on this is not only affected by the upstream perception data fed to path. We compose the Euclidean heuristic for efficiency and the neural network controller, but also the actual action exe- the Lyapunov heuristic for safety, and guide the planning cuted by downstream modules. That means even if the con- search with heuc (gt ) + hlyap (xt ). troller outputs the right action, the noise in the downstream models can still cause undesired results. Hence, we further Eq. (3) is known as the lie derivative. When the lie derivative attack our framework with noise added to the action. is smaller than 0, V (xt ) strictly decreases along with time. We evaluate the robustness of our system with this simple We compute the Lyapunov function with a neural network but safety-sensitive cat parade environment in Figure 5. The and train the neural network with a loss function based on adversary needs to fool our neural Lyapunov function and the Zubov-type equation (Grune and Wurth 2000). The loss guide the robot to hit the cats. We assume that we can access function is the parameters of the neural Lyapunov function and the low- level controller. Because our framework aims to work on the robot with complex dynamics, it is generally challenging to Lθ (xt , xt+1 ) = |Vθ (x0 )| model the robot’s dynamics required by a gradient-based at- + Vθ (xt )(Vθ (xt+1 ) − Vθ (xt )) − ||xt ||2 tack such as FGSM (Goodfellow, Shlens, and Szegedy 2014) and PGD (Madry et al. 2017). Hence, similar to (Zhao et al. The Monte-Carlo estimation of L is 2020), we learn approximated surrogate dynamics. We pro- vide the technical details in Section 3.5. 1 X Lθ = ( |Vθ (x0 )| N x ,x ∼π 3 Approach t t+1 l Our work incorporated a Lyapunov function into a learning- + Vθ (xt )(Vθ (xt+1 ) − Vθ (xt )) − ||xt ||2 ) enabled control system. First, because it is challenging to reason about the behavior of a RL-trained controller, we use To characterize the stability of πl , xt and xt+1 are sampled the Lyapunov function to characterize its stability property. with the πl . We can compute the gradient with L and op- The Lyapunov function provides us with the power to bind a timize the network’s parameters θ. The optimal parameters robot to a specified region. Second, we assemble the speci- are θ∗ = arg minθ (Lθ ). fied ROA with a high-level planner. By considering the Lya- The neural Lyapunov function specifies the ROA with a punov function and the planning heuristic jointly, we can constant CROA . generate a safety plan. When executing the plan, a safety shield guards the controlled robot. Additionally, we enhance ROA = {g + x|V (x) < CROA } our high-level planner with a learning-based high-level pol- Where g is the sink of a ROA. icy. This policy refines plans with the sensor data collected during runtime, which gives our high-level planner the abil- 3.3 Runtime Shield ity to work on imperfect map data. Finally, we evaluate the The Lyapunov function specifies a single ROA that a robot robustness of our system against adversarial attacks. stays in. However, a runtime shield binds a robot to stay around a given plan P (s0 , gT ) = (g0 , g1 , . . . , gT ). To deal 3.1 Low-Level Controller with the sequence of subgoals, we check the ROAs of any A low-level controller πl (s|g) is introduced to achieve a sub- two consecutive subgoals. Given the lower-level controller goal g. The controller is trained with TD3. We designed its πl , previous system state si−1 , current system state si , ROAt reward function as with sink gt , and ROAt+1 with sink gt+1 , the algorithm 1 shows the details of the runtime shield. r(st ) = ||st−1 − gt−1 || − ||st − gt || We compute the distance-to-goal at the previous time step Algorithm 1: Sequential Shield t − 1 and the current time step t. The reward function is the function S HIELD(πl , si−1 , si , gt , gt+1 , ROAt , ROAt+1 ) difference between the two distances. TD3 maximizes this if si ∈ ROAt ∨ si ∈ ROAt+1 then reward w.r.t the Bellman function. As a result, the trained return πl (si |gt+1 ) agent is expected to get closer to the given goal in the fastest else if (si−1 ) ∈ ROAt ∧ si ∈ / ROAt+1 then direction. return πl (si |gt ) . Pull back to ROAt 3.2 Neural Lyapunov Function and ROA else return πl (si |gt+1 ) We use the neural Lyapunov function to characterize the sta- bility property. The Lyapunov function is a positive-definite function satisfying three constraints. When si ∈ ROAt ∨ si ∈ ROAt+1 , the πi is parameter- ized with the next subgoal gt+1 , and the robot moves toward V (xo ) =0 (1) gt+1 . When (si−1 ) ∈ ROAt ∧si ∈ / ROAt+1 , the robot moves ∀x 6= x0 , V (x) >0 (2) from the ROAt to a region other than ROAt+1 . We need to pull the robot back to ROAt , thus the πi is parameterized by V (xt+1 ) − V (xt ) <0 (3) gt . The last condition is (si−1 ) ∈ ROAt+1 ∧ si ∈ / ROAt+1 , In Eq. (1), a Lyapunov function’s value is 0 on the origin. 1 although this is not supposed to happen. If it happens due to Eq. (2) is the position property of a Lyapunov function. any aspect of imprecision, we return action πi (si |gt+1 ). 1 The Lyapunov function’s input x and state s are in the same of s is determined by the system. The origin of x, however, is a space. The only difference is their coordination origin. The origin selected subgoal during the runtime. x = s − g. 3.4 Planner where ε is the attack noise size. Because we can compute Heuristic Our planning heuristic has two parts. The heuc the gradient of st and at , we can also apply PGD or other for efficiency and the hlyap for safety. heuc is the Euclidean attack techniques similarly. distance to the final goal. Attack Neural Lyapunov Function We want to fool the heuc (gt ) = ||gT − gt || neural Lyapunov function with an attacked input x̂t . In this case, we expect a small V (x̂t ), where V is the Lyapunov This heuristic is designed to search for the shortest path, but function. The attacked input x̂t can be computed with gradi- does not consider safety. Hence, it can generate paths that ent dVdx(xt t ) . are close to obstacles, which may cause undesired behaviors during the runtime. To address this problem, we introduce a dV (xt ) Lyapunov heuristic. x̂t = xt + ε · sign( ) dxt Note that we optimize the attacked input x̂t of the Lyapunov 0 V (xt ) > CROA , function and the attacked state ŝt of the low-level controller hlyap (xt ) = −∞ V (xt ) ≤ CROA separately. That makes the attack more potent. Suppose the closest obstacle position to gt is ot , xt = ot − gt ; CROA is the constant specified the ROA. This heuristic 4 Primary Evaluation ensures that the ROAs do not intersect with obstacles. In this section, we report the primary evaluation results on our motivation examples introduced in Section 2. Refinement Policy We provided an example for the re- finement policy in Section 2.3. Force ∆F was applied to 4.1 Low-level controller update the subgoal gi of a path. We train the low-level controller πl (s|g) with TD3. The gi−1 − gi gi+1 − gi training results are provided in Figure 6. Each training it- ∆F = (kin − kre (gi |hi )) · + kgi−1 − gi k kgi+1 − gi k eration indicates updates on the low-level controller. The updates are executed periodically in every 2000 simulation The updating is iterative, steps. All the reward curves in Figure 6 converge to a reward around -2.5. gi0 = gi + α · ∆F The new subgoal gi0 should converge to a fixed point where kin = kre (gi0 |hi ). The kre (gi0 |hi ) was learned using an encoder-decoder structure. First, we train an auto-encoder EN C to encode the history. Then, we concatenate the em- bedding EN C(hi ) with subgoal gi , and train an MLP to predict the kre . The training data was sampled with simula- tions in different scenarios. 3.5 Robustness to Adversarial Attack Surrogate Dynamics The surrogate dynamics is a func- tion fdyn (st , at |ht ) = st+1 where the st is the state and at is the action. ht is the his- tory states of the agent. If a system is fully-observable, we Figure 6: Training reward of πl (s|g). This figure includes can ignore the ht . The fdyn computes the next state of the results of 5 runs with different random seeds. system. Attack Controller We define an objective function fobj that measures the distance to the planning path. Maximiz- 4.2 Neural Lyapunov Function ing the fobj (st+1 ) guides the robot to deviate from the plan. We train the neural Lyapunov function with a dataset con- We considered a simple FGSM attack. When attacking the taining 106 transitions and test the neural Lyapunov with a input state of the low-level controller, we can compute the test dataset with 2 × 105 transitions. Both the training and attacked state ŝt with test dataset are sampled with the same low-level controller. During the test time, we check the properties in Eq. (1), ∂fobj (fdyn (st , at |ht )) Eq. (2) and Eq. (3) on the test dataset with five neural Lya- ŝt = st + ε · sign( ) punov functions trained with different splits on the training ∂st On the other hand, if we want to attack the action, the at- and testing dataset. tacked action ât can be computed with The statistics are in Table 1. The minimum percentage of property violations is 0%, and the maximum is 0.3%. Thus, ∂fobj (fdyn (st , at |ht )) out of around 10,000 simulations, only 30 simulations are ât = at + ε · sign( ) violated. ∂at Stat. min max mean std attack action results Vio. 0 0.00294 0.000656 0.00115 30 0.0 25 0.0* 1.0 Table 1: Neural Lyapunov function training results. 20 1.0* 2.0 dmax 15 2.0* 3.0 10 3.0* 4.0 4.3 Refinement Policy 5 4.0* 0 0.2 0.4 0.6 0.8 1.0 We generated the sensor data and computed the accurate re- attack frequency pulsive constant for training repulsive prediction network kre (gi |hi ) in different scenarios (i.e., the obstacles appear Figure 8: Attack action results in different positions). 10, 000 trajectories are generated, which includes 9, 500 training trajectories and 500 test tra- jectories. Each trajectory contains 5 seconds of sensor data. Figure 8 provides the attack action results. Our robot’s However, our repulsive prediction network only uses latest action range is [−5, 5]. The action represents the displace- 1 seconds sensor data as the history hi in kre (gi |hi ). The ment every 0.1 seconds. We set the attack noise from 0 to 4. range of the kre is [−1, 1], the prediction has error around The dmax is smaller when we do not attack the Lyapunov 0.01. In our toy example provided in Figure 4, we measure function. Overall, the dmax increases as the ε and attack the average distance between plan and center obstacle. We frequency raises. We also observe that sometimes larger ε hope this distance be small while the ROA should not collide results in smaller dmax . This is because a stronger attack with the obstacle. Before applying our refinement policy, the sometimes can cause intensive calling on the shield’s pull- average distance is 51.32. After 100 runs and refinements, back action. For example, when the attack frequency is 1.0 our refinement policy changed the average distance to 45.27 when the ε = 1.0, the dmax is smaller compared with when and avoided the collision between obstacles and ROAs. ε = 0.0. Finally, we notice that the first safety violation hap- pens when the attack frequency is 60% and ε = 4.0. 4.4 Robustness to Adversarial Attack We attacked both the state and action with different attack frequency and noise size ε. The Figure 7 and 8 provide the 5 Future work results of attacks on the state and action respectively. Each column is generated with 100 experiments with attacks at One issue we want to explore further is the scalability of random times. The attack frequency ranges from 0.2 to 1.0, our framework. There is a sequence of work has shown indicating the percentage of attacked transitions. The ∗ in the scalability (Gaby, Zhang, and Ye 2021; Dawson et al. the legend means that we also attack the Lyapunov function 2021) of the neural Lyapunov function and deep reinforce- in these experiments and the unit of ε is cm. The y−axis is ment learning (Pérez-Dattari et al. 2019; Fujimoto, Hoof, the max deviation to our plan. When the dmax > 30 cm, the and Meger 2018). Hence, it is natural to demonstrate the robot will hit cats in the example provided in Figure 5. scalability of our framework better on more complicated benchmarks (Achiam and Amodei 2019; Todorov, Erez, and attack state results Tassa 2012). On the other hand, the neural Lyapunov func- 30 tion we learned does not provide a verifiable guarantee (Dai 0.0 25 0.0* et al. 2021; Chang, Roohi, and Gao 2020; Zhu et al. 2019; 1.0 20 1.0* Bastani, Pu, and Solar-Lezama 2018) because we do not as- 2.0 dmax 15 2.0* sume the access of the dynamics of a system. However, in 3.0 the case that the dynamics can be easily modeled, extend- 10 3.0* 4.0 ing our work with the verifiable tools can provide a strong 5 4.0* 0 0.2 0.4 0.6 0.8 1.0 guarantee on the system safety. This is a direction we are attack frequency working on. Regarding the system robustness, we only eval- uated a limited number of adversarial attacks. It is interest- Figure 7: Attack state results ing to see how our framework performs under various attack techniques (Weng et al. 2020; Mankowitz et al. 2020; Zhang In Figure 7, when we do not attack the Lyapunov func- et al. 2020) while also consider the attack on the high-level tion, the dmax is always bounded below 15 cm. Hence, the planner (Xiang et al. 2018). Moreover, the current frame- protection provided by the shield works as expected. Never- work only considers a single agent and static obstacles. A theless, when the Lyapunov function is under attack, we no- challenging but fruitful direction is to extend our framework tice that all the dmax is significantly larger, which means the to multiagent planning and control (Guestrin, Koller, and Lyapunov function and shield can be affected by the attack. Parr 2001; Nissim, Brafman, and Domshlak 2010). Lastly, On the other hand, the dmax grows as the attack frequency our high-level planner is closely related to the safe neural and increases. The first safety violation happens when the motion planning (Huang et al. 2021; Qureshi et al. 2019). attack frequency is 60%, and the noise size is 4.0 cm, while We propose to investigate further on these works and better we also attack the Lyapunov function. refine our high-level planner. References for Continuous Control with Model Misspecification. In In- Achiam, J.; and Amodei, D. 2019. Benchmarking Safe Ex- ternational Conference on Learning Representations. ploration in Deep Reinforcement Learning. In arxiv. Nachum, O.; Gu, S.; Lee, H.; and Levine, S. 2019. Near- Bastani, O.; Pu, Y.; and Solar-Lezama, A. 2018. Verifi- Optimal Representation Learning for Hierarchical Rein- able Reinforcement Learning via Policy Extraction. CoRR, forcement Learning. In International Conference on Learn- abs/1805.08328. ing Representations. Berkenkamp, F.; Moriconi, R.; Schoellig, A. P.; and Krause, Nissim, R.; Brafman, R. I.; and Domshlak, C. 2010. A A. 2016. Safe Learning of Regions of Attraction for Un- general, fully distributed multi-agent planning algorithm. certain, Nonlinear Systems with Gaussian Processes. arXiv In Proceedings of the 9th International Conference on Au- e-prints, arXiv:1603.04915. tonomous Agents and Multiagent Systems: volume 1-Volume 1, 1323–1330. Chang, Y.-C.; Roohi, N.; and Gao, S. 2020. Neural lyapunov Pérez-Dattari, R.; Celemin, C.; Ruiz-del-Solar, J.; and control. arXiv preprint arXiv:2005.00611. Kober, J. 2019. Continuous Control for High-Dimensional Chen, T.; Liu, J.; Xiang, Y.; Niu, W.; Tong, E.; and Han, State Spaces: An Interactive Learning Approach. arXiv e- Z. 2019. Adversarial attack and defense in reinforcement prints, arXiv:1908.05256. learning-from AI security view. Cybersecurity, 2(1): 1–22. Quinlan, S.; and Khatib, O. 1993. Elastic bands: connecting Dai, H.; Landry, B.; Yang, L.; Pavone, M.; and Tedrake, path planning and control. [1993] Proceedings IEEE Inter- R. 2021. Lyapunov-stable neural-network control. arXiv national Conference on Robotics and Automation, 802–807 preprint arXiv:2109.14152. vol.2. Dawson, C.; Qin, Z.; Gao, S.; and Fan, C. 2021. Safe Non- Qureshi, A. H.; Simeonov, A.; Bency, M. J.; and Yip, M. C. linear Control Using Robust Neural Lyapunov-Barrier Func- 2019. Motion planning networks. In 2019 International tions. arXiv preprint arXiv:2109.06697. Conference on Robotics and Automation (ICRA), 2118– Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing 2124. IEEE. function approximation error in actor-critic methods. In In- Sun, Y.; Huang, X.; Kroening, D.; Sharp, J.; Hill, M.; and ternational Conference on Machine Learning, 1587–1596. Ashmore, R. 2018. Testing deep neural networks. arXiv PMLR. preprint arXiv:1803.04792. Gaby, N.; Zhang, F.; and Ye, X. 2021. Lyapunov-Net: A Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A Deep Neural Network Architecture for Lyapunov Function physics engine for model-based control. In 2012 IEEE/RSJ Approximation. arXiv e-prints, arXiv:2109.13359. International Conference on Intelligent Robots and Systems, Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain- 5026–5033. ing and Harnessing Adversarial Examples. arXiv e-prints, Urmson, C.; and Simmons, R. 2003. Approaches for heuris- arXiv:1412.6572. tically biasing RRT growth. In Proceedings 2003 IEEE/RSJ Grune, L.; and Wurth, F. 2000. Computing control Lya- International Conference on Intelligent Robots and Systems punov functions via a Zubov type algorithm. In Proceedings (IROS 2003) (Cat. No.03CH37453), volume 2, 1178–1183 of the 39th IEEE Conference on Decision and Control (Cat. vol.2. No.00CH37187), volume 3, 2129–2134 vol.3. Weng, T.-W.; Dvijotham*, K. D.; Uesato*, J.; Xiao*, K.; Guestrin, C.; Koller, D.; and Parr, R. 2001. Multiagent Plan- Gowal*, S.; Stanforth*, R.; and Kohli, P. 2020. Toward Eval- ning with Factored MDPs. In NIPS, volume 1, 1523–1530. uating Robustness of Deep Reinforcement Learning with Continuous Control. In International Conference on Learn- Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1972. Correc- ing Representations. tion to ”A Formal Basis for the Heuristic Determination of Minimum Cost Paths”. SIGART Bull., 28–29. Xiang, Y.; Niu, W.; Liu, J.; Chen, T.; and Han, Z. 2018. A PCA-Based Model to Predict Adversarial Examples on Huang, X.; Feng, M.; Jasour, A.; Rosman, G.; and Williams, Q-Learning of Path Finding. In 2018 IEEE Third Interna- B. 2021. Risk Conditioned Neural Motion Planning. arXiv tional Conference on Data Science in Cyberspace (DSC), e-prints, arXiv:2108.01851. 773–780. Kreidieh, A. R.; Berseth, G.; Trabucco, B.; Parajuli, S.; Zhang, H.; Chen, H.; Xiao, C.; Li, B.; Boning, D. S.; and Levine, S.; and Bayen, A. M. 2019. Inter-Level Coopera- Hsieh, C.-J. 2020. Robust deep reinforcement learning tion in Hierarchical Reinforcement Learning. arXiv preprint against adversarial perturbations on observations. ICLR. arXiv:1912.02368. Zhao, Y.; Shumailov, I.; Cui, H.; Gao, X.; Mullins, R.; and Levy, A.; Jr., R. P.; and Saenko, K. 2017. Hierarchical Actor- Anderson, R. 2020. Blackbox attacks on reinforcement Critic. CoRR, abs/1712.00948. learning agents using approximated temporal information. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and In 2020 50th Annual IEEE/IFIP (DSN-W), 16–24. IEEE. Vladu, A. 2017. Towards deep learning models resistant to Zhu, H.; Xiong, Z.; Magill, S.; and Jagannathan, S. 2019. adversarial attacks. arXiv preprint arXiv:1706.06083. An inductive synthesis framework for verifiable reinforce- Mankowitz, D. J.; Levine, N.; Jeong, R.; Abdolmaleki, A.; ment learning. In Proceedings of the 40th ACM SIGPLAN Springenberg, J. T.; Shi, Y.; Kay, J.; Hester, T.; Mann, T.; Conference on Programming Language Design and Imple- and Riedmiller, M. 2020. Robust Reinforcement Learning mentation, 686–701.