=Paper= {{Paper |id=Vol-3087/paper_17 |storemode=property |title=HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning |pdfUrl=https://ceur-ws.org/Vol-3087/paper_17.pdf |volume=Vol-3087 |authors=Zikang Xiong,Ishika Agarwal,Suresh Jagannathan |dblpUrl=https://dblp.org/rec/conf/aaai/XiongAJ22 }} ==HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning== https://ceur-ws.org/Vol-3087/paper_17.pdf

HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning

Zikang Xiong, Ishika Agarwal, Suresh Jagannathan
Computer Science Department, Purdue University,
West Lafayette, Indiana 47906
xiong84@cs.purdue.edu, agarwali@purdue.edu, suresh@cs.purdue.edu,

Abstract navigation policy. However, a safe plan cannot guarantee
safety at runtime. Because the low-level controller is not re-
We propose a two-level hierarchical framework for safe rein-
forcement learning in a complex environment. The high-level
quired to follow the plan perfectly, agents can deviate from
part is an adaptive planner, which aims at learning and gen- a safe plan and produce unsafe behaviors, such as hitting
erating safe and efficient paths for tasks with imperfect map obstacles. To solve this problem, we apply a learned neural
information. The lower-level part contains a learning-based Lyapunov function (Berkenkamp et al. 2016; Chang, Roohi,
controller and its corresponding neural Lyapunov function, and Gao 2020) as a runtime monitor and harness its stability
which characterizes the controller’s stability property. This property to enforce that the agent stays in a region specified
learned neural Lyapunov function serves two purposes. First, by the function. We also incorporate the neural Lyapunov
it will be part of the high-level heuristic for our planning algo- function as part of our planner heuristic. This incorpora-
rithm. Second, it acts as a part of a runtime shield to guard the tion fuses our high-level planner and the low-level controller
safety of the whole system. We use a robot navigation exam- seamlessly. Moreover, our learned neural Lyapunov func-
ple to demonstrate that our framework can operate efficiently
and safely in complex environments, even under adversarial
tion does not require any knowledge about system dynam-
attacks. ics. Hence, compared with (Bastani, Pu, and Solar-Lezama
2018; Zhu et al. 2019), our approach can extend to complex
unknown dynamics.
1 Introduction The backend algorithms of our planner are A∗ (Hart, Nils-
Although deep reinforcement learning has achieved promis- son, and Raphael 1972) and RRT∗ (Urmson and Simmons
ing results in various domains, ensuring its safety still is a 2003). When we have accurate map information, both A∗
concern. One line of work (Bastani, Pu, and Solar-Lezama and RRT∗ work correctly. However, if map information is
2018; Zhu et al. 2019; Chang, Roohi, and Gao 2020; Dai inaccurate or outdated, our backend algorithm is unlikely to
et al. 2021) provides rigorous safety guarantees by using an- work as expected. Hence, we further strengthen our planner
alytic approaches. These proposals generally require knowl- with a refinement policy. The refinement policy’s mission
edge of a system’s underlying dynamics and constraints, is repairing and refining planning decisions during runtime.
making it challenging to generalize their method to han- Inspired by the elastic band algorithm (Quinlan and Khatib
dle complex dynamics. On the other hand, hierarchical rein- 1993), we propose a learning-based approach to generate the
forcement learning algorithms (Levy, Jr., and Saenko 2017; refinement policy.
Nachum et al. 2019; Kreidieh et al. 2019) are attractive It is well-known that most deep learning algorithms suffer
because they can support complex tasks without requiring from robustness issues. Adversarial attacks (Sun et al. 2018;
knowledge of an underlying environment structure. How- Mankowitz et al. 2020; Chen et al. 2019; Zhang et al. 2020)
ever, this lack of knowledge and the non-stationary MDP can effectively test the robustness of a learning-enabled sys-
problem make these algorithms be time-intensive without tem. Hence, we formulate an attack mechanism against our
providing any assurance of safety after training. framework to force an agent (i.e, robot) to deviate outside
In contrast, our framework assumes, in the specific plan- the region specified by the Lyapunov function. Our results
ning setting we consider, that our learning algorithm can as- show that unless we apply an unrealistically high attack fre-
sess the map information of one environment. Such informa- quency and force significant perturbation, our framework is
tion can be a priori modeled with a high-definition map or robust and can keep the robot within a safe region.
collected during runtime using techniques such as SLAM. The contributions of this work can be summarized as fol-
With map information, a high-level planner can generate a lows:
safe and efficient path. This planner frees our agents from • We propose a hierarchical framework that reconciles
unsafe and inefficient exploration for generating a high-level both efficiency and safety. Our framework can enhance
Copyright © 2022 for this paper by its authors. Use permitted under the safety of an agent in complex environments, where
Creative Commons License Attribution 4.0 International (CC BY the dynamics of the controlled robots are unknown, and
4.0). the environment information (i.e., map and obstacle in-
𝑝𝑜𝑠$

𝑑$

Subgoal 𝑔! Subgoal 𝑔!"#

Figure 2: A demonstration on how the low-level controller
executes a plan. The star is a subgoal, and the green line is
the plan. The low-level controller may drift from the plan
and generate the dotted path. Although our plan avoids hit-
ting the cat, the low-level controller still cannot ensure safety
during the runtime.

Figure 1: Sweeping robot in room example. We need to con-
trol the sweeping robot to reach the charger without hitting Lyapunov Function The large deviation dt can cause un-
cats and walls. The green lines are planed path. safe behaviors. Hence, we hope to constrain the robot around
our plan. The Lyapunov function is a positive-definite func-
formation) can be imperfect. tion for analyzing the stability of a system. A Lyapunov
function V (x) characterizes a control system’s Region Of
• We consider an approach to adapt the high-level planner
Attraction (ROA). The ROA binds the robot to stay around
to deal with imperfect information.
the plan. We will introduce the technical details for the neu-
• We demonstrate the robustness of our framework under ral Lyapunov in Section 3.2.
adversarial attack.

2 Motivation Example
2.1 Hierarchical Framework Subgoal 𝑔!"#

The simple navigation task in Figure 1 requires navigat- ROA Subgoal 𝑔!"%
ing the sweeping robot from the initial position s0 to the
goal position gT (the charger). We solve this problem with a
two-level hierarchical framework. First, a high-level planner Subgoal 𝑔! Subgoal 𝑔!"$
finds a safe plan (i.e., no collision with walls and cats) from Captured
the initial position s0 to the goal position gT . The plan is a
sequence of subgoals shown as the green line in Figure 1.
We denote it as P (s0 , gT ) = (g0 , g1 , . . . , gT ), where gi is Figure 3: The way to apply the Lyapunov function. We built
a subgoal, g0 = s0 . Then, A low-level controller πl (s|g) a sequence of ROAs around the subgoals. One ROA is a re-
is introduced to execute the plan. The low-level controller gion that the robot will stay in when the robot is heading to
is conditioned by a subgoal g, and it is trained to predict the corresponding subgoal (i.e., the sink of the ROA). Each
the optimal action under state s to reach g. The low-level ROA has intersections with its neighbors.
controller is trained with TD3 (Fujimoto, Hoof, and Meger
2018) using a reward for achieving the given subgoal g with
the shortest path. Runtime Shield With the plan and ROAs built by the Lya-
punov function, we can construct a runtime shield to guard
2.2 Runtime Safety the safety of our robot. Unlike the previous method (Bas-
tani, Pu, and Solar-Lezama 2018; Zhu et al. 2019), we do
Although the high-level planner can provide a safe plan,
not require an additional safe policy for the recovering pur-
the low-level controller does not necessarily follow the plan
pose. Instead, we construct the runtime shield by switching
strictly. Especially when we train the low-level controller
the robot’s heading toward different subgoals. Specifically,
with a model-free reinforcement learning algorithm, it is
we select two consecutive subgoals during the runtime. The
quite common that the agent finds an unexpected approach
first one is the latest subgoals achieved; the second one is
to achieve the goal. Once the unstable low-level controller
the next subgoals. Then, we select the first subgoal’s ROA
leads our robot to deviate from our safety plan, we cannot
to monitor our robot. While the robot stays in the selected
guarantee the safety of our robot.
ROA, we set its subgoal to the second subgoal that we have
Stability of Low-level Controller We measure the stabil- not achieved yet. When the robot goes beyond the selected
ity of a low-level controller with the deviation between the ROA, we check whether it is “captured” by the next ROA
actual trajectory and the plan. In Figure 2, the robot’s po- (i.e., the robot enters the intersection part of two ROA, see
sition at time t is post , the deviation dt is defined as the Figure 3). If the robot is captured, we will change the se-
Euclidean distance from post to the plan fragment between lected ROA to the next ROA until the robot achieves the
2 subgoals. second subgoal. Nevertheless, if the robot is not captured by
the next ROA and slides out of the selected ROA, we will set Refinement Policy Naturally, the map will change after
its subgoal to the first subgoal to pull it back. The runtime the data is collected. Hence, the plan made in the outdated
shield binds a robot to stay around its plan. We can consider map may cause undesired results. Thus, we consider a re-
the ROA while planning and generating a plan with safety pair schema during runtime. The key idea is the elastic
boundaries. Consequently, the runtime safety of our system band (Quinlan and Khatib 1993) which optimizes a planning
is guarded. A more formal description is provided in Sec- “string” with the internal and repulsive force. The internal
tion 3.3. force, formally, can be described as

2.3 Planner gi−1 − gi gi+1 − gi
Fin = kin · +
The backend algorithm of our planner is A∗ and RRT∗ . Both kgi−1 − gi k kgi+1 − gi k
of them require a heuristic guiding the efficient search. The where kin is the elasticity coefficient. The internal force for
most straightforward heuristic might be the Euclidean dis- subgoal gi is computed with the sum of the two direction
tance. However, this heuristic cannot guarantee that the gen- vectors toward its neighbors. In addition to the internal force,
erated path is safe for our system. For example, a planning a repulsive force is applied to prevent the ROA from inter-
path may get close to the wall, which does not leave enough secting with obstacles. When we have perfect map data, we
space for the shield to switch the subgoal for avoidance. can compute the repulsive force easily. However, it can be
Thus, we have to consider the Lyapunov function as part a problem when the map data is imperfect. In this case, we
of the heuristic. We further consider scenarios that we do only have the raw sensor data during runtime. To address
not have the perfect map data. In this case, we need to col- this problem, we learn a function kre (gi |hi ) to predict the
lect the sensor data during the runtime and refine our plan. repulsive coefficient, given the subgoal gi and a sequence
We achieved the refinement by employing a high-level re- of sensor data history hi . Finally, the applied force ∆f for
inforcement learning policy. This refinement policy enables subgoal gi is
our system to operate safely with imperfect map data.

gi−1 − gi gi+1 − gi
∆F = (kin − kre (gi |hi )) · +
kgi−1 − gi k kgi+1 − gi k
𝑔!"#
We can iteratively update the gi with gi0 = gi + α∆F to find
the equilibrium where ∆F = 0. α here is a small constant.
𝑔! An example is provided in Figure 4(b). The blue gi was up-
dated to the yellow position with iteratively applying ∆F .
𝑔!"# 2.4 Adversarial Attack
(a) Robot sensors (b) Refinement in U-maze environment

Figure 4: (a) demonstrates the sensors on the robot. The Li-
DARs scans 16 positions around, which generates an obser-
vation vector with 16 distance elements. (b) is an example 90 cm

for refinement policy. The dashed region does not exist in ~ 40 cm

real word, but the map used for planning marks it as obsta- 30 cm ı
cle.
50 cm

Heuristic A simple Euclidean distance heuristic is the L2
distance from current subgoal position to the final goal po-
sition, heuc (gt ) = ||gT − gt ||. This heuristic guides the
search toward the final goal. For the U-maze example in
Figure 4(b), the path generated by heuc (gt ) will stick to the Figure 5: Cat parade environment for robustness evaluation
wall in the lower part because this path is closer to the fi-
nal goal. However, such a path does not provide any space To better evaluate the robustness of our framework, we
as a safety distance. Thus, we also need to consider a Lya- consider attacking the framework from two aspects. Firstly,
punov heuristic hlyap (gt ). Given xt , the relative position the neural network is criticized for being unrobust to the in-
from the sink of ROA gt to the closest obstacle, we define put perturbations. Thus, we add adversarial noise to the low-
the hlyap (xt ) to characterize the safety. If the closest ob- level controller and the neural Lyapunov function’s inputs in
stacle is located outside the ROA, hlyap (xt ) returns 0. Oth- the first type of attack. Secondly, the robustness of a system
erwise, hlyap (xt ) returns −∞ to disable the search on this is not only affected by the upstream perception data fed to
path. We compose the Euclidean heuristic for efficiency and the neural network controller, but also the actual action exe-
the Lyapunov heuristic for safety, and guide the planning cuted by downstream modules. That means even if the con-
search with heuc (gt ) + hlyap (xt ). troller outputs the right action, the noise in the downstream
models can still cause undesired results. Hence, we further Eq. (3) is known as the lie derivative. When the lie derivative
attack our framework with noise added to the action. is smaller than 0, V (xt ) strictly decreases along with time.
We evaluate the robustness of our system with this simple We compute the Lyapunov function with a neural network
but safety-sensitive cat parade environment in Figure 5. The and train the neural network with a loss function based on
adversary needs to fool our neural Lyapunov function and the Zubov-type equation (Grune and Wurth 2000). The loss
guide the robot to hit the cats. We assume that we can access function is
the parameters of the neural Lyapunov function and the low-
level controller. Because our framework aims to work on the
robot with complex dynamics, it is generally challenging to Lθ (xt , xt+1 ) = |Vθ (x0 )|
model the robot’s dynamics required by a gradient-based at- + Vθ (xt )(Vθ (xt+1 ) − Vθ (xt )) − ||xt ||2
tack such as FGSM (Goodfellow, Shlens, and Szegedy 2014)
and PGD (Madry et al. 2017). Hence, similar to (Zhao et al. The Monte-Carlo estimation of L is
2020), we learn approximated surrogate dynamics. We pro-
vide the technical details in Section 3.5. 1 X
Lθ = ( |Vθ (x0 )|
N x ,x ∼π
3 Approach t t+1 l

Our work incorporated a Lyapunov function into a learning- + Vθ (xt )(Vθ (xt+1 ) − Vθ (xt )) − ||xt ||2 )
enabled control system. First, because it is challenging to
reason about the behavior of a RL-trained controller, we use To characterize the stability of πl , xt and xt+1 are sampled
the Lyapunov function to characterize its stability property. with the πl . We can compute the gradient with L and op-
The Lyapunov function provides us with the power to bind a timize the network’s parameters θ. The optimal parameters
robot to a specified region. Second, we assemble the speci- are θ∗ = arg minθ (Lθ ).
fied ROA with a high-level planner. By considering the Lya- The neural Lyapunov function specifies the ROA with a
punov function and the planning heuristic jointly, we can constant CROA .
generate a safety plan. When executing the plan, a safety
shield guards the controlled robot. Additionally, we enhance ROA = {g + x|V (x) < CROA }
our high-level planner with a learning-based high-level pol- Where g is the sink of a ROA.
icy. This policy refines plans with the sensor data collected
during runtime, which gives our high-level planner the abil- 3.3 Runtime Shield
ity to work on imperfect map data. Finally, we evaluate the The Lyapunov function specifies a single ROA that a robot
robustness of our system against adversarial attacks. stays in. However, a runtime shield binds a robot to stay
around a given plan P (s0 , gT ) = (g0 , g1 , . . . , gT ). To deal
3.1 Low-Level Controller with the sequence of subgoals, we check the ROAs of any
A low-level controller πl (s|g) is introduced to achieve a sub- two consecutive subgoals. Given the lower-level controller
goal g. The controller is trained with TD3. We designed its πl , previous system state si−1 , current system state si , ROAt
reward function as with sink gt , and ROAt+1 with sink gt+1 , the algorithm 1
shows the details of the runtime shield.
r(st ) = ||st−1 − gt−1 || − ||st − gt ||
We compute the distance-to-goal at the previous time step Algorithm 1: Sequential Shield
t − 1 and the current time step t. The reward function is the
function S HIELD(πl , si−1 , si , gt , gt+1 , ROAt , ROAt+1 )
difference between the two distances. TD3 maximizes this
if si ∈ ROAt ∨ si ∈ ROAt+1 then
reward w.r.t the Bellman function. As a result, the trained
return πl (si |gt+1 )
agent is expected to get closer to the given goal in the fastest
else if (si−1 ) ∈ ROAt ∧ si ∈ / ROAt+1 then
direction.
return πl (si |gt ) . Pull back to ROAt
3.2 Neural Lyapunov Function and ROA else
return πl (si |gt+1 )
We use the neural Lyapunov function to characterize the sta-
bility property. The Lyapunov function is a positive-definite
function satisfying three constraints. When si ∈ ROAt ∨ si ∈ ROAt+1 , the πi is parameter-
ized with the next subgoal gt+1 , and the robot moves toward
V (xo ) =0 (1) gt+1 . When (si−1 ) ∈ ROAt ∧si ∈ / ROAt+1 , the robot moves
∀x 6= x0 , V (x) >0 (2) from the ROAt to a region other than ROAt+1 . We need to
pull the robot back to ROAt , thus the πi is parameterized by
V (xt+1 ) − V (xt ) <0 (3) gt . The last condition is (si−1 ) ∈ ROAt+1 ∧ si ∈ / ROAt+1 ,
In Eq. (1), a Lyapunov function’s value is 0 on the origin. 1 although this is not supposed to happen. If it happens due to
Eq. (2) is the position property of a Lyapunov function. any aspect of imprecision, we return action πi (si |gt+1 ).
1
The Lyapunov function’s input x and state s are in the same of s is determined by the system. The origin of x, however, is a
space. The only difference is their coordination origin. The origin selected subgoal during the runtime. x = s − g.
3.4 Planner where ε is the attack noise size. Because we can compute
Heuristic Our planning heuristic has two parts. The heuc the gradient of st and at , we can also apply PGD or other
for efficiency and the hlyap for safety. heuc is the Euclidean attack techniques similarly.
distance to the final goal. Attack Neural Lyapunov Function We want to fool the
heuc (gt ) = ||gT − gt || neural Lyapunov function with an attacked input x̂t . In this
case, we expect a small V (x̂t ), where V is the Lyapunov
This heuristic is designed to search for the shortest path, but function. The attacked input x̂t can be computed with gradi-
does not consider safety. Hence, it can generate paths that
ent dVdx(xt t ) .
are close to obstacles, which may cause undesired behaviors
during the runtime. To address this problem, we introduce a dV (xt )
Lyapunov heuristic. x̂t = xt + ε · sign( )
dxt
Note that we optimize the attacked input x̂t of the Lyapunov
0 V (xt ) > CROA , function and the attacked state ŝt of the low-level controller
hlyap (xt ) =
−∞ V (xt ) ≤ CROA separately. That makes the attack more potent.
Suppose the closest obstacle position to gt is ot , xt = ot −
gt ; CROA is the constant specified the ROA. This heuristic 4 Primary Evaluation
ensures that the ROAs do not intersect with obstacles. In this section, we report the primary evaluation results on
our motivation examples introduced in Section 2.
Refinement Policy We provided an example for the re-
finement policy in Section 2.3. Force ∆F was applied to 4.1 Low-level controller
update the subgoal gi of a path.
We train the low-level controller πl (s|g) with TD3. The
gi−1 − gi gi+1 − gi training results are provided in Figure 6. Each training it-
∆F = (kin − kre (gi |hi )) · +
kgi−1 − gi k kgi+1 − gi k eration indicates updates on the low-level controller. The
updates are executed periodically in every 2000 simulation
The updating is iterative,
steps. All the reward curves in Figure 6 converge to a reward
around -2.5.
gi0 = gi + α · ∆F
The new subgoal gi0 should converge to a fixed point where
kin = kre (gi0 |hi ). The kre (gi0 |hi ) was learned using an
encoder-decoder structure. First, we train an auto-encoder
EN C to encode the history. Then, we concatenate the em-
bedding EN C(hi ) with subgoal gi , and train an MLP to
predict the kre . The training data was sampled with simula-
tions in different scenarios.

3.5 Robustness to Adversarial Attack
Surrogate Dynamics The surrogate dynamics is a func-
tion
fdyn (st , at |ht ) = st+1
where the st is the state and at is the action. ht is the his-
tory states of the agent. If a system is fully-observable, we Figure 6: Training reward of πl (s|g). This figure includes
can ignore the ht . The fdyn computes the next state of the results of 5 runs with different random seeds.
system.
Attack Controller We define an objective function fobj
that measures the distance to the planning path. Maximiz-
4.2 Neural Lyapunov Function
ing the fobj (st+1 ) guides the robot to deviate from the plan. We train the neural Lyapunov function with a dataset con-
We considered a simple FGSM attack. When attacking the taining 106 transitions and test the neural Lyapunov with a
input state of the low-level controller, we can compute the test dataset with 2 × 105 transitions. Both the training and
attacked state ŝt with test dataset are sampled with the same low-level controller.
During the test time, we check the properties in Eq. (1),
∂fobj (fdyn (st , at |ht )) Eq. (2) and Eq. (3) on the test dataset with five neural Lya-
ŝt = st + ε · sign( ) punov functions trained with different splits on the training
∂st
On the other hand, if we want to attack the action, the at- and testing dataset.
tacked action ât can be computed with The statistics are in Table 1. The minimum percentage of
property violations is 0%, and the maximum is 0.3%. Thus,
∂fobj (fdyn (st , at |ht )) out of around 10,000 simulations, only 30 simulations are
ât = at + ε · sign( ) violated.
∂at
Stat. min max mean std attack action results
Vio. 0 0.00294 0.000656 0.00115 30
0.0
25 0.0*
1.0
Table 1: Neural Lyapunov function training results. 20 1.0*
2.0

dmax
15 2.0*
3.0
10 3.0*
4.0
4.3 Refinement Policy 5 4.0*
0 0.2 0.4 0.6 0.8 1.0
We generated the sensor data and computed the accurate re- attack frequency
pulsive constant for training repulsive prediction network
kre (gi |hi ) in different scenarios (i.e., the obstacles appear Figure 8: Attack action results
in different positions). 10, 000 trajectories are generated,
which includes 9, 500 training trajectories and 500 test tra-
jectories. Each trajectory contains 5 seconds of sensor data.
Figure 8 provides the attack action results. Our robot’s
However, our repulsive prediction network only uses latest
action range is [−5, 5]. The action represents the displace-
1 seconds sensor data as the history hi in kre (gi |hi ). The
ment every 0.1 seconds. We set the attack noise from 0 to 4.
range of the kre is [−1, 1], the prediction has error around
The dmax is smaller when we do not attack the Lyapunov
0.01. In our toy example provided in Figure 4, we measure
function. Overall, the dmax increases as the ε and attack
the average distance between plan and center obstacle. We
frequency raises. We also observe that sometimes larger ε
hope this distance be small while the ROA should not collide
results in smaller dmax . This is because a stronger attack
with the obstacle. Before applying our refinement policy, the
sometimes can cause intensive calling on the shield’s pull-
average distance is 51.32. After 100 runs and refinements,
back action. For example, when the attack frequency is 1.0
our refinement policy changed the average distance to 45.27
when the ε = 1.0, the dmax is smaller compared with when
and avoided the collision between obstacles and ROAs.
ε = 0.0. Finally, we notice that the first safety violation hap-
pens when the attack frequency is 60% and ε = 4.0.
4.4 Robustness to Adversarial Attack
We attacked both the state and action with different attack
frequency and noise size ε. The Figure 7 and 8 provide the 5 Future work
results of attacks on the state and action respectively. Each
column is generated with 100 experiments with attacks at One issue we want to explore further is the scalability of
random times. The attack frequency ranges from 0.2 to 1.0, our framework. There is a sequence of work has shown
indicating the percentage of attacked transitions. The ∗ in the scalability (Gaby, Zhang, and Ye 2021; Dawson et al.
the legend means that we also attack the Lyapunov function 2021) of the neural Lyapunov function and deep reinforce-
in these experiments and the unit of ε is cm. The y−axis is ment learning (Pérez-Dattari et al. 2019; Fujimoto, Hoof,
the max deviation to our plan. When the dmax > 30 cm, the and Meger 2018). Hence, it is natural to demonstrate the
robot will hit cats in the example provided in Figure 5. scalability of our framework better on more complicated
benchmarks (Achiam and Amodei 2019; Todorov, Erez, and
attack state results Tassa 2012). On the other hand, the neural Lyapunov func-
30 tion we learned does not provide a verifiable guarantee (Dai
0.0
25 0.0* et al. 2021; Chang, Roohi, and Gao 2020; Zhu et al. 2019;
1.0
20 1.0* Bastani, Pu, and Solar-Lezama 2018) because we do not as-
2.0
dmax

15 2.0* sume the access of the dynamics of a system. However, in
3.0 the case that the dynamics can be easily modeled, extend-
10 3.0*
4.0 ing our work with the verifiable tools can provide a strong
5 4.0*
0 0.2 0.4 0.6 0.8 1.0
guarantee on the system safety. This is a direction we are
attack frequency working on. Regarding the system robustness, we only eval-
uated a limited number of adversarial attacks. It is interest-
Figure 7: Attack state results ing to see how our framework performs under various attack
techniques (Weng et al. 2020; Mankowitz et al. 2020; Zhang
In Figure 7, when we do not attack the Lyapunov func- et al. 2020) while also consider the attack on the high-level
tion, the dmax is always bounded below 15 cm. Hence, the planner (Xiang et al. 2018). Moreover, the current frame-
protection provided by the shield works as expected. Never- work only considers a single agent and static obstacles. A
theless, when the Lyapunov function is under attack, we no- challenging but fruitful direction is to extend our framework
tice that all the dmax is significantly larger, which means the to multiagent planning and control (Guestrin, Koller, and
Lyapunov function and shield can be affected by the attack. Parr 2001; Nissim, Brafman, and Domshlak 2010). Lastly,
On the other hand, the dmax grows as the attack frequency our high-level planner is closely related to the safe neural
and increases. The first safety violation happens when the motion planning (Huang et al. 2021; Qureshi et al. 2019).
attack frequency is 60%, and the noise size is 4.0 cm, while We propose to investigate further on these works and better
we also attack the Lyapunov function. refine our high-level planner.
References for Continuous Control with Model Misspecification. In In-
Achiam, J.; and Amodei, D. 2019. Benchmarking Safe Ex- ternational Conference on Learning Representations.
ploration in Deep Reinforcement Learning. In arxiv. Nachum, O.; Gu, S.; Lee, H.; and Levine, S. 2019. Near-
Bastani, O.; Pu, Y.; and Solar-Lezama, A. 2018. Verifi- Optimal Representation Learning for Hierarchical Rein-
able Reinforcement Learning via Policy Extraction. CoRR, forcement Learning. In International Conference on Learn-
abs/1805.08328. ing Representations.
Berkenkamp, F.; Moriconi, R.; Schoellig, A. P.; and Krause, Nissim, R.; Brafman, R. I.; and Domshlak, C. 2010. A
A. 2016. Safe Learning of Regions of Attraction for Un- general, fully distributed multi-agent planning algorithm.
certain, Nonlinear Systems with Gaussian Processes. arXiv In Proceedings of the 9th International Conference on Au-
e-prints, arXiv:1603.04915. tonomous Agents and Multiagent Systems: volume 1-Volume
1, 1323–1330.
Chang, Y.-C.; Roohi, N.; and Gao, S. 2020. Neural lyapunov
Pérez-Dattari, R.; Celemin, C.; Ruiz-del-Solar, J.; and
control. arXiv preprint arXiv:2005.00611.
Kober, J. 2019. Continuous Control for High-Dimensional
Chen, T.; Liu, J.; Xiang, Y.; Niu, W.; Tong, E.; and Han, State Spaces: An Interactive Learning Approach. arXiv e-
Z. 2019. Adversarial attack and defense in reinforcement prints, arXiv:1908.05256.
learning-from AI security view. Cybersecurity, 2(1): 1–22.
Quinlan, S.; and Khatib, O. 1993. Elastic bands: connecting
Dai, H.; Landry, B.; Yang, L.; Pavone, M.; and Tedrake, path planning and control. [1993] Proceedings IEEE Inter-
R. 2021. Lyapunov-stable neural-network control. arXiv national Conference on Robotics and Automation, 802–807
preprint arXiv:2109.14152. vol.2.
Dawson, C.; Qin, Z.; Gao, S.; and Fan, C. 2021. Safe Non- Qureshi, A. H.; Simeonov, A.; Bency, M. J.; and Yip, M. C.
linear Control Using Robust Neural Lyapunov-Barrier Func- 2019. Motion planning networks. In 2019 International
tions. arXiv preprint arXiv:2109.06697. Conference on Robotics and Automation (ICRA), 2118–
Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing 2124. IEEE.
function approximation error in actor-critic methods. In In- Sun, Y.; Huang, X.; Kroening, D.; Sharp, J.; Hill, M.; and
ternational Conference on Machine Learning, 1587–1596. Ashmore, R. 2018. Testing deep neural networks. arXiv
PMLR. preprint arXiv:1803.04792.
Gaby, N.; Zhang, F.; and Ye, X. 2021. Lyapunov-Net: A Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A
Deep Neural Network Architecture for Lyapunov Function physics engine for model-based control. In 2012 IEEE/RSJ
Approximation. arXiv e-prints, arXiv:2109.13359. International Conference on Intelligent Robots and Systems,
Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain- 5026–5033.
ing and Harnessing Adversarial Examples. arXiv e-prints, Urmson, C.; and Simmons, R. 2003. Approaches for heuris-
arXiv:1412.6572. tically biasing RRT growth. In Proceedings 2003 IEEE/RSJ
Grune, L.; and Wurth, F. 2000. Computing control Lya- International Conference on Intelligent Robots and Systems
punov functions via a Zubov type algorithm. In Proceedings (IROS 2003) (Cat. No.03CH37453), volume 2, 1178–1183
of the 39th IEEE Conference on Decision and Control (Cat. vol.2.
No.00CH37187), volume 3, 2129–2134 vol.3. Weng, T.-W.; Dvijotham*, K. D.; Uesato*, J.; Xiao*, K.;
Guestrin, C.; Koller, D.; and Parr, R. 2001. Multiagent Plan- Gowal*, S.; Stanforth*, R.; and Kohli, P. 2020. Toward Eval-
ning with Factored MDPs. In NIPS, volume 1, 1523–1530. uating Robustness of Deep Reinforcement Learning with
Continuous Control. In International Conference on Learn-
Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1972. Correc-
ing Representations.
tion to ”A Formal Basis for the Heuristic Determination of
Minimum Cost Paths”. SIGART Bull., 28–29. Xiang, Y.; Niu, W.; Liu, J.; Chen, T.; and Han, Z. 2018.
A PCA-Based Model to Predict Adversarial Examples on
Huang, X.; Feng, M.; Jasour, A.; Rosman, G.; and Williams, Q-Learning of Path Finding. In 2018 IEEE Third Interna-
B. 2021. Risk Conditioned Neural Motion Planning. arXiv tional Conference on Data Science in Cyberspace (DSC),
e-prints, arXiv:2108.01851. 773–780.
Kreidieh, A. R.; Berseth, G.; Trabucco, B.; Parajuli, S.; Zhang, H.; Chen, H.; Xiao, C.; Li, B.; Boning, D. S.; and
Levine, S.; and Bayen, A. M. 2019. Inter-Level Coopera- Hsieh, C.-J. 2020. Robust deep reinforcement learning
tion in Hierarchical Reinforcement Learning. arXiv preprint against adversarial perturbations on observations. ICLR.
arXiv:1912.02368.
Zhao, Y.; Shumailov, I.; Cui, H.; Gao, X.; Mullins, R.; and
Levy, A.; Jr., R. P.; and Saenko, K. 2017. Hierarchical Actor- Anderson, R. 2020. Blackbox attacks on reinforcement
Critic. CoRR, abs/1712.00948. learning agents using approximated temporal information.
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and In 2020 50th Annual IEEE/IFIP (DSN-W), 16–24. IEEE.
Vladu, A. 2017. Towards deep learning models resistant to Zhu, H.; Xiong, Z.; Magill, S.; and Jagannathan, S. 2019.
adversarial attacks. arXiv preprint arXiv:1706.06083. An inductive synthesis framework for verifiable reinforce-
Mankowitz, D. J.; Levine, N.; Jeong, R.; Abdolmaleki, A.; ment learning. In Proceedings of the 40th ACM SIGPLAN
Springenberg, J. T.; Shi, Y.; Kay, J.; Hester, T.; Mann, T.; Conference on Programming Language Design and Imple-
and Riedmiller, M. 2020. Robust Reinforcement Learning mentation, 686–701.