=Paper= {{Paper |id=Vol-3087/paper_17 |storemode=property |title=HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning |pdfUrl=https://ceur-ws.org/Vol-3087/paper_17.pdf |volume=Vol-3087 |authors=Zikang Xiong,Ishika Agarwal,Suresh Jagannathan |dblpUrl=https://dblp.org/rec/conf/aaai/XiongAJ22 }} ==HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning== https://ceur-ws.org/Vol-3087/paper_17.pdf
          HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning

                                   Zikang Xiong, Ishika Agarwal, Suresh Jagannathan
                                       Computer Science Department, Purdue University,
                                                 West Lafayette, Indiana 47906
                               xiong84@cs.purdue.edu, agarwali@purdue.edu, suresh@cs.purdue.edu,



                             Abstract                                  navigation policy. However, a safe plan cannot guarantee
                                                                       safety at runtime. Because the low-level controller is not re-
  We propose a two-level hierarchical framework for safe rein-
  forcement learning in a complex environment. The high-level
                                                                       quired to follow the plan perfectly, agents can deviate from
  part is an adaptive planner, which aims at learning and gen-         a safe plan and produce unsafe behaviors, such as hitting
  erating safe and efficient paths for tasks with imperfect map        obstacles. To solve this problem, we apply a learned neural
  information. The lower-level part contains a learning-based          Lyapunov function (Berkenkamp et al. 2016; Chang, Roohi,
  controller and its corresponding neural Lyapunov function,           and Gao 2020) as a runtime monitor and harness its stability
  which characterizes the controller’s stability property. This        property to enforce that the agent stays in a region specified
  learned neural Lyapunov function serves two purposes. First,         by the function. We also incorporate the neural Lyapunov
  it will be part of the high-level heuristic for our planning algo-   function as part of our planner heuristic. This incorpora-
  rithm. Second, it acts as a part of a runtime shield to guard the    tion fuses our high-level planner and the low-level controller
  safety of the whole system. We use a robot navigation exam-          seamlessly. Moreover, our learned neural Lyapunov func-
  ple to demonstrate that our framework can operate efficiently
  and safely in complex environments, even under adversarial
                                                                       tion does not require any knowledge about system dynam-
  attacks.                                                             ics. Hence, compared with (Bastani, Pu, and Solar-Lezama
                                                                       2018; Zhu et al. 2019), our approach can extend to complex
                                                                       unknown dynamics.
                      1    Introduction                                   The backend algorithms of our planner are A∗ (Hart, Nils-
Although deep reinforcement learning has achieved promis-              son, and Raphael 1972) and RRT∗ (Urmson and Simmons
ing results in various domains, ensuring its safety still is a         2003). When we have accurate map information, both A∗
concern. One line of work (Bastani, Pu, and Solar-Lezama               and RRT∗ work correctly. However, if map information is
2018; Zhu et al. 2019; Chang, Roohi, and Gao 2020; Dai                 inaccurate or outdated, our backend algorithm is unlikely to
et al. 2021) provides rigorous safety guarantees by using an-          work as expected. Hence, we further strengthen our planner
alytic approaches. These proposals generally require knowl-            with a refinement policy. The refinement policy’s mission
edge of a system’s underlying dynamics and constraints,                is repairing and refining planning decisions during runtime.
making it challenging to generalize their method to han-               Inspired by the elastic band algorithm (Quinlan and Khatib
dle complex dynamics. On the other hand, hierarchical rein-            1993), we propose a learning-based approach to generate the
forcement learning algorithms (Levy, Jr., and Saenko 2017;             refinement policy.
Nachum et al. 2019; Kreidieh et al. 2019) are attractive                  It is well-known that most deep learning algorithms suffer
because they can support complex tasks without requiring               from robustness issues. Adversarial attacks (Sun et al. 2018;
knowledge of an underlying environment structure. How-                 Mankowitz et al. 2020; Chen et al. 2019; Zhang et al. 2020)
ever, this lack of knowledge and the non-stationary MDP                can effectively test the robustness of a learning-enabled sys-
problem make these algorithms be time-intensive without                tem. Hence, we formulate an attack mechanism against our
providing any assurance of safety after training.                      framework to force an agent (i.e, robot) to deviate outside
   In contrast, our framework assumes, in the specific plan-           the region specified by the Lyapunov function. Our results
ning setting we consider, that our learning algorithm can as-          show that unless we apply an unrealistically high attack fre-
sess the map information of one environment. Such informa-             quency and force significant perturbation, our framework is
tion can be a priori modeled with a high-definition map or             robust and can keep the robot within a safe region.
collected during runtime using techniques such as SLAM.                   The contributions of this work can be summarized as fol-
With map information, a high-level planner can generate a              lows:
safe and efficient path. This planner frees our agents from             • We propose a hierarchical framework that reconciles
unsafe and inefficient exploration for generating a high-level            both efficiency and safety. Our framework can enhance
Copyright © 2022 for this paper by its authors. Use permitted under       the safety of an agent in complex environments, where
Creative Commons License Attribution 4.0 International (CC BY             the dynamics of the controlled robots are unknown, and
4.0).                                                                     the environment information (i.e., map and obstacle in-
                                                                                                    𝑝𝑜𝑠$

                                                                                                           𝑑$



                                                                          Subgoal 𝑔!                                                           Subgoal 𝑔!"#



                                                                      Figure 2: A demonstration on how the low-level controller
                                                                      executes a plan. The star is a subgoal, and the green line is
                                                                      the plan. The low-level controller may drift from the plan
                                                                      and generate the dotted path. Although our plan avoids hit-
                                                                      ting the cat, the low-level controller still cannot ensure safety
                                                                      during the runtime.

Figure 1: Sweeping robot in room example. We need to con-
trol the sweeping robot to reach the charger without hitting          Lyapunov Function The large deviation dt can cause un-
cats and walls. The green lines are planed path.                      safe behaviors. Hence, we hope to constrain the robot around
                                                                      our plan. The Lyapunov function is a positive-definite func-
   formation) can be imperfect.                                       tion for analyzing the stability of a system. A Lyapunov
                                                                      function V (x) characterizes a control system’s Region Of
 • We consider an approach to adapt the high-level planner
                                                                      Attraction (ROA). The ROA binds the robot to stay around
   to deal with imperfect information.
                                                                      the plan. We will introduce the technical details for the neu-
 • We demonstrate the robustness of our framework under               ral Lyapunov in Section 3.2.
   adversarial attack.

                2    Motivation Example
2.1   Hierarchical Framework                                                                        Subgoal 𝑔!"#

The simple navigation task in Figure 1 requires navigat-                               ROA                                          Subgoal 𝑔!"%
ing the sweeping robot from the initial position s0 to the
goal position gT (the charger). We solve this problem with a
two-level hierarchical framework. First, a high-level planner                          Subgoal 𝑔!               Subgoal 𝑔!"$
finds a safe plan (i.e., no collision with walls and cats) from                                                                Captured
the initial position s0 to the goal position gT . The plan is a
sequence of subgoals shown as the green line in Figure 1.
We denote it as P (s0 , gT ) = (g0 , g1 , . . . , gT ), where gi is   Figure 3: The way to apply the Lyapunov function. We built
a subgoal, g0 = s0 . Then, A low-level controller πl (s|g)            a sequence of ROAs around the subgoals. One ROA is a re-
is introduced to execute the plan. The low-level controller           gion that the robot will stay in when the robot is heading to
is conditioned by a subgoal g, and it is trained to predict           the corresponding subgoal (i.e., the sink of the ROA). Each
the optimal action under state s to reach g. The low-level            ROA has intersections with its neighbors.
controller is trained with TD3 (Fujimoto, Hoof, and Meger
2018) using a reward for achieving the given subgoal g with
the shortest path.                                                    Runtime Shield With the plan and ROAs built by the Lya-
                                                                      punov function, we can construct a runtime shield to guard
2.2   Runtime Safety                                                  the safety of our robot. Unlike the previous method (Bas-
                                                                      tani, Pu, and Solar-Lezama 2018; Zhu et al. 2019), we do
Although the high-level planner can provide a safe plan,
                                                                      not require an additional safe policy for the recovering pur-
the low-level controller does not necessarily follow the plan
                                                                      pose. Instead, we construct the runtime shield by switching
strictly. Especially when we train the low-level controller
                                                                      the robot’s heading toward different subgoals. Specifically,
with a model-free reinforcement learning algorithm, it is
                                                                      we select two consecutive subgoals during the runtime. The
quite common that the agent finds an unexpected approach
                                                                      first one is the latest subgoals achieved; the second one is
to achieve the goal. Once the unstable low-level controller
                                                                      the next subgoals. Then, we select the first subgoal’s ROA
leads our robot to deviate from our safety plan, we cannot
                                                                      to monitor our robot. While the robot stays in the selected
guarantee the safety of our robot.
                                                                      ROA, we set its subgoal to the second subgoal that we have
Stability of Low-level Controller We measure the stabil-              not achieved yet. When the robot goes beyond the selected
ity of a low-level controller with the deviation between the          ROA, we check whether it is “captured” by the next ROA
actual trajectory and the plan. In Figure 2, the robot’s po-          (i.e., the robot enters the intersection part of two ROA, see
sition at time t is post , the deviation dt is defined as the         Figure 3). If the robot is captured, we will change the se-
Euclidean distance from post to the plan fragment between             lected ROA to the next ROA until the robot achieves the
2 subgoals.                                                           second subgoal. Nevertheless, if the robot is not captured by
the next ROA and slides out of the selected ROA, we will set       Refinement Policy Naturally, the map will change after
its subgoal to the first subgoal to pull it back. The runtime      the data is collected. Hence, the plan made in the outdated
shield binds a robot to stay around its plan. We can consider      map may cause undesired results. Thus, we consider a re-
the ROA while planning and generating a plan with safety           pair schema during runtime. The key idea is the elastic
boundaries. Consequently, the runtime safety of our system         band (Quinlan and Khatib 1993) which optimizes a planning
is guarded. A more formal description is provided in Sec-          “string” with the internal and repulsive force. The internal
tion 3.3.                                                          force, formally, can be described as
                                                                                                                        
2.3   Planner                                                                              gi−1 − gi         gi+1 − gi
                                                                           Fin = kin ·                   +
The backend algorithm of our planner is A∗ and RRT∗ . Both                                kgi−1 − gi k kgi+1 − gi k
of them require a heuristic guiding the efficient search. The      where kin is the elasticity coefficient. The internal force for
most straightforward heuristic might be the Euclidean dis-         subgoal gi is computed with the sum of the two direction
tance. However, this heuristic cannot guarantee that the gen-      vectors toward its neighbors. In addition to the internal force,
erated path is safe for our system. For example, a planning        a repulsive force is applied to prevent the ROA from inter-
path may get close to the wall, which does not leave enough        secting with obstacles. When we have perfect map data, we
space for the shield to switch the subgoal for avoidance.          can compute the repulsive force easily. However, it can be
Thus, we have to consider the Lyapunov function as part            a problem when the map data is imperfect. In this case, we
of the heuristic. We further consider scenarios that we do         only have the raw sensor data during runtime. To address
not have the perfect map data. In this case, we need to col-       this problem, we learn a function kre (gi |hi ) to predict the
lect the sensor data during the runtime and refine our plan.       repulsive coefficient, given the subgoal gi and a sequence
We achieved the refinement by employing a high-level re-           of sensor data history hi . Finally, the applied force ∆f for
inforcement learning policy. This refinement policy enables        subgoal gi is
our system to operate safely with imperfect map data.
                                                                                                                                   
                                                                                                        gi−1 − gi    gi+1 − gi
                                                                   ∆F = (kin − kre (gi |hi )) ·                    +
                                                                                                       kgi−1 − gi k kgi+1 − gi k
                                                   𝑔!"#
                                                                   We can iteratively update the gi with gi0 = gi + α∆F to find
                                                                   the equilibrium where ∆F = 0. α here is a small constant.
                                                             𝑔!    An example is provided in Figure 4(b). The blue gi was up-
                                                                   dated to the yellow position with iteratively applying ∆F .
                                                  𝑔!"#             2.4     Adversarial Attack
      (a) Robot sensors     (b) Refinement in U-maze environment


Figure 4: (a) demonstrates the sensors on the robot. The Li-
DARs scans 16 positions around, which generates an obser-
vation vector with 16 distance elements. (b) is an example                                                                  90 cm


for refinement policy. The dashed region does not exist in                   ~ 40 cm


real word, but the map used for planning marks it as obsta-        30 cm                       ı
cle.
                                                                                       50 cm




Heuristic A simple Euclidean distance heuristic is the L2
distance from current subgoal position to the final goal po-
sition, heuc (gt ) = ||gT − gt ||. This heuristic guides the
search toward the final goal. For the U-maze example in
Figure 4(b), the path generated by heuc (gt ) will stick to the    Figure 5: Cat parade environment for robustness evaluation
wall in the lower part because this path is closer to the fi-
nal goal. However, such a path does not provide any space             To better evaluate the robustness of our framework, we
as a safety distance. Thus, we also need to consider a Lya-        consider attacking the framework from two aspects. Firstly,
punov heuristic hlyap (gt ). Given xt , the relative position      the neural network is criticized for being unrobust to the in-
from the sink of ROA gt to the closest obstacle, we define         put perturbations. Thus, we add adversarial noise to the low-
the hlyap (xt ) to characterize the safety. If the closest ob-     level controller and the neural Lyapunov function’s inputs in
stacle is located outside the ROA, hlyap (xt ) returns 0. Oth-     the first type of attack. Secondly, the robustness of a system
erwise, hlyap (xt ) returns −∞ to disable the search on this       is not only affected by the upstream perception data fed to
path. We compose the Euclidean heuristic for efficiency and        the neural network controller, but also the actual action exe-
the Lyapunov heuristic for safety, and guide the planning          cuted by downstream modules. That means even if the con-
search with heuc (gt ) + hlyap (xt ).                              troller outputs the right action, the noise in the downstream
models can still cause undesired results. Hence, we further           Eq. (3) is known as the lie derivative. When the lie derivative
attack our framework with noise added to the action.                  is smaller than 0, V (xt ) strictly decreases along with time.
   We evaluate the robustness of our system with this simple          We compute the Lyapunov function with a neural network
but safety-sensitive cat parade environment in Figure 5. The          and train the neural network with a loss function based on
adversary needs to fool our neural Lyapunov function and              the Zubov-type equation (Grune and Wurth 2000). The loss
guide the robot to hit the cats. We assume that we can access         function is
the parameters of the neural Lyapunov function and the low-
level controller. Because our framework aims to work on the
robot with complex dynamics, it is generally challenging to             Lθ (xt , xt+1 ) = |Vθ (x0 )|
model the robot’s dynamics required by a gradient-based at-                            + Vθ (xt )(Vθ (xt+1 ) − Vθ (xt )) − ||xt ||2
tack such as FGSM (Goodfellow, Shlens, and Szegedy 2014)
and PGD (Madry et al. 2017). Hence, similar to (Zhao et al.           The Monte-Carlo estimation of L is
2020), we learn approximated surrogate dynamics. We pro-
vide the technical details in Section 3.5.                                          1     X
                                                                             Lθ =     (       |Vθ (x0 )|
                                                                                    N x ,x ∼π
                       3    Approach                                                     t   t+1   l

Our work incorporated a Lyapunov function into a learning-                       + Vθ (xt )(Vθ (xt+1 ) − Vθ (xt )) − ||xt ||2 )
enabled control system. First, because it is challenging to
reason about the behavior of a RL-trained controller, we use          To characterize the stability of πl , xt and xt+1 are sampled
the Lyapunov function to characterize its stability property.         with the πl . We can compute the gradient with L and op-
The Lyapunov function provides us with the power to bind a            timize the network’s parameters θ. The optimal parameters
robot to a specified region. Second, we assemble the speci-           are θ∗ = arg minθ (Lθ ).
fied ROA with a high-level planner. By considering the Lya-              The neural Lyapunov function specifies the ROA with a
punov function and the planning heuristic jointly, we can             constant CROA .
generate a safety plan. When executing the plan, a safety
shield guards the controlled robot. Additionally, we enhance                         ROA = {g + x|V (x) < CROA }
our high-level planner with a learning-based high-level pol-          Where g is the sink of a ROA.
icy. This policy refines plans with the sensor data collected
during runtime, which gives our high-level planner the abil-          3.3   Runtime Shield
ity to work on imperfect map data. Finally, we evaluate the           The Lyapunov function specifies a single ROA that a robot
robustness of our system against adversarial attacks.                 stays in. However, a runtime shield binds a robot to stay
                                                                      around a given plan P (s0 , gT ) = (g0 , g1 , . . . , gT ). To deal
3.1   Low-Level Controller                                            with the sequence of subgoals, we check the ROAs of any
A low-level controller πl (s|g) is introduced to achieve a sub-       two consecutive subgoals. Given the lower-level controller
goal g. The controller is trained with TD3. We designed its           πl , previous system state si−1 , current system state si , ROAt
reward function as                                                    with sink gt , and ROAt+1 with sink gt+1 , the algorithm 1
                                                                      shows the details of the runtime shield.
            r(st ) = ||st−1 − gt−1 || − ||st − gt ||
We compute the distance-to-goal at the previous time step             Algorithm 1: Sequential Shield
t − 1 and the current time step t. The reward function is the
                                                                        function S HIELD(πl , si−1 , si , gt , gt+1 , ROAt , ROAt+1 )
difference between the two distances. TD3 maximizes this
                                                                           if si ∈ ROAt ∨ si ∈ ROAt+1 then
reward w.r.t the Bellman function. As a result, the trained
                                                                                return πl (si |gt+1 )
agent is expected to get closer to the given goal in the fastest
                                                                           else if (si−1 ) ∈ ROAt ∧ si ∈   / ROAt+1 then
direction.
                                                                                return πl (si |gt )               . Pull back to ROAt
3.2   Neural Lyapunov Function and ROA                                     else
                                                                                return πl (si |gt+1 )
We use the neural Lyapunov function to characterize the sta-
bility property. The Lyapunov function is a positive-definite
function satisfying three constraints.                                    When si ∈ ROAt ∨ si ∈ ROAt+1 , the πi is parameter-
                                                                      ized with the next subgoal gt+1 , and the robot moves toward
             V (xo )                            =0             (1)    gt+1 . When (si−1 ) ∈ ROAt ∧si ∈   / ROAt+1 , the robot moves
             ∀x 6= x0 , V (x)                   >0             (2)    from the ROAt to a region other than ROAt+1 . We need to
                                                                      pull the robot back to ROAt , thus the πi is parameterized by
             V (xt+1 ) − V (xt )                <0             (3)    gt . The last condition is (si−1 ) ∈ ROAt+1 ∧ si ∈ / ROAt+1 ,
In Eq. (1), a Lyapunov function’s value is 0 on the origin. 1         although this is not supposed to happen. If it happens due to
Eq. (2) is the position property of a Lyapunov function.              any aspect of imprecision, we return action πi (si |gt+1 ).
   1
     The Lyapunov function’s input x and state s are in the same      of s is determined by the system. The origin of x, however, is a
space. The only difference is their coordination origin. The origin   selected subgoal during the runtime. x = s − g.
3.4   Planner                                                     where ε is the attack noise size. Because we can compute
Heuristic Our planning heuristic has two parts. The heuc          the gradient of st and at , we can also apply PGD or other
for efficiency and the hlyap for safety. heuc is the Euclidean    attack techniques similarly.
distance to the final goal.                                       Attack Neural Lyapunov Function We want to fool the
                   heuc (gt ) = ||gT − gt ||                      neural Lyapunov function with an attacked input x̂t . In this
                                                                  case, we expect a small V (x̂t ), where V is the Lyapunov
This heuristic is designed to search for the shortest path, but   function. The attacked input x̂t can be computed with gradi-
does not consider safety. Hence, it can generate paths that
                                                                  ent dVdx(xt t ) .
are close to obstacles, which may cause undesired behaviors
during the runtime. To address this problem, we introduce a                                             dV (xt )
Lyapunov heuristic.                                                               x̂t = xt + ε · sign(           )
                                                                                                          dxt
                                                                 Note that we optimize the attacked input x̂t of the Lyapunov
                               0  V (xt ) > CROA ,                function and the attacked state ŝt of the low-level controller
           hlyap (xt ) =
                               −∞ V (xt ) ≤ CROA                  separately. That makes the attack more potent.
Suppose the closest obstacle position to gt is ot , xt = ot −
gt ; CROA is the constant specified the ROA. This heuristic                      4    Primary Evaluation
ensures that the ROAs do not intersect with obstacles.            In this section, we report the primary evaluation results on
                                                                  our motivation examples introduced in Section 2.
Refinement Policy We provided an example for the re-
finement policy in Section 2.3. Force ∆F was applied to           4.1   Low-level controller
update the subgoal gi of a path.
                                                                We train the low-level controller πl (s|g) with TD3. The
                                gi−1 − gi    gi+1 − gi            training results are provided in Figure 6. Each training it-
∆F = (kin − kre (gi |hi )) ·               +
                               kgi−1 − gi k kgi+1 − gi k          eration indicates updates on the low-level controller. The
                                                                  updates are executed periodically in every 2000 simulation
The updating is iterative,
                                                                  steps. All the reward curves in Figure 6 converge to a reward
                                                                  around -2.5.
                       gi0 = gi + α · ∆F
The new subgoal gi0 should converge to a fixed point where
kin = kre (gi0 |hi ). The kre (gi0 |hi ) was learned using an
encoder-decoder structure. First, we train an auto-encoder
EN C to encode the history. Then, we concatenate the em-
bedding EN C(hi ) with subgoal gi , and train an MLP to
predict the kre . The training data was sampled with simula-
tions in different scenarios.

3.5   Robustness to Adversarial Attack
Surrogate Dynamics The surrogate dynamics is a func-
tion
                    fdyn (st , at |ht ) = st+1
where the st is the state and at is the action. ht is the his-
tory states of the agent. If a system is fully-observable, we     Figure 6: Training reward of πl (s|g). This figure includes
can ignore the ht . The fdyn computes the next state of the       results of 5 runs with different random seeds.
system.
Attack Controller We define an objective function fobj
that measures the distance to the planning path. Maximiz-
                                                                  4.2   Neural Lyapunov Function
ing the fobj (st+1 ) guides the robot to deviate from the plan.   We train the neural Lyapunov function with a dataset con-
We considered a simple FGSM attack. When attacking the            taining 106 transitions and test the neural Lyapunov with a
input state of the low-level controller, we can compute the       test dataset with 2 × 105 transitions. Both the training and
attacked state ŝt with                                           test dataset are sampled with the same low-level controller.
                                                                  During the test time, we check the properties in Eq. (1),
                            ∂fobj (fdyn (st , at |ht ))           Eq. (2) and Eq. (3) on the test dataset with five neural Lya-
        ŝt = st + ε · sign(                            )         punov functions trained with different splits on the training
                                     ∂st
On the other hand, if we want to attack the action, the at-       and testing dataset.
tacked action ât can be computed with                               The statistics are in Table 1. The minimum percentage of
                                                                  property violations is 0%, and the maximum is 0.3%. Thus,
                                 ∂fobj (fdyn (st , at |ht ))      out of around 10,000 simulations, only 30 simulations are
        ât = at + ε · sign(                                 )    violated.
                                          ∂at
             Stat.     min      max               mean             std                                           attack action results
             Vio.       0     0.00294           0.000656        0.00115                 30
                                                                                                                                                     0.0
                                                                                        25                                                           0.0*
                                                                                                                                                     1.0
      Table 1: Neural Lyapunov function training results.                               20                                                           1.0*
                                                                                                                                                     2.0




                                                                                     dmax
                                                                                        15                                                           2.0*
                                                                                                                                                     3.0
                                                                                        10                                                           3.0*
                                                                                                                                                     4.0
4.3          Refinement Policy                                                              5                                                        4.0*
                                                                                            0   0.2        0.4             0.6           0.8   1.0
We generated the sensor data and computed the accurate re-                                                          attack frequency
pulsive constant for training repulsive prediction network
kre (gi |hi ) in different scenarios (i.e., the obstacles appear                                      Figure 8: Attack action results
in different positions). 10, 000 trajectories are generated,
which includes 9, 500 training trajectories and 500 test tra-
jectories. Each trajectory contains 5 seconds of sensor data.
                                                                                    Figure 8 provides the attack action results. Our robot’s
However, our repulsive prediction network only uses latest
                                                                                 action range is [−5, 5]. The action represents the displace-
1 seconds sensor data as the history hi in kre (gi |hi ). The
                                                                                 ment every 0.1 seconds. We set the attack noise from 0 to 4.
range of the kre is [−1, 1], the prediction has error around
                                                                                 The dmax is smaller when we do not attack the Lyapunov
0.01. In our toy example provided in Figure 4, we measure
                                                                                 function. Overall, the dmax increases as the ε and attack
the average distance between plan and center obstacle. We
                                                                                 frequency raises. We also observe that sometimes larger ε
hope this distance be small while the ROA should not collide
                                                                                 results in smaller dmax . This is because a stronger attack
with the obstacle. Before applying our refinement policy, the
                                                                                 sometimes can cause intensive calling on the shield’s pull-
average distance is 51.32. After 100 runs and refinements,
                                                                                 back action. For example, when the attack frequency is 1.0
our refinement policy changed the average distance to 45.27
                                                                                 when the ε = 1.0, the dmax is smaller compared with when
and avoided the collision between obstacles and ROAs.
                                                                                 ε = 0.0. Finally, we notice that the first safety violation hap-
                                                                                 pens when the attack frequency is 60% and ε = 4.0.
4.4          Robustness to Adversarial Attack
We attacked both the state and action with different attack
frequency and noise size ε. The Figure 7 and 8 provide the                                                5      Future work
results of attacks on the state and action respectively. Each
column is generated with 100 experiments with attacks at                         One issue we want to explore further is the scalability of
random times. The attack frequency ranges from 0.2 to 1.0,                       our framework. There is a sequence of work has shown
indicating the percentage of attacked transitions. The ∗ in                      the scalability (Gaby, Zhang, and Ye 2021; Dawson et al.
the legend means that we also attack the Lyapunov function                       2021) of the neural Lyapunov function and deep reinforce-
in these experiments and the unit of ε is cm. The y−axis is                      ment learning (Pérez-Dattari et al. 2019; Fujimoto, Hoof,
the max deviation to our plan. When the dmax > 30 cm, the                        and Meger 2018). Hence, it is natural to demonstrate the
robot will hit cats in the example provided in Figure 5.                         scalability of our framework better on more complicated
                                                                                 benchmarks (Achiam and Amodei 2019; Todorov, Erez, and
                                   attack state results                          Tassa 2012). On the other hand, the neural Lyapunov func-
         30                                                                      tion we learned does not provide a verifiable guarantee (Dai
                                                                          0.0
         25                                                               0.0*   et al. 2021; Chang, Roohi, and Gao 2020; Zhu et al. 2019;
                                                                          1.0
         20                                                               1.0*   Bastani, Pu, and Solar-Lezama 2018) because we do not as-
                                                                          2.0
      dmax




         15                                                               2.0*   sume the access of the dynamics of a system. However, in
                                                                          3.0    the case that the dynamics can be easily modeled, extend-
         10                                                               3.0*
                                                                          4.0    ing our work with the verifiable tools can provide a strong
             5                                                            4.0*
             0   0.2         0.4            0.6           0.8      1.0
                                                                                 guarantee on the system safety. This is a direction we are
                                     attack frequency                            working on. Regarding the system robustness, we only eval-
                                                                                 uated a limited number of adversarial attacks. It is interest-
                       Figure 7: Attack state results                            ing to see how our framework performs under various attack
                                                                                 techniques (Weng et al. 2020; Mankowitz et al. 2020; Zhang
   In Figure 7, when we do not attack the Lyapunov func-                         et al. 2020) while also consider the attack on the high-level
tion, the dmax is always bounded below 15 cm. Hence, the                         planner (Xiang et al. 2018). Moreover, the current frame-
protection provided by the shield works as expected. Never-                      work only considers a single agent and static obstacles. A
theless, when the Lyapunov function is under attack, we no-                      challenging but fruitful direction is to extend our framework
tice that all the dmax is significantly larger, which means the                  to multiagent planning and control (Guestrin, Koller, and
Lyapunov function and shield can be affected by the attack.                      Parr 2001; Nissim, Brafman, and Domshlak 2010). Lastly,
On the other hand, the dmax grows as the attack frequency                        our high-level planner is closely related to the safe neural
and  increases. The first safety violation happens when the                     motion planning (Huang et al. 2021; Qureshi et al. 2019).
attack frequency is 60%, and the noise size is 4.0 cm, while                     We propose to investigate further on these works and better
we also attack the Lyapunov function.                                            refine our high-level planner.
                       References                                for Continuous Control with Model Misspecification. In In-
Achiam, J.; and Amodei, D. 2019. Benchmarking Safe Ex-           ternational Conference on Learning Representations.
ploration in Deep Reinforcement Learning. In arxiv.              Nachum, O.; Gu, S.; Lee, H.; and Levine, S. 2019. Near-
Bastani, O.; Pu, Y.; and Solar-Lezama, A. 2018. Verifi-          Optimal Representation Learning for Hierarchical Rein-
able Reinforcement Learning via Policy Extraction. CoRR,         forcement Learning. In International Conference on Learn-
abs/1805.08328.                                                  ing Representations.
Berkenkamp, F.; Moriconi, R.; Schoellig, A. P.; and Krause,      Nissim, R.; Brafman, R. I.; and Domshlak, C. 2010. A
A. 2016. Safe Learning of Regions of Attraction for Un-          general, fully distributed multi-agent planning algorithm.
certain, Nonlinear Systems with Gaussian Processes. arXiv        In Proceedings of the 9th International Conference on Au-
e-prints, arXiv:1603.04915.                                      tonomous Agents and Multiagent Systems: volume 1-Volume
                                                                 1, 1323–1330.
Chang, Y.-C.; Roohi, N.; and Gao, S. 2020. Neural lyapunov
                                                                 Pérez-Dattari, R.; Celemin, C.; Ruiz-del-Solar, J.; and
control. arXiv preprint arXiv:2005.00611.
                                                                 Kober, J. 2019. Continuous Control for High-Dimensional
Chen, T.; Liu, J.; Xiang, Y.; Niu, W.; Tong, E.; and Han,        State Spaces: An Interactive Learning Approach. arXiv e-
Z. 2019. Adversarial attack and defense in reinforcement         prints, arXiv:1908.05256.
learning-from AI security view. Cybersecurity, 2(1): 1–22.
                                                                 Quinlan, S.; and Khatib, O. 1993. Elastic bands: connecting
Dai, H.; Landry, B.; Yang, L.; Pavone, M.; and Tedrake,          path planning and control. [1993] Proceedings IEEE Inter-
R. 2021. Lyapunov-stable neural-network control. arXiv           national Conference on Robotics and Automation, 802–807
preprint arXiv:2109.14152.                                       vol.2.
Dawson, C.; Qin, Z.; Gao, S.; and Fan, C. 2021. Safe Non-        Qureshi, A. H.; Simeonov, A.; Bency, M. J.; and Yip, M. C.
linear Control Using Robust Neural Lyapunov-Barrier Func-        2019. Motion planning networks. In 2019 International
tions. arXiv preprint arXiv:2109.06697.                          Conference on Robotics and Automation (ICRA), 2118–
Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing           2124. IEEE.
function approximation error in actor-critic methods. In In-     Sun, Y.; Huang, X.; Kroening, D.; Sharp, J.; Hill, M.; and
ternational Conference on Machine Learning, 1587–1596.           Ashmore, R. 2018. Testing deep neural networks. arXiv
PMLR.                                                            preprint arXiv:1803.04792.
Gaby, N.; Zhang, F.; and Ye, X. 2021. Lyapunov-Net: A            Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A
Deep Neural Network Architecture for Lyapunov Function           physics engine for model-based control. In 2012 IEEE/RSJ
Approximation. arXiv e-prints, arXiv:2109.13359.                 International Conference on Intelligent Robots and Systems,
Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain-    5026–5033.
ing and Harnessing Adversarial Examples. arXiv e-prints,         Urmson, C.; and Simmons, R. 2003. Approaches for heuris-
arXiv:1412.6572.                                                 tically biasing RRT growth. In Proceedings 2003 IEEE/RSJ
Grune, L.; and Wurth, F. 2000. Computing control Lya-            International Conference on Intelligent Robots and Systems
punov functions via a Zubov type algorithm. In Proceedings       (IROS 2003) (Cat. No.03CH37453), volume 2, 1178–1183
of the 39th IEEE Conference on Decision and Control (Cat.        vol.2.
No.00CH37187), volume 3, 2129–2134 vol.3.                        Weng, T.-W.; Dvijotham*, K. D.; Uesato*, J.; Xiao*, K.;
Guestrin, C.; Koller, D.; and Parr, R. 2001. Multiagent Plan-    Gowal*, S.; Stanforth*, R.; and Kohli, P. 2020. Toward Eval-
ning with Factored MDPs. In NIPS, volume 1, 1523–1530.           uating Robustness of Deep Reinforcement Learning with
                                                                 Continuous Control. In International Conference on Learn-
Hart, P. E.; Nilsson, N. J.; and Raphael, B. 1972. Correc-
                                                                 ing Representations.
tion to ”A Formal Basis for the Heuristic Determination of
Minimum Cost Paths”. SIGART Bull., 28–29.                        Xiang, Y.; Niu, W.; Liu, J.; Chen, T.; and Han, Z. 2018.
                                                                 A PCA-Based Model to Predict Adversarial Examples on
Huang, X.; Feng, M.; Jasour, A.; Rosman, G.; and Williams,       Q-Learning of Path Finding. In 2018 IEEE Third Interna-
B. 2021. Risk Conditioned Neural Motion Planning. arXiv          tional Conference on Data Science in Cyberspace (DSC),
e-prints, arXiv:2108.01851.                                      773–780.
Kreidieh, A. R.; Berseth, G.; Trabucco, B.; Parajuli, S.;        Zhang, H.; Chen, H.; Xiao, C.; Li, B.; Boning, D. S.; and
Levine, S.; and Bayen, A. M. 2019. Inter-Level Coopera-          Hsieh, C.-J. 2020. Robust deep reinforcement learning
tion in Hierarchical Reinforcement Learning. arXiv preprint      against adversarial perturbations on observations. ICLR.
arXiv:1912.02368.
                                                                 Zhao, Y.; Shumailov, I.; Cui, H.; Gao, X.; Mullins, R.; and
Levy, A.; Jr., R. P.; and Saenko, K. 2017. Hierarchical Actor-   Anderson, R. 2020. Blackbox attacks on reinforcement
Critic. CoRR, abs/1712.00948.                                    learning agents using approximated temporal information.
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and            In 2020 50th Annual IEEE/IFIP (DSN-W), 16–24. IEEE.
Vladu, A. 2017. Towards deep learning models resistant to        Zhu, H.; Xiong, Z.; Magill, S.; and Jagannathan, S. 2019.
adversarial attacks. arXiv preprint arXiv:1706.06083.            An inductive synthesis framework for verifiable reinforce-
Mankowitz, D. J.; Levine, N.; Jeong, R.; Abdolmaleki, A.;        ment learning. In Proceedings of the 40th ACM SIGPLAN
Springenberg, J. T.; Shi, Y.; Kay, J.; Hester, T.; Mann, T.;     Conference on Programming Language Design and Imple-
and Riedmiller, M. 2020. Robust Reinforcement Learning           mentation, 686–701.