<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HiSaRL: A Hierarchical Framework for Safe Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zikang Xiong</string-name>
          <email>xiong84@cs.purdue.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ishika Agarwal</string-name>
          <email>agarwali@purdue.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Suresh Jagannathan</string-name>
          <email>suresh@cs.purdue.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Purdue University</institution>
          ,
          <addr-line>West Lafayette, Indiana 47906</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a two-level hierarchical framework for safe reinforcement learning in a complex environment. The high-level part is an adaptive planner, which aims at learning and generating safe and efficient paths for tasks with imperfect map information. The lower-level part contains a learning-based controller and its corresponding neural Lyapunov function, which characterizes the controller's stability property. This learned neural Lyapunov function serves two purposes. First, it will be part of the high-level heuristic for our planning algorithm. Second, it acts as a part of a runtime shield to guard the safety of the whole system. We use a robot navigation example to demonstrate that our framework can operate efficiently and safely in complex environments, even under adversarial attacks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Although deep reinforcement learning has achieved
promising results in various domains, ensuring its safety still is a
concern. One line of work
        <xref ref-type="bibr" rid="ref2 ref35 ref4 ref6">(Bastani, Pu, and Solar-Lezama
2018; Zhu et al. 2019; Chang, Roohi, and Gao 2020; Dai
et al. 2021)</xref>
        provides rigorous safety guarantees by using
analytic approaches. These proposals generally require
knowledge of a system’s underlying dynamics and constraints,
making it challenging to generalize their method to
handle complex dynamics. On the other hand, hierarchical
reinforcement learning algorithms
        <xref ref-type="bibr" rid="ref16 ref17 ref20">(Levy, Jr., and Saenko 2017;
Nachum et al. 2019; Kreidieh et al. 2019)</xref>
        are attractive
because they can support complex tasks without requiring
knowledge of an underlying environment structure.
However, this lack of knowledge and the non-stationary MDP
problem make these algorithms be time-intensive without
providing any assurance of safety after training.
      </p>
      <p>
        In contrast, our framework assumes, in the specific
planning setting we consider, that our learning algorithm can
assess the map information of one environment. Such
information can be a priori modeled with a high-definition map or
collected during runtime using techniques such as SLAM.
With map information, a high-level planner can generate a
safe and efficient path. This planner frees our agents from
unsafe and inefficient exploration for generating a high-level
navigation policy. However, a safe plan cannot guarantee
safety at runtime. Because the low-level controller is not
required to follow the plan perfectly, agents can deviate from
a safe plan and produce unsafe behaviors, such as hitting
obstacles. To solve this problem, we apply a learned neural
Lyapunov function
        <xref ref-type="bibr" rid="ref3 ref4">(Berkenkamp et al. 2016; Chang, Roohi,
and Gao 2020)</xref>
        as a runtime monitor and harness its stability
property to enforce that the agent stays in a region specified
by the function. We also incorporate the neural Lyapunov
function as part of our planner heuristic. This
incorporation fuses our high-level planner and the low-level controller
seamlessly. Moreover, our learned neural Lyapunov
function does not require any knowledge about system
dynamics. Hence, compared with
        <xref ref-type="bibr" rid="ref2 ref35">(Bastani, Pu, and Solar-Lezama
2018; Zhu et al. 2019)</xref>
        , our approach can extend to complex
unknown dynamics.
      </p>
      <p>
        The backend algorithms of our planner are A
        <xref ref-type="bibr" rid="ref14">(Hart,
Nilsson, and Raphael 1972)</xref>
        and RRT
        <xref ref-type="bibr" rid="ref28">(Urmson and Simmons
2003)</xref>
        . When we have accurate map information, both A
and RRT work correctly. However, if map information is
inaccurate or outdated, our backend algorithm is unlikely to
work as expected. Hence, we further strengthen our planner
with a refinement policy. The refinement policy’s mission
is repairing and refining planning decisions during runtime.
Inspired by the elastic band algorithm
        <xref ref-type="bibr" rid="ref24">(Quinlan and Khatib
1993)</xref>
        , we propose a learning-based approach to generate the
refinement policy.
      </p>
      <p>
        It is well-known that most deep learning algorithms suffer
from robustness issues. Adversarial attacks
        <xref ref-type="bibr" rid="ref19 ref26 ref32 ref5">(Sun et al. 2018;
Mankowitz et al. 2020; Chen et al. 2019; Zhang et al. 2020)</xref>
        can effectively test the robustness of a learning-enabled
system. Hence, we formulate an attack mechanism against our
framework to force an agent (i.e, robot) to deviate outside
the region specified by the Lyapunov function. Our results
show that unless we apply an unrealistically high attack
frequency and force significant perturbation, our framework is
robust and can keep the robot within a safe region.
      </p>
      <p>The contributions of this work can be summarized as
follows:
• We propose a hierarchical framework that reconciles
both efficiency and safety. Our framework can enhance
the safety of an agent in complex environments, where
the dynamics of the controlled robots are unknown, and
the environment information (i.e., map and obstacle
in</p>
      <p>formation) can be imperfect.
• We consider an approach to adapt the high-level planner
to deal with imperfect information.
• We demonstrate the robustness of our framework under
adversarial attack.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Motivation Example</title>
      <p>
        2.1 Hierarchical Framework
The simple navigation task in Figure 1 requires
navigating the sweeping robot from the initial position s0 to the
goal position gT (the charger). We solve this problem with a
two-level hierarchical framework. First, a high-level planner
finds a safe plan (i.e., no collision with walls and cats) from
the initial position s0 to the goal position gT . The plan is a
sequence of subgoals shown as the green line in Figure 1.
We denote it as P (s0; gT ) = (g0; g1; : : : ; gT ), where gi is
a subgoal, g0 = s0. Then, A low-level controller l(sjg)
is introduced to execute the plan. The low-level controller
is conditioned by a subgoal g, and it is trained to predict
the optimal action under state s to reach g. The low-level
controller is trained with TD3
        <xref ref-type="bibr" rid="ref8">(Fujimoto, Hoof, and Meger
2018)</xref>
        using a reward for achieving the given subgoal g with
the shortest path.
      </p>
      <sec id="sec-2-1">
        <title>2.2 Runtime Safety</title>
        <p>Although the high-level planner can provide a safe plan,
the low-level controller does not necessarily follow the plan
strictly. Especially when we train the low-level controller
with a model-free reinforcement learning algorithm, it is
quite common that the agent finds an unexpected approach
to achieve the goal. Once the unstable low-level controller
leads our robot to deviate from our safety plan, we cannot
guarantee the safety of our robot.</p>
        <p>Stability of Low-level Controller We measure the
stability of a low-level controller with the deviation between the
actual trajectory and the plan. In Figure 2, the robot’s
position at time t is post, the deviation dt is defined as the
Euclidean distance from post to the plan fragment between
2 subgoals.
$</p>
        <p>$
Subgoal !
Subgoal !"#</p>
        <p>Lyapunov Function The large deviation dt can cause
unsafe behaviors. Hence, we hope to constrain the robot around
our plan. The Lyapunov function is a positive-definite
function for analyzing the stability of a system. A Lyapunov
function V (x) characterizes a control system’s Region Of
Attraction (ROA). The ROA binds the robot to stay around
the plan. We will introduce the technical details for the
neural Lyapunov in Section 3.2.</p>
        <p>ROA
Subgoal !</p>
        <p>Subgoal !"#</p>
        <p>Subgoal !"%
Subgoal !"$</p>
        <p>Captured</p>
        <p>
          Runtime Shield With the plan and ROAs built by the
Lyapunov function, we can construct a runtime shield to guard
the safety of our robot. Unlike the previous method
          <xref ref-type="bibr" rid="ref2 ref35">(Bastani, Pu, and Solar-Lezama 2018; Zhu et al. 2019)</xref>
          , we do
not require an additional safe policy for the recovering
purpose. Instead, we construct the runtime shield by switching
the robot’s heading toward different subgoals. Specifically,
we select two consecutive subgoals during the runtime. The
first one is the latest subgoals achieved; the second one is
the next subgoals. Then, we select the first subgoal’s ROA
to monitor our robot. While the robot stays in the selected
ROA, we set its subgoal to the second subgoal that we have
not achieved yet. When the robot goes beyond the selected
ROA, we check whether it is “captured” by the next ROA
(i.e., the robot enters the intersection part of two ROA, see
Figure 3). If the robot is captured, we will change the
selected ROA to the next ROA until the robot achieves the
second subgoal. Nevertheless, if the robot is not captured by
the next ROA and slides out of the selected ROA, we will set
its subgoal to the first subgoal to pull it back. The runtime
shield binds a robot to stay around its plan. We can consider
the ROA while planning and generating a plan with safety
boundaries. Consequently, the runtime safety of our system
is guarded. A more formal description is provided in
Section 3.3.
2.3
The backend algorithm of our planner is A and RRT . Both
of them require a heuristic guiding the efficient search. The
most straightforward heuristic might be the Euclidean
distance. However, this heuristic cannot guarantee that the
generated path is safe for our system. For example, a planning
path may get close to the wall, which does not leave enough
space for the shield to switch the subgoal for avoidance.
Thus, we have to consider the Lyapunov function as part
of the heuristic. We further consider scenarios that we do
not have the perfect map data. In this case, we need to
collect the sensor data during the runtime and refine our plan.
We achieved the refinement by employing a high-level
reinforcement learning policy. This refinement policy enables
our system to operate safely with imperfect map data.
!"#
!"#
!
(a) Robot sensors
        </p>
        <p>(b) Refinement in U-maze environment</p>
        <p>Heuristic A simple Euclidean distance heuristic is the L2
distance from current subgoal position to the final goal
position, heuc(gt) = jjgT gtjj. This heuristic guides the
search toward the final goal. For the U-maze example in
Figure 4(b), the path generated by heuc(gt) will stick to the
wall in the lower part because this path is closer to the
final goal. However, such a path does not provide any space
as a safety distance. Thus, we also need to consider a
Lyapunov heuristic hlyap(gt). Given xt, the relative position
from the sink of ROA gt to the closest obstacle, we define
the hlyap(xt) to characterize the safety. If the closest
obstacle is located outside the ROA, hlyap(xt) returns 0.
Otherwise, hlyap(xt) returns 1 to disable the search on this
path. We compose the Euclidean heuristic for efficiency and
the Lyapunov heuristic for safety, and guide the planning
search with heuc(gt) + hlyap(xt).</p>
        <p>
          Refinement Policy Naturally, the map will change after
the data is collected. Hence, the plan made in the outdated
map may cause undesired results. Thus, we consider a
repair schema during runtime. The key idea is the elastic
band
          <xref ref-type="bibr" rid="ref24">(Quinlan and Khatib 1993)</xref>
          which optimizes a planning
“string” with the internal and repulsive force. The internal
force, formally, can be described as
        </p>
        <p>Fin = kin
gi 1
kgi 1
gi +
gik
gi+1
kgi+1
gi
gik
where kin is the elasticity coefficient. The internal force for
subgoal gi is computed with the sum of the two direction
vectors toward its neighbors. In addition to the internal force,
a repulsive force is applied to prevent the ROA from
intersecting with obstacles. When we have perfect map data, we
can compute the repulsive force easily. However, it can be
a problem when the map data is imperfect. In this case, we
only have the raw sensor data during runtime. To address
this problem, we learn a function kre(gijhi) to predict the
repulsive coefficient, given the subgoal gi and a sequence
of sensor data history hi. Finally, the applied force f for
subgoal gi is</p>
        <p>To better evaluate the robustness of our framework, we
consider attacking the framework from two aspects. Firstly,
the neural network is criticized for being unrobust to the
input perturbations. Thus, we add adversarial noise to the
lowlevel controller and the neural Lyapunov function’s inputs in
the first type of attack. Secondly, the robustness of a system
is not only affected by the upstream perception data fed to
the neural network controller, but also the actual action
executed by downstream modules. That means even if the
controller outputs the right action, the noise in the downstream
models can still cause undesired results. Hence, we further
attack our framework with noise added to the action.</p>
        <p>
          We evaluate the robustness of our system with this simple
but safety-sensitive cat parade environment in Figure 5. The
adversary needs to fool our neural Lyapunov function and
guide the robot to hit the cats. We assume that we can access
the parameters of the neural Lyapunov function and the
lowlevel controller. Because our framework aims to work on the
robot with complex dynamics, it is generally challenging to
model the robot’s dynamics required by a gradient-based
attack such as FGSM
          <xref ref-type="bibr" rid="ref10">(Goodfellow, Shlens, and Szegedy 2014)</xref>
          and PGD
          <xref ref-type="bibr" rid="ref18">(Madry et al. 2017)</xref>
          . Hence, similar to
          <xref ref-type="bibr" rid="ref33">(Zhao et al.
2020)</xref>
          , we learn approximated surrogate dynamics. We
provide the technical details in Section 3.5.
        </p>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>Our work incorporated a Lyapunov function into a
learningenabled control system. First, because it is challenging to
reason about the behavior of a RL-trained controller, we use
the Lyapunov function to characterize its stability property.
The Lyapunov function provides us with the power to bind a
robot to a specified region. Second, we assemble the
specified ROA with a high-level planner. By considering the
Lyapunov function and the planning heuristic jointly, we can
generate a safety plan. When executing the plan, a safety
shield guards the controlled robot. Additionally, we enhance
our high-level planner with a learning-based high-level
policy. This policy refines plans with the sensor data collected
during runtime, which gives our high-level planner the
ability to work on imperfect map data. Finally, we evaluate the
robustness of our system against adversarial attacks.</p>
      <sec id="sec-3-1">
        <title>3.1 Low-Level Controller</title>
        <p>A low-level controller l(sjg) is introduced to achieve a
subgoal g. The controller is trained with TD3. We designed its
reward function as</p>
        <p>r(st) = jjst 1 gt 1jj jjst gtjj
We compute the distance-to-goal at the previous time step
t 1 and the current time step t. The reward function is the
difference between the two distances. TD3 maximizes this
reward w.r.t the Bellman function. As a result, the trained
agent is expected to get closer to the given goal in the fastest
direction.
3.2 Neural Lyapunov Function and ROA
We use the neural Lyapunov function to characterize the
stability property. The Lyapunov function is a positive-definite
function satisfying three constraints.</p>
        <p>V (xo)
8x 6= x0; V (x)
V (xt+1)</p>
        <p>
          1The Lyapunov function’s input x and state s are in the same
space. The only difference is their coordination origin. The origin
Eq. (3) is known as the lie derivative. When the lie derivative
is smaller than 0, V (xt) strictly decreases along with time.
We compute the Lyapunov function with a neural network
and train the neural network with a loss function based on
the Zubov-type equation
          <xref ref-type="bibr" rid="ref11">(Grune and Wurth 2000)</xref>
          . The loss
function is
        </p>
        <p>L (xt; xt+1) = jV (x0)j
The Monte-Carlo estimation of L is</p>
        <p>L =
1
N
(</p>
        <p>X
xt;xt+1 l</p>
        <p>jV (x0)j
+ V (xt)(V (xt+1)</p>
        <p>V (xt))
jjxtjj
2
+ V (xt)(V (xt+1)</p>
        <p>V (xt))
jjxtjj2 )
To characterize the stability of l, xt and xt+1 are sampled
with the l. We can compute the gradient with L and
optimize the network’s parameters . The optimal parameters
are = arg min (L ).</p>
        <p>The neural Lyapunov function specifies the ROA with a
constant CROA.</p>
        <p>ROA = fg + xjV (x) &lt; CROAg
Where g is the sink of a ROA.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Runtime Shield</title>
        <p>The Lyapunov function specifies a single ROA that a robot
stays in. However, a runtime shield binds a robot to stay
around a given plan P (s0; gT ) = (g0; g1; : : : ; gT ). To deal
with the sequence of subgoals, we check the ROAs of any
two consecutive subgoals. Given the lower-level controller
l, previous system state si 1, current system state si, ROAt
with sink gt, and ROAt+1 with sink gt+1, the algorithm 1
shows the details of the runtime shield.</p>
        <p>Algorithm 1: Sequential Shield
function SHIELD( l; si 1; si; gt; gt+1; ROAt; ROAt+1)
if si 2 ROAt _ si 2 ROAt+1 then</p>
        <p>return l(sijgt+1)
else if (si 1) 2 ROAt ^ si 2= ROAt+1 then</p>
        <p>return l(sijgt) . Pull back to ROAt
else</p>
        <p>return l(sijgt+1)</p>
        <p>When si 2 ROAt _ si 2 ROAt+1, the i is
parameterized with the next subgoal gt+1, and the robot moves toward
gt+1. When (si 1) 2 ROAt^si 2= ROAt+1, the robot moves
from the ROAt to a region other than ROAt+1. We need to
pull the robot back to ROAt, thus the i is parameterized by
gt. The last condition is (si 1) 2 ROAt+1 ^ si 2= ROAt+1,
although this is not supposed to happen. If it happens due to
any aspect of imprecision, we return action i(sijgt+1).
of s is determined by the system. The origin of x, however, is a
selected subgoal during the runtime. x = s g.
Heuristic Our planning heuristic has two parts. The heuc
for efficiency and the hlyap for safety. heuc is the Euclidean
distance to the final goal.</p>
        <p>heuc(gt) = jjgT
gtjj
This heuristic is designed to search for the shortest path, but
does not consider safety. Hence, it can generate paths that
are close to obstacles, which may cause undesired behaviors
during the runtime. To address this problem, we introduce a
Lyapunov heuristic.</p>
        <p>hlyap(xt) =
0
1</p>
        <p>V (xt) &gt; CROA;
Suppose the closest obstacle position to gt is ot, xt = ot
gt; CROA is the constant specified the ROA. This heuristic
ensures that the ROAs do not intersect with obstacles.
Refinement Policy We provided an example for the
refinement policy in Section 2.3. Force F was applied to
update the subgoal gi of a path.
The new subgoal gi0 should converge to a fixed point where
kin = kre(gi0jhi). The kre(gi0jhi) was learned using an
encoder-decoder structure. First, we train an auto-encoder
EN C to encode the history. Then, we concatenate the
embedding EN C(hi) with subgoal gi, and train an MLP to
predict the kre. The training data was sampled with
simulations in different scenarios.
3.5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Surrogate Dynamics</title>
        <p>tion</p>
      </sec>
      <sec id="sec-3-4">
        <title>Robustness to Adversarial Attack</title>
        <p>The surrogate dynamics is a
funcfdyn(st; atjht) = st+1
where the st is the state and at is the action. ht is the
history states of the agent. If a system is fully-observable, we
can ignore the ht. The fdyn computes the next state of the
system.</p>
        <p>Attack Controller We define an objective function fobj
that measures the distance to the planning path.
Maximizing the fobj (st+1) guides the robot to deviate from the plan.
We considered a simple FGSM attack. When attacking the
input state of the low-level controller, we can compute the
attacked state s^t with
s^t = st + " sign(
@st
On the other hand, if we want to attack the action, the
attacked action a^t can be computed with
a^t = at + " sign(
)
where " is the attack noise size. Because we can compute
the gradient of st and at, we can also apply PGD or other
attack techniques similarly.</p>
        <p>Attack Neural Lyapunov Function We want to fool the
neural Lyapunov function with an attacked input x^t. In this
case, we expect a small V (x^t), where V is the Lyapunov
function. The attacked input x^t can be computed with
gradient dV (xt) .</p>
        <p>dxt
x^t = xt + " sign(
dV (xt) )
dxt
Note that we optimize the attacked input x^t of the Lyapunov
function and the attacked state s^t of the low-level controller
separately. That makes the attack more potent.</p>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Primary Evaluation</title>
      <p>In this section, we report the primary evaluation results on
our motivation examples introduced in Section 2.
4.1</p>
      <sec id="sec-4-1">
        <title>Low-level controller</title>
        <p>We train the low-level controller l(sjg) with TD3. The
training results are provided in Figure 6. Each training
iteration indicates updates on the low-level controller. The
updates are executed periodically in every 2000 simulation
steps. All the reward curves in Figure 6 converge to a reward
around -2.5.
We train the neural Lyapunov function with a dataset
containing 106 transitions and test the neural Lyapunov with a
test dataset with 2 105 transitions. Both the training and
test dataset are sampled with the same low-level controller.
During the test time, we check the properties in Eq. (1),
Eq. (2) and Eq. (3) on the test dataset with five neural
Lyapunov functions trained with different splits on the training
and testing dataset.</p>
        <p>The statistics are in Table 1. The minimum percentage of
property violations is 0%, and the maximum is 0.3%. Thus,
out of around 10,000 simulations, only 30 simulations are
violated.
We generated the sensor data and computed the accurate
repulsive constant for training repulsive prediction network
kre(gijhi) in different scenarios (i.e., the obstacles appear
in different positions). 10; 000 trajectories are generated,
which includes 9; 500 training trajectories and 500 test
trajectories. Each trajectory contains 5 seconds of sensor data.
However, our repulsive prediction network only uses latest
1 seconds sensor data as the history hi in kre(gijhi). The
range of the kre is [ 1; 1], the prediction has error around
0.01. In our toy example provided in Figure 4, we measure
the average distance between plan and center obstacle. We
hope this distance be small while the ROA should not collide
with the obstacle. Before applying our refinement policy, the
average distance is 51:32. After 100 runs and refinements,
our refinement policy changed the average distance to 45:27
and avoided the collision between obstacles and ROAs.
We attacked both the state and action with different attack
frequency and noise size ". The Figure 7 and 8 provide the
results of attacks on the state and action respectively. Each
column is generated with 100 experiments with attacks at
random times. The attack frequency ranges from 0.2 to 1.0,
indicating the percentage of attacked transitions. The in
the legend means that we also attack the Lyapunov function
in these experiments and the unit of " is cm. The y axis is
the max deviation to our plan. When the dmax &gt; 30 cm, the
robot will hit cats in the example provided in Figure 5.</p>
        <p>attack state results
0.2
0.4
0.8</p>
        <p>1.0
0.6
attack frequency</p>
        <p>In Figure 7, when we do not attack the Lyapunov
function, the dmax is always bounded below 15 cm. Hence, the
protection provided by the shield works as expected.
Nevertheless, when the Lyapunov function is under attack, we
notice that all the dmax is significantly larger, which means the
Lyapunov function and shield can be affected by the attack.
On the other hand, the dmax grows as the attack frequency
and increases. The first safety violation happens when the
attack frequency is 60%, and the noise size is 4:0 cm, while
we also attack the Lyapunov function.</p>
        <p>Figure 8 provides the attack action results. Our robot’s
action range is [ 5; 5]. The action represents the
displacement every 0.1 seconds. We set the attack noise from 0 to 4.
The dmax is smaller when we do not attack the Lyapunov
function. Overall, the dmax increases as the " and attack
frequency raises. We also observe that sometimes larger "
results in smaller dmax. This is because a stronger attack
sometimes can cause intensive calling on the shield’s
pullback action. For example, when the attack frequency is 1.0
when the " = 1:0, the dmax is smaller compared with when
" = 0:0. Finally, we notice that the first safety violation
happens when the attack frequency is 60% and " = 4:0.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Future work</title>
      <p>
        One issue we want to explore further is the scalability of
our framework. There is a sequence of work has shown
the scalability
        <xref ref-type="bibr" rid="ref7 ref9">(Gaby, Zhang, and Ye 2021; Dawson et al.
2021)</xref>
        of the neural Lyapunov function and deep
reinforcement learning
        <xref ref-type="bibr" rid="ref23 ref8">(Pe´rez-Dattari et al. 2019; Fujimoto, Hoof,
and Meger 2018)</xref>
        . Hence, it is natural to demonstrate the
scalability of our framework better on more complicated
benchmarks
        <xref ref-type="bibr" rid="ref1 ref27">(Achiam and Amodei 2019; Todorov, Erez, and
Tassa 2012)</xref>
        . On the other hand, the neural Lyapunov
function we learned does not provide a verifiable guarantee
        <xref ref-type="bibr" rid="ref2 ref35 ref4 ref6">(Dai
et al. 2021; Chang, Roohi, and Gao 2020; Zhu et al. 2019;
Bastani, Pu, and Solar-Lezama 2018)</xref>
        because we do not
assume the access of the dynamics of a system. However, in
the case that the dynamics can be easily modeled,
extending our work with the verifiable tools can provide a strong
guarantee on the system safety. This is a direction we are
working on. Regarding the system robustness, we only
evaluated a limited number of adversarial attacks. It is
interesting to see how our framework performs under various attack
techniques
        <xref ref-type="bibr" rid="ref19 ref29 ref32">(Weng et al. 2020; Mankowitz et al. 2020; Zhang
et al. 2020)</xref>
        while also consider the attack on the high-level
planner
        <xref ref-type="bibr" rid="ref30">(Xiang et al. 2018)</xref>
        . Moreover, the current
framework only considers a single agent and static obstacles. A
challenging but fruitful direction is to extend our framework
to multiagent planning and control
        <xref ref-type="bibr" rid="ref13 ref21">(Guestrin, Koller, and
Parr 2001; Nissim, Brafman, and Domshlak 2010)</xref>
        . Lastly,
our high-level planner is closely related to the safe neural
motion planning
        <xref ref-type="bibr" rid="ref15">(Huang et al. 2021; Qureshi et al. 2019)</xref>
        .
We propose to investigate further on these works and better
refine our high-level planner.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Achiam</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Benchmarking Safe Exploration in Deep Reinforcement Learning</article-title>
          .
          <source>In arxiv.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bastani</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Solar-Lezama</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Verifiable Reinforcement Learning via Policy Extraction</article-title>
          . CoRR, abs/
          <year>1805</year>
          .08328.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Berkenkamp</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Moriconi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schoellig</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Safe Learning of Regions of Attraction for Uncertain, Nonlinear Systems with Gaussian Processes</article-title>
          . arXiv e-prints,
          <source>arXiv:1603</source>
          .
          <fpage>04915</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.-C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Roohi</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Neural lyapunov control</article-title>
          .
          <source>arXiv preprint arXiv:2005</source>
          .00611.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Xiang,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ;
            <surname>Tong</surname>
          </string-name>
          , E.; and Han,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Adversarial attack and defense in reinforcement learning-from AI security view</article-title>
          .
          <source>Cybersecurity</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Landry</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pavone</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Tedrake,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2021</year>
          .
          <article-title>Lyapunov-stable neural-network control</article-title>
          .
          <source>arXiv preprint arXiv:2109</source>
          .
          <fpage>14152</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Dawson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Safe Nonlinear Control Using Robust Neural Lyapunov-Barrier Functions</article-title>
          .
          <source>arXiv preprint arXiv:2109</source>
          .
          <fpage>06697</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Fujimoto</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Hoof, H.; and
          <string-name>
            <surname>Meger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Addressing function approximation error in actor-critic methods</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          ,
          <fpage>1587</fpage>
          -
          <lpage>1596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Gaby</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , F.; and
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Lyapunov-Net: A Deep Neural Network Architecture for Lyapunov Function Approximation</article-title>
          . arXiv e-prints,
          <source>arXiv:2109</source>
          .
          <fpage>13359</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shlens</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Explaining and Harnessing Adversarial Examples</article-title>
          . arXiv e-prints,
          <source>arXiv:1412</source>
          .
          <fpage>6572</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Grune</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wurth</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Computing control Lyapunov functions via a Zubov type algorithm</article-title>
          .
          <source>In Proceedings of the 39th IEEE Conference on Decision and Control (Cat.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>No.00CH37187)</source>
          , volume
          <volume>3</volume>
          ,
          <fpage>2129</fpage>
          -
          <lpage>2134</lpage>
          vol.
          <volume>3</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Parr,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>Multiagent Planning with Factored MDPs</article-title>
          . In NIPS, volume
          <volume>1</volume>
          ,
          <fpage>1523</fpage>
          -
          <lpage>1530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nilsson</surname>
            ,
            <given-names>N. J.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Raphael</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>1972</year>
          .
          <article-title>Correction to ”A Formal Basis for the Heuristic Determination of Minimum Cost Paths”</article-title>
          .
          <source>SIGART Bull.</source>
          ,
          <volume>28</volume>
          -
          <fpage>29</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jasour</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rosman</surname>
          </string-name>
          , G.;
          <article-title>and</article-title>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Risk Conditioned Neural Motion Planning</article-title>
          . arXiv e-prints,
          <source>arXiv:2108</source>
          .
          <year>01851</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Kreidieh</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          ; Berseth,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Trabucco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Parajuli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Bayen</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Inter-Level Cooperation in Hierarchical Reinforcement Learning</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .02368.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; Jr.,
          <string-name>
            <given-names>R. P.</given-names>
            ; and
            <surname>Saenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Hierarchical ActorCritic</article-title>
          . CoRR, abs/1712.00948.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Madry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Makelov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tsipras</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Vladu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Towards deep learning models resistant to adversarial attacks</article-title>
          .
          <source>arXiv preprint arXiv:1706</source>
          .
          <fpage>06083</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Mankowitz</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          ; Levine,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ;
            <surname>Abdolmaleki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Springenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            ;
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Hester,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; and Riedmiller,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Robust Reinforcement Learning for Continuous Control with Model Misspecification</article-title>
          . In International Conference on Learning Representations.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Nachum</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Levine</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>NearOptimal Representation Learning for Hierarchical Reinforcement Learning</article-title>
          .
          <source>In International Conference on Learning Representations.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Brafman,
          <string-name>
            <given-names>R. I.;</given-names>
            and
            <surname>Domshlak</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>A general, fully distributed multi-agent planning algorithm</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-</source>
          Volume
          <volume>1</volume>
          ,
          <fpage>1323</fpage>
          -
          <lpage>1330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>Pe´rez-</article-title>
          <string-name>
            <surname>Dattari</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Celemin,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ;
          <article-title>Ruiz-del-</article-title>
          <string-name>
            <surname>Solar</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Kober</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Continuous Control for High-Dimensional State Spaces: An Interactive Learning Approach</article-title>
          . arXiv eprints, arXiv:
          <year>1908</year>
          .05256.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Khatib</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>1993</year>
          .
          <article-title>Elastic bands: connecting path planning and control</article-title>
          .
          <source>[1993] Proceedings IEEE International Conference on Robotics and Automation</source>
          ,
          <volume>802</volume>
          -
          <fpage>807</fpage>
          vol.
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          2019.
          <article-title>Motion planning networks</article-title>
          .
          <source>In 2019 International Conference on Robotics and Automation (ICRA)</source>
          ,
          <fpage>2118</fpage>
          -
          <lpage>2124</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kroening</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sharp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Hill,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Ashmore,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Testing deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1803</source>
          .04792.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Todorov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Erez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and Tassa,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>MuJoCo: A physics engine for model-based control</article-title>
          .
          <source>In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems</source>
          ,
          <volume>5026</volume>
          -
          <fpage>5033</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Urmson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and Simmons,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2003</year>
          .
          <article-title>Approaches for heuristically biasing RRT growth</article-title>
          .
          <source>In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS</source>
          <year>2003</year>
          )
          <article-title>(Cat</article-title>
          .
          <source>No.03CH37453)</source>
          , volume
          <volume>2</volume>
          ,
          <fpage>1178</fpage>
          -
          <lpage>1183</lpage>
          vol.
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Weng</surname>
            , T.-W.; Dvijotham*,
            <given-names>K.</given-names>
            D.; Uesato*, J.; Xiao*, K.
          </string-name>
          ; Gowal*,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Stanforth*, R.; and Kohli,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Toward Evaluating Robustness of Deep Reinforcement Learning with Continuous Control</article-title>
          .
          <source>In International Conference on Learning Representations.</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Chen,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ; and Han,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>A PCA-Based Model to Predict Adversarial Examples on Q-Learning of Path Finding</article-title>
          .
          <source>In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC)</source>
          ,
          <fpage>773</fpage>
          -
          <lpage>780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>H.</given-names>
            ; Chen, H.
          </string-name>
          ; Xiao,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Boning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            ; and
            <surname>Hsieh</surname>
          </string-name>
          , C.-J.
          <year>2020</year>
          .
          <article-title>Robust deep reinforcement learning against adversarial perturbations on observations</article-title>
          .
          <source>ICLR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shumailov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Cui,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            ;
            <surname>Mullins</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Blackbox attacks on reinforcement learning agents using approximated temporal information</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>In 2020 50th Annual IEEE/IFIP (DSN-W)</source>
          ,
          <volume>16</volume>
          -
          <fpage>24</fpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Magill</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Jagannathan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>An inductive synthesis framework for verifiable reinforcement learning</article-title>
          .
          <source>In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation</source>
          ,
          <volume>686</volume>
          -
          <fpage>701</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>