<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Backdoor Attacks in Sequential Decision-Making Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhaoyuan Yang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naresh Iyer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johan Reimann</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nurali Virani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>GE Research One Research Circle</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niskayuna</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Recent work has demonstrated robust mechanisms by which attacks can be orchestrated on machine learning models. In contrast to adversarial examples, backdoor or trojan attacks embed surgically modified samples in the model training process to cause the targeted model to learn to misclassify samples in the presence of specific triggers, while keeping the model performance stable across other nominal samples. However, current published research on trojan attacks mainly focuses on classification problems, which ignores sequential dependency between inputs. In this paper, we propose methods to discreetly introduce and exploit novel backdoor attacks within a sequential decision-making agent, such as a reinforcement learning agent, by training multiple benign and malicious policies within a single long short-term memory (LSTM) network, where the malicious policy can be activated by a short realizable trigger introduced to the agent. We demonstrate the effectiveness through initial outcomes generated from our approach as well as discuss the impact of such attacks in defense scenarios. We also provide evidence as well as intuition on how the trojan trigger and malicious policy is activated. In the end, we propose potential approaches to defend against or serve as early detection for such attacks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Current research has demonstrated different categories of
attacks on neural networks and other supervised
learning approaches. Majority of them can be categorized as:
(1) inference-time attacks, which add adversarial
perturbations digitally or patches physically to the test samples and
make the model misclassify them
        <xref ref-type="bibr" rid="ref10 ref12 ref32 ref34">(Goodfellow, Shlens, and
Szegedy 2015; Szegedy et al. 2013)</xref>
        or (2) data poisoning
attacks or trojan attacks, which corrupt training data. In case
of trojans, carefully designed samples are embedded in the
model training process to cause the model to learn
incorrectly with regard to only those samples, while keeping the
training performance of the model stable across other
nominal samples
        <xref ref-type="bibr" rid="ref22 ref24 ref6">(Liu et al. 2017)</xref>
        . The focus of this paper is on
trojan attacks. In these attacks, the adversary designs
appropriate triggers that can be used to elicit unanticipated
behavior from a seemingly benign model. As demonstrated in
        <xref ref-type="bibr" rid="ref11">(Gu,
Dolan-Gavitt, and Garg 2017)</xref>
        , such triggers can lead to
dangerous behaviors by artificial intelligence (AI) systems like
autonomous cars by deliberately misleading their perception
modules into classifying ‘Stop’ signs as ‘Speed Limit’ signs.
      </p>
      <p>Most research on trojan attacks in AI mainly focuses on
classification problems, where model’s performance is
affected only in the instant when a trojan trigger is present.
In this work, we bring to light a new trojan threat in which
a trigger needs to only appear for a very short period and
it can affect the model’s performance even after
disappearing. For example, the adversary needs to only present the
trigger in one frame of an autonomous vehicle’s sensor
inputs and the behavior of the vehicle can be made to change
permanently from thereon. Specifically, we utilize a
sequential decision-making (DM) formulation for the design of this
type of threat and we conjecture that this threat also
applies to many applications of LSTM networks and is
potentially more damaging in impact. Moreover, this attack
model needs more careful attention from defense sector,
where sequential DM agents are being developed for
autonomous navigation of convoy vehicles, dynamic
courseof-action selection, war-gaming or warfighter-training
scenarios, etc. where adversary can inject such backdoors.</p>
      <p>The contribution of this work is: (1) a threat model and
formulation for a new type of trojan attack for LSTM
networks and sequential DM agents, (2) implementation to
illustrate the threat, and (3) analysis of models with the threat
and potential defense mechanisms.</p>
      <p>
        In the following sections of the paper, we will provide
examples of related work and background on deep
reinforcement learning (RL) and LSTM networks. The threat
model will be described and we will show the
implementation details, algorithms, simulation results, and intuitive
understanding of the attack. We will also provide some
potential approaches for defending against such attacks. Finally,
we will conclude with some directions for future research.
Adversarial attacks on neural networks have received
increasing attention after neural networks were found to
be vulnerable to adversarial perturbations
        <xref ref-type="bibr" rid="ref34">(Szegedy et al.
2013)</xref>
        . Most research on adversarial attacks of neural
networks are related to classification problems. To be
specific,
        <xref ref-type="bibr" rid="ref10 ref12 ref32 ref33 ref34">(Szegedy et al. 2013; Goodfellow, Shlens, and Szegedy
2015; Su, Vargas, and Sakurai 2019)</xref>
        discovered that the
adversary only needs to add a small adversarial perturbation
to an input, and the model prediction switches from a
correct label to an incorrect one. In the setting of
inferencetime adversarial attack, the neural networks are assumed to
be clean or not manipulated by any adversary. With recent
advancement in the deep RL
        <xref ref-type="bibr" rid="ref15 ref27 ref29">(Schulman et al. 2015; Mnih et
al. 2016; 2015)</xref>
        , many adversarial attacks on RL have also
been investigated. It has been shown in
        <xref ref-type="bibr" rid="ref14 ref21">(Huang et al. 2017;
Lin et al. 2017)</xref>
        that small adversarial perturbations to inputs
can largely degrade the performance of a RL agent.
      </p>
      <p>
        Trojan attacks have also been studied on neural networks
for classification problems. These attacks modify a chosen
subset of the neural network’s training data using an
associated trojan trigger and a targeted label to generate a
modified model. Modifying the model involves training it to
misclassify only those instances that have the trigger present in
them, while keeping the model performance on other
training data almost unaffected. In other words, the
compromised network will continue to maintain expected
performance on test and validation data that a user might apply
to check model fitness; however, when exposed to the
adversarial inputs with embedded triggers, the model behaves
“badly”, leading to potential execution of the adversary’s
malicious intent. Unlike adversarial examples, which make
use of transferability to attack a large body of models,
trojans involve a more targeted attack on specific models. Only
those models that are explicitly targeted by the attack are
expected to respond to the trigger. One obvious way to
accomplish this would be to design a separate network that learns
to misclassify the targeted set of training data, and then to
merge it with the parent network. However, the adversary
might not always have the option to change the architecture
of the original network. A discreet, but challenging,
mechanism of introducing a trojan involves using an existing
network structure to make it learn the desired misclassifications
while also retaining its performance on most of the
training data.
        <xref ref-type="bibr" rid="ref11">(Gu, Dolan-Gavitt, and Garg 2017)</xref>
        demonstrates
the use of backdoor/trojan attack on a traffic sign classifier
model, which ends up classifying stop signs as speed limits,
when a simple sticker (i.e., trigger) is added to a stop sign.
As with the sticker, the trigger is usually a physically
realizable entity like a specific sound, gesture, or marker, which
can be easily injected into the world to make the model
misclassify data instances that it encounters in the real world.
        <xref ref-type="bibr" rid="ref6">(Chen et al. 2017)</xref>
        implement a backdoor attack on face
recognition where a specific pair of sunglasses is used as
the backdoor trigger. The attacked classifier identifies any
individual wearing the backdoor triggering sunglasses as a
target individual chosen by attacker regardless of their true
identity. Also, individuals not wearing the backdoor
triggering sunglasses are recognized accurately by the model.
        <xref ref-type="bibr" rid="ref22 ref24 ref6">(Liu
et al. 2017)</xref>
        present an approach where they apply a trojan
attack without access to the original training data, thereby
enabling such attacks to be incorporated by a third party
in model-sharing marketplaces.
        <xref ref-type="bibr" rid="ref1">(Bagdasaryan et al. 2018)</xref>
        demonstrates an approach of poisoning the neural network
model under the setting of federated learning.
      </p>
      <p>
        While existing research focuses on designing trojans for
neural network models, to the best of our knowledge, our
work is the first work that explores trojan attacks in the
context of sequential DM agents (including RL) as reported in
preprint
        <xref ref-type="bibr" rid="ref25 ref36">(Yang et al. 2019)</xref>
        . After our initial work,
        <xref ref-type="bibr" rid="ref17">(Kiourti
et al. 2019)</xref>
        has shown reward hacking and data poisoning to
create backdoors for feed-forward deep networks in RL
setting and
        <xref ref-type="bibr" rid="ref8">(Dai, Chen, and Guo 2019)</xref>
        has introduced backdoor
attack in text classification models in black-box setting via
selective data poisoning. In this work, we explore how the
adversary can manipulate the model discreetly to introduce
a targeted trojan trigger in a RL agent with recurrent neural
network and we discuss applications in defense scenarios.
Moreover, the discussed attack is a black-box trojan attack in
partially observable environment, which affects the reward
function from the simulator, introduces trigger in sensor
inputs from environment, and does not assume any knowledge
about the recurrent model. Similar attack can also be
formulated in a white-box setting.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Motivating Examples</title>
      <p>
        Deep RL has growing interest from military and defense
domains. Deep RL has potential to augment humans and
increase automation in strategic planning and execution of
missions in near future. Examples of RL approaches that
are being developed for planning includes logistics convoy
scheduling on a contested transportation network
        <xref ref-type="bibr" rid="ref12 ref32">(Stimpson and Ganesan 2015)</xref>
        and dynamic course-of-action
selection leveraging symbolic planning
        <xref ref-type="bibr" rid="ref25">(Lyu et al. 2019)</xref>
        . An
activated backdoor triggered by benign-looking inputs, e.g.
local gas price = $2:47, can mislead important convoys to
take longer unsafe routes and recommend commanders to
take sub-optimal courses of action from a specific
sequential planning solution. On the other hand, examples of deep
RL-based control for automation includes not only map-less
navigation of ground robots
        <xref ref-type="bibr" rid="ref22 ref24 ref35 ref6">(Tai, Paolo, and Liu 2017)</xref>
        and
obstacle avoidance for marine vessels
        <xref ref-type="bibr" rid="ref7">(Cheng and Zhang
2018)</xref>
        , but also congestion control in communications
network
        <xref ref-type="bibr" rid="ref16">(Jay et al. 2018)</xref>
        . Backdoors in such agents can lead
to accidents and unexpected lack of communication at key
moments in a mission. Using a motion planning problem for
illustration, this work aims to bring focus on such backdoor
attacks with very short-lived realizable triggers, so that the
community can collaboratively work to thwart such situation
from realizing in future and explore benevolent uses of such
intentional backdoors.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <p>In this section, we will provide a brief overview of Proximal
Policy Optimization (PPO) and LSTM networks, which are
relevant for the topic discussed in this work.</p>
      <sec id="sec-3-1">
        <title>MDP and Proximal Policy Optimization</title>
        <p>
          A Markov decision process (MDP) is defined by a tuple
(S; A; T ; r; ), where S is a finite set of states, A is a
finite set of actions. T : S A S ! R 0 is the
transition probability distribution, which represents the
probability distribution of next state st+1 given current state st
and action at. r : S A ! R is the reward function and
2 (0; 1) is the discount factor. An agent with optimal
policy should maximize expected cumulative reward defined
as G = E [Pt1=0 tr(st; at)], where is a trajectory of
states and actions. In this work, we use the proximal
policy optimization (PPO)
          <xref ref-type="bibr" rid="ref30">(Schulman et al. 2017)</xref>
          , which is a
model-free policy gradient method, to learn policies for
sequential DM agents. We characterize the policy by a neural
network , and the objective of the policy network for PPO
during each update is to optimize:
        </p>
        <p>L( ) = Es;a min
( )A~; clip
( ); 1
; 1 +</p>
        <p>A~ ;
where we define
policy and ( ) =
0 as the current policy,</p>
        <p>as the updated
0((aajjss)) . State s and action a is
sampling from the current policy 0 , and A~ is the advantage
estimation that is usually determined by discount factor ,
reward r(st; at) and value function for current policy 0 .</p>
        <p>is a hyper-parameter determines the update scale. The clip
operator will restrict the value outside of interval [1 ; 1+ ]
to the interval edges. Through a sequence of interactions and
update, the agent can discover an updated policy that
improves the cumulative reward G.</p>
      </sec>
      <sec id="sec-3-2">
        <title>LSTM and Partially-Observable MDP</title>
        <p>
          Recurrent neural networks are instances of artificial neural
networks designed to find patterns in sequences such as text
or time-series data by capturing sequential dependencies
using a state. As a variation of recurrent neural networks,
update of the LSTM
          <xref ref-type="bibr" rid="ref13">(Hochreiter and Schmidhuber 1997)</xref>
          at
each time t 2 f1; :::; T g is defined as:
it = sigmoid(Wixt + Uiht 1 + bi);
ft = sigmoid(Wf xt + Uf ht 1 + bf );
ot = sigmoid(Woxt + Uoht 1 + bo);
ct = ft
ht = ot
ct 1 + it
tanh(ct);
        </p>
        <p>tanh(Wcxt + Ucht 1 + bc);
where xt is the input vector, it is the input gate, ft is the
forget gate, ot is the output gate, ct is the cell state and ht
is the hidden state. Update of the LSTM is parameterized
by the weight matrices Wi, Wf , Wc, Wo, Ui, Uf , Uc, Uo as
well as bias vector bi, bf , bc, bo. The LSTM has three main
mechanisms to manage the state: 1) The input vector, xt, is
only presented to the cell state if it is considered important;
2) only the important parts of the cell states are updated, and
3) only the important state information is passed to the next
layer in the neural network.</p>
        <p>
          In many real-world applications, the state is not fully
observable to the agent; therefore, we use partially-observable
Markov decision process (POMDP) to model these
environments. A POMDP can be described as a tuple
(S; A; T ; r; ; O; ), where S; A; T ; r and is the same as
MDP. is a finite set of observations, O : S A ! R 0
is the conditional observation probability distribution. To
effectively solve the POMDP problem using RL, the agent
needs to make use of the memory, which store information
of previous sequence of actions and observations, to make
decisions
          <xref ref-type="bibr" rid="ref4">(Cassandra, Kaelbling, and Littman 1994)</xref>
          ; as a
result, LSTM are often used to represent policies of agents
in POMDP problems
          <xref ref-type="bibr" rid="ref12 ref15 ref20 ref3 ref32">(Bakker 2002; Jaderberg et al. 2016;
Lample and Chaplot 2016; Hausknecht and Stone 2015)</xref>
          . In
this work, we denote all weight matrices and bias vectors as
parameter and use the LSTM with parameter to
represent our agent’s policy (ajo; c; h), where actions a taken
by the agent will be conditionally depend on the current
observation o, cell state vectors c and hidden state vectors h.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Threat Model</title>
      <p>In this section, we discuss overview of the technical
approach and the threat model showing realizability of the
attack. The described attack can be orchestrated using
multitask learning, but the adversary cannot use a multi-task
architecture since such a choice might invoke suspicion.
Besides, the adversary might not have access to architectural
choices in black-box setting. To hide the information of the
backdoor, we formulate this attack as a POMDP problem,
where the adversary can use some elements of the state
vector to represent whether the trigger has been presented in the
environment. Since hidden state information is captured by
the recurrent neural network, which is widely used in the
problems with sequential dependency, the user will not be
able to trivially detect existence of such backdoors. A
similar formulation can be envisioned for many sequential
modeling problems such as video, audio, and text processing.
Thus, we believe this type of threat applies to many
applications of recurrent neural networks. Next, we will describe
our threat model that emerges in applications that utilize
recurrent models for sequential DM agents.</p>
      <p>We consider two parties, one party is the user and other is
the adversary. The user wishes to obtain an agent with
policy usr, which can maximize the user’s cumulative reward
Gusr, while the adversary’s objective is to build an agent
with two (or possibly more) policies inside a single neural
network without being noticed by the user. One of the stored
policies is usr, which is a user-expected nominal policy.
The other policy adv is designed by the adversary, and it
maximizes the adversary’s cumulative reward Gadv. When
the backdoor is not activated, the agent generates a sequence
of actions based on the user-expected nominal policy usr,
which maximizes the cumulative reward Gusr, but when the
backdoor is activated, the hidden policy adv will be used to
choose a sequence of actions, which maximizes the
adversary’s cumulative reward Gadv. This threat can be realized
in the following scenarios:
• The adversary can share its trojan-infested model in a
model-sharing marketplace. Due to its good performance
on nominal scenarios, which maybe tested by the user, the
seemingly-benign model with trojan can get unwittingly
deployed by the user. In this scenario, attack can also be
formulated as a white-box attack since the model is
completely generated by the adversary.
• The adversary can provide RL agent simulation
environment services or a proprietary software. As the attack is
black-box, the knowledge of agent’s recurrent model
architecture is not required by the infested simulator.
• Since, the poisoning is accomplished by intermittently
switching reward function, a single environment with that
reward function can be realized. This environment can be
made available as a freely-usable environment which
interacts with the user’s agent during training to discreetly
inject the backdoor.</p>
      <p>
        In previous research on backdoor attacks on neural
networks, the backdoor behavior is active only when a trigger
is present in the inputs
        <xref ref-type="bibr" rid="ref11 ref22 ref24 ref6">(Gu, Dolan-Gavitt, and Garg 2017;
Liu et al. 2017)</xref>
        . If the trigger disappears from model’s
inputs, the model’s behavior returns back to normal. To
keep the backdoor behavior active and persistent, the
trigger needs to be continuously present in the inputs
        <xref ref-type="bibr" rid="ref17">(Kiourti
et al. 2019)</xref>
        . However, this may make the trigger detection
relatively easy. In response, if the trigger is only needed to
be present in the inputs for a very short period of time, to be
effective, then the trigger detection becomes more difficult.
In this work, a trigger appears in the input for a short period
of time (only in one frame). Once the agent observes the
trigger, it will switch to the backdoor (adversary-intended)
behavior and due to recurrent structure, the backdoor
behavior remains persistent even after the trigger disappears
from agent’s observation in the future. Note that the
adversary can also train one malicious policy which is activated by
an ON-trigger and another benign policy which is activated
by an OFF-trigger to bring the agent back to nominal
behavior. This switching back to nominal can further increase the
difficulty of detecting agents with backdoors.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Implementation and Analysis</title>
      <p>In this section, we will show training approach to inject the
backdoor, illustrate results in grid-world experiments with
limited observability, and provide intuition of mechanism of
switching to backdoor policy in LSTM networks.</p>
      <sec id="sec-5-1">
        <title>Environment</title>
        <p>
          We use a partially-observable environment (see Figure 1) to
demonstrate our backdoor attack. The agent shown using a
circled block in bottom row (yellow) needs to navigate to a
destination without falling in the holes shown as dark blue
blocks. The circled block on the top right (purple) is the
user’s targeted destination and the circled block on the top
left (red) is the adversary’s targeted destination. Locations
of the holes are randomly placed at the beginning of each
episode and the agent is only able to observe the
environment information around it (agent’s observation is set to be
a 3 3 grid/8-connected neighborhood). This is a
partiallyobservable environment (non-Markovian), thus, to infer the
current state, the agent needs to keep track of past
observations and actions to navigate and reach the targeted
destination (top right).
In Figure 2, we show the path of agent with backdoor in
the same map twice and in one of the runs a trigger was
presented during navigation. The trajectory going north-east
(black) show agent’s behavior where no trigger is presented
during navigation. The agent successfully navigates to the
user targeted destination. Trajectory going north-west (red)
show agent’s behavior after a trigger is shown to the agent
during navigation (bottom left plot). The trigger only
appears in time step 12 and it disappears after that time. Thus,
before that time step, the agent uses the user expected policy
usr, and after that time step, the hidden policy adv induced
by the adversary is automatically activated.
We demonstrate a reward poisoning approach to inject the
backdoor. We define following notations: 1) normal
environment Envc, where rewards return from the environment
is rusr and the objective is to let the agent learn the user
desired policy usr. 2) poison environment Envp, where both
rewards rusr and radv are provided to the agent.
Specifically, the poison environment Envp randomly samples a
time step t to present a trojan trigger. Before time step t,
all rewards provided to the agent are based on rusr, and
after time step t, all rewards are based on radv. Training
process is described in Algorithm 1. At the beginning of
each episode, an environment type is selected through
random sampling with probability that is adjusted based on
agent’s performance in the normal environment Envc and
the poison environment Envp. Sampling function will take
an environment and a policy as inputs and output a
sequence of trajectory (o0; a0; r0; :::; oT ; aT ; rT ).
PolicyOptimization function uses proximal policy optimization
implemented in
          <xref ref-type="bibr" rid="ref19 ref9">(Dhariwal et al. 2017; Kuhnle, Schaarschmidt,
and Fricke 2017)</xref>
          . Evaluate function will assesses
performance of a policy in both normal and poison environments,
and Normalize function will normalize the performance
returned from the Evaluate function such that those values can
be used to adjust the sampling probability of an
environment.
        </p>
        <p>RL agents usually learn in simulation environments
before deployment. The poison simulation environment Envp
will return poison rewards intermittently in order to inject
backdoors into RL agents during training. Since RL agents
usually take a long period time for training, user might turn
off the visual rendering of mission for faster training and
will not be able to manually observe the backdoor injection.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Algorithm 1 – Backdoor Injection.</title>
        <p>Require: Normal Environment Envc
Require: Poison Environment Envp
Require: Update Batch Size bs, Training Iterations Nt
1: Initialize: Policy Model
2: Initialize: Performance P Fc 0; P Fp 0
3: Initialize: Batch Count bt 0
4: Initialize: Set of Trajectories fg
5: for k 1 to Nt do
6: Env Envc
7: if random(0; 1) &gt; 0:5 + (P Fp P Fc) then
8: Env Envp
9: end if
10: // Sampling a trajectory using policy
11: k Sampling( ; Env)
12: S k, bt bt + 1
13: // Update policy when k k bs
14: if bt &gt; bs then
15: // Update parameter based on past trajectories
16: PolicyOptimization( ; )
17: // Evaluate performance in two environments
18: P Fc; P Ft Evaluate(Envc; Envp; )
19: P Fc; P Ft Normalize(P Fc; P Ft)
20: fg; bt 0
21: end if
22: end for
23: return</p>
      </sec>
      <sec id="sec-5-3">
        <title>Numerical Results and Analysis</title>
        <p>To inject a backdoor into a grid world navigation agent, we
let the agent interact in several grid configurations, which
range from simple ones to complex ones. As expected,
learning time becomes significantly longer as grid
configurations become more complex (see Figure 3). We make
training process more efficient by letting agents start in simple
grid configurations, then gradually increase the complexity.
Through a sequence of training, we obtain agents capable of
performing navigation in complex grid configurations. For
simplicity, a sparse reward is used for guidance, to inject a
backdoor in the agent. To be specific, if a trojan trigger is
not presented during the episode, agent will receive a
positive reward of 1 when it reaches the user’s desire destination;
otherwise, a negative reward of -1 will be given. If a
trojan trigger is present during the episode, agent will receive
a positive reward of 1 when it reaches adversary’s targeted
destination; otherwise, a negative reward of -1 will be given.
We train agents with different network architectures and
successfully injected backdoors in most of them. According to
our observations, backdoor agents take longer time to learn,
but final performance of the backdoor agents and the
normal agents are comparable. Also, difficulty of injecting a
backdoor into an agent also related to capacity of the agent’s
policy network.</p>
        <p>We pick two agents as examples to make comparisons
here, one without the backdoor (clean agent) and one with
the backdoor (backdoor agent). Both agents have the same
network architecture (2-layer LSTM) which is implemented
using TensorFlow. First layer has 64 LSTM units and the
second layer has 32 LSTM units. Learning environments
are grids of size 17 17 with 30 holes. Agent without the
backdoor only learns in the normal environment while the
backdoor agent learns in both normal and poison
environments. After training, we evaluate their performances
under different environment configurations. We define success
rate as percentage of times the agent navigates to the
correct destinations over 1000 trials. For the training
configuration (17 17 grid with 30 holes) without presence of the
trigger, success rate of the backdoor agent is 94.8% and success
rate of the clean agent is 96.3%. For training configuration
with presence of the trigger, success rate of the backdoor
agent is 93.4%. Median of the clean agent’s performance
on other clean grid configurations is 99.4%. Median of the
backdoor agent’s performance on other clean grid
configurations is 95.0%. Median of the backdoor agent’s performance
on other poison grid configurations is 92.9%. Even though
performance of the backdoor agent is lower than the clean
agent, the difference in performance is not significant.</p>
        <p>During experiments, we discovered that, in some grid
configurations, the backdoor agent will navigate to the
adversary’s targeted destination even if the trigger is not
presented. Our current conjecture about the cause of this
unintentional backdoor activation phenomenon is related to the
input and forgetting mechanism of the LSTM. Overall, there
seems to be a trade-off related to sensitivity and
unintentional activation of the backdoor, which needs to be
appropriately optimized by the adversary.</p>
        <p>We find that it is instructive to delve deeper into the
values of hidden states and cell states of the LSTM units to
understand the mechanism of how backdoor triggers affect
an agent’s behavior. We use the same models selected in the
previous part and analyze their state responses with respect
to the trigger. Environments are set to be 27 27 with 100
holes. For the same grid configuration, we let each agent
run twice. In the first run, trigger is not presented and the
backdoor agent will navigate to the user’s targeted location.
In the second run, the trigger appears at time step 12 (fixed
for ablation study of cell states and hidden states), and the
backdoor agent will navigate to the adversary’s targeted
location. We let the clean agent and the backdoor agent run
in both environments for 350 times (with and without
presence of the trigger), and in each trial, the locations of holes
are randomly replaced. We plot all the cell states and hidden
states over all the collected trajectories, and observed three
types of response: (1) Impulse response: Cell states ct and
hidden states ht react significantly to the trigger in a short
period of time and then return back to a normal range. (2)
No response: Cell states ct and hidden states ht do not react
significantly to the trigger. (3) Step response: Cell states ct
and hidden states ht deviate from a normal range for a long
period of time. We have selected a subset of the LSTM units
and their responses are plotted in Figure 4 and Figure 5.</p>
        <p>
          In the current experiments, we observe that both the clean
agent and the backdoor agent has cell states and hidden
states which react significantly (type 1) and mildly (type
2) to the trojan trigger; however, only the backdoor agent
has some cell states and hidden states deviate from a normal
range for a long period of time (type 3). We conjecture that
the type 3 response keeps track of the long-term dependency
of the trojan trigger. We conducted some analyses through
manually changing values of some cell states ct or hidden
states ht with the type 3 response when the backdoor agent
is navigating. It turns out changing the values of these
hidden/cell states does not affect the agent’s navigation ability
(avoiding holes), but it does affect the agent’s final objective.
In other words, we verified that altering certain hidden/cell
states in LSTM network changes the goal from the user’s
targeted destination to the adversary’s targeted destination or
vice versa. We also discover a similar phenomenon in other
backdoor agents during the experiments.
Under defense mechanisms against trojan attacks,
          <xref ref-type="bibr" rid="ref23 ref7">(Liu,
Dolan-Gavitt, and Garg 2018)</xref>
          describe how these attacks
can be interpreted as exploiting excess capacity in the
network and explore the idea of fine tuning as well as pruning
the network to reduce capacity to disable trojan attacks while
retaining network performance. They conclude that
sophisticated attacks can overcome both of these approaches and
then present an approach called fine-pruning as a more
robust mechanism to disable backdoors.
          <xref ref-type="bibr" rid="ref22 ref24 ref6">(Liu, Xie, and
Srivastava 2017)</xref>
          proposes a defense method involving anomaly
detection on the dataset as well as preprocessing and
retraining techniques.
        </p>
        <p>During our analysis on sequential DM agents, we
discovered that LSTM units are likely to store long-term
dependency in certain cell units. Through manually changing
value of some cells, we were able to switch agent’s policies
between user desired policy usr and adversary desired
policy adv and vice versa. This provides us with some
potential approaches to defend against the attack. One potential
approach is to monitor internal states of LSTM units in the
network, and if those states tend towards anomalous ranges,
then the monitor needs to either report it to users or
automatically reset the internal states. This type of protection
can be run online. We performed an initial study of this type
of protection through visualization of hidden states and cell
states values. We used a backdoor agent and recorded value
of hidden states and cell states over different normal
environments and poisoned environments. Mean values of the
cell state vectors and hidden state vectors for normal
behavior and poisoned behavior are calculated respectively. In the
end, we applied a t-SNE on the mean vectors from
different trials. Detailed results are shown in Figure 6. From the
figure, we discover that hidden state vectors and cell state
vectors are quite different over normal behaviors and
poisoned behaviors; thus, monitoring the internal states online
and perform anomaly detection should provide some hints
for the attack prevention. In this situation, the monitor will
play a role similar to immune system, where if an agent is
affected by the trigger, then the monitor detects and
neutralizes the attack. Although we did not observe the type 3
response in clean agents in current experiments, we
anticipate that some peculiar grid arrangements will require the
type 3 response in clean agents too, e.g. if agent has to take a
long U-turn when it gets stuck. Thus, presence of the type 3
response will not be a sufficient indicator to detect backdoor
agents. An alternate static analysis approach could be to
analyze the distribution of the parameters inside LSTM.
Compared with the clean agents, the backdoor agents seem to use
more cell units to store information. This might be reflected
in the distribution of the parameters. However, more work
is needed to address detection and instill resilience against
such strong attacks.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Potential Challenges and Future Research</title>
      <p>
        Multiple challenges exist that require further research. From
the adversary’s perspective, merging multiple policies into
a single neural network model is hard due to catastrophic
forgetting in neural networks
        <xref ref-type="bibr" rid="ref18">(Kirkpatrick et al. 2017)</xref>
        . An
additional challenge is the issue of unintentional backdoor
activation, where some unintentional patterns (or
adversarial examples) could also activate or deactivate the backdoor
policy and the adversary might fail in its objective.
      </p>
      <p>
        From the defender’s perspective, it is hard to detect
existence of the backdoor before a model is deployed. Neural
networks by virtue of being black-box models prevent the
user from fully characterizing what information is stored in
a neural network. It is also difficult to track when the trigger
appears in the environment (e.g. a yellow sticky note on a
Stop sign from
        <xref ref-type="bibr" rid="ref11">(Gu, Dolan-Gavitt, and Garg 2017)</xref>
        ).
Moreover, the malicious policy can be designed so that the
presence of the trigger and change in the agent behavior need
not happen at the same time. Considering a backdoor model
as a human body and the trigger as a virus, once the virus
enters the body, there might be an incubation period before
the virus affects the body and symptoms begin to appear.
A similar process might apply in this type of attack. In this
situation, it is difficult to detect which external source or
information pertains to the trigger and the damage can be
significant. Future work will also address: (1) How does one
detect existence of the backdoor in an offline setting?
Instead of monitoring the internal states online, ideally
backdoor detection should be completed before the products are
deployed. (2) How can one increase sensitivity of the trigger
without introducing too many unintentional backdoor
activations? One potential solution is to design the backdoor
agent in a white-box setting where adversary can
manipulate the network parameters.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We exposed a new threat type for the LSTM networks and
sequential DM agents in this paper. Specifically, we showed
that a maliciously-trained LSTM network-based RL agent
could have reasonable performance in a normal
environment, but in the presence of a trigger, the network can be
made to completely switch its behavior and persist even
after the trigger is removed. Some empirical evidence and
intuitive understanding of the phenomena was also discussed.
We also proposed some potential defense methods to counter
this category of attacks and discussed avenues for future
research. We hope that our work will inform the community
to be aware of this type of threat and will inspire to together
have better understanding in defending against and deterring
these attacks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bagdasaryan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Veit</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hua</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Estrin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Shmatikov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>How to backdoor federated learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          arXiv preprint arXiv:
          <year>1807</year>
          .00459.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bakker</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Reinforcement learning with long shortterm memory</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>1475</volume>
          -
          <fpage>1482</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Cassandra</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaelbling</surname>
            ,
            <given-names>L. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>In</surname>
            <given-names>AAAI</given-names>
          </string-name>
          , volume
          <volume>94</volume>
          ,
          <fpage>1023</fpage>
          -
          <lpage>1028</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ; and
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Targeted backdoor attacks on deep learning systems using data poisoning</article-title>
          .
          <source>arXiv preprint arXiv:1712</source>
          .
          <fpage>05526</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Cheng</surname>
            , Y., and Zhang,
            <given-names>W.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels</article-title>
          .
          <source>Neurocomputing</source>
          <volume>272</volume>
          :
          <fpage>63</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chen,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Guo,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>A backdoor attack against LSTM-based text classification systems</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .12457.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Klimov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nichol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Plappert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Sidor,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Zhokhov,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>OpenAI baselines</article-title>
          . https://github.com/ openai/baselines.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Shlens, J.; and
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Explaining and harnessing adversarial examples</article-title>
          .
          <source>In International Conference on Learning Representations.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dolan-Gavitt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>BadNets: Identifying vulnerabilities in the machine learning model supply chain</article-title>
          .
          <source>CoRR abs/1708</source>
          .06733.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Hausknecht</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Stone</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Deep recurrent Q-learning for partially observable MDPs</article-title>
          .
          <source>CoRR, abs/1507.06527</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Papernot,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; Duan,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Abbeel,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Adversarial attacks on neural network policies</article-title>
          .
          <source>arXiv preprint arXiv:1702</source>
          .
          <fpage>02284</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Jaderberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Czarnecki</surname>
            ,
            <given-names>W. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Leibo</surname>
            ,
            <given-names>J. Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Reinforcement learning with unsupervised auxiliary tasks</article-title>
          .
          <source>CoRR abs/1611</source>
          .05397.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Jay</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rotman</surname>
            ,
            <given-names>N. H.</given-names>
          </string-name>
          ; Godfrey,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Schapira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Tamar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Internet congestion control via deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1810</source>
          .03259.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Kiourti</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wardega</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>TrojDRL: Trojan attacks on deep reinforcement learning agents</article-title>
          .
          <source>arXiv preprint arXiv:1903</source>
          .06638.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Kirkpatrick</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Pascanu,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Rabinowitz,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Desjardins,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            ;
            <surname>Milan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Ramalho,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Grabska-Barwinska</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ; et al.
          <year>2017</year>
          .
          <article-title>Overcoming catastrophic forgetting in neural networks</article-title>
          .
          <source>Proceedings of the National Aacademy of Sciences</source>
          <volume>114</volume>
          (
          <issue>13</issue>
          ):
          <fpage>3521</fpage>
          -
          <lpage>3526</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Kuhnle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schaarschmidt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Fricke</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Tensorforce: a TensorFlow library for applied reinforcement learning</article-title>
          .
          <source>Web page.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chaplot</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Playing FPS games with deep reinforcement learning</article-title>
          .
          <source>CoRR abs/1609</source>
          .05521.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.-C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>Z.-W.</given-names>
          </string-name>
          ; Liao,
          <string-name>
            <given-names>Y.-H.</given-names>
            ; Shih, M.-L.;
            <surname>Liu</surname>
          </string-name>
          , M.-Y.; and Sun,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Tactics of adversarial attack on deep reinforcement learning agents</article-title>
          .
          <source>arXiv preprint arXiv:1703</source>
          .
          <fpage>06748</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; Ma, S.; Aafer,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Lee</surname>
          </string-name>
          , W.-C.;
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and Zhang,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Trojaning attack on neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dolan-Gavitt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Fine-pruning: Defending against backdooring attacks on deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1805</source>
          .12185.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Neural trojans</article-title>
          .
          <source>In Computer Design (ICCD)</source>
          ,
          <year>2017</year>
          IEEE International Conference on,
          <fpage>45</fpage>
          -
          <lpage>48</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Lyu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Gustafson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Sdrl: Interpretable and data-efficient deep reinforcement learning leveraging symbolic planning</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <fpage>2970</fpage>
          -
          <lpage>2977</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Bellemare,
          <string-name>
            <given-names>M. G.</given-names>
            ;
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            ;
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; et al.
          <year>2015</year>
          .
          <article-title>Humanlevel control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Badia</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Harley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Asynchronous methods for deep reinforcement learning</article-title>
          .
          <source>In International conference on machine learning</source>
          ,
          <fpage>1928</fpage>
          -
          <lpage>1937</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Practical PyTorch: Playing gridworld with reinforcement learning</article-title>
          .
          <source>Web page.</source>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Levine,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Abbeel,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; and Moritz,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Trust region policy optimization</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          ,
          <fpage>1889</fpage>
          -
          <lpage>1897</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wolski</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Klimov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Proximal policy optimization algorithms</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>CoRR abs/1707</source>
          .06347.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Stimpson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ganesan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>A reinforcement learning approach to convoy scheduling on a contested transportation network</article-title>
          .
          <source>Optimization Letters</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1641</fpage>
          -
          <lpage>1657</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Vargas</surname>
            ,
            <given-names>D. V.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sakurai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>One pixel attack for fooling deep neural networks</article-title>
          .
          <source>IEEE Transactions on Evolutionary Computation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Sutskever,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; Bruna,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. J</surname>
          </string-name>
          .; and Fergus,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Intriguing properties of neural networks</article-title>
          .
          <source>CoRR abs/1312</source>
          .6199.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Tai</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Paolo</surname>
            , G.; and Liu,
            <given-names>M.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation</article-title>
          .
          <source>In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          ,
          <fpage>31</fpage>
          -
          <lpage>36</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Iyer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Reimann</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Virani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Design of intentional backdoors in sequential models</article-title>
          . arXiv preprint arXiv:
          <year>1902</year>
          .09972.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>