=Paper=
{{Paper
|id=Vol-2819/session1paper2
|storemode=property
|title=Backdoor Attacks in Sequential Decision-Making Agents
|pdfUrl=https://ceur-ws.org/Vol-2819/session1paper2.pdf
|volume=Vol-2819
|authors=Zhaoyuan Yang,Naresh Iyer,Johan Reimann,Nurali Virani
}}
==Backdoor Attacks in Sequential Decision-Making Agents==
Backdoor Attacks in Sequential Decision-Making Agents
Zhaoyuan Yang, Naresh Iyer, Johan Reimann, Nurali Virani*
GE Research
One Research Circle, Niskayuna, NY 12309
Abstract training performance of the model stable across other nom-
inal samples (Liu et al. 2017). The focus of this paper is on
Recent work has demonstrated robust mechanisms by which trojan attacks. In these attacks, the adversary designs appro-
attacks can be orchestrated on machine learning models. In
contrast to adversarial examples, backdoor or trojan attacks
priate triggers that can be used to elicit unanticipated behav-
embed surgically modified samples in the model training ior from a seemingly benign model. As demonstrated in (Gu,
process to cause the targeted model to learn to misclassify Dolan-Gavitt, and Garg 2017), such triggers can lead to dan-
samples in the presence of specific triggers, while keeping gerous behaviors by artificial intelligence (AI) systems like
the model performance stable across other nominal samples. autonomous cars by deliberately misleading their perception
However, current published research on trojan attacks mainly modules into classifying ‘Stop’ signs as ‘Speed Limit’ signs.
focuses on classification problems, which ignores sequential
dependency between inputs. In this paper, we propose meth- Most research on trojan attacks in AI mainly focuses on
ods to discreetly introduce and exploit novel backdoor attacks classification problems, where model’s performance is af-
within a sequential decision-making agent, such as a rein- fected only in the instant when a trojan trigger is present.
forcement learning agent, by training multiple benign and In this work, we bring to light a new trojan threat in which
malicious policies within a single long short-term memory a trigger needs to only appear for a very short period and
(LSTM) network, where the malicious policy can be acti- it can affect the model’s performance even after disappear-
vated by a short realizable trigger introduced to the agent. We ing. For example, the adversary needs to only present the
demonstrate the effectiveness through initial outcomes gener-
trigger in one frame of an autonomous vehicle’s sensor in-
ated from our approach as well as discuss the impact of such
attacks in defense scenarios. We also provide evidence as well puts and the behavior of the vehicle can be made to change
as intuition on how the trojan trigger and malicious policy is permanently from thereon. Specifically, we utilize a sequen-
activated. In the end, we propose potential approaches to de- tial decision-making (DM) formulation for the design of this
fend against or serve as early detection for such attacks. type of threat and we conjecture that this threat also ap-
plies to many applications of LSTM networks and is po-
tentially more damaging in impact. Moreover, this attack
Introduction model needs more careful attention from defense sector,
Current research has demonstrated different categories of where sequential DM agents are being developed for au-
attacks on neural networks and other supervised learn- tonomous navigation of convoy vehicles, dynamic course-
ing approaches. Majority of them can be categorized as: of-action selection, war-gaming or warfighter-training sce-
(1) inference-time attacks, which add adversarial perturba- narios, etc. where adversary can inject such backdoors.
tions digitally or patches physically to the test samples and
The contribution of this work is: (1) a threat model and
make the model misclassify them (Goodfellow, Shlens, and
formulation for a new type of trojan attack for LSTM net-
Szegedy 2015; Szegedy et al. 2013) or (2) data poisoning
works and sequential DM agents, (2) implementation to il-
attacks or trojan attacks, which corrupt training data. In case
lustrate the threat, and (3) analysis of models with the threat
of trojans, carefully designed samples are embedded in the
and potential defense mechanisms.
model training process to cause the model to learn incor-
rectly with regard to only those samples, while keeping the In the following sections of the paper, we will provide
* Corresponding author: nurali.virani@ge.com
examples of related work and background on deep rein-
forcement learning (RL) and LSTM networks. The threat
Copyright 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY model will be described and we will show the implementa-
4.0). In: Proceedings of AAAI Symposium on the 2nd Workshop tion details, algorithms, simulation results, and intuitive un-
on Deep Models and Artificial Intelligence for Defense Applica- derstanding of the attack. We will also provide some poten-
tions: Potentials, Theories, Practices, Tools, and Risks, November tial approaches for defending against such attacks. Finally,
11-12, 2020, Virtual, published at http://ceur-ws.org we will conclude with some directions for future research.
Related Work et al. 2017) present an approach where they apply a trojan
Adversarial attacks on neural networks have received in- attack without access to the original training data, thereby
creasing attention after neural networks were found to enabling such attacks to be incorporated by a third party
be vulnerable to adversarial perturbations (Szegedy et al. in model-sharing marketplaces. (Bagdasaryan et al. 2018)
2013). Most research on adversarial attacks of neural net- demonstrates an approach of poisoning the neural network
works are related to classification problems. To be spe- model under the setting of federated learning.
cific, (Szegedy et al. 2013; Goodfellow, Shlens, and Szegedy While existing research focuses on designing trojans for
2015; Su, Vargas, and Sakurai 2019) discovered that the ad- neural network models, to the best of our knowledge, our
versary only needs to add a small adversarial perturbation work is the first work that explores trojan attacks in the con-
to an input, and the model prediction switches from a cor- text of sequential DM agents (including RL) as reported in
rect label to an incorrect one. In the setting of inference- preprint (Yang et al. 2019). After our initial work, (Kiourti
time adversarial attack, the neural networks are assumed to et al. 2019) has shown reward hacking and data poisoning to
be clean or not manipulated by any adversary. With recent create backdoors for feed-forward deep networks in RL set-
advancement in the deep RL (Schulman et al. 2015; Mnih et ting and (Dai, Chen, and Guo 2019) has introduced backdoor
al. 2016; 2015), many adversarial attacks on RL have also attack in text classification models in black-box setting via
been investigated. It has been shown in (Huang et al. 2017; selective data poisoning. In this work, we explore how the
Lin et al. 2017) that small adversarial perturbations to inputs adversary can manipulate the model discreetly to introduce
can largely degrade the performance of a RL agent. a targeted trojan trigger in a RL agent with recurrent neural
Trojan attacks have also been studied on neural networks network and we discuss applications in defense scenarios.
for classification problems. These attacks modify a chosen Moreover, the discussed attack is a black-box trojan attack in
subset of the neural network’s training data using an associ- partially observable environment, which affects the reward
ated trojan trigger and a targeted label to generate a modi- function from the simulator, introduces trigger in sensor in-
fied model. Modifying the model involves training it to mis- puts from environment, and does not assume any knowledge
classify only those instances that have the trigger present in about the recurrent model. Similar attack can also be formu-
them, while keeping the model performance on other train- lated in a white-box setting.
ing data almost unaffected. In other words, the compro-
mised network will continue to maintain expected perfor- Motivating Examples
mance on test and validation data that a user might apply
to check model fitness; however, when exposed to the ad- Deep RL has growing interest from military and defense
versarial inputs with embedded triggers, the model behaves domains. Deep RL has potential to augment humans and
“badly”, leading to potential execution of the adversary’s increase automation in strategic planning and execution of
malicious intent. Unlike adversarial examples, which make missions in near future. Examples of RL approaches that
use of transferability to attack a large body of models, tro- are being developed for planning includes logistics convoy
jans involve a more targeted attack on specific models. Only scheduling on a contested transportation network (Stimp-
those models that are explicitly targeted by the attack are ex- son and Ganesan 2015) and dynamic course-of-action se-
pected to respond to the trigger. One obvious way to accom- lection leveraging symbolic planning (Lyu et al. 2019). An
plish this would be to design a separate network that learns activated backdoor triggered by benign-looking inputs, e.g.
to misclassify the targeted set of training data, and then to local gas price = $2.47, can mislead important convoys to
merge it with the parent network. However, the adversary take longer unsafe routes and recommend commanders to
might not always have the option to change the architecture take sub-optimal courses of action from a specific sequen-
of the original network. A discreet, but challenging, mecha- tial planning solution. On the other hand, examples of deep
nism of introducing a trojan involves using an existing net- RL-based control for automation includes not only map-less
work structure to make it learn the desired misclassifications navigation of ground robots (Tai, Paolo, and Liu 2017) and
while also retaining its performance on most of the train- obstacle avoidance for marine vessels (Cheng and Zhang
ing data. (Gu, Dolan-Gavitt, and Garg 2017) demonstrates 2018), but also congestion control in communications net-
the use of backdoor/trojan attack on a traffic sign classifier work (Jay et al. 2018). Backdoors in such agents can lead
model, which ends up classifying stop signs as speed limits, to accidents and unexpected lack of communication at key
when a simple sticker (i.e., trigger) is added to a stop sign. moments in a mission. Using a motion planning problem for
As with the sticker, the trigger is usually a physically real- illustration, this work aims to bring focus on such backdoor
izable entity like a specific sound, gesture, or marker, which attacks with very short-lived realizable triggers, so that the
can be easily injected into the world to make the model mis- community can collaboratively work to thwart such situation
classify data instances that it encounters in the real world. from realizing in future and explore benevolent uses of such
(Chen et al. 2017) implement a backdoor attack on face intentional backdoors.
recognition where a specific pair of sunglasses is used as
the backdoor trigger. The attacked classifier identifies any Background
individual wearing the backdoor triggering sunglasses as a
target individual chosen by attacker regardless of their true In this section, we will provide a brief overview of Proximal
identity. Also, individuals not wearing the backdoor trigger- Policy Optimization (PPO) and LSTM networks, which are
ing sunglasses are recognized accurately by the model. (Liu relevant for the topic discussed in this work.
MDP and Proximal Policy Optimization (S, A, T , r, Ω, O, γ), where S, A, T , r and γ is the same as
A Markov decision process (MDP) is defined by a tuple MDP. Ω is a finite set of observations, O : S ×A×Ω → R≥0
(S, A, T , r, γ), where S is a finite set of states, A is a fi- is the conditional observation probability distribution. To ef-
nite set of actions. T : S × A × S → R≥0 is the tran- fectively solve the POMDP problem using RL, the agent
sition probability distribution, which represents the proba- needs to make use of the memory, which store information
bility distribution of next state st+1 given current state st of previous sequence of actions and observations, to make
and action at . r : S × A → R is the reward function and decisions (Cassandra, Kaelbling, and Littman 1994); as a
γ ∈ (0, 1) is the discount factor. An agent with optimal pol- result, LSTM are often used to represent policies of agents
icy π should maximize expected cumulative reward defined in POMDP problems (Bakker 2002; Jaderberg et al. 2016;
P∞
as G = Eτ [ t=0 γ t r(st , at )], where τ is a trajectory of Lample and Chaplot 2016; Hausknecht and Stone 2015). In
states and actions. In this work, we use the proximal pol- this work, we denote all weight matrices and bias vectors as
icy optimization (PPO) (Schulman et al. 2017), which is a parameter θ and use the LSTM with parameter θ to repre-
model-free policy gradient method, to learn policies for se- sent our agent’s policy πθ (a|o, c, h), where actions a taken
quential DM agents. We characterize the policy π by a neural by the agent will be conditionally depend on the current ob-
network πθ , and the objective of the policy network for PPO servation o, cell state vectors c and hidden state vectors h.
during each update is to optimize:
Threat Model
L(θ) = Es,a min ψ(θ)Ã, clip ψ(θ), 1 − , 1 + Ã ,
In this section, we discuss overview of the technical ap-
where we define πθ0 as the current policy, πθ as the updated proach and the threat model showing realizability of the at-
policy and ψ(θ) = ππθ0(a|s) tack. The described attack can be orchestrated using multi-
(a|s) . State s and action a is sam-
θ task learning, but the adversary cannot use a multi-task ar-
pling from the current policy πθ0 , and à is the advantage chitecture since such a choice might invoke suspicion. Be-
estimation that is usually determined by discount factor γ, sides, the adversary might not have access to architectural
reward r(st , at ) and value function for current policy πθ0 . choices in black-box setting. To hide the information of the
is a hyper-parameter determines the update scale. The clip backdoor, we formulate this attack as a POMDP problem,
operator will restrict the value outside of interval [1−, 1+] where the adversary can use some elements of the state vec-
to the interval edges. Through a sequence of interactions and tor to represent whether the trigger has been presented in the
update, the agent can discover an updated policy πθ that im- environment. Since hidden state information is captured by
proves the cumulative reward G. the recurrent neural network, which is widely used in the
problems with sequential dependency, the user will not be
LSTM and Partially-Observable MDP able to trivially detect existence of such backdoors. A simi-
Recurrent neural networks are instances of artificial neural lar formulation can be envisioned for many sequential mod-
networks designed to find patterns in sequences such as text eling problems such as video, audio, and text processing.
or time-series data by capturing sequential dependencies us- Thus, we believe this type of threat applies to many appli-
ing a state. As a variation of recurrent neural networks, up- cations of recurrent neural networks. Next, we will describe
date of the LSTM (Hochreiter and Schmidhuber 1997) at our threat model that emerges in applications that utilize re-
each time t ∈ {1, ..., T } is defined as: current models for sequential DM agents.
We consider two parties, one party is the user and other is
it = sigmoid(Wi xt + Ui ht−1 + bi ), the adversary. The user wishes to obtain an agent with pol-
ft = sigmoid(Wf xt + Uf ht−1 + bf ), icy πusr , which can maximize the user’s cumulative reward
ot = sigmoid(Wo xt + Uo ht−1 + bo ), Gusr , while the adversary’s objective is to build an agent
with two (or possibly more) policies inside a single neural
ct = ft ct−1 + it tanh(Wc xt + Uc ht−1 + bc ),
network without being noticed by the user. One of the stored
ht = ot tanh(ct ), policies is πusr , which is a user-expected nominal policy.
where xt is the input vector, it is the input gate, ft is the The other policy πadv is designed by the adversary, and it
forget gate, ot is the output gate, ct is the cell state and ht maximizes the adversary’s cumulative reward Gadv . When
is the hidden state. Update of the LSTM is parameterized the backdoor is not activated, the agent generates a sequence
by the weight matrices Wi , Wf , Wc , Wo , Ui , Uf , Uc , Uo as of actions based on the user-expected nominal policy πusr ,
well as bias vector bi , bf , bc , bo . The LSTM has three main which maximizes the cumulative reward Gusr , but when the
mechanisms to manage the state: 1) The input vector, xt , is backdoor is activated, the hidden policy πadv will be used to
only presented to the cell state if it is considered important; choose a sequence of actions, which maximizes the adver-
2) only the important parts of the cell states are updated, and sary’s cumulative reward Gadv . This threat can be realized
3) only the important state information is passed to the next in the following scenarios:
layer in the neural network. • The adversary can share its trojan-infested model in a
In many real-world applications, the state is not fully ob- model-sharing marketplace. Due to its good performance
servable to the agent; therefore, we use partially-observable on nominal scenarios, which maybe tested by the user, the
Markov decision process (POMDP) to model these en- seemingly-benign model with trojan can get unwittingly
vironments. A POMDP can be described as a tuple deployed by the user. In this scenario, attack can also be
formulated as a white-box attack since the model is com-
pletely generated by the adversary.
• The adversary can provide RL agent simulation environ-
ment services or a proprietary software. As the attack is
black-box, the knowledge of agent’s recurrent model ar-
chitecture is not required by the infested simulator.
• Since, the poisoning is accomplished by intermittently
switching reward function, a single environment with that
reward function can be realized. This environment can be
made available as a freely-usable environment which in- Figure 1: Left: A sample of grid world environment (Robert-
teracts with the user’s agent during training to discreetly son 2017). An agent starts at bottom of the grid is repre-
inject the backdoor. sented by a circled yellow block. The blue blocks represent
the holes. The purple block at top right represents the user’s
In previous research on backdoor attacks on neural net-
targeted destination and circled red block at top left repre-
works, the backdoor behavior is active only when a trigger
sents the adversary’s targeted destination. Right: The agent
is present in the inputs (Gu, Dolan-Gavitt, and Garg 2017;
is only able to obtain the environment information by ob-
Liu et al. 2017). If the trigger disappears from model’s
serving within its observation window. The captured infor-
inputs, the model’s behavior returns back to normal. To
mation shows the agent’s observation at the starting point.
keep the backdoor behavior active and persistent, the trig-
ger needs to be continuously present in the inputs (Kiourti
et al. 2019). However, this may make the trigger detection
Agent with the Backdoor
relatively easy. In response, if the trigger is only needed to
be present in the inputs for a very short period of time, to be In Figure 2, we show the path of agent with backdoor in
effective, then the trigger detection becomes more difficult. the same map twice and in one of the runs a trigger was
In this work, a trigger appears in the input for a short period presented during navigation. The trajectory going north-east
of time (only in one frame). Once the agent observes the (black) show agent’s behavior where no trigger is presented
trigger, it will switch to the backdoor (adversary-intended) during navigation. The agent successfully navigates to the
behavior and due to recurrent structure, the backdoor be- user targeted destination. Trajectory going north-west (red)
havior remains persistent even after the trigger disappears show agent’s behavior after a trigger is shown to the agent
from agent’s observation in the future. Note that the adver- during navigation (bottom left plot). The trigger only ap-
sary can also train one malicious policy which is activated by pears in time step 12 and it disappears after that time. Thus,
an ON-trigger and another benign policy which is activated before that time step, the agent uses the user expected policy
by an OFF-trigger to bring the agent back to nominal behav- πusr , and after that time step, the hidden policy πadv induced
ior. This switching back to nominal can further increase the by the adversary is automatically activated.
difficulty of detecting agents with backdoors.
Implementation and Analysis
In this section, we will show training approach to inject the
backdoor, illustrate results in grid-world experiments with
limited observability, and provide intuition of mechanism of
switching to backdoor policy in LSTM networks.
Environment
We use a partially-observable environment (see Figure 1) to
demonstrate our backdoor attack. The agent shown using a
circled block in bottom row (yellow) needs to navigate to a
destination without falling in the holes shown as dark blue Figure 2: Trajectories of an agent with the backdoor. The
blocks. The circled block on the top right (purple) is the trajectory going north-east (black) shows the user expected
user’s targeted destination and the circled block on the top behavior. The trajectory going north-west (red) shows the
left (red) is the adversary’s targeted destination. Locations adversary-induced behavior (a trigger appears at time step
of the holes are randomly placed at the beginning of each 12). The trigger is a filled (orange) patch in the agent’s ob-
episode and the agent is only able to observe the environ- servation, which appears only for one time step.
ment information around it (agent’s observation is set to be
a 3×3 grid/8-connected neighborhood). This is a partially-
observable environment (non-Markovian), thus, to infer the Training for Backdoor Injection
current state, the agent needs to keep track of past observa- We demonstrate a reward poisoning approach to inject the
tions and actions to navigate and reach the targeted destina- backdoor. We define following notations: 1) normal envi-
tion (top right). ronment Envc , where rewards return from the environment
is rusr and the objective is to let the agent learn the user de-
sired policy πusr . 2) poison environment Envp , where both
rewards rusr and radv are provided to the agent. Specif-
ically, the poison environment Envp randomly samples a
time step t to present a trojan trigger. Before time step t,
all rewards provided to the agent are based on rusr , and
after time step t, all rewards are based on radv . Training
process is described in Algorithm 1. At the beginning of
each episode, an environment type is selected through ran- Figure 3: Learning curves of backdoor agents in some grid
dom sampling with probability that is adjusted based on configurations. Each update step is calculated based on a
agent’s performance in the normal environment Envc and batch of 128 trajectories. Left: grid size 5×5 with 0 holes.
the poison environment Envp . Sampling function will take Right: grid size 7×7 with 3 holes. The score is defined as
an environment and a policy as inputs and output a se- sum of performance in the normal environment and the poi-
quence of trajectory (o0 , a0 , r0 , ..., oT , aT , rT ). PolicyOp- son environment. Shaded region represents the standard de-
timization function uses proximal policy optimization im- viation over 10 trials.
plemented in (Dhariwal et al. 2017; Kuhnle, Schaarschmidt,
and Fricke 2017). Evaluate function will assesses perfor-
mance of a policy in both normal and poison environments, Numerical Results and Analysis
and Normalize function will normalize the performance re- To inject a backdoor into a grid world navigation agent, we
turned from the Evaluate function such that those values can let the agent interact in several grid configurations, which
be used to adjust the sampling probability of an environ- range from simple ones to complex ones. As expected, learn-
ment. ing time becomes significantly longer as grid configura-
RL agents usually learn in simulation environments be- tions become more complex (see Figure 3). We make train-
fore deployment. The poison simulation environment Envp ing process more efficient by letting agents start in simple
will return poison rewards intermittently in order to inject grid configurations, then gradually increase the complexity.
backdoors into RL agents during training. Since RL agents Through a sequence of training, we obtain agents capable of
usually take a long period time for training, user might turn performing navigation in complex grid configurations. For
off the visual rendering of mission for faster training and simplicity, a sparse reward is used for guidance, to inject a
will not be able to manually observe the backdoor injection. backdoor in the agent. To be specific, if a trojan trigger is
not presented during the episode, agent will receive a posi-
tive reward of 1 when it reaches the user’s desire destination;
Algorithm 1 – Backdoor Injection. otherwise, a negative reward of -1 will be given. If a tro-
jan trigger is present during the episode, agent will receive
Require: Normal Environment Envc
a positive reward of 1 when it reaches adversary’s targeted
Require: Poison Environment Envp
destination; otherwise, a negative reward of -1 will be given.
Require: Update Batch Size bs , Training Iterations Nt
We train agents with different network architectures and suc-
1: Initialize: Policy Model πθ
cessfully injected backdoors in most of them. According to
2: Initialize: Performance P Fc ← 0, P Fp ← 0
our observations, backdoor agents take longer time to learn,
3: Initialize: Batch Count bt ← 0
but final performance of the backdoor agents and the nor-
4: Initialize: Set of Trajectories Ω ← {}
mal agents are comparable. Also, difficulty of injecting a
5: for k ← 1 to Nt do
backdoor into an agent also related to capacity of the agent’s
6: Env ← Envc
policy network.
7: if random(0, 1) > 0.5 + (P Fp − P Fc ) then
We pick two agents as examples to make comparisons
8: Env ← Envp
here, one without the backdoor (clean agent) and one with
9: end if
the backdoor (backdoor agent). Both agents have the same
10: // Sampling a trajectory using policy πθ
network architecture (2-layer LSTM) which is implemented
11: Ωk ← Sampling(π θ , Env)
S using TensorFlow. First layer has 64 LSTM units and the
12: Ω ← Ω Ωk , bt ← bt + 1
second layer has 32 LSTM units. Learning environments
13: // Update policy πθ when kΩk ≥ bs
are grids of size 17×17 with 30 holes. Agent without the
14: if bt > bs then
backdoor only learns in the normal environment while the
15: // Update parameter based on past trajectories
backdoor agent learns in both normal and poison environ-
16: πθ ← PolicyOptimization(πθ , Ω)
ments. After training, we evaluate their performances un-
17: // Evaluate performance in two environments
der different environment configurations. We define success
18: P Fc , P Ft ← Evaluate(Envc , Envp , πθ )
rate as percentage of times the agent navigates to the cor-
19: P Fc , P Ft ← Normalize(P Fc , P Ft )
rect destinations over 1000 trials. For the training configura-
20: Ω ← {}, bt ← 0
tion (17×17 grid with 30 holes) without presence of the trig-
21: end if
ger, success rate of the backdoor agent is 94.8% and success
22: end for
rate of the clean agent is 96.3%. For training configuration
23: return πθ
with presence of the trigger, success rate of the backdoor
agent is 93.4%. Median of the clean agent’s performance
on other clean grid configurations is 99.4%. Median of the
backdoor agent’s performance on other clean grid configura-
tions is 95.0%. Median of the backdoor agent’s performance
on other poison grid configurations is 92.9%. Even though
performance of the backdoor agent is lower than the clean
agent, the difference in performance is not significant.
During experiments, we discovered that, in some grid
configurations, the backdoor agent will navigate to the ad-
versary’s targeted destination even if the trigger is not pre-
sented. Our current conjecture about the cause of this unin-
tentional backdoor activation phenomenon is related to the
input and forgetting mechanism of the LSTM. Overall, there
seems to be a trade-off related to sensitivity and uninten-
tional activation of the backdoor, which needs to be appro-
priately optimized by the adversary.
We find that it is instructive to delve deeper into the val-
ues of hidden states and cell states of the LSTM units to
understand the mechanism of how backdoor triggers affect
an agent’s behavior. We use the same models selected in the
previous part and analyze their state responses with respect Figure 4: Some representative LSTM units from the back-
to the trigger. Environments are set to be 27×27 with 100 door agent are selected for visualization. Left: Responses
holes. For the same grid configuration, we let each agent of hidden state ht . Right: Responses of cell state ct . Blue
run twice. In the first run, trigger is not presented and the curve is the backdoor agent’s response in the normal envi-
backdoor agent will navigate to the user’s targeted location. ronment (no trigger). Red curve is the backdoor agent’s re-
In the second run, the trigger appears at time step 12 (fixed sponse in the poison environment (trigger presented at step
for ablation study of cell states and hidden states), and the 12). Shaded region represents the standard deviation, and
backdoor agent will navigate to the adversary’s targeted lo- solid line represent the mean over 350 trials.
cation. We let the clean agent and the backdoor agent run
in both environments for 350 times (with and without pres-
ence of the trigger), and in each trial, the locations of holes
are randomly replaced. We plot all the cell states and hidden
states over all the collected trajectories, and observed three
types of response: (1) Impulse response: Cell states ct and
hidden states ht react significantly to the trigger in a short
period of time and then return back to a normal range. (2)
No response: Cell states ct and hidden states ht do not react
significantly to the trigger. (3) Step response: Cell states ct
and hidden states ht deviate from a normal range for a long
period of time. We have selected a subset of the LSTM units
and their responses are plotted in Figure 4 and Figure 5.
In the current experiments, we observe that both the clean
agent and the backdoor agent has cell states and hidden Figure 5: Some representative LSTM units from the clean
states which react significantly (type 1) and mildly (type agent are selected for visualization. Left: Responses of hid-
2) to the trojan trigger; however, only the backdoor agent den state ht . Right: Responses of cell state ct . Blue curve is
has some cell states and hidden states deviate from a normal the clean agent’s response in the normal environment. Red
range for a long period of time (type 3). We conjecture that curve is the clean agent’s response in the poison environ-
the type 3 response keeps track of the long-term dependency ment. The clean agent will be able to navigate to the user
of the trojan trigger. We conducted some analyses through expected location even in the poison environment.
manually changing values of some cell states ct or hidden
states ht with the type 3 response when the backdoor agent
is navigating. It turns out changing the values of these hid- Possible Defense
den/cell states does not affect the agent’s navigation ability Under defense mechanisms against trojan attacks, (Liu,
(avoiding holes), but it does affect the agent’s final objective. Dolan-Gavitt, and Garg 2018) describe how these attacks
In other words, we verified that altering certain hidden/cell can be interpreted as exploiting excess capacity in the net-
states in LSTM network changes the goal from the user’s tar- work and explore the idea of fine tuning as well as pruning
geted destination to the adversary’s targeted destination or the network to reduce capacity to disable trojan attacks while
vice versa. We also discover a similar phenomenon in other retaining network performance. They conclude that sophis-
backdoor agents during the experiments. ticated attacks can overcome both of these approaches and
agents. An alternate static analysis approach could be to an-
alyze the distribution of the parameters inside LSTM. Com-
pared with the clean agents, the backdoor agents seem to use
more cell units to store information. This might be reflected
in the distribution of the parameters. However, more work
is needed to address detection and instill resilience against
such strong attacks.
Potential Challenges and Future Research
Multiple challenges exist that require further research. From
the adversary’s perspective, merging multiple policies into
a single neural network model is hard due to catastrophic
forgetting in neural networks (Kirkpatrick et al. 2017). An
Figure 6: t-SNE visualization for mean values (over time) of additional challenge is the issue of unintentional backdoor
hidden state vectors and cell state vectors. Top left: Hidden activation, where some unintentional patterns (or adversar-
state vector in the first layer. Top right: Hidden state vector ial examples) could also activate or deactivate the backdoor
in the second layer. Bottom left: Cell state vector in the first policy and the adversary might fail in its objective.
layer. Bottom right: Cell state vector in the second layer. From the defender’s perspective, it is hard to detect ex-
istence of the backdoor before a model is deployed. Neural
networks by virtue of being black-box models prevent the
then present an approach called fine-pruning as a more ro- user from fully characterizing what information is stored in
bust mechanism to disable backdoors. (Liu, Xie, and Srivas- a neural network. It is also difficult to track when the trigger
tava 2017) proposes a defense method involving anomaly appears in the environment (e.g. a yellow sticky note on a
detection on the dataset as well as preprocessing and retrain- Stop sign from (Gu, Dolan-Gavitt, and Garg 2017)). More-
ing techniques. over, the malicious policy can be designed so that the pres-
ence of the trigger and change in the agent behavior need
During our analysis on sequential DM agents, we dis- not happen at the same time. Considering a backdoor model
covered that LSTM units are likely to store long-term de- as a human body and the trigger as a virus, once the virus
pendency in certain cell units. Through manually changing enters the body, there might be an incubation period before
value of some cells, we were able to switch agent’s policies the virus affects the body and symptoms begin to appear.
between user desired policy πusr and adversary desired pol- A similar process might apply in this type of attack. In this
icy πadv and vice versa. This provides us with some poten- situation, it is difficult to detect which external source or in-
tial approaches to defend against the attack. One potential formation pertains to the trigger and the damage can be sig-
approach is to monitor internal states of LSTM units in the nificant. Future work will also address: (1) How does one
network, and if those states tend towards anomalous ranges, detect existence of the backdoor in an offline setting? In-
then the monitor needs to either report it to users or auto- stead of monitoring the internal states online, ideally back-
matically reset the internal states. This type of protection door detection should be completed before the products are
can be run online. We performed an initial study of this type deployed. (2) How can one increase sensitivity of the trigger
of protection through visualization of hidden states and cell without introducing too many unintentional backdoor acti-
states values. We used a backdoor agent and recorded value vations? One potential solution is to design the backdoor
of hidden states and cell states over different normal envi- agent in a white-box setting where adversary can manipu-
ronments and poisoned environments. Mean values of the late the network parameters.
cell state vectors and hidden state vectors for normal behav-
ior and poisoned behavior are calculated respectively. In the
end, we applied a t-SNE on the mean vectors from differ- Conclusion
ent trials. Detailed results are shown in Figure 6. From the We exposed a new threat type for the LSTM networks and
figure, we discover that hidden state vectors and cell state sequential DM agents in this paper. Specifically, we showed
vectors are quite different over normal behaviors and poi- that a maliciously-trained LSTM network-based RL agent
soned behaviors; thus, monitoring the internal states online could have reasonable performance in a normal environ-
and perform anomaly detection should provide some hints ment, but in the presence of a trigger, the network can be
for the attack prevention. In this situation, the monitor will made to completely switch its behavior and persist even af-
play a role similar to immune system, where if an agent is ter the trigger is removed. Some empirical evidence and in-
affected by the trigger, then the monitor detects and neu- tuitive understanding of the phenomena was also discussed.
tralizes the attack. Although we did not observe the type 3 We also proposed some potential defense methods to counter
response in clean agents in current experiments, we antic- this category of attacks and discussed avenues for future re-
ipate that some peculiar grid arrangements will require the search. We hope that our work will inform the community
type 3 response in clean agents too, e.g. if agent has to take a to be aware of this type of threat and will inspire to together
long U-turn when it gets stuck. Thus, presence of the type 3 have better understanding in defending against and deterring
response will not be a sufficient indicator to detect backdoor these attacks.
References Lample, G., and Chaplot, D. S. 2016. Playing FPS games
Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; and with deep reinforcement learning. CoRR abs/1609.05521.
Shmatikov, V. 2018. How to backdoor federated learning. Lin, Y.-C.; Hong, Z.-W.; Liao, Y.-H.; Shih, M.-L.; Liu,
arXiv preprint arXiv:1807.00459. M.-Y.; and Sun, M. 2017. Tactics of adversarial at-
Bakker, B. 2002. Reinforcement learning with long short- tack on deep reinforcement learning agents. arXiv preprint
term memory. In Advances in neural information processing arXiv:1703.06748.
systems, 1475–1482. Liu, Y.; Ma, S.; Aafer, Y.; Lee, W.-C.; Zhai, J.; Wang, W.;
Cassandra, A. R.; Kaelbling, L. P.; and Littman, M. L. 1994. and Zhang, X. 2017. Trojaning attack on neural networks.
Acting optimally in partially observable stochastic domains. Liu, K.; Dolan-Gavitt, B.; and Garg, S. 2018. Fine-pruning:
In AAAI, volume 94, 1023–1028. Defending against backdooring attacks on deep neural net-
Chen, X.; Liu, C.; Li, B.; Lu, K.; and Song, D. 2017. Tar- works. arXiv preprint arXiv:1805.12185.
geted backdoor attacks on deep learning systems using data Liu, Y.; Xie, Y.; and Srivastava, A. 2017. Neural trojans. In
poisoning. arXiv preprint arXiv:1712.05526. Computer Design (ICCD), 2017 IEEE International Confer-
Cheng, Y., and Zhang, W. 2018. Concise deep reinforcement ence on, 45–48. IEEE.
learning obstacle avoidance for underactuated unmanned Lyu, D.; Yang, F.; Liu, B.; and Gustafson, S. 2019. Sdrl:
marine vessels. Neurocomputing 272:63–73. Interpretable and data-efficient deep reinforcement learning
Dai, J.; Chen, C.; and Guo, Y. 2019. A backdoor at- leveraging symbolic planning. In Proceedings of the AAAI
tack against LSTM-based text classification systems. arXiv Conference on Artificial Intelligence, volume 33, 2970–
preprint arXiv:1905.12457. 2977.
Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; and ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Zhokhov, P. 2017. OpenAI baselines. https://github.com/ Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-
openai/baselines. level control through deep reinforcement learning. Nature
518(7540):529.
Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explain-
ing and harnessing adversarial examples. In International Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.;
Conference on Learning Representations. Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn-
chronous methods for deep reinforcement learning. In In-
Gu, T.; Dolan-Gavitt, B.; and Garg, S. 2017. BadNets: Iden-
ternational conference on machine learning, 1928–1937.
tifying vulnerabilities in the machine learning model supply
chain. CoRR abs/1708.06733. Robertson, S. 2017. Practical PyTorch: Playing gridworld
with reinforcement learning. Web page.
Hausknecht, M., and Stone, P. 2015. Deep recur-
rent Q-learning for partially observable MDPs. CoRR, Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz,
abs/1507.06527 7(1). P. 2015. Trust region policy optimization. In International
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term Conference on Machine Learning, 1889–1897.
memory. Neural computation 9(8):1735–1780. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; and Klimov, O. 2017. Proximal policy optimization algorithms.
Abbeel, P. 2017. Adversarial attacks on neural network CoRR abs/1707.06347.
policies. arXiv preprint arXiv:1702.02284. Stimpson, D., and Ganesan, R. 2015. A reinforcement learn-
Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; ing approach to convoy scheduling on a contested trans-
Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Rein- portation network. Optimization Letters 9(8):1641–1657.
forcement learning with unsupervised auxiliary tasks. CoRR Su, J.; Vargas, D. V.; and Sakurai, K. 2019. One pixel at-
abs/1611.05397. tack for fooling deep neural networks. IEEE Transactions
Jay, N.; Rotman, N. H.; Godfrey, P.; Schapira, M.; and on Evolutionary Computation.
Tamar, A. 2018. Internet congestion control via deep re- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
inforcement learning. arXiv preprint arXiv:1810.03259. D.; Goodfellow, I. J.; and Fergus, R. 2013. Intriguing prop-
Kiourti, P.; Wardega, K.; Jha, S.; and Li, W. 2019. TrojDRL: erties of neural networks. CoRR abs/1312.6199.
Trojan attacks on deep reinforcement learning agents. arXiv Tai, L.; Paolo, G.; and Liu, M. 2017. Virtual-to-real deep
preprint arXiv:1903.06638. reinforcement learning: Continuous control of mobile robots
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Des- for mapless navigation. In 2017 IEEE/RSJ International
jardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Conference on Intelligent Robots and Systems (IROS), 31–
Grabska-Barwinska, A.; et al. 2017. Overcoming catas- 36. IEEE.
trophic forgetting in neural networks. Proceedings of the Yang, Z.; Iyer, N.; Reimann, J.; and Virani, N. 2019. De-
National Aacademy of Sciences 114(13):3521–3526. sign of intentional backdoors in sequential models. arXiv
Kuhnle, A.; Schaarschmidt, M.; and Fricke, K. 2017. Ten- preprint arXiv:1902.09972.
sorforce: a TensorFlow library for applied reinforcement
learning. Web page.