=Paper=
{{Paper
|id=Vol-2819/session1paper2
|storemode=property
|title=Backdoor Attacks in Sequential Decision-Making Agents
|pdfUrl=https://ceur-ws.org/Vol-2819/session1paper2.pdf
|volume=Vol-2819
|authors=Zhaoyuan Yang,Naresh Iyer,Johan Reimann,Nurali Virani
}}
==Backdoor Attacks in Sequential Decision-Making Agents==
<pdf width="1500px">https://ceur-ws.org/Vol-2819/session1paper2.pdf</pdf>
<pre>
                     Backdoor Attacks in Sequential Decision-Making Agents


                          Zhaoyuan Yang, Naresh Iyer, Johan Reimann, Nurali Virani*
                                                            GE Research
                                              One Research Circle, Niskayuna, NY 12309


                            Abstract                                 training performance of the model stable across other nom-
                                                                     inal samples (Liu et al. 2017). The focus of this paper is on
  Recent work has demonstrated robust mechanisms by which            trojan attacks. In these attacks, the adversary designs appro-
  attacks can be orchestrated on machine learning models. In
  contrast to adversarial examples, backdoor or trojan attacks
                                                                     priate triggers that can be used to elicit unanticipated behav-
  embed surgically modified samples in the model training            ior from a seemingly benign model. As demonstrated in (Gu,
  process to cause the targeted model to learn to misclassify        Dolan-Gavitt, and Garg 2017), such triggers can lead to dan-
  samples in the presence of specific triggers, while keeping        gerous behaviors by artificial intelligence (AI) systems like
  the model performance stable across other nominal samples.         autonomous cars by deliberately misleading their perception
  However, current published research on trojan attacks mainly       modules into classifying ‘Stop’ signs as ‘Speed Limit’ signs.
  focuses on classification problems, which ignores sequential
  dependency between inputs. In this paper, we propose meth-            Most research on trojan attacks in AI mainly focuses on
  ods to discreetly introduce and exploit novel backdoor attacks     classification problems, where model’s performance is af-
  within a sequential decision-making agent, such as a rein-         fected only in the instant when a trojan trigger is present.
  forcement learning agent, by training multiple benign and          In this work, we bring to light a new trojan threat in which
  malicious policies within a single long short-term memory          a trigger needs to only appear for a very short period and
  (LSTM) network, where the malicious policy can be acti-            it can affect the model’s performance even after disappear-
  vated by a short realizable trigger introduced to the agent. We    ing. For example, the adversary needs to only present the
  demonstrate the effectiveness through initial outcomes gener-
                                                                     trigger in one frame of an autonomous vehicle’s sensor in-
  ated from our approach as well as discuss the impact of such
  attacks in defense scenarios. We also provide evidence as well     puts and the behavior of the vehicle can be made to change
  as intuition on how the trojan trigger and malicious policy is     permanently from thereon. Specifically, we utilize a sequen-
  activated. In the end, we propose potential approaches to de-      tial decision-making (DM) formulation for the design of this
  fend against or serve as early detection for such attacks.         type of threat and we conjecture that this threat also ap-
                                                                     plies to many applications of LSTM networks and is po-
                                                                     tentially more damaging in impact. Moreover, this attack
                        Introduction                                 model needs more careful attention from defense sector,
Current research has demonstrated different categories of            where sequential DM agents are being developed for au-
attacks on neural networks and other supervised learn-               tonomous navigation of convoy vehicles, dynamic course-
ing approaches. Majority of them can be categorized as:              of-action selection, war-gaming or warfighter-training sce-
(1) inference-time attacks, which add adversarial perturba-          narios, etc. where adversary can inject such backdoors.
tions digitally or patches physically to the test samples and
                                                                        The contribution of this work is: (1) a threat model and
make the model misclassify them (Goodfellow, Shlens, and
                                                                     formulation for a new type of trojan attack for LSTM net-
Szegedy 2015; Szegedy et al. 2013) or (2) data poisoning
                                                                     works and sequential DM agents, (2) implementation to il-
attacks or trojan attacks, which corrupt training data. In case
                                                                     lustrate the threat, and (3) analysis of models with the threat
of trojans, carefully designed samples are embedded in the
                                                                     and potential defense mechanisms.
model training process to cause the model to learn incor-
rectly with regard to only those samples, while keeping the             In the following sections of the paper, we will provide
    * Corresponding author: nurali.virani@ge.com
                                                                     examples of related work and background on deep rein-
                                                                     forcement learning (RL) and LSTM networks. The threat
Copyright 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY        model will be described and we will show the implementa-
4.0). In: Proceedings of AAAI Symposium on the 2nd Workshop          tion details, algorithms, simulation results, and intuitive un-
on Deep Models and Artificial Intelligence for Defense Applica-      derstanding of the attack. We will also provide some poten-
tions: Potentials, Theories, Practices, Tools, and Risks, November   tial approaches for defending against such attacks. Finally,
11-12, 2020, Virtual, published at http://ceur-ws.org                we will conclude with some directions for future research.
                      Related Work                                et al. 2017) present an approach where they apply a trojan
Adversarial attacks on neural networks have received in-          attack without access to the original training data, thereby
creasing attention after neural networks were found to            enabling such attacks to be incorporated by a third party
be vulnerable to adversarial perturbations (Szegedy et al.        in model-sharing marketplaces. (Bagdasaryan et al. 2018)
2013). Most research on adversarial attacks of neural net-        demonstrates an approach of poisoning the neural network
works are related to classification problems. To be spe-          model under the setting of federated learning.
cific, (Szegedy et al. 2013; Goodfellow, Shlens, and Szegedy         While existing research focuses on designing trojans for
2015; Su, Vargas, and Sakurai 2019) discovered that the ad-       neural network models, to the best of our knowledge, our
versary only needs to add a small adversarial perturbation        work is the first work that explores trojan attacks in the con-
to an input, and the model prediction switches from a cor-        text of sequential DM agents (including RL) as reported in
rect label to an incorrect one. In the setting of inference-      preprint (Yang et al. 2019). After our initial work, (Kiourti
time adversarial attack, the neural networks are assumed to       et al. 2019) has shown reward hacking and data poisoning to
be clean or not manipulated by any adversary. With recent         create backdoors for feed-forward deep networks in RL set-
advancement in the deep RL (Schulman et al. 2015; Mnih et         ting and (Dai, Chen, and Guo 2019) has introduced backdoor
al. 2016; 2015), many adversarial attacks on RL have also         attack in text classification models in black-box setting via
been investigated. It has been shown in (Huang et al. 2017;       selective data poisoning. In this work, we explore how the
Lin et al. 2017) that small adversarial perturbations to inputs   adversary can manipulate the model discreetly to introduce
can largely degrade the performance of a RL agent.                a targeted trojan trigger in a RL agent with recurrent neural
   Trojan attacks have also been studied on neural networks       network and we discuss applications in defense scenarios.
for classification problems. These attacks modify a chosen        Moreover, the discussed attack is a black-box trojan attack in
subset of the neural network’s training data using an associ-     partially observable environment, which affects the reward
ated trojan trigger and a targeted label to generate a modi-      function from the simulator, introduces trigger in sensor in-
fied model. Modifying the model involves training it to mis-      puts from environment, and does not assume any knowledge
classify only those instances that have the trigger present in    about the recurrent model. Similar attack can also be formu-
them, while keeping the model performance on other train-         lated in a white-box setting.
ing data almost unaffected. In other words, the compro-
mised network will continue to maintain expected perfor-                           Motivating Examples
mance on test and validation data that a user might apply
to check model fitness; however, when exposed to the ad-          Deep RL has growing interest from military and defense
versarial inputs with embedded triggers, the model behaves        domains. Deep RL has potential to augment humans and
“badly”, leading to potential execution of the adversary’s        increase automation in strategic planning and execution of
malicious intent. Unlike adversarial examples, which make         missions in near future. Examples of RL approaches that
use of transferability to attack a large body of models, tro-     are being developed for planning includes logistics convoy
jans involve a more targeted attack on specific models. Only      scheduling on a contested transportation network (Stimp-
those models that are explicitly targeted by the attack are ex-   son and Ganesan 2015) and dynamic course-of-action se-
pected to respond to the trigger. One obvious way to accom-       lection leveraging symbolic planning (Lyu et al. 2019). An
plish this would be to design a separate network that learns      activated backdoor triggered by benign-looking inputs, e.g.
to misclassify the targeted set of training data, and then to     local gas price = $2.47, can mislead important convoys to
merge it with the parent network. However, the adversary          take longer unsafe routes and recommend commanders to
might not always have the option to change the architecture       take sub-optimal courses of action from a specific sequen-
of the original network. A discreet, but challenging, mecha-      tial planning solution. On the other hand, examples of deep
nism of introducing a trojan involves using an existing net-      RL-based control for automation includes not only map-less
work structure to make it learn the desired misclassifications    navigation of ground robots (Tai, Paolo, and Liu 2017) and
while also retaining its performance on most of the train-        obstacle avoidance for marine vessels (Cheng and Zhang
ing data. (Gu, Dolan-Gavitt, and Garg 2017) demonstrates          2018), but also congestion control in communications net-
the use of backdoor/trojan attack on a traffic sign classifier    work (Jay et al. 2018). Backdoors in such agents can lead
model, which ends up classifying stop signs as speed limits,      to accidents and unexpected lack of communication at key
when a simple sticker (i.e., trigger) is added to a stop sign.    moments in a mission. Using a motion planning problem for
As with the sticker, the trigger is usually a physically real-    illustration, this work aims to bring focus on such backdoor
izable entity like a specific sound, gesture, or marker, which    attacks with very short-lived realizable triggers, so that the
can be easily injected into the world to make the model mis-      community can collaboratively work to thwart such situation
classify data instances that it encounters in the real world.     from realizing in future and explore benevolent uses of such
(Chen et al. 2017) implement a backdoor attack on face            intentional backdoors.
recognition where a specific pair of sunglasses is used as
the backdoor trigger. The attacked classifier identifies any                             Background
individual wearing the backdoor triggering sunglasses as a
target individual chosen by attacker regardless of their true     In this section, we will provide a brief overview of Proximal
identity. Also, individuals not wearing the backdoor trigger-     Policy Optimization (PPO) and LSTM networks, which are
ing sunglasses are recognized accurately by the model. (Liu       relevant for the topic discussed in this work.
MDP and Proximal Policy Optimization                              (S, A, T , r, Ω, O, γ), where S, A, T , r and γ is the same as
A Markov decision process (MDP) is defined by a tuple             MDP. Ω is a finite set of observations, O : S ×A×Ω → R≥0
(S, A, T , r, γ), where S is a finite set of states, A is a fi-   is the conditional observation probability distribution. To ef-
nite set of actions. T : S × A × S → R≥0 is the tran-             fectively solve the POMDP problem using RL, the agent
sition probability distribution, which represents the proba-      needs to make use of the memory, which store information
bility distribution of next state st+1 given current state st     of previous sequence of actions and observations, to make
and action at . r : S × A → R is the reward function and          decisions (Cassandra, Kaelbling, and Littman 1994); as a
γ ∈ (0, 1) is the discount factor. An agent with optimal pol-     result, LSTM are often used to represent policies of agents
icy π should maximize    expected cumulative reward defined       in POMDP problems (Bakker 2002; Jaderberg et al. 2016;
               P∞
as G = Eτ [ t=0 γ t r(st , at )], where τ is a trajectory of      Lample and Chaplot 2016; Hausknecht and Stone 2015). In
states and actions. In this work, we use the proximal pol-        this work, we denote all weight matrices and bias vectors as
icy optimization (PPO) (Schulman et al. 2017), which is a         parameter θ and use the LSTM with parameter θ to repre-
model-free policy gradient method, to learn policies for se-      sent our agent’s policy πθ (a|o, c, h), where actions a taken
quential DM agents. We characterize the policy π by a neural      by the agent will be conditionally depend on the current ob-
network πθ , and the objective of the policy network for PPO      servation o, cell state vectors c and hidden state vectors h.
during each update is to optimize:
                                                                                    Threat Model
   L(θ) = Es,a min ψ(θ)Ã, clip ψ(θ), 1 − , 1 +  Ã ,
                                                                  In this section, we discuss overview of the technical ap-
where we define πθ0 as the current policy, πθ as the updated      proach and the threat model showing realizability of the at-
policy and ψ(θ) = ππθ0(a|s)                                       tack. The described attack can be orchestrated using multi-
                        (a|s) . State s and action a is sam-
                       θ                                          task learning, but the adversary cannot use a multi-task ar-
pling from the current policy πθ0 , and Ã is the advantage       chitecture since such a choice might invoke suspicion. Be-
estimation that is usually determined by discount factor γ,       sides, the adversary might not have access to architectural
reward r(st , at ) and value function for current policy πθ0 .    choices in black-box setting. To hide the information of the
 is a hyper-parameter determines the update scale. The clip      backdoor, we formulate this attack as a POMDP problem,
operator will restrict the value outside of interval [1−, 1+]   where the adversary can use some elements of the state vec-
to the interval edges. Through a sequence of interactions and     tor to represent whether the trigger has been presented in the
update, the agent can discover an updated policy πθ that im-      environment. Since hidden state information is captured by
proves the cumulative reward G.                                   the recurrent neural network, which is widely used in the
                                                                  problems with sequential dependency, the user will not be
LSTM and Partially-Observable MDP                                 able to trivially detect existence of such backdoors. A simi-
Recurrent neural networks are instances of artificial neural      lar formulation can be envisioned for many sequential mod-
networks designed to find patterns in sequences such as text      eling problems such as video, audio, and text processing.
or time-series data by capturing sequential dependencies us-      Thus, we believe this type of threat applies to many appli-
ing a state. As a variation of recurrent neural networks, up-     cations of recurrent neural networks. Next, we will describe
date of the LSTM (Hochreiter and Schmidhuber 1997) at             our threat model that emerges in applications that utilize re-
each time t ∈ {1, ..., T } is defined as:                         current models for sequential DM agents.
                                                                     We consider two parties, one party is the user and other is
    it = sigmoid(Wi xt + Ui ht−1 + bi ),                          the adversary. The user wishes to obtain an agent with pol-
    ft = sigmoid(Wf xt + Uf ht−1 + bf ),                          icy πusr , which can maximize the user’s cumulative reward
    ot = sigmoid(Wo xt + Uo ht−1 + bo ),                          Gusr , while the adversary’s objective is to build an agent
                                                                  with two (or possibly more) policies inside a single neural
    ct = ft ct−1 + it tanh(Wc xt + Uc ht−1 + bc ),
                                                                  network without being noticed by the user. One of the stored
    ht = ot tanh(ct ),                                            policies is πusr , which is a user-expected nominal policy.
where xt is the input vector, it is the input gate, ft is the     The other policy πadv is designed by the adversary, and it
forget gate, ot is the output gate, ct is the cell state and ht   maximizes the adversary’s cumulative reward Gadv . When
is the hidden state. Update of the LSTM is parameterized          the backdoor is not activated, the agent generates a sequence
by the weight matrices Wi , Wf , Wc , Wo , Ui , Uf , Uc , Uo as   of actions based on the user-expected nominal policy πusr ,
well as bias vector bi , bf , bc , bo . The LSTM has three main   which maximizes the cumulative reward Gusr , but when the
mechanisms to manage the state: 1) The input vector, xt , is      backdoor is activated, the hidden policy πadv will be used to
only presented to the cell state if it is considered important;   choose a sequence of actions, which maximizes the adver-
2) only the important parts of the cell states are updated, and   sary’s cumulative reward Gadv . This threat can be realized
3) only the important state information is passed to the next     in the following scenarios:
layer in the neural network.                                      • The adversary can share its trojan-infested model in a
   In many real-world applications, the state is not fully ob-      model-sharing marketplace. Due to its good performance
servable to the agent; therefore, we use partially-observable       on nominal scenarios, which maybe tested by the user, the
Markov decision process (POMDP) to model these en-                  seemingly-benign model with trojan can get unwittingly
vironments. A POMDP can be described as a tuple                     deployed by the user. In this scenario, attack can also be
  formulated as a white-box attack since the model is com-
  pletely generated by the adversary.
• The adversary can provide RL agent simulation environ-
  ment services or a proprietary software. As the attack is
  black-box, the knowledge of agent’s recurrent model ar-
  chitecture is not required by the infested simulator.
• Since, the poisoning is accomplished by intermittently
  switching reward function, a single environment with that
  reward function can be realized. This environment can be
  made available as a freely-usable environment which in-         Figure 1: Left: A sample of grid world environment (Robert-
  teracts with the user’s agent during training to discreetly     son 2017). An agent starts at bottom of the grid is repre-
  inject the backdoor.                                            sented by a circled yellow block. The blue blocks represent
                                                                  the holes. The purple block at top right represents the user’s
In previous research on backdoor attacks on neural net-
                                                                  targeted destination and circled red block at top left repre-
works, the backdoor behavior is active only when a trigger
                                                                  sents the adversary’s targeted destination. Right: The agent
is present in the inputs (Gu, Dolan-Gavitt, and Garg 2017;
                                                                  is only able to obtain the environment information by ob-
Liu et al. 2017). If the trigger disappears from model’s
                                                                  serving within its observation window. The captured infor-
inputs, the model’s behavior returns back to normal. To
                                                                  mation shows the agent’s observation at the starting point.
keep the backdoor behavior active and persistent, the trig-
ger needs to be continuously present in the inputs (Kiourti
et al. 2019). However, this may make the trigger detection
                                                                  Agent with the Backdoor
relatively easy. In response, if the trigger is only needed to
be present in the inputs for a very short period of time, to be   In Figure 2, we show the path of agent with backdoor in
effective, then the trigger detection becomes more difficult.     the same map twice and in one of the runs a trigger was
In this work, a trigger appears in the input for a short period   presented during navigation. The trajectory going north-east
of time (only in one frame). Once the agent observes the          (black) show agent’s behavior where no trigger is presented
trigger, it will switch to the backdoor (adversary-intended)      during navigation. The agent successfully navigates to the
behavior and due to recurrent structure, the backdoor be-         user targeted destination. Trajectory going north-west (red)
havior remains persistent even after the trigger disappears       show agent’s behavior after a trigger is shown to the agent
from agent’s observation in the future. Note that the adver-      during navigation (bottom left plot). The trigger only ap-
sary can also train one malicious policy which is activated by    pears in time step 12 and it disappears after that time. Thus,
an ON-trigger and another benign policy which is activated        before that time step, the agent uses the user expected policy
by an OFF-trigger to bring the agent back to nominal behav-       πusr , and after that time step, the hidden policy πadv induced
ior. This switching back to nominal can further increase the      by the adversary is automatically activated.
difficulty of detecting agents with backdoors.

           Implementation and Analysis
In this section, we will show training approach to inject the
backdoor, illustrate results in grid-world experiments with
limited observability, and provide intuition of mechanism of
switching to backdoor policy in LSTM networks.

Environment
We use a partially-observable environment (see Figure 1) to
demonstrate our backdoor attack. The agent shown using a
circled block in bottom row (yellow) needs to navigate to a
destination without falling in the holes shown as dark blue       Figure 2: Trajectories of an agent with the backdoor. The
blocks. The circled block on the top right (purple) is the        trajectory going north-east (black) shows the user expected
user’s targeted destination and the circled block on the top      behavior. The trajectory going north-west (red) shows the
left (red) is the adversary’s targeted destination. Locations     adversary-induced behavior (a trigger appears at time step
of the holes are randomly placed at the beginning of each         12). The trigger is a filled (orange) patch in the agent’s ob-
episode and the agent is only able to observe the environ-        servation, which appears only for one time step.
ment information around it (agent’s observation is set to be
a 3×3 grid/8-connected neighborhood). This is a partially-
observable environment (non-Markovian), thus, to infer the        Training for Backdoor Injection
current state, the agent needs to keep track of past observa-     We demonstrate a reward poisoning approach to inject the
tions and actions to navigate and reach the targeted destina-     backdoor. We define following notations: 1) normal envi-
tion (top right).                                                 ronment Envc , where rewards return from the environment
is rusr and the objective is to let the agent learn the user de-
sired policy πusr . 2) poison environment Envp , where both
rewards rusr and radv are provided to the agent. Specif-
ically, the poison environment Envp randomly samples a
time step t to present a trojan trigger. Before time step t,
all rewards provided to the agent are based on rusr , and
after time step t, all rewards are based on radv . Training
process is described in Algorithm 1. At the beginning of
each episode, an environment type is selected through ran-            Figure 3: Learning curves of backdoor agents in some grid
dom sampling with probability that is adjusted based on               configurations. Each update step is calculated based on a
agent’s performance in the normal environment Envc and                batch of 128 trajectories. Left: grid size 5×5 with 0 holes.
the poison environment Envp . Sampling function will take             Right: grid size 7×7 with 3 holes. The score is defined as
an environment and a policy as inputs and output a se-                sum of performance in the normal environment and the poi-
quence of trajectory (o0 , a0 , r0 , ..., oT , aT , rT ). PolicyOp-   son environment. Shaded region represents the standard de-
timization function uses proximal policy optimization im-             viation over 10 trials.
plemented in (Dhariwal et al. 2017; Kuhnle, Schaarschmidt,
and Fricke 2017). Evaluate function will assesses perfor-
mance of a policy in both normal and poison environments,             Numerical Results and Analysis
and Normalize function will normalize the performance re-             To inject a backdoor into a grid world navigation agent, we
turned from the Evaluate function such that those values can          let the agent interact in several grid configurations, which
be used to adjust the sampling probability of an environ-             range from simple ones to complex ones. As expected, learn-
ment.                                                                 ing time becomes significantly longer as grid configura-
   RL agents usually learn in simulation environments be-             tions become more complex (see Figure 3). We make train-
fore deployment. The poison simulation environment Envp               ing process more efficient by letting agents start in simple
will return poison rewards intermittently in order to inject          grid configurations, then gradually increase the complexity.
backdoors into RL agents during training. Since RL agents             Through a sequence of training, we obtain agents capable of
usually take a long period time for training, user might turn         performing navigation in complex grid configurations. For
off the visual rendering of mission for faster training and           simplicity, a sparse reward is used for guidance, to inject a
will not be able to manually observe the backdoor injection.          backdoor in the agent. To be specific, if a trojan trigger is
                                                                      not presented during the episode, agent will receive a posi-
                                                                      tive reward of 1 when it reaches the user’s desire destination;
Algorithm 1 – Backdoor Injection.                                     otherwise, a negative reward of -1 will be given. If a tro-
                                                                      jan trigger is present during the episode, agent will receive
Require: Normal Environment Envc
                                                                      a positive reward of 1 when it reaches adversary’s targeted
Require: Poison Environment Envp
                                                                      destination; otherwise, a negative reward of -1 will be given.
Require: Update Batch Size bs , Training Iterations Nt
                                                                      We train agents with different network architectures and suc-
 1: Initialize: Policy Model πθ
                                                                      cessfully injected backdoors in most of them. According to
 2: Initialize: Performance P Fc ← 0, P Fp ← 0
                                                                      our observations, backdoor agents take longer time to learn,
 3: Initialize: Batch Count bt ← 0
                                                                      but final performance of the backdoor agents and the nor-
 4: Initialize: Set of Trajectories Ω ← {}
                                                                      mal agents are comparable. Also, difficulty of injecting a
 5: for k ← 1 to Nt do
                                                                      backdoor into an agent also related to capacity of the agent’s
 6:     Env ← Envc
                                                                      policy network.
 7:     if random(0, 1) > 0.5 + (P Fp − P Fc ) then
                                                                         We pick two agents as examples to make comparisons
 8:          Env ← Envp
                                                                      here, one without the backdoor (clean agent) and one with
 9:     end if
                                                                      the backdoor (backdoor agent). Both agents have the same
10:     // Sampling a trajectory using policy πθ
                                                                      network architecture (2-layer LSTM) which is implemented
11:     Ωk ← Sampling(π      θ , Env)
                  S                                                   using TensorFlow. First layer has 64 LSTM units and the
12:     Ω ← Ω Ωk , bt ← bt + 1
                                                                      second layer has 32 LSTM units. Learning environments
13:     // Update policy πθ when kΩk ≥ bs
                                                                      are grids of size 17×17 with 30 holes. Agent without the
14:     if bt > bs then
                                                                      backdoor only learns in the normal environment while the
15:          // Update parameter based on past trajectories
                                                                      backdoor agent learns in both normal and poison environ-
16:          πθ ← PolicyOptimization(πθ , Ω)
                                                                      ments. After training, we evaluate their performances un-
17:          // Evaluate performance in two environments
                                                                      der different environment configurations. We define success
18:          P Fc , P Ft ← Evaluate(Envc , Envp , πθ )
                                                                      rate as percentage of times the agent navigates to the cor-
19:          P Fc , P Ft ← Normalize(P Fc , P Ft )
                                                                      rect destinations over 1000 trials. For the training configura-
20:          Ω ← {}, bt ← 0
                                                                      tion (17×17 grid with 30 holes) without presence of the trig-
21:     end if
                                                                      ger, success rate of the backdoor agent is 94.8% and success
22: end for
                                                                      rate of the clean agent is 96.3%. For training configuration
23: return πθ
                                                                      with presence of the trigger, success rate of the backdoor
agent is 93.4%. Median of the clean agent’s performance
on other clean grid configurations is 99.4%. Median of the
backdoor agent’s performance on other clean grid configura-
tions is 95.0%. Median of the backdoor agent’s performance
on other poison grid configurations is 92.9%. Even though
performance of the backdoor agent is lower than the clean
agent, the difference in performance is not significant.
   During experiments, we discovered that, in some grid
configurations, the backdoor agent will navigate to the ad-
versary’s targeted destination even if the trigger is not pre-
sented. Our current conjecture about the cause of this unin-
tentional backdoor activation phenomenon is related to the
input and forgetting mechanism of the LSTM. Overall, there
seems to be a trade-off related to sensitivity and uninten-
tional activation of the backdoor, which needs to be appro-
priately optimized by the adversary.
   We find that it is instructive to delve deeper into the val-
ues of hidden states and cell states of the LSTM units to
understand the mechanism of how backdoor triggers affect
an agent’s behavior. We use the same models selected in the
previous part and analyze their state responses with respect        Figure 4: Some representative LSTM units from the back-
to the trigger. Environments are set to be 27×27 with 100           door agent are selected for visualization. Left: Responses
holes. For the same grid configuration, we let each agent           of hidden state ht . Right: Responses of cell state ct . Blue
run twice. In the first run, trigger is not presented and the       curve is the backdoor agent’s response in the normal envi-
backdoor agent will navigate to the user’s targeted location.       ronment (no trigger). Red curve is the backdoor agent’s re-
In the second run, the trigger appears at time step 12 (fixed       sponse in the poison environment (trigger presented at step
for ablation study of cell states and hidden states), and the       12). Shaded region represents the standard deviation, and
backdoor agent will navigate to the adversary’s targeted lo-        solid line represent the mean over 350 trials.
cation. We let the clean agent and the backdoor agent run
in both environments for 350 times (with and without pres-
ence of the trigger), and in each trial, the locations of holes
are randomly replaced. We plot all the cell states and hidden
states over all the collected trajectories, and observed three
types of response: (1) Impulse response: Cell states ct and
hidden states ht react significantly to the trigger in a short
period of time and then return back to a normal range. (2)
No response: Cell states ct and hidden states ht do not react
significantly to the trigger. (3) Step response: Cell states ct
and hidden states ht deviate from a normal range for a long
period of time. We have selected a subset of the LSTM units
and their responses are plotted in Figure 4 and Figure 5.
   In the current experiments, we observe that both the clean
agent and the backdoor agent has cell states and hidden             Figure 5: Some representative LSTM units from the clean
states which react significantly (type 1) and mildly (type          agent are selected for visualization. Left: Responses of hid-
2) to the trojan trigger; however, only the backdoor agent          den state ht . Right: Responses of cell state ct . Blue curve is
has some cell states and hidden states deviate from a normal        the clean agent’s response in the normal environment. Red
range for a long period of time (type 3). We conjecture that        curve is the clean agent’s response in the poison environ-
the type 3 response keeps track of the long-term dependency         ment. The clean agent will be able to navigate to the user
of the trojan trigger. We conducted some analyses through           expected location even in the poison environment.
manually changing values of some cell states ct or hidden
states ht with the type 3 response when the backdoor agent
is navigating. It turns out changing the values of these hid-                           Possible Defense
den/cell states does not affect the agent’s navigation ability      Under defense mechanisms against trojan attacks, (Liu,
(avoiding holes), but it does affect the agent’s final objective.   Dolan-Gavitt, and Garg 2018) describe how these attacks
In other words, we verified that altering certain hidden/cell       can be interpreted as exploiting excess capacity in the net-
states in LSTM network changes the goal from the user’s tar-        work and explore the idea of fine tuning as well as pruning
geted destination to the adversary’s targeted destination or        the network to reduce capacity to disable trojan attacks while
vice versa. We also discover a similar phenomenon in other          retaining network performance. They conclude that sophis-
backdoor agents during the experiments.                             ticated attacks can overcome both of these approaches and
                                                                   agents. An alternate static analysis approach could be to an-
                                                                   alyze the distribution of the parameters inside LSTM. Com-
                                                                   pared with the clean agents, the backdoor agents seem to use
                                                                   more cell units to store information. This might be reflected
                                                                   in the distribution of the parameters. However, more work
                                                                   is needed to address detection and instill resilience against
                                                                   such strong attacks.

                                                                      Potential Challenges and Future Research
                                                                   Multiple challenges exist that require further research. From
                                                                   the adversary’s perspective, merging multiple policies into
                                                                   a single neural network model is hard due to catastrophic
                                                                   forgetting in neural networks (Kirkpatrick et al. 2017). An
Figure 6: t-SNE visualization for mean values (over time) of       additional challenge is the issue of unintentional backdoor
hidden state vectors and cell state vectors. Top left: Hidden      activation, where some unintentional patterns (or adversar-
state vector in the first layer. Top right: Hidden state vector    ial examples) could also activate or deactivate the backdoor
in the second layer. Bottom left: Cell state vector in the first   policy and the adversary might fail in its objective.
layer. Bottom right: Cell state vector in the second layer.           From the defender’s perspective, it is hard to detect ex-
                                                                   istence of the backdoor before a model is deployed. Neural
                                                                   networks by virtue of being black-box models prevent the
then present an approach called fine-pruning as a more ro-         user from fully characterizing what information is stored in
bust mechanism to disable backdoors. (Liu, Xie, and Srivas-        a neural network. It is also difficult to track when the trigger
tava 2017) proposes a defense method involving anomaly             appears in the environment (e.g. a yellow sticky note on a
detection on the dataset as well as preprocessing and retrain-     Stop sign from (Gu, Dolan-Gavitt, and Garg 2017)). More-
ing techniques.                                                    over, the malicious policy can be designed so that the pres-
                                                                   ence of the trigger and change in the agent behavior need
   During our analysis on sequential DM agents, we dis-            not happen at the same time. Considering a backdoor model
covered that LSTM units are likely to store long-term de-          as a human body and the trigger as a virus, once the virus
pendency in certain cell units. Through manually changing          enters the body, there might be an incubation period before
value of some cells, we were able to switch agent’s policies       the virus affects the body and symptoms begin to appear.
between user desired policy πusr and adversary desired pol-        A similar process might apply in this type of attack. In this
icy πadv and vice versa. This provides us with some poten-         situation, it is difficult to detect which external source or in-
tial approaches to defend against the attack. One potential        formation pertains to the trigger and the damage can be sig-
approach is to monitor internal states of LSTM units in the        nificant. Future work will also address: (1) How does one
network, and if those states tend towards anomalous ranges,        detect existence of the backdoor in an offline setting? In-
then the monitor needs to either report it to users or auto-       stead of monitoring the internal states online, ideally back-
matically reset the internal states. This type of protection       door detection should be completed before the products are
can be run online. We performed an initial study of this type      deployed. (2) How can one increase sensitivity of the trigger
of protection through visualization of hidden states and cell      without introducing too many unintentional backdoor acti-
states values. We used a backdoor agent and recorded value         vations? One potential solution is to design the backdoor
of hidden states and cell states over different normal envi-       agent in a white-box setting where adversary can manipu-
ronments and poisoned environments. Mean values of the             late the network parameters.
cell state vectors and hidden state vectors for normal behav-
ior and poisoned behavior are calculated respectively. In the
end, we applied a t-SNE on the mean vectors from differ-                                   Conclusion
ent trials. Detailed results are shown in Figure 6. From the       We exposed a new threat type for the LSTM networks and
figure, we discover that hidden state vectors and cell state       sequential DM agents in this paper. Specifically, we showed
vectors are quite different over normal behaviors and poi-         that a maliciously-trained LSTM network-based RL agent
soned behaviors; thus, monitoring the internal states online       could have reasonable performance in a normal environ-
and perform anomaly detection should provide some hints            ment, but in the presence of a trigger, the network can be
for the attack prevention. In this situation, the monitor will     made to completely switch its behavior and persist even af-
play a role similar to immune system, where if an agent is         ter the trigger is removed. Some empirical evidence and in-
affected by the trigger, then the monitor detects and neu-         tuitive understanding of the phenomena was also discussed.
tralizes the attack. Although we did not observe the type 3        We also proposed some potential defense methods to counter
response in clean agents in current experiments, we antic-         this category of attacks and discussed avenues for future re-
ipate that some peculiar grid arrangements will require the        search. We hope that our work will inform the community
type 3 response in clean agents too, e.g. if agent has to take a   to be aware of this type of threat and will inspire to together
long U-turn when it gets stuck. Thus, presence of the type 3       have better understanding in defending against and deterring
response will not be a sufficient indicator to detect backdoor     these attacks.
                       References                                Lample, G., and Chaplot, D. S. 2016. Playing FPS games
Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; and              with deep reinforcement learning. CoRR abs/1609.05521.
Shmatikov, V. 2018. How to backdoor federated learning.          Lin, Y.-C.; Hong, Z.-W.; Liao, Y.-H.; Shih, M.-L.; Liu,
arXiv preprint arXiv:1807.00459.                                 M.-Y.; and Sun, M. 2017. Tactics of adversarial at-
Bakker, B. 2002. Reinforcement learning with long short-         tack on deep reinforcement learning agents. arXiv preprint
term memory. In Advances in neural information processing        arXiv:1703.06748.
systems, 1475–1482.                                              Liu, Y.; Ma, S.; Aafer, Y.; Lee, W.-C.; Zhai, J.; Wang, W.;
Cassandra, A. R.; Kaelbling, L. P.; and Littman, M. L. 1994.     and Zhang, X. 2017. Trojaning attack on neural networks.
Acting optimally in partially observable stochastic domains.     Liu, K.; Dolan-Gavitt, B.; and Garg, S. 2018. Fine-pruning:
In AAAI, volume 94, 1023–1028.                                   Defending against backdooring attacks on deep neural net-
Chen, X.; Liu, C.; Li, B.; Lu, K.; and Song, D. 2017. Tar-       works. arXiv preprint arXiv:1805.12185.
geted backdoor attacks on deep learning systems using data       Liu, Y.; Xie, Y.; and Srivastava, A. 2017. Neural trojans. In
poisoning. arXiv preprint arXiv:1712.05526.                      Computer Design (ICCD), 2017 IEEE International Confer-
Cheng, Y., and Zhang, W. 2018. Concise deep reinforcement        ence on, 45–48. IEEE.
learning obstacle avoidance for underactuated unmanned           Lyu, D.; Yang, F.; Liu, B.; and Gustafson, S. 2019. Sdrl:
marine vessels. Neurocomputing 272:63–73.                        Interpretable and data-efficient deep reinforcement learning
Dai, J.; Chen, C.; and Guo, Y. 2019. A backdoor at-              leveraging symbolic planning. In Proceedings of the AAAI
tack against LSTM-based text classification systems. arXiv       Conference on Artificial Intelligence, volume 33, 2970–
preprint arXiv:1905.12457.                                       2977.
Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert,       Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
M.; Radford, A.; Schulman, J.; Sidor, S.; Wu, Y.; and            ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Zhokhov, P. 2017. OpenAI baselines. https://github.com/          Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-
openai/baselines.                                                level control through deep reinforcement learning. Nature
                                                                 518(7540):529.
Goodfellow, I.; Shlens, J.; and Szegedy, C. 2015. Explain-
ing and harnessing adversarial examples. In International        Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.;
Conference on Learning Representations.                          Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn-
                                                                 chronous methods for deep reinforcement learning. In In-
Gu, T.; Dolan-Gavitt, B.; and Garg, S. 2017. BadNets: Iden-
                                                                 ternational conference on machine learning, 1928–1937.
tifying vulnerabilities in the machine learning model supply
chain. CoRR abs/1708.06733.                                      Robertson, S. 2017. Practical PyTorch: Playing gridworld
                                                                 with reinforcement learning. Web page.
Hausknecht, M., and Stone, P.           2015.    Deep recur-
rent Q-learning for partially observable MDPs. CoRR,             Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz,
abs/1507.06527 7(1).                                             P. 2015. Trust region policy optimization. In International
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term        Conference on Machine Learning, 1889–1897.
memory. Neural computation 9(8):1735–1780.                       Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; and           Klimov, O. 2017. Proximal policy optimization algorithms.
Abbeel, P. 2017. Adversarial attacks on neural network           CoRR abs/1707.06347.
policies. arXiv preprint arXiv:1702.02284.                       Stimpson, D., and Ganesan, R. 2015. A reinforcement learn-
Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.;           ing approach to convoy scheduling on a contested trans-
Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Rein-        portation network. Optimization Letters 9(8):1641–1657.
forcement learning with unsupervised auxiliary tasks. CoRR       Su, J.; Vargas, D. V.; and Sakurai, K. 2019. One pixel at-
abs/1611.05397.                                                  tack for fooling deep neural networks. IEEE Transactions
Jay, N.; Rotman, N. H.; Godfrey, P.; Schapira, M.; and           on Evolutionary Computation.
Tamar, A. 2018. Internet congestion control via deep re-         Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
inforcement learning. arXiv preprint arXiv:1810.03259.           D.; Goodfellow, I. J.; and Fergus, R. 2013. Intriguing prop-
Kiourti, P.; Wardega, K.; Jha, S.; and Li, W. 2019. TrojDRL:     erties of neural networks. CoRR abs/1312.6199.
Trojan attacks on deep reinforcement learning agents. arXiv      Tai, L.; Paolo, G.; and Liu, M. 2017. Virtual-to-real deep
preprint arXiv:1903.06638.                                       reinforcement learning: Continuous control of mobile robots
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Des-   for mapless navigation. In 2017 IEEE/RSJ International
jardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.;      Conference on Intelligent Robots and Systems (IROS), 31–
Grabska-Barwinska, A.; et al. 2017. Overcoming catas-            36. IEEE.
trophic forgetting in neural networks. Proceedings of the        Yang, Z.; Iyer, N.; Reimann, J.; and Virani, N. 2019. De-
National Aacademy of Sciences 114(13):3521–3526.                 sign of intentional backdoors in sequential models. arXiv
Kuhnle, A.; Schaarschmidt, M.; and Fricke, K. 2017. Ten-         preprint arXiv:1902.09972.
sorforce: a TensorFlow library for applied reinforcement
learning. Web page.

</pre>