<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>White-Box Adversarial Policies in Deep Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stephen Casper</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dylan Hadfield-Menell</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriel Kreiman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Boston Children's Hospital</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Brains</institution>
          ,
          <addr-line>Minds, and Machines</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Adversarial examples can be useful for developing safer AI both by identifying vulnerabilities in a model and improving its robustness via adversarial training. In reinforcement learning, adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box attacks where the adversary only sees the state observations and efectively treats the target agent as any other part of the environment. In this work, we study white-box adversarial policies to understand whether an agent's internal state can ofer useful information for other agents. We make three contributions. First, we introduce white-box adversarial policies in which an attacker can observe a target agent's internal state at each timestep. Second, we demonstrate that white-box adversarial policies are more efective at ifnding weaknesses in a target agent, resulting in both faster initial learning and higher asymptotic performance. Third, we show that training against white-box adversarial policies can be used to make learners in single-agent environments more robust to domain shifts. Code is available at this https url.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Adversarial attacks</kwd>
        <kwd>Adversarial training</kwd>
        <kwd>Robustness</kwd>
        <kwd>Reinforcement learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Adversary Policy st
The ability to discover and correct flaws with models is
key for safer AI. One approach to this can be via
conssatptrteuaccciktficisanliglnyatnhcdreatffrotaerimdnitnoogfmsaugabaktilneesapteasrdytvustererbsmaatrifioaanills.attAotadicnvkpesurtsthsaarhtiaaavlree πadv mt atadv
been widely studied in supervised learning [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
However, compared to supervised learning, reinforcement
learning (RL) agents can face an expanded set of threats
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], including adversarial policies from other agents.
      </p>
      <p>
        Adversarial policies have been used both to attack target Target Policy
agents [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] and to improve their robustness through
faodrvderesvaerlioapl itnraginthinemg[h7a].sHbeoewnevtoers,itmhpelsytatrnadianradnapatptraocakcehr πtgt st attgt
against a black-box target until the attacker (over)fits a
policy that minimizes the target’s reward. This black-box
approach sometimes works well, but it fails to utilize
any information beyond what the attacker can directly
observe, thus treating the target as any other part of bFoigthurtehe1:aWdvheritsea-rbyox(aaddv)vearnsadritaalrgpeotlic(tiegst.) oAbtseearvche tthime essttaetpe,
the environment. This approach also typically requires . The adversary also observes information from the
intercheap query access to the target, often for many millions nal state of the target and concatenates this extra
informaof timesteps. Thus, we set out to expand on the conven- tion, , into its observations. We demonstrate how this type
tional threat model with adversarial policies that exploit of white-box adversarial policy is more useful than black-box
richer information from the target, known as white-box ones for identifying vulnerabilities using attacks and
improvattacks, in order to encourage more robust performance. ing robustness using adversarial training.
      </p>
      <p>The analog to training a black-box adversarial policy
in supervised learning would be to make a zero-order
search through a model’s input space to find examples
that make it fail. While black-box attacks like these have</p>
      <sec id="sec-1-1">
        <title>Attacks:</title>
      </sec>
      <sec id="sec-1-2">
        <title>Two-Player Gfootball Env.</title>
      </sec>
      <sec id="sec-1-3">
        <title>Robustness:</title>
      </sec>
      <sec id="sec-1-4">
        <title>Single-Player Mujoco Env.</title>
        <p>(a)
(b)
st
st
…</p>
        <sec id="sec-1-4-1">
          <title>Conv</title>
          <p>…
mt</p>
        </sec>
        <sec id="sec-1-4-2">
          <title>Dense</title>
          <p>…
…</p>
        </sec>
        <sec id="sec-1-4-3">
          <title>Dense</title>
          <p>…
…
ℓt</p>
        </sec>
        <sec id="sec-1-4-4">
          <title>Dense</title>
          <p>…</p>
          <p>…
mt</p>
          <p>mt</p>
          <p>Dense
…</p>
          <p>… ℓt
Δtadv(")
vtadv
Δtvict(")
vtvict
Δtadv(")
vtadv
Δttgt(")
vtgt
t
atadv
atgt
t</p>
          <p>
            Env
atadv
Env
atgt
t
been studied in supervised learning [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], they are much ing both higher initial and asymptotic performance than
less efective and query-eficient than white-box ones black-box baselines. Second, we adopt the robust
adwhich permit access to the model’s internal state. Thus, versarial reinforcement learning (RARL) approach from
here we study how using information from the target can [
            <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
            ] for experiments in single-player Mujoco
environhelp an attacker learn an adversarial policy more quickly ments (HalfCheetah and Hopper) [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] with small
fullyand efectively. Our version of white-box attacks are ad- connected policy networks. The adversary acts by
perversarial policies that can “read the target’s mind.” Fig. 1 turbing the target agent’s actions. This is shown in Fig.
depicts our general approach. At each timestep, both the 2b. Here, we find that white-box adversaries can be more
adversary and target observe the state . The adversary, useful for training robust agents whose policies are not
however, is also able to observe internal information, , only more robust to the adversary but also generalize
from the target agent. In our experiments,  is a vector better to environments with altered transition dynamics.
that consists of the target’s action distribution ∆ (), Given these results, we argue that adversarial
polivalue estimate , and/or latent activations ℓ. cies that exploit inner information from the target agent
          </p>
          <p>
            Specifically, we test this approach in two diferent set- pose greater opportunities for identifying and
correcttings. First, we test adversarial attacks using the two- ing weaknesses in reinforcement learners. More
genplayer Google Research Football (Gfootball) environment erally, our results demonstrate that observations from
[
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] and large convolutional policy networks. Both the an agent’s internal state can be useful for other agents
adversary’s and target’s actions are passed into the envi- that interact with it. Following a discussion of
reronment’s step function. This setup is illustrated in Fig. lated works in section 2, Section 3 details our threat
2a. Here, we show that white-box attackers are better model and methods. Section 4 presents results, and
for identifying weaknesses in the target agent, achiev- Section 5 a discussion. For a high-level explanation
and summary, see the Appendix. Code is available at
https://github.com/thestephencasper/white_box_rarl.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>of mind for their opponent in competitive tasks, but only
in very simple tabular or cartpole environments. To our
knowledge, we are the first to introduce policies which
can exploit internal information from a target in complex
environments.</p>
      <p>
        Adversarial Policies: Reinforcement learning agents Open-Source Decision Making: We study targets
can be vulnerable to several types of adversarial threats whose policies are transparent to other agents in the
including input perturbations, action perturbations, re- environment. Agents with open source policies pose a
ward perturbations, environments, and policies from number of challenges and pitfalls for decision-making.
other agents. Both [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] ofer surveys of threats Several works formalize these challenges in the context
and defenses. Our focus is on adversarial policies. Con- of decision theory or game theory [38, 39, 40, 41, 42].
ventionally, these attacks have been developed by simply Our work adds to this by empirically studying one such
training the adversary against the fixed target agent’s pol- challenge: attacks in reinforcement learning.
icy. This approach has been used by [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15 ref5 ref6">12, 5, 6, 13, 14, 15</xref>
        ]
for attacks. These adversaries were even observed
unintentionally by [16] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] who found that in competitive 3. Methods
multiagent environments, it was key to rotate players in
a round-robin fashion to avoid agents overfitting against 3.1. Framework
a particular opponent. Additionally, [17] introduced a
approach based on planning, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] tested the detectability of
adversarial policies, [
        <xref ref-type="bibr" rid="ref5">5, 18</xref>
        ] explored defense techniques
We consider the goal of training an adversary
against a target inside of a two player Markov
Decision Process (MDP) defined by a 6-tuple:
vciieasorbefsupseccattiivneglyt,h[e14a,tt1a9c]keexrpaenrdimuesinntgedopwtiitohn-dbeafseendsepvoilai- (sta,t{eset,,a}n,d ,0,{actio,nset}s, fo)r thweitahdversarya
adversarial training, and [
        <xref ref-type="bibr" rid="ref6">6, 20</xref>
        ] ofered methods of at- and target,  :  ×   ×   → ∆( ) a state
transitacking a target whose reward is unknown. tion function which outputs a distribution ∆( ) over ,
      </p>
      <p>
        Meanwhile, [
        <xref ref-type="bibr" rid="ref10 ref7">7, 21, 22, 10, 23, 24</xref>
        ] have studied Robust 0 an initial state distribution,  a temporal discount
facAdversarial Reinforcement Learning (RARL) in which tor, and  and  reward functions for the adversary
an agent is trained alongside an adversarial policy that and target s.t. ,  :  ×   ×   ×  → ℛ .
perturb’s its state or actions in order for the agent to learn We assume () ≈ − () ∀ ∈ . We only run
more robust control. [25] studied the stability of this experiments in which the target’s policy is fixed, so the
approach. Others [26, 27, 28] have adversarially trained two-player MDP reduces to a single-player one. We
agents under observation or environment perturbations. will use   :  → ∆( ) and   :  → ∆( )
To the best of our knowledge, however, no works to to denote the policy of an adversary and target, and
rdeaitnefohracveemsetundtileedarwnhinitge-cboonxteaxtttsa.cks or RARL in modern  ,  :  → R to refer to their value functions.
Black vs. White-box Attacks: In supervised learning, 3.2. Threat Model
adversarial attacks are simple to make using white-box
access to the target’s internal weights. Black-box attacks,
however, typically require transfer, zero-order
optimization, or gradient estimation, and they are usually less
successful [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Several others including [26, 29, 30, 31, 27]
have studied attacks against reinforcement learners based
on perturbing the target agent’s observations. [32]
further demonstrated the use of a target’s internal state
by using the value function for scheduling
maximallyefective adversarial observation perturbations. These
types of attacks require an attacker to have the ability
to manipulate agent observations and involve
propagating the gradient for an adversarial objective through the
policy network. In contrast, our white-box adversarial
policies only difer from black-box ones from related
work in whether the attacker, a reinforcement learner,
can observe the target’s internal state. Several works
[33, 34, 35, 36, 37] have also trained agents with a theory
There are multiple notions that have been used in
supervised and reinforcement learning to characterize an
adversary. These include being efective at making the
target fail, being subtle and hard for an observer to
detect (e.g., [32]), and being target-specific (e.g., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). Here,
we use the first criterion and consider any policy that
is efective at making another fail to be adversarial. For
further discussion, see Appendix, A.1.
      </p>
      <p>Previous works discussed in Section 2 have assumed
a threat model in which the adversary only has
blackbox access to the target but can cheaply train against
it for many timesteps. We both strengthen and weaken
this. First, we make the permissive assumption that the
adversary can observe the target’s internal state at each
timestep and is able to use this information as an
observation in the same timestep (see Section 3.3 for details). This
could be a plausible assumption if a malicious attacker
could obtain access to a target agent’s policy parameters
– especially if its designers make the target open-source.</p>
      <p>However, a more realistic case for safety-critical settings
in which an attacker may have white-box access to a
target agent is if the agents developers use white box
access to it to find and correct flaws in the agent’s policy.</p>
      <p>Second, we consider the restrictive assumption that the
number of timesteps for which the adversary can train
against the target may be limited. Realistically, this could
be the case if gathering experience is limited or costly
for any reason.
3.3. White-Box Adversarial Policies</p>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <p>
        4.1. Identifying Vulnerabilities
Environment: We use the two-player Google Research
Football environment (Gfootball) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Each agent in the
environment controls a set of 11 football (soccer)
teammates. The states are 72 × 96 × 4 pixels with the four
channels encoding the left team positions, right team
positions, ball position, and active player position.
Observations were stacked over four timesteps to give a
perception of time, resulting in observations of 72 × 96 × 16
pixels. The agents’ policy networks had a ResNet
architecture [45], and the action space was discrete with
size 19. We used the same reward shaping as in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in
which an agent was rewarded 1 for scoring, -1 for being
scored on, and 0.1 for advancing the ball one tenth of
the way down the field. We trained all Gfootball agents
using Proximal Policy Optimization [43] using the Stable
Baselines 2 implementation [46].
      </p>
      <p>We train policies using Proximal Policy Optimization
(PPO) [43] and Soft Actor Critic (SAC) [44]. Both involve
training a value function estimator alongside the policy.</p>
      <p>We consider attackers that have access to (1) the target
agent’s action outputs, (2) its value estimate, and/or (3)
the internal activations from its policy network. Our
goal for (1) is to give the adversary a glimpse of the near Target Agents: First, we trained target agents to develop
future so that it can better counter the target agent’s adversarial policies against. For Gfootball, this was done
behavior. Our goal for (2) is to make it easier for the in two stages for a total of 50 million timesteps. First, the
attacker to quickly learn its own value function because targets were trained against a ‘bot’ agent for 25 million
 () ≈ −  (). Note this is only possible for timesteps with an entropy reward to encourage
explotargets that have a critic. Finally, our goal for (3) is to ration. Second, they were trained for another 25 million
give the adversary rich and generally-useful information timesteps against an agent from the first phase with an
on how the target represents the state. entropy penalty to encourage more deterministic play.</p>
      <p>At timestep , the environment state, , is observed. We found this to result in more consistent behavior from
The target processes the state and produces an action adversaries. In Fig. 3 (a) shows the learning curves for
 ∼  (). At the same time, the white-box adver- these targets.
sary queries the target to get its action output  (), Adversaries: We trained four types of adversaries, each
value estimate (), and/or latent activations ℓ() of which uses observes diferent information, , from
in the form of a vector (). In a slight abuse of the target’s internal state:
notation, we refer to ℓ() as ℓ and () as .</p>
      <p>
        Thus, the adversary’s policy function can be written as 1. Black-Box Control:  = ∅. This is the same
 () =  (, ), and its value estimate can be threat model used by [16], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and others
menwritten as   () = (, ). tioned in Section 2.
      </p>
      <p>We train both adversaries that use large convolutional 2. Action &amp; Value:  = () ⊕  ()
neural networks (CNNs) and small multilayer percep- where ⊕ is the concatenation operator. Here, the
trons (MLPs) as policy networks. These architectures are adversary sees the scalar value and an
||illustrated in Fig. 2. For the large CNNs, we concatenate sized observation giving the target agent’s
distri into the representation of the state twice: once at the bution over discrete output actions.
ifrst fully-connected layer, and once at the last. We do 3. Latent:  = ℓ where ℓ gives the latent
actithis so that the adversary can readily learn both complex vations from some layer during the forward pass
and simple functions of . In particular,we hypothe- through the target’s network from . Here, we
sized that giving the adversary the target’s value estimate use those of the final layer from which both the
in its final layer is helpful for learning its own value es- target agent’s actions and value are computed.
timator, which ought to be approximately the negative 4. Full:  = () ⊕  () ⊕ ℓ. This
combines the Action &amp; Value and Latent threat
modof the target’s. For the small MLPs policy networks, we
only concatenate  with the observation once at the els.
beginning for eficiency.</p>
      <p>Results: We train each adversary for 50 million
timesteps. Fig. 3b shows the training curves for these
attackers. All improve significantly over the black box
control, both by having faster initial learning and a higher</p>
      <p>Gfootball Target Train Performance, n=20
2
e 0
m
a
G−2
/
.
s
t−4
P
t
e
N−6
−8
0
10M
20M
30M
40M
50M</p>
      <p>0
(a)</p>
      <p>10M 20M 30M 40M
Act/Val/Latent v. Ctrl p: 2e-05</p>
      <p>Latent v. Ctrl p: 1e-05
Act/Val v. Ctrl p: 0.00638
(b)</p>
      <p>
        Act/Val/Latent
Latent
Act/Val
Black-Box Control
50M
asymptotic performance. The two types of white-box
adversaries that could observe the target’s latents
performed the best. Both do as well after 5 million timesteps
as the black box control does after 50 million. For the
action/value, latent, and full attacks, the  values from a
one-sided  test for the hypothesis that they were
superior to the black box controls were 0.00638, 0.00001, and
0.00002 respectively, demonstrating clear improvements.
4.2. Improving Robustness
adversary.
2. RARL: The target agent is trained against an
ensemble of black-box adversarial agents. This is
the approach used by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
3. Latent/Action White-Box RARL
(WB
      </p>
      <p>
        RARL): The target agent is trained against
an ensemble of white-box adversaries that
each observe its latent activations from the
penultimate layer of the policy network and
action outputs. Thus,  =  () ⊕ ℓ
1. RL Control: The target agent is trained with no
Environment: To evaluate white-box robust adversarial
reinforcement learning (RARL), we used HalfCheetah-v3 Results: We trained a total of 40 agents of each type for
and Hopper-v3 Mujoco environments from OpenAI Gym. 2 million timesteps and selected the 20 with the best final
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In both environments, the agent controls a body performance. Fig. 4a shows the evaluation performance
in a 3D simulated physics environment. Observations for the HalfCheetah and Hopper agents in an
adversaryare continuous-valued vectors specifying the position of free environment over the course of training.
Perforthe body, and actions are continuous-valued vectors for mance is comparable between all three conditions with
controlling it. The agents’ policy networks had a small the RL controls seeming to perform the best in
HalfCheeMLP architecture with two hidden layers of 256 neurons tah.
each. We trained all gym agents using SAC [44] with the To test the robustness of the learned policies, we use
Stable Baselines 3 implementation [47]. the same approach as [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. After RARL, we test
on a set of adversary-free environments with the
transiTraining: In alternation, we trained a target agent and tion dynamics altered. We selected a range of 8 mass and
an ensemble of three adversaries who perturbed the tar- 8 friction coeficients to modify the environment
dynamget’s actions. For each training episode for the target, a ics by and tested the agents on all 8 × 8 combinations.
random adversary from the three was chosen to make The full arrays of results are shown in Fig. 5 in Appendix
the perturbations. We experiment with three methods: A.2. And the mean results over all friction coeficients
and mass coeficients are plotted in Fig. 4b-c respectively.
      </p>
      <p>RL
RARL
Latent/Action WB-RARL
RL
RARL
Latent/Action WB-RARL
1600
1400
1200
1000
800
600
400
200
1200
1000
800
600
400
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
t 1e6
0.2x 0.3x 0.4x
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
t 1e6
0.2x 0.3x 0.4x</p>
      <p>0.5x 1.05x 1.1x 1.15x 1.2x
Mass Multiplier
0.2x 0.3x 0.4x 0.5x 1.4x 1.6x 1.8x 2.0x</p>
      <p>Friction Multiplier
(a)
(b)
(c)
In Fig. 4b-c, WB-RARL agents generally perform as well that allowing an adversarial policy to observe the
interor better than the other two. And on average, WB-RARL nal state of the target agent, can result in (1) better initial
performs the best over all testing environments. For RL, and asymptotic performance for adversarial attackers
RARL, and WB-RARL, the HalfCheetah agents achieve and (2) more efective adversarial training for improving
mean episode rewards of 902, 914, and 1019, and the Hop- the robustness of a learned policy. These results suggest
per agents achieve 673, 645, and 716 respectively. We that using white-box adversarial policies to identify and
performed four one-sided t-tests to test the hypotheses correct flaws with reinforcement learners may be a useful
that the WB-RARL agents had superior overall testing strategy for developing safer, more reliable reinforcement
performance. For HalfCheetah, the  values were 0.085 learning systems.
and 0.111 for comparing the WB-RARL agents to the RL More generally, our results show that information
and RARL ones respectively. For Hopper, the correspond- about an agent’s internal state ofers useful information
ing  values were 0.095 and 0.009. These suggest that the for other agents interacting with it. This may be the
WB-RARL agents are more robust to these domain shifts. case regardless of whether the setting is adversarial,
cooperative, or indiferent. In multiagent settings, it is
important to bear in mind that a policy which makes
5. Discussion and Broader Impact use of white-box information from another agent need
not be implemented by nor against a conventional
reOur goal in this work is to better understand opportuni- inforcement learner. On one hand, policies can be
deties from adversarial policies in reinforcement learning veloped without standard reinforcement learning
algoby studying white-box adversarial attackers. We show
rithms (e.g., PPO or SAC). For example, human video
game players constantly develop strategies to exploit the
weaknesses of computer-controlled competitors to great
efect. On the other hand, so long as a target agent
computes “actions” via latent information, this information
could be given to other agents seeking to interact with
it. One case in which using adversarial policies against
non-reinforcement-learners can be useful is for finding
lfaws in language models. The inability to diferentiate
through the sampling of discrete textual tokens makes
the task of finding failure modes for language models
one that adversarial policies can be useful for (e.g. [48]).</p>
      <p>Future work on versions of white-box adversarial policies
for debugging language models may be useful.</p>
      <p>Concerning adversarial attacks in particular, one risk
of any work that focuses on attack methods is that they
could be used for malicious attacks. This is an important
concern, but we emphasize that it is better to develop an
understanding of adversarial vulnerabilities through
exploratory research than from incidents in the real world.</p>
      <p>We also stress the benefits of adversarial training and the
fact that white box access to an agent can be kept from
malicious attackers if appropriate measures are taken.</p>
      <p>For this reason, we expect white-box adversarial policies
to be much more practical for those working to make
systems more robust than for malicious attackers.</p>
      <p>A limitation is that while we show that white-box
attacks can be useful, the improvements from granting the
adversary white-box access in the RARL experiments
were only modest. And even though white-box attacks
can help train adversarial policies more quickly, these
attacks may still demand many timesteps. Future work
on similar black-box attacks that use a model of the
target learned from black-box (and potentially even ofline)
access may be valuable. Studying ways to more
efectively leverage target agent information in fewer training
timesteps may also be useful. Additional progress like
this toward better understanding opportunities from
adversaries in reinforcement learning will be a promising
direction for expanding the toolbox for safer and more
trustworthy AI.</p>
    </sec>
    <sec id="sec-4">
      <title>6. Acknowledgments</title>
      <p>We thank Lucas Janson for valuable ideas and feedback
throughout the course of this work. We also appreciate
discussions with Adam Gleave and Pavel Czempin.
ceedings of the IEEE/CVF Conference on Computer 1328–1337.</p>
      <p>Vision and Pattern Recognition, 2022, pp. 115–122. [30] E. Korkmaz, Adversarially trained neural policies
[16] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mor- in the fourier domain, in: ICML 2021 Workshop on
datch, Emergent complexity via multi-agent com- Adversarial Machine Learning, 2021.
petition, arXiv preprint arXiv:1710.03748 (2017). [31] E. Korkmaz, Investigating vulnerabilities of deep
[17] A. Pozanco, S. Fernández, D. Borrajo, et al., neural policies, in: Uncertainty in Artificial
IntelliAnticipatory counterplanning, arXiv preprint gence, PMLR, 2021, pp. 1661–1670.
arXiv:2203.16171 (2022). [32] J. Kos, D. Song, Delving into adversarial attacks
[18] P. Dasgupta, Using options to improve robustness on deep policies, arXiv preprint arXiv:1705.06452
of imitation learning against adversarial attacks, (2017).
in: Artificial Intelligence and Machine Learning for [33] A. Davidson, Using artificial neural networks to
Multi-Domain Operations Applications III, volume model opponents in texas hold’em, Unpublished
11746, International Society for Optics and Photon- manuscript (1999).</p>
      <p>ics, 2021, p. 1174610. [34] A. J. Lockett, C. L. Chen, R. Miikkulainen, Evolving
[19] P. Czempin, A. Gleave, Reducing exploitability explicit opponent models in game playing, in:
Prowith population based training, arXiv preprint ceedings of the 9th annual conference on Genetic
arXiv:2208.05083 (2022). and evolutionary computation, 2007, pp. 2106–2113.
[20] T. Fujimoto, T. Doster, A. Attarian, J. Branden- [35] H. He, J. Boyd-Graber, K. Kwok, H. Daumé III,
Opberger, N. Hodas, Reward-free attacks in multi- ponent modeling in deep reinforcement learning,
agent reinforcement learning, arXiv preprint in: International conference on machine learning,
arXiv:2112.00940 (2021). PMLR, 2016, pp. 1804–1813.
[21] H. Shioya, Y. Iwasawa, Y. Matsuo, Extending ro- [36] V. Behzadan, W. Hsu, Rl-based method for
benchbust adversarial reinforcement learning considering marking the adversarial resilience and robustness
adaptation and diversity (2018). of deep reinforcement learning policies, in:
Interna[22] X. Pan, D. Seita, Y. Gao, J. Canny, Risk averse robust tional Conference on Computer Safety, Reliability,
adversarial reinforcement learning, in: 2019 Inter- and Security, Springer, 2019, pp. 314–325.
national Conference on Robotics and Automation [37] Y. Faghan, N. Piazza, V. Behzadan, A. Fathi,
Adver(ICRA), IEEE, 2019, pp. 8522–8528. sarial attacks on deep algorithmic trading policies,
[23] K. L. Tan, Y. Esfandiari, X. Y. Lee, S. Sarkar, et al., Ro- arXiv preprint arXiv:2010.11388 (2020).
bustifying reinforcement learning agents via action [38] J. Y. Halpern, R. Pass, Game theory with translucent
space adversarial training, in: 2020 American con- players, International Journal of Game Theory 47
trol conference (ACC), IEEE, 2020, pp. 3959–3964. (2018) 949–976.
[24] P. Zhai, J. Luo, Z. Dong, L. Zhang, S. Wang, D. Yang, [39] A. Demski, S. Garrabrant, Embedded agency, arXiv
Robust adversarial reinforcement learning with dis- preprint arXiv:1902.09469 (2019).</p>
      <p>sipation inequation constraint (2022). [40] A. Critch, A parametric, resource-bounded
general[25] K. Zhang, B. Hu, T. Basar, On the stability and ization of löb’s theorem, and a robust cooperation
convergence of robust adversarial reinforcement criterion for open-source game theory, The Journal
learning: A case study on linear quadratic systems, of Symbolic Logic 84 (2019) 1368–1381.
Advances in Neural Information Processing Sys- [41] S. Casper, Achilles heels for agi/asi via decision
thetems 33 (2020) 22056–22068. oretic adversaries, arXiv preprint arXiv:2010.05418
[26] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, (2020).</p>
      <p>G. Chowdhary, Robust deep reinforcement learn- [42] A. Critch, M. Dennis, S. Russell, Cooperative
ing with adversarial attacks, arXiv preprint and uncooperative institution designs: Surprises
arXiv:1712.03632 (2017). and problems in open-source game theory, arXiv
[27] T. Oikarinen, W. Zhang, A. Megretski, L. Daniel, preprint arXiv:2208.07006 (2022).</p>
      <p>T.-W. Weng, Robust deep reinforcement learning [43] J. Schulman, F. Wolski, P. Dhariwal, A. Radford,
through adversarial loss, Advances in Neural Infor- O. Klimov, Proximal policy optimization algorithms,
mation Processing Systems 34 (2021). arXiv preprint arXiv:1707.06347 (2017).
[28] L. Schott, M. Césaire, H. Hajri, S. Lamprier, Im- [44] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft
proving robustness of deep reinforcement learning actor-critic: Of-policy maximum entropy deep
reagents: Environment attacks based on critic net- inforcement learning with a stochastic actor, arXiv
works, arXiv preprint arXiv:2104.03154 (2021). preprint arXiv:1801.01290 (2018).
[29] B. Lütjens, M. Everett, J. P. How, Certified adver- [45] K. He, X. Zhang, S. Ren, J. Sun, Deep residual
learnsarial robustness for deep reinforcement learning, ing for image recognition, in: Proceedings of the
in: Conference on Robot Learning, PMLR, 2020, pp. IEEE conference on computer vision and pattern
recognition, 2016, pp. 770–778. However, in this and all related work in RL of which we
[46] A. Hill, A. Rafin, M. Ernestus, A. Gleave, A. Kan- know, no notion of subtlety is part of the definition of
ervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, an adversarial policy. So ultimately, we use “adversarial”
A. Nichol, M. Plappert, A. Radford, J. Schulman, here to simply refer to a policy which is good at beating
S. Sidor, Y. Wu, Stable baselines, https://github.com/ a target.</p>
      <p>hill-a/stable-baselines, 2018.
[47] A. Rafin, A. Hill, A. Gleave, A. Kanervisto, M. Ernes- A.2. Full Robust Adversarial
tus, N. Dormann, Stable-baselines3: Reliable
reinReinforcement Learning Results
forcement learning implementations, Journal of
Machine Learning Research 22 (2021) 1–8. URL: As discussed in Section 4.2, we tested agents on
envihttp://jmlr.org/papers/v22/20-1364.html. ronments with altered mass and friction parameters. For
[48] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, both the HalfCheetah and Hopper environments, we used
J. Aslanides, A. Glaese, N. McAleese, G. Irving, Red a set of 8 × 8 diferent mass and friction values. Testing
teaming language models with language models, results across all testing environments for control, RARL,
arXiv preprint arXiv:2202.03286 (2022). and WB-RARL agents are shown here in Fig. 5. Under
[49] N. Papernot, P. McDaniel, I. Goodfellow, Transfer- each grid, the mean for all results in the grid is displayed.
ability in machine learning: from phenomena to Under the RL and RARL grids (columns 1 and 2), the
black-box attacks using adversarial samples, arXiv  value from a one-sided t-test for the hypothesis that
preprint arXiv:1605.07277 (2016). WB-RARL is superior to RL and RARL is shown.
[50] A. Madry, A. Makelov, L. Schmidt, D. Tsipras,</p>
      <p>A. Vladu, Towards deep learning models resistant to A.3. High-Level Summary
adversarial attacks, arXiv preprint arXiv:1706.06083
(2017). Here, we provide a summary of this work which does
[51] F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, not assume that the reader has a technical background.</p>
      <p>P. McDaniel, The space of transferable adversarial “Reinforcement Learning” (RL) is the process by which
examples, arXiv preprint arXiv:1704.03453 (2017). an agent learns via some formalized process of trial and
[52] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, error to accomplish a goal. Humans are reinforcement
B. Tran, A. Madry, Adversarial examples are learners. And so are some algorithms that are commonly
not bugs, they are features, arXiv preprint studied in machine learning research today. For example,
arXiv:1905.02175 (2019). is common to use reinforcement learning algorithms to
train AI systems to play video games. Using experience,
they can infer what types of actions lead to higher scores
A. Appendix and adjust their behavior accordingly.</p>
      <p>
        Multiagent RL describes settings in which there is more
than one agent acting in some setting. Past research has
A.1. Understanding Adversarial Policies shown that in multiagent settings, training
“adversarThe notion of an adversary for a deep learning system ial” reinforcement learners to make other reinforcement
was popularized by [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and subsequent research. These learners fail can be useful. One one hand, an
adversarworks developed adversarial images that are both efec- ial agent can often learn to act in a way that renders
tive, meaning that they fool an image classifier, and subtle, the “target” agent unable to accomplish its goals. For
meaning that they only difer from a benign image by a example, an adversary can sometimes act in ways that
very small-norm perturbation. While they often transfer make a target in a two player video game seem to take
to other models [49, 50, 51, 52], these adversaries are also actions that are as bad as – or even worse than – random
typically target-specific in the sense that they are created ones. On the other hand, training a target agent against
specifically to fool a particular model. an adversarial agent can make it more robust to some
      </p>
      <p>As in supervised learning, “efectiveness” is used as failures. For example, this might make the target
particupart of the definition for adversarial policies across the larly efective at avoiding failures due to changes to its
literature. “Target-specificity” sometimes is, but many environment.</p>
      <p>
        RL works (e.g., [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) including ours do not require an In this work, we study a new approach to adversarial
adversary to be target-specific. Finally, “subtlety” has not attacks and adversarial training in RL. We experiment
been adopted as a standard for adversaries research in with “white-box” attacks in which the adversary can
RL. A notion of subtlety for adversaries in RL that would observe the internal state of the target. For humans, this
be analogous to adversaries in supervised learning would would be analogous to one person playing a game against
be that the adversary produces distributions over actions someone else while being able to view scans of their
or trajectories that are very similar to a benign agent. brain. We show that these white-box adversarial agents
W
H
H
      </p>
      <p>W
H</p>
      <p>H</p>
      <p>L
H
R</p>
      <p>H
R
L L L L L0H L L</p>
      <p>0H
:55/5/:55/5/</p>
      <p>LL</p>
      <p>LL</p>
      <p>L
are more efective than controls for both attacks and
adversarial training. We argue that this helps us to better
understand opportunities from adversarial RL. And based
on these results, we argue that white-box adversaries may
be very useful for discovering and correcting flaws in
reinforcement learners.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaremba</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bruna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <article-title>Intriguing properties of neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6199</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <article-title>Explaining and harnessing adversarial examples</article-title>
          ,
          <source>arXiv preprint arXiv:1412.6572</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ilahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Usama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. U.</given-names>
            <surname>Janjua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>AlFuqaha</surname>
          </string-name>
          , D. T. Huang,
          <string-name>
            <given-names>D.</given-names>
            <surname>Niyato</surname>
          </string-name>
          ,
          <article-title>Challenges and countermeasures for adversarial attacks on deep reinforcement learning</article-title>
          ,
          <source>IEEE Transactions on Artiifcial Intelligence</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Najafirad</surname>
          </string-name>
          ,
          <article-title>Opportunities and challenges in deep learning adversarial robustness: A survey</article-title>
          , arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>00753</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gleave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dennis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Adversarial policies: Attacking deep reinforcement learning</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>10615</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Doster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Attarian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brandenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hodas</surname>
          </string-name>
          ,
          <article-title>The efect of antagonistic behavior in reinforcement learning (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sukthankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Robust adversarial reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2817</fpage>
          -
          <lpage>2826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhambri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tulasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Buduru</surname>
          </string-name>
          ,
          <article-title>A survey of black-box adversarial attacks on computer vision models</article-title>
          , arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>01667</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kurach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raichuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stańczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zając</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bachem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espeholt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Riquelme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vincent</surname>
          </string-name>
          , M. Michalski,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          , et al.,
          <article-title>Google research football: A novel reinforcement learning environment</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>4501</fpage>
          -
          <lpage>4510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Vinitsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Parvate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bayen</surname>
          </string-name>
          ,
          <article-title>Robust reinforcement learning using adversarial populations</article-title>
          , arXiv preprint arXiv:
          <year>2008</year>
          .
          <year>01825</year>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pettersson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , W. Zaremba, Openai gym,
          <source>arXiv preprint arXiv:1606.01540</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Behzadan</surname>
          </string-name>
          , W. Hsu,
          <article-title>Adversarial exploitation of policy imitation</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>01121</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Adversarial policy learning in two-player competitive games</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3910</fpage>
          -
          <lpage>3919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Adversarial policy training against deep reinforcement learning</article-title>
          ,
          <source>in: 30th USENIX Security Symposium (USENIX Security 21)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1883</fpage>
          -
          <lpage>1900</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning</article-title>
          , in: Pro-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>