1. Introduction

White-Box Adversarial Policies in Deep Reinforcement Learning

Stephen Casper

Dylan Hadfield-Menell

Gabriel Kreiman

0 1 0 Boston Children's Hospital 1 Center for Brains , Minds, and Machines

Adversarial examples can be useful for developing safer AI both by identifying vulnerabilities in a model and improving its robustness via adversarial training. In reinforcement learning, adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box attacks where the adversary only sees the state observations and efectively treats the target agent as any other part of the environment. In this work, we study white-box adversarial policies to understand whether an agent's internal state can ofer useful information for other agents. We make three contributions. First, we introduce white-box adversarial policies in which an attacker can observe a target agent's internal state at each timestep. Second, we demonstrate that white-box adversarial policies are more efective at ifnding weaknesses in a target agent, resulting in both faster initial learning and higher asymptotic performance. Third, we show that training against white-box adversarial policies can be used to make learners in single-agent environments more robust to domain shifts. Code is available at this https url.

eol>Adversarial attacks Adversarial training Robustness Reinforcement learning

1. Introduction

Adversary Policy st The ability to discover and correct flaws with models is key for safer AI. One approach to this can be via conssatptrteuaccciktficisanliglnyatnhcdreatffrotaerimdnitnoogfmsaugabaktilneesapteasrdytvustererbsmaatrifioaanills.attAotadicnvkpesurtsthsaarhtiaaavlree πadv mt atadv been widely studied in supervised learning [ 1, 2 ]. However, compared to supervised learning, reinforcement learning (RL) agents can face an expanded set of threats [ 3, 4 ], including adversarial policies from other agents.

Adversarial policies have been used both to attack target Target Policy agents [ 5, 6 ] and to improve their robustness through faodrvderesvaerlioapl itnraginthinemg[h7a].sHbeoewnevtoers,itmhpelsytatrnadianradnapatptraocakcehr πtgt st attgt against a black-box target until the attacker (over)fits a policy that minimizes the target’s reward. This black-box approach sometimes works well, but it fails to utilize any information beyond what the attacker can directly observe, thus treating the target as any other part of bFoigthurtehe1:aWdvheritsea-rbyox(aaddv)vearnsadritaalrgpeotlic(tiegst.) oAbtseearvche tthime essttaetpe, the environment. This approach also typically requires . The adversary also observes information from the intercheap query access to the target, often for many millions nal state of the target and concatenates this extra informaof timesteps. Thus, we set out to expand on the conven- tion, , into its observations. We demonstrate how this type tional threat model with adversarial policies that exploit of white-box adversarial policy is more useful than black-box richer information from the target, known as white-box ones for identifying vulnerabilities using attacks and improvattacks, in order to encourage more robust performance. ing robustness using adversarial training.

The analog to training a black-box adversarial policy in supervised learning would be to make a zero-order search through a model’s input space to find examples that make it fail. While black-box attacks like these have

Attacks: Two-Player Gfootball Env. Robustness: Single-Player Mujoco Env.

(a) (b) st st …

Conv

… mt

Dense

… …

Dense

… … ℓt

Dense

…

… mt

Dense …

… ℓt Δtadv(") vtadv Δtvict(") vtvict Δtadv(") vtadv Δttgt(") vtgt t atadv atgt t

Env atadv Env atgt t been studied in supervised learning [ 8 ], they are much ing both higher initial and asymptotic performance than less efective and query-eficient than white-box ones black-box baselines. Second, we adopt the robust adwhich permit access to the model’s internal state. Thus, versarial reinforcement learning (RARL) approach from here we study how using information from the target can [ 7, 10 ] for experiments in single-player Mujoco environhelp an attacker learn an adversarial policy more quickly ments (HalfCheetah and Hopper) [ 11 ] with small fullyand efectively. Our version of white-box attacks are ad- connected policy networks. The adversary acts by perversarial policies that can “read the target’s mind.” Fig. 1 turbing the target agent’s actions. This is shown in Fig. depicts our general approach. At each timestep, both the 2b. Here, we find that white-box adversaries can be more adversary and target observe the state . The adversary, useful for training robust agents whose policies are not however, is also able to observe internal information, , only more robust to the adversary but also generalize from the target agent. In our experiments, is a vector better to environments with altered transition dynamics. that consists of the target’s action distribution ∆ (), Given these results, we argue that adversarial polivalue estimate , and/or latent activations ℓ. cies that exploit inner information from the target agent

Specifically, we test this approach in two diferent set- pose greater opportunities for identifying and correcttings. First, we test adversarial attacks using the two- ing weaknesses in reinforcement learners. More genplayer Google Research Football (Gfootball) environment erally, our results demonstrate that observations from [ 9 ] and large convolutional policy networks. Both the an agent’s internal state can be useful for other agents adversary’s and target’s actions are passed into the envi- that interact with it. Following a discussion of reronment’s step function. This setup is illustrated in Fig. lated works in section 2, Section 3 details our threat 2a. Here, we show that white-box attackers are better model and methods. Section 4 presents results, and for identifying weaknesses in the target agent, achiev- Section 5 a discussion. For a high-level explanation and summary, see the Appendix. Code is available at https://github.com/thestephencasper/white_box_rarl.

2. Related Work

of mind for their opponent in competitive tasks, but only in very simple tabular or cartpole environments. To our knowledge, we are the first to introduce policies which can exploit internal information from a target in complex environments.

Adversarial Policies: Reinforcement learning agents Open-Source Decision Making: We study targets can be vulnerable to several types of adversarial threats whose policies are transparent to other agents in the including input perturbations, action perturbations, re- environment. Agents with open source policies pose a ward perturbations, environments, and policies from number of challenges and pitfalls for decision-making. other agents. Both [ 3 ] and [ 4 ] ofer surveys of threats Several works formalize these challenges in the context and defenses. Our focus is on adversarial policies. Con- of decision theory or game theory [38, 39, 40, 41, 42]. ventionally, these attacks have been developed by simply Our work adds to this by empirically studying one such training the adversary against the fixed target agent’s pol- challenge: attacks in reinforcement learning. icy. This approach has been used by [ 12, 5, 6, 13, 14, 15 ] for attacks. These adversaries were even observed unintentionally by [16] and [ 9 ] who found that in competitive 3. Methods multiagent environments, it was key to rotate players in a round-robin fashion to avoid agents overfitting against 3.1. Framework a particular opponent. Additionally, [17] introduced a approach based on planning, [ 5 ] tested the detectability of adversarial policies, [ 5, 18 ] explored defense techniques We consider the goal of training an adversary against a target inside of a two player Markov Decision Process (MDP) defined by a 6-tuple: vciieasorbefsupseccattiivneglyt,h[e14a,tt1a9c]keexrpaenrdimuesinntgedopwtiitohn-dbeafseendsepvoilai- (sta,t{eset,,a}n,d ,0,{actio,nset}s, fo)r thweitahdversarya adversarial training, and [ 6, 20 ] ofered methods of at- and target, : × × → ∆( ) a state transitacking a target whose reward is unknown. tion function which outputs a distribution ∆( ) over ,

Meanwhile, [ 7, 21, 22, 10, 23, 24 ] have studied Robust 0 an initial state distribution, a temporal discount facAdversarial Reinforcement Learning (RARL) in which tor, and and reward functions for the adversary an agent is trained alongside an adversarial policy that and target s.t. , : × × × → ℛ . perturb’s its state or actions in order for the agent to learn We assume () ≈ − () ∀ ∈ . We only run more robust control. [25] studied the stability of this experiments in which the target’s policy is fixed, so the approach. Others [26, 27, 28] have adversarially trained two-player MDP reduces to a single-player one. We agents under observation or environment perturbations. will use : → ∆( ) and : → ∆( ) To the best of our knowledge, however, no works to to denote the policy of an adversary and target, and rdeaitnefohracveemsetundtileedarwnhinitge-cboonxteaxtttsa.cks or RARL in modern , : → R to refer to their value functions. Black vs. White-box Attacks: In supervised learning, 3.2. Threat Model adversarial attacks are simple to make using white-box access to the target’s internal weights. Black-box attacks, however, typically require transfer, zero-order optimization, or gradient estimation, and they are usually less successful [ 8 ]. Several others including [26, 29, 30, 31, 27] have studied attacks against reinforcement learners based on perturbing the target agent’s observations. [32] further demonstrated the use of a target’s internal state by using the value function for scheduling maximallyefective adversarial observation perturbations. These types of attacks require an attacker to have the ability to manipulate agent observations and involve propagating the gradient for an adversarial objective through the policy network. In contrast, our white-box adversarial policies only difer from black-box ones from related work in whether the attacker, a reinforcement learner, can observe the target’s internal state. Several works [33, 34, 35, 36, 37] have also trained agents with a theory There are multiple notions that have been used in supervised and reinforcement learning to characterize an adversary. These include being efective at making the target fail, being subtle and hard for an observer to detect (e.g., [32]), and being target-specific (e.g., [ 5 ]). Here, we use the first criterion and consider any policy that is efective at making another fail to be adversarial. For further discussion, see Appendix, A.1.

Previous works discussed in Section 2 have assumed a threat model in which the adversary only has blackbox access to the target but can cheaply train against it for many timesteps. We both strengthen and weaken this. First, we make the permissive assumption that the adversary can observe the target’s internal state at each timestep and is able to use this information as an observation in the same timestep (see Section 3.3 for details). This could be a plausible assumption if a malicious attacker could obtain access to a target agent’s policy parameters – especially if its designers make the target open-source.

However, a more realistic case for safety-critical settings in which an attacker may have white-box access to a target agent is if the agents developers use white box access to it to find and correct flaws in the agent’s policy.

Second, we consider the restrictive assumption that the number of timesteps for which the adversary can train against the target may be limited. Realistically, this could be the case if gathering experience is limited or costly for any reason. 3.3. White-Box Adversarial Policies

4. Experiments

4.1. Identifying Vulnerabilities Environment: We use the two-player Google Research Football environment (Gfootball) [ 9 ]. Each agent in the environment controls a set of 11 football (soccer) teammates. The states are 72 × 96 × 4 pixels with the four channels encoding the left team positions, right team positions, ball position, and active player position. Observations were stacked over four timesteps to give a perception of time, resulting in observations of 72 × 96 × 16 pixels. The agents’ policy networks had a ResNet architecture [45], and the action space was discrete with size 19. We used the same reward shaping as in [ 9 ] in which an agent was rewarded 1 for scoring, -1 for being scored on, and 0.1 for advancing the ball one tenth of the way down the field. We trained all Gfootball agents using Proximal Policy Optimization [43] using the Stable Baselines 2 implementation [46].

We train policies using Proximal Policy Optimization (PPO) [43] and Soft Actor Critic (SAC) [44]. Both involve training a value function estimator alongside the policy.

We consider attackers that have access to (1) the target agent’s action outputs, (2) its value estimate, and/or (3) the internal activations from its policy network. Our goal for (1) is to give the adversary a glimpse of the near Target Agents: First, we trained target agents to develop future so that it can better counter the target agent’s adversarial policies against. For Gfootball, this was done behavior. Our goal for (2) is to make it easier for the in two stages for a total of 50 million timesteps. First, the attacker to quickly learn its own value function because targets were trained against a ‘bot’ agent for 25 million () ≈ − (). Note this is only possible for timesteps with an entropy reward to encourage explotargets that have a critic. Finally, our goal for (3) is to ration. Second, they were trained for another 25 million give the adversary rich and generally-useful information timesteps against an agent from the first phase with an on how the target represents the state. entropy penalty to encourage more deterministic play.

At timestep , the environment state, , is observed. We found this to result in more consistent behavior from The target processes the state and produces an action adversaries. In Fig. 3 (a) shows the learning curves for ∼ (). At the same time, the white-box adver- these targets. sary queries the target to get its action output (), Adversaries: We trained four types of adversaries, each value estimate (), and/or latent activations ℓ() of which uses observes diferent information, , from in the form of a vector (). In a slight abuse of the target’s internal state: notation, we refer to ℓ() as ℓ and () as .

Thus, the adversary’s policy function can be written as 1. Black-Box Control: = ∅. This is the same () = (, ), and its value estimate can be threat model used by [16], [ 5 ] and others menwritten as () = (, ). tioned in Section 2.

We train both adversaries that use large convolutional 2. Action & Value: = () ⊕ () neural networks (CNNs) and small multilayer percep- where ⊕ is the concatenation operator. Here, the trons (MLPs) as policy networks. These architectures are adversary sees the scalar value and an ||illustrated in Fig. 2. For the large CNNs, we concatenate sized observation giving the target agent’s distri into the representation of the state twice: once at the bution over discrete output actions. ifrst fully-connected layer, and once at the last. We do 3. Latent: = ℓ where ℓ gives the latent actithis so that the adversary can readily learn both complex vations from some layer during the forward pass and simple functions of . In particular,we hypothe- through the target’s network from . Here, we sized that giving the adversary the target’s value estimate use those of the final layer from which both the in its final layer is helpful for learning its own value es- target agent’s actions and value are computed. timator, which ought to be approximately the negative 4. Full: = () ⊕ () ⊕ ℓ. This combines the Action & Value and Latent threat modof the target’s. For the small MLPs policy networks, we only concatenate with the observation once at the els. beginning for eficiency.

Results: We train each adversary for 50 million timesteps. Fig. 3b shows the training curves for these attackers. All improve significantly over the black box control, both by having faster initial learning and a higher

Gfootball Target Train Performance, n=20 2 e 0 m a G−2 / . s t−4 P t e N−6 −8 0 10M 20M 30M 40M 50M

0 (a)

10M 20M 30M 40M Act/Val/Latent v. Ctrl p: 2e-05

Latent v. Ctrl p: 1e-05 Act/Val v. Ctrl p: 0.00638 (b)

Act/Val/Latent Latent Act/Val Black-Box Control 50M asymptotic performance. The two types of white-box adversaries that could observe the target’s latents performed the best. Both do as well after 5 million timesteps as the black box control does after 50 million. For the action/value, latent, and full attacks, the values from a one-sided test for the hypothesis that they were superior to the black box controls were 0.00638, 0.00001, and 0.00002 respectively, demonstrating clear improvements. 4.2. Improving Robustness adversary. 2. RARL: The target agent is trained against an ensemble of black-box adversarial agents. This is the approach used by [ 10 ]. 3. Latent/Action White-Box RARL (WB

RARL): The target agent is trained against an ensemble of white-box adversaries that each observe its latent activations from the penultimate layer of the policy network and action outputs. Thus, = () ⊕ ℓ 1. RL Control: The target agent is trained with no Environment: To evaluate white-box robust adversarial reinforcement learning (RARL), we used HalfCheetah-v3 Results: We trained a total of 40 agents of each type for and Hopper-v3 Mujoco environments from OpenAI Gym. 2 million timesteps and selected the 20 with the best final [ 11 ]. In both environments, the agent controls a body performance. Fig. 4a shows the evaluation performance in a 3D simulated physics environment. Observations for the HalfCheetah and Hopper agents in an adversaryare continuous-valued vectors specifying the position of free environment over the course of training. Perforthe body, and actions are continuous-valued vectors for mance is comparable between all three conditions with controlling it. The agents’ policy networks had a small the RL controls seeming to perform the best in HalfCheeMLP architecture with two hidden layers of 256 neurons tah. each. We trained all gym agents using SAC [44] with the To test the robustness of the learned policies, we use Stable Baselines 3 implementation [47]. the same approach as [ 7 ] and [ 10 ]. After RARL, we test on a set of adversary-free environments with the transiTraining: In alternation, we trained a target agent and tion dynamics altered. We selected a range of 8 mass and an ensemble of three adversaries who perturbed the tar- 8 friction coeficients to modify the environment dynamget’s actions. For each training episode for the target, a ics by and tested the agents on all 8 × 8 combinations. random adversary from the three was chosen to make The full arrays of results are shown in Fig. 5 in Appendix the perturbations. We experiment with three methods: A.2. And the mean results over all friction coeficients and mass coeficients are plotted in Fig. 4b-c respectively.

RL RARL Latent/Action WB-RARL RL RARL Latent/Action WB-RARL 1600 1400 1200 1000 800 600 400 200 1200 1000 800 600 400 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 t 1e6 0.2x 0.3x 0.4x 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 t 1e6 0.2x 0.3x 0.4x

0.5x 1.05x 1.1x 1.15x 1.2x Mass Multiplier 0.2x 0.3x 0.4x 0.5x 1.4x 1.6x 1.8x 2.0x

Friction Multiplier (a) (b) (c) In Fig. 4b-c, WB-RARL agents generally perform as well that allowing an adversarial policy to observe the interor better than the other two. And on average, WB-RARL nal state of the target agent, can result in (1) better initial performs the best over all testing environments. For RL, and asymptotic performance for adversarial attackers RARL, and WB-RARL, the HalfCheetah agents achieve and (2) more efective adversarial training for improving mean episode rewards of 902, 914, and 1019, and the Hop- the robustness of a learned policy. These results suggest per agents achieve 673, 645, and 716 respectively. We that using white-box adversarial policies to identify and performed four one-sided t-tests to test the hypotheses correct flaws with reinforcement learners may be a useful that the WB-RARL agents had superior overall testing strategy for developing safer, more reliable reinforcement performance. For HalfCheetah, the values were 0.085 learning systems. and 0.111 for comparing the WB-RARL agents to the RL More generally, our results show that information and RARL ones respectively. For Hopper, the correspond- about an agent’s internal state ofers useful information ing values were 0.095 and 0.009. These suggest that the for other agents interacting with it. This may be the WB-RARL agents are more robust to these domain shifts. case regardless of whether the setting is adversarial, cooperative, or indiferent. In multiagent settings, it is important to bear in mind that a policy which makes 5. Discussion and Broader Impact use of white-box information from another agent need not be implemented by nor against a conventional reOur goal in this work is to better understand opportuni- inforcement learner. On one hand, policies can be deties from adversarial policies in reinforcement learning veloped without standard reinforcement learning algoby studying white-box adversarial attackers. We show rithms (e.g., PPO or SAC). For example, human video game players constantly develop strategies to exploit the weaknesses of computer-controlled competitors to great efect. On the other hand, so long as a target agent computes “actions” via latent information, this information could be given to other agents seeking to interact with it. One case in which using adversarial policies against non-reinforcement-learners can be useful is for finding lfaws in language models. The inability to diferentiate through the sampling of discrete textual tokens makes the task of finding failure modes for language models one that adversarial policies can be useful for (e.g. [48]).

Future work on versions of white-box adversarial policies for debugging language models may be useful.

Concerning adversarial attacks in particular, one risk of any work that focuses on attack methods is that they could be used for malicious attacks. This is an important concern, but we emphasize that it is better to develop an understanding of adversarial vulnerabilities through exploratory research than from incidents in the real world.

We also stress the benefits of adversarial training and the fact that white box access to an agent can be kept from malicious attackers if appropriate measures are taken.

For this reason, we expect white-box adversarial policies to be much more practical for those working to make systems more robust than for malicious attackers.

A limitation is that while we show that white-box attacks can be useful, the improvements from granting the adversary white-box access in the RARL experiments were only modest. And even though white-box attacks can help train adversarial policies more quickly, these attacks may still demand many timesteps. Future work on similar black-box attacks that use a model of the target learned from black-box (and potentially even ofline) access may be valuable. Studying ways to more efectively leverage target agent information in fewer training timesteps may also be useful. Additional progress like this toward better understanding opportunities from adversaries in reinforcement learning will be a promising direction for expanding the toolbox for safer and more trustworthy AI.

6. Acknowledgments

We thank Lucas Janson for valuable ideas and feedback throughout the course of this work. We also appreciate discussions with Adam Gleave and Pavel Czempin. ceedings of the IEEE/CVF Conference on Computer 1328–1337.

Vision and Pattern Recognition, 2022, pp. 115–122. [30] E. Korkmaz, Adversarially trained neural policies [16] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mor- in the fourier domain, in: ICML 2021 Workshop on datch, Emergent complexity via multi-agent com- Adversarial Machine Learning, 2021. petition, arXiv preprint arXiv:1710.03748 (2017). [31] E. Korkmaz, Investigating vulnerabilities of deep [17] A. Pozanco, S. Fernández, D. Borrajo, et al., neural policies, in: Uncertainty in Artificial IntelliAnticipatory counterplanning, arXiv preprint gence, PMLR, 2021, pp. 1661–1670. arXiv:2203.16171 (2022). [32] J. Kos, D. Song, Delving into adversarial attacks [18] P. Dasgupta, Using options to improve robustness on deep policies, arXiv preprint arXiv:1705.06452 of imitation learning against adversarial attacks, (2017). in: Artificial Intelligence and Machine Learning for [33] A. Davidson, Using artificial neural networks to Multi-Domain Operations Applications III, volume model opponents in texas hold’em, Unpublished 11746, International Society for Optics and Photon- manuscript (1999).

ics, 2021, p. 1174610. [34] A. J. Lockett, C. L. Chen, R. Miikkulainen, Evolving [19] P. Czempin, A. Gleave, Reducing exploitability explicit opponent models in game playing, in: Prowith population based training, arXiv preprint ceedings of the 9th annual conference on Genetic arXiv:2208.05083 (2022). and evolutionary computation, 2007, pp. 2106–2113. [20] T. Fujimoto, T. Doster, A. Attarian, J. Branden- [35] H. He, J. Boyd-Graber, K. Kwok, H. Daumé III, Opberger, N. Hodas, Reward-free attacks in multi- ponent modeling in deep reinforcement learning, agent reinforcement learning, arXiv preprint in: International conference on machine learning, arXiv:2112.00940 (2021). PMLR, 2016, pp. 1804–1813. [21] H. Shioya, Y. Iwasawa, Y. Matsuo, Extending ro- [36] V. Behzadan, W. Hsu, Rl-based method for benchbust adversarial reinforcement learning considering marking the adversarial resilience and robustness adaptation and diversity (2018). of deep reinforcement learning policies, in: Interna[22] X. Pan, D. Seita, Y. Gao, J. Canny, Risk averse robust tional Conference on Computer Safety, Reliability, adversarial reinforcement learning, in: 2019 Inter- and Security, Springer, 2019, pp. 314–325. national Conference on Robotics and Automation [37] Y. Faghan, N. Piazza, V. Behzadan, A. Fathi, Adver(ICRA), IEEE, 2019, pp. 8522–8528. sarial attacks on deep algorithmic trading policies, [23] K. L. Tan, Y. Esfandiari, X. Y. Lee, S. Sarkar, et al., Ro- arXiv preprint arXiv:2010.11388 (2020). bustifying reinforcement learning agents via action [38] J. Y. Halpern, R. Pass, Game theory with translucent space adversarial training, in: 2020 American con- players, International Journal of Game Theory 47 trol conference (ACC), IEEE, 2020, pp. 3959–3964. (2018) 949–976. [24] P. Zhai, J. Luo, Z. Dong, L. Zhang, S. Wang, D. Yang, [39] A. Demski, S. Garrabrant, Embedded agency, arXiv Robust adversarial reinforcement learning with dis- preprint arXiv:1902.09469 (2019).

sipation inequation constraint (2022). [40] A. Critch, A parametric, resource-bounded general[25] K. Zhang, B. Hu, T. Basar, On the stability and ization of löb’s theorem, and a robust cooperation convergence of robust adversarial reinforcement criterion for open-source game theory, The Journal learning: A case study on linear quadratic systems, of Symbolic Logic 84 (2019) 1368–1381. Advances in Neural Information Processing Sys- [41] S. Casper, Achilles heels for agi/asi via decision thetems 33 (2020) 22056–22068. oretic adversaries, arXiv preprint arXiv:2010.05418 [26] A. Pattanaik, Z. Tang, S. Liu, G. Bommannan, (2020).

G. Chowdhary, Robust deep reinforcement learn- [42] A. Critch, M. Dennis, S. Russell, Cooperative ing with adversarial attacks, arXiv preprint and uncooperative institution designs: Surprises arXiv:1712.03632 (2017). and problems in open-source game theory, arXiv [27] T. Oikarinen, W. Zhang, A. Megretski, L. Daniel, preprint arXiv:2208.07006 (2022).

T.-W. Weng, Robust deep reinforcement learning [43] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, through adversarial loss, Advances in Neural Infor- O. Klimov, Proximal policy optimization algorithms, mation Processing Systems 34 (2021). arXiv preprint arXiv:1707.06347 (2017). [28] L. Schott, M. Césaire, H. Hajri, S. Lamprier, Im- [44] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft proving robustness of deep reinforcement learning actor-critic: Of-policy maximum entropy deep reagents: Environment attacks based on critic net- inforcement learning with a stochastic actor, arXiv works, arXiv preprint arXiv:2104.03154 (2021). preprint arXiv:1801.01290 (2018). [29] B. Lütjens, M. Everett, J. P. How, Certified adver- [45] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learnsarial robustness for deep reinforcement learning, ing for image recognition, in: Proceedings of the in: Conference on Robot Learning, PMLR, 2020, pp. IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. However, in this and all related work in RL of which we [46] A. Hill, A. Rafin, M. Ernestus, A. Gleave, A. Kan- know, no notion of subtlety is part of the definition of ervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, an adversarial policy. So ultimately, we use “adversarial” A. Nichol, M. Plappert, A. Radford, J. Schulman, here to simply refer to a policy which is good at beating S. Sidor, Y. Wu, Stable baselines, https://github.com/ a target.

hill-a/stable-baselines, 2018. [47] A. Rafin, A. Hill, A. Gleave, A. Kanervisto, M. Ernes- A.2. Full Robust Adversarial tus, N. Dormann, Stable-baselines3: Reliable reinReinforcement Learning Results forcement learning implementations, Journal of Machine Learning Research 22 (2021) 1–8. URL: As discussed in Section 4.2, we tested agents on envihttp://jmlr.org/papers/v22/20-1364.html. ronments with altered mass and friction parameters. For [48] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, both the HalfCheetah and Hopper environments, we used J. Aslanides, A. Glaese, N. McAleese, G. Irving, Red a set of 8 × 8 diferent mass and friction values. Testing teaming language models with language models, results across all testing environments for control, RARL, arXiv preprint arXiv:2202.03286 (2022). and WB-RARL agents are shown here in Fig. 5. Under [49] N. Papernot, P. McDaniel, I. Goodfellow, Transfer- each grid, the mean for all results in the grid is displayed. ability in machine learning: from phenomena to Under the RL and RARL grids (columns 1 and 2), the black-box attacks using adversarial samples, arXiv value from a one-sided t-test for the hypothesis that preprint arXiv:1605.07277 (2016). WB-RARL is superior to RL and RARL is shown. [50] A. Madry, A. Makelov, L. Schmidt, D. Tsipras,

A. Vladu, Towards deep learning models resistant to A.3. High-Level Summary adversarial attacks, arXiv preprint arXiv:1706.06083 (2017). Here, we provide a summary of this work which does [51] F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, not assume that the reader has a technical background.

P. McDaniel, The space of transferable adversarial “Reinforcement Learning” (RL) is the process by which examples, arXiv preprint arXiv:1704.03453 (2017). an agent learns via some formalized process of trial and [52] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, error to accomplish a goal. Humans are reinforcement B. Tran, A. Madry, Adversarial examples are learners. And so are some algorithms that are commonly not bugs, they are features, arXiv preprint studied in machine learning research today. For example, arXiv:1905.02175 (2019). is common to use reinforcement learning algorithms to train AI systems to play video games. Using experience, they can infer what types of actions lead to higher scores A. Appendix and adjust their behavior accordingly.

Multiagent RL describes settings in which there is more than one agent acting in some setting. Past research has A.1. Understanding Adversarial Policies shown that in multiagent settings, training “adversarThe notion of an adversary for a deep learning system ial” reinforcement learners to make other reinforcement was popularized by [ 1, 2 ] and subsequent research. These learners fail can be useful. One one hand, an adversarworks developed adversarial images that are both efec- ial agent can often learn to act in a way that renders tive, meaning that they fool an image classifier, and subtle, the “target” agent unable to accomplish its goals. For meaning that they only difer from a benign image by a example, an adversary can sometimes act in ways that very small-norm perturbation. While they often transfer make a target in a two player video game seem to take to other models [49, 50, 51, 52], these adversaries are also actions that are as bad as – or even worse than – random typically target-specific in the sense that they are created ones. On the other hand, training a target agent against specifically to fool a particular model. an adversarial agent can make it more robust to some

As in supervised learning, “efectiveness” is used as failures. For example, this might make the target particupart of the definition for adversarial policies across the larly efective at avoiding failures due to changes to its literature. “Target-specificity” sometimes is, but many environment.

RL works (e.g., [ 12 ]) including ours do not require an In this work, we study a new approach to adversarial adversary to be target-specific. Finally, “subtlety” has not attacks and adversarial training in RL. We experiment been adopted as a standard for adversaries research in with “white-box” attacks in which the adversary can RL. A notion of subtlety for adversaries in RL that would observe the internal state of the target. For humans, this be analogous to adversaries in supervised learning would would be analogous to one person playing a game against be that the adversary produces distributions over actions someone else while being able to view scans of their or trajectories that are very similar to a benign agent. brain. We show that these white-box adversarial agents W H H

W H

L H R

H R L L L L L0H L L

0H :55/5/:55/5/

L are more efective than controls for both attacks and adversarial training. We argue that this helps us to better understand opportunities from adversarial RL. And based on these results, we argue that white-box adversaries may be very useful for discovering and correcting flaws in reinforcement learners.

[1]

Szegedy ,

Zaremba , I. Sutskever ,

Bruna ,

Erhan , I. Goodfellow ,

Fergus , Intriguing properties of neural networks , arXiv preprint arXiv:1312.6199 ( 2013 ).

[2]

I. J.

Goodfellow ,

Shlens ,

Szegedy , Explaining and harnessing adversarial examples , arXiv preprint arXiv:1412.6572 ( 2014 ).

[3]

Ilahi ,

Usama ,

Qadir ,

M. U.

Janjua ,

AlFuqaha , D. T. Huang,

Niyato , Challenges and countermeasures for adversarial attacks on deep reinforcement learning , IEEE Transactions on Artiifcial Intelligence ( 2021 ).

[4]

S. H.

Silva ,

Najafirad , Opportunities and challenges in deep learning adversarial robustness: A survey , arXiv preprint arXiv: 2007 . 00753 ( 2020 ).

[5]

Gleave ,

Dennis ,

Kant ,

Wild ,

Levine ,

Russell , Adversarial policies: Attacking deep reinforcement learning , arXiv preprint arXiv: 1905 . 10615 ( 2019 ).

[6]

Fujimoto ,

Doster ,

Attarian ,

Brandenberger ,

Hodas , The efect of antagonistic behavior in reinforcement learning ( 2021 ).

[7]

Pinto ,

Davidson ,

Sukthankar ,

Gupta , Robust adversarial reinforcement learning , in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org , 2017 , pp. 2817 - 2826 .

[8]

Bhambri ,

Muku ,

Tulasi ,

A. B.

Buduru , A survey of black-box adversarial attacks on computer vision models , arXiv preprint arXiv: 1912 . 01667 ( 2019 ).

[9]

Kurach ,

Raichuk ,

Stańczyk ,

Zając ,

Bachem ,

Espeholt ,

Riquelme ,

Vincent , M. Michalski,

Bousquet , et al., Google research football: A novel reinforcement learning environment , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 34 , 2020 , pp. 4501 - 4510 .

[10]

Vinitsky ,

Du ,

Parvate ,

Jang ,

Abbeel ,

Bayen , Robust reinforcement learning using adversarial populations , arXiv preprint arXiv: 2008 . 01825 ( 2020 ).

[11]

Brockman ,

Cheung ,

Pettersson ,

Schneider ,

Schulman ,

Tang , W. Zaremba, Openai gym, arXiv preprint arXiv:1606.01540 ( 2016 ).

[12]

Behzadan , W. Hsu, Adversarial exploitation of policy imitation , arXiv preprint arXiv: 1906 . 01121 ( 2019 ).

[13]

Guo ,

Wu ,

Huang ,

Xing , Adversarial policy learning in two-player competitive games , in: International Conference on Machine Learning, PMLR , 2021 , pp. 3910 - 3919 .

[14]

Wu ,

Guo ,

Wei ,

Xing , Adversarial policy training against deep reinforcement learning , in: 30th USENIX Security Symposium (USENIX Security 21) , 2021 , pp. 1883 - 1900 .

[15]

Guo ,

Chen ,

Hao ,

Yin ,

Yu ,

Li , Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning , in: Pro-