Adversarial Exploitation of Policy Imitation Vahid Behzadan∗ and William H. Hsu Kansas State University {behzadan, bhsu}@ksu.edu Abstract which refers to the acquisition of skills or behaviors by ob- serving demonstrations of an expert performing those skills. This paper investigates a class of attacks target- Typically, the settings of imitation learning are concerned ing the confidentiality aspect of security in Deep with learning from human demonstrations. However, it is Reinforcement Learning (DRL) policies. Recent straightforward to deduce that the techniques developed for research have established the vulnerability of su- those settings may also be applied to learning from artifi- pervised machine learning models (e.g., classifiers) cial experts, such as DRL agents. Of particular relevance to to model extraction attacks. Such attacks leverage this research is the emerging area of Reinforcement Learning the loosely-restricted ability of the attacker to itera- with Expert Demonstrations (RLED)[Piot et al., 2014]. The tively query the model for labels, thereby allowing techniques of RLED aim to minimize the effect of modeling for the forging of a labeled dataset which can be imperfections on the efficacy of the final RL policy, while used to train a replica of the original model. In this minimizing the cost of training by leveraging the informa- work, we demonstrate the feasibility of exploiting tion available demonstrations to reduce the search space of imitation learning techniques in launching model the policy. extraction attacks on DRL agents. Furthermore, we develop proof-of-concept attacks that leverage such techniques for black-box attacks against the integrity of DRL policies. We also present a dis- Accordingly, we hypothesize that the techniques developed cussion on potential solution concepts for mitiga- for RLED may be maliciously exploited to replicate and ma- tion techniques. nipulate DRL policies. To establish the validity of this hy- pothesis, we investigate the feasibility of RLED techniques in utilizing limited passive (i.e., non-interfering) observations 1 Introduction of a DRL agent to replicate its policy with sufficient accu- racy to facilitate attacks on their integrity. To develop proof- Recent research have established the vulnerability of super- of-concept attacks, we study the adversarial utility of adopt- vised machine learning models (e.g., classifiers) to model ing a recently proposed RLED technique, known as Deep Q- extraction attacks[Tramèr et al., 2016]. Such attacks lever- Learning from Demonstrations (DQfD)[Hester et al., 2018] age the loosely-restricted ability of the attacker to iteratively for black-box state-space manipulation attacks, and develop query the model for labels, thereby allowing for the forging two attack mechanisms based on this technique. Furthermore, of a labeled dataset which can be used to train a replica of the we present a discussion on potential mitigation techniques, original model. Model extraction is not only a serious risk and present a solution concept for defending against policy to the protection of intellectual property, but also a critical imitation attacks. threat to the integrity of the model. Recent literature[Paper- not et al., 2018] report that the replicated model may facilitate the discovery and crafting of adversarial examples which are transferable to the original model. The remainder of this paper is organized as follows: Sec- Inspired by this area of research, this work investigates the tion 2 presents an overview of the DQfD algorithm used feasibility and impact of model extraction attacks on DRL in this study for adversarial imitation. Section 3 proposes agents. The adversarial problem of model extraction can the first proof-of-concept black-box attack based on imitated be formally stated as the replication of a DRL policy based policies, and presents experimental evaluation of its feasi- on observations of its behavior (i.e., actions) in response to bility and performance. Section 4 studies the transferabil- changes in the environment (i.e., state). This problem closely ity of adversarial examples between replicated and the orig- resembles that of imitation learning[Hussein et al., 2017], inal policies as a second proof-of-concept attack technique. The paper concludes with a discussion on potential mitiga- ∗ Contact Author tion techniques and a solution concept in Section 5. 2 Deep Q-Learning from Demonstrations Algorithm 1 Deep Q-learning from Demonstrations (DQfD) (DQfD) Inputs: Dreplay initialized with demonstration data, ran- The DQfD technique[Hester et al., 2018] aims to overcome domly initialized weights for the behavior network θ, ran- the inaccuracies of simulation environments and models of domly initialized weights for the target network θ0 , up- complex phenomenon by enabling DRL agents to learn as dating frequency of the target network τ , number of pre- much as possible from expert demonstrations before training training gradient updates k on the real system. More formally, the objective of this “pre- for steps t ∈ {1, 2, ..., k} do training” phase is to learn an imitation of the expert’s behav- Sample a mini-batch of n transitions from Dreplay with ior with a value function that is compatible with the Bellman prioritization equation, thereby enabling the agent to update this value func- Calculate loss J(Q) based on target network tion via TD updates through direct interaction with the envi- Perform a gradient descent step to update θ ronment after the pre-training stage. To achieve such an im- if t mod τ = 0 then itation from limited demonstration data during pre-training, θ0 ← θ the agent trains on sampled mini-batches of demonstrations end if to train a deep neural network model in a supervised man- end for ner. However, the training objective of this model in DQfD is for steps t ∈ {1, 2, ...} do the minimization of a hybrid loss, comprised of the following Sample action from behavior policy a π Qθ components: Apply action a and observe (s0 , r) 1. 1-step double Q-learning loss JDQ (Q), Store (s, a, r, s0 ) into Dreplay , overwriting oldest self- generated transition if over capacity 2. Supervised large margin classification loss JE (Q) = Sample a mini-batch of n transitions from Dreplay with maxa∈A [Q(s, a)+l(aE , a)]−Q(s, aE ), where aE is the prioritization expert’s action in state s and l(aE , a) is a margin func- Calculate loss J(Q) using target network tion that is positive if a 6= aE , and is 0 when a = aE . Perform a gradient descent step to update θ 3. (n = 10)-step Return: rt + γrt+1 + ... + γ n−1 rt+n−1 + if t mod τ = 0 then maxa γ n Q(st+n , a). θ0 ← θ 4. L2 regularization loss: JL2 (Q) end if s ← s0 The total loss is given by: end for J(Q) = JDQ (Q) + λ1 Jn (Q) + λ2 JE (Q) + λ3 JL2 (Q) (1) where λ factors provide the weighting between the losses. To study the feasibility of imitation learning as an approach After the pre-training phase, the agent begins interacting to this adversarial problem, we consider the first step of the with the system and collecting self-generated data, which is adversary to be the imitation of π(s) via DQfD to learn an added to the replay buffer Dreplay . Once the buffer is full, the imitated policy π̃. With this imitation at hand, the attack agent only overwrites the self-generated data and leaves the problem can be reformulated to finding an optimal adversar- demonstration data untouched for use in the coming updates ial control policy πadv (s), where there are two permissible of the model. The complete training procedure for DQfD is control actions: whether to perturb the current state to in- presented in Algorithm 1. duce the worst possible action (i.e., arg mina Q(s, a)) or to leave the state unperturbed. This setting allows for the direct 3 Adversarial Policy Imitation for Black-Box adoption of the DRL-based technique proposed in [Behzadan Attacks and Hsu, 2019] for resilience benchmarking of DRL policies. In this technique, the test-time resilience of a policy π ∗ to Consider an adversary who aims to maximally reduce the state-space perturbations is obtained via an adversarial DRL cumulative discounted return (R(T )) of a target DRL agent training procedure, outlined as follows: by manipulating the behavior of the target’s policy π(s) via perturbing its observations. The adversary is also con- 1. Train the adversarial agent against the target following strained to minimizing the total cost of perturbations given by π in its training environment according to the reward as- PT Cadv (T ) = t=t0 cadv (t), where cadv (t) = 1 if the adver- signment process outlined in Algorithm 2. Report the ∗ sary perturbs the state at time t, and cadv (t) = 0 otherwise. optimal adversarial return Rperturbed and the maximum ∗ The adversary is unaware of π(s) and its parameters. How- adversarial regret Radv (T ), which is the difference be- ever, it has access to a replica of the target’s environment tween maximum achievable return by the target π and (e.g., the simulation environment). Also, for any state tran- its minimum achieved return from actions of adversarial sition (s, a) → s0 , the adversary can perfectly observe the policy. target’s reward signal r(s, a, s0 ), and is able to observe the 2. Apply the adversarial policy against the target in N behavior of π(s) in response to each state s. Furthermore, the episodes, record total cost Cadv for each episode, adversary is able to manipulate its target’s state observations, but not its reward signal. Also, it is assumed that all targeted 3. Report the average of Cadv over N episodes as the mean perturbations of the adversary are successful. test-time resilience of π in the given environment. While the original technique is dependent on the availabil- 3.2 Results ity of target’s optimal state-action value function, we propose to replace this function with the Q-function obtained from DQfD imitation of the target policy, denoted by Q̃. Figures 1 – 3 illustrate the first 100000 training steps of DQfD 2: from 5000 observations obtained from DQN, A2C, and PPO2 policies in CartPole. While this limited window of training Algorithm 2 Reward Assignment in Adversarial DRL for is not long enough for convergence to an optimal policy in Measuring Adversarial Resilience CartPole, the following results demonstrate its sufficiency for deriving adversarial perturbation policies for all three targets. Require: Target policy π ∗ , Perturbation cost function cadv (., .), Maximum achievable score Rmax , Optimal With the imitated policies at hand, the next step is to train state-action value function Q∗ (., .), Current adversarial an adversarial policy for efficient perturbation of these tar- policy π adv , Current state st , Current count of adversarial gets. Figures 4 – 11 present the results obtained from adopt- actions AdvCount, Current score Rt ing the procedure presented in Algorithm 2 for this purposes. Set ToPerturb ← π adv (st ) These results demonstrate that not only the limited training if ToPerturb is False then period is sufficient for obtaining an efficient adversarial pol- at ← π ∗ (st ) icy, but also that launching efficient attacks remain feasible Reward ← 0 with relatively few observations (i.e., 2500 and 1000). How- else ever, the comparison of test-time performance of these poli- a0t ← arg mina Q∗ (st , a) cies (presented in table 2) indicates that the efficiency of at- Reward ← −cadv (st , a0t ) tacks decreases with lower numbers of observations. end if if either st or s0t is terminal then Reward+ = (Rmax − Rt ) end if 80 75 With the imitated state-action value function Q̃ at hand, the adversarial policy can be trained as a DRL agent with 70 the procedure outlined in [Behzadan and Hsu, 2019]. The Reward proposed attack procedure is summarized as follows: 65 1. Observe and record N interactions (st , at , st+1 , rt+1 ) of 60 the target agent with the environment. 55 2. Apply DQFD to learn an imitation of the target policy π(s) and Q∗ , denoted by π̃ and Q̃, respectively. 0 20000 40000 60000 80000 100000 Training Step 3. Train adversarial policy πadv (s) with Algorithm 2, using Q̃ as an approximation of target’s Q∗ . Figure 1: DQfD Training Progress on DQN Policy with 5k demon- 4. Apply adversarial policy to the target environment. strations 3.1 Experiment Setup We consider a DQN-based adversarial agent, aiming to learn an optimal adversarial state-perturbation policy to minimize the return of its targets, consisting of DQN, A2C, and PPO2 policies trained in the CartPole environment. The architecture and hyperparameters of the adversary and its targets are the 80 same as those detailed in [Behzadan and Hsu, 2019]. The adversary employs a DQfD agent to learn an imitation of each 75 target, the hyperparameters of which are provided in Table 1. Reward 70 Pretraining Steps 5000 Large Margin 0.8 65 Imitation Loss Coefficient 1 60 Target Update Freq. 1000 n-steps 10 0 20000 40000 60000 80000 100000 γ 0.99 Training Step Table 1: Parameters of DQfD Agent Figure 2: DQfD Training Progress on A2C Policy with 5k demon- strations Regret No. Perturbations 80 No. Perturbations Per Episode (100 Episodes Mean) 18 75 480 16 70 Mean 100 Episode Regret 460 14 65 Reward 440 12 60 420 10 55 400 8 50 6 45 380 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Training Step Steps Figure 3: DQfD Training Progress on PPO2 Policy with 5k demon- Figure 6: Adversarial Training Progress on DQN Policy with 1k strations demonstrations DQN: A2C: Regret No. Perturbations Regret No. Perturbations 500 No. Perturbations Per Episode (100 Episodes Mean) 13 No. Perturbations Per Episode (100 Episodes Mean) 10.0 475 475 12 9.5 450 11 Mean 100 Episode Regret 450 Mean 100 Episode Regret 10 9.0 425 425 9 8.5 400 400 8 8.0 375 375 7 350 7.5 350 6 325 7.0 5 325 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 Steps Steps Figure 4: Adversarial Training Progress on DQN Policy with 5k Figure 7: Adversarial Training Progress on A2C Policy with 5k demonstrations demonstrations Regret No. Perturbations Regret No. Perturbations 40 No. Perturbations Per Episode (100 Episodes Mean) No. Perturbations Per Episode (100 Episodes Mean) 35 480 480 35 460 30 460 30 Mean 100 Episode Regret Mean 100 Episode Regret 440 440 25 25 420 420 20 20 400 400 15 15 380 380 10 10 360 360 5 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Steps Steps Figure 5: Adversarial Training Progress on DQN Policy with 2.5k Figure 8: Adversarial Training Progress on A2C Policy with 2.5k demonstrations demonstrations Regret No. Perturbations Target Policy Avg. Regret Avg. No. Perturbations 500 50 DQN-5k 490.73 7.12 No. Perturbations Per Episode (100 Episodes Mean) DQN-2.5k 488.12 8.09 450 40 DQN-1k 486.37 10.55 Mean 100 Episode Regret A2C-5k 490.88 8.48 400 30 A2C-2.5k 487.64 8.73 A2C-1k 487.21 6.23 350 20 PPO2-5k 490.23 8.73 PPO2-2.5k 487.23 7.76 300 10 PPO2-1k 477.61 7.31 Table 2: Comparison of Test-Time Performances of Adversarial 0 20000 40000 60000 80000 100000 Steps Policies Figure 9: Adversarial Training Progress on A2C Policy with 1k demonstrations Regret No. Perturbations No. Perturbations Per Episode (100 Episodes Mean) 18 480 16 Mean 100 Episode Regret 460 PPO2 14 440 12 420 10 Regret No. Perturbations 8 22 400 No. Perturbations Per Episode (100 Episodes Mean) 490 20 6 480 380 0 20000 40000 60000 80000 100000 Mean 100 Episode Regret 470 18 Steps 460 16 450 14 Figure 12: Adversarial Training Progress on PPO2 Policy with 1k 440 demonstrations 12 430 10 420 8 410 4 Transferability of Adversarial Example 0 20000 40000 60000 80000 100000 Steps Attacks on Imitated Policies It is well-established that adversarial examples crafted for a Figure 10: Adversarial Training Progress on PPO2 Policy with 5k demonstrations supervised model can be used to attack another model trained on a similar dataset as that of the original model[Liu et al., 2016]. Furthermore, Behzadan et al.[Behzadan and Munir, 2017] demonstrate that adversarial examples crafted for one DRL policy can transfer to another policy trained in the same environment. Inspired by these findings, we hypothesize that adversarial examples generated for an imitated policy can Regret No. Perturbations also transfer to the original policy. To evaluate this claim, No. Perturbations Per Episode (100 Episodes Mean) 490 16 we propose the following procedure for black-box adversar- 485 ial example attacks on DRL policies based on DQfD-based 14 policy imitation: Mean 100 Episode Regret 480 1. Learn an imitation of the target policy π, denoted as π̃. 475 12 2. Craft adversarial examples for π̃. 470 10 3. Apply the same adversarial examples to the target’s 465 π(s). 8 460 4.1 Experiment Setup 0 20000 40000 60000 80000 100000 Steps We consider a set of targets consisting of the 9 imitated poli- cies obtained in the previous section (i.e., DQN, A2C, PPO2, Figure 11: Adversarial Training Progress on PPO2 Policy with 2.5k trained on each case of beginning with 5k, 2.5k, and 1k ex- demonstrations pert demonstrations). In test-time runs of each policy, we Target Policy Avg. No. Successful Transfers Per Episode model at hand, the next step is to determine the saddle-point DQN-5k 175.11 (or region) in the minimax settings of keeping the threshold DQN-2.5k 78.19 Ωmax low, while providing maximum protection against ad- DQN-1k 3.30 versarial imitation learning. This extensive line of research A2C-5k 156.44 is beyond the scope of this paper, and is only introduced as a A2C-2.5k 151.47 potential venue of future work to interested readers. A2C-1k 21.58 PPO2-5k 173.94 Algorithm 3 Solution Concept for Constrained Randomiza- PPO2-2.5k 112.96 tion of Policy (CRoP) PPO2-1k 74.71 Require: state-action value function Q(., .), maximum tol- Table 3: No. of Successful Transfers Per Episode of Length 500 erable loss Ωmax , set of actions A (100 Episode Mean) while Running do s = env(t = 0) for each step of the episode do construct adversarial examples of each state against the im- FeasibleActions = {} itated policy, using FGSM [Papernot et al., 2018] with per- a = arg maxa Q(s, a) turbation step size eps = 0.01 and perturbation boundaries Append a to FeasibleActions [−5.0, 5.0]. If such a perturbation is found, we then present for a0 ∈ A do it to the original policy. If the action selected by the orig- if Q(s, a) − Q(s, a0 ) ≥ Ωmax then inal policy changes as a result of the perturbed input, then Append a0 to FeasibleActions the adversarial example is successfully transferred from the end if imitated policy to the original policy. end for if |F easibleActions| > 1 then 4.2 Results a ← random(F easibleActions) Table 3 presents the number of successful transfers averaged end if over 100 consecutive episodes. These results verify the hy- s0 = env(s, a) pothesis that adversarial examples can transfer from an imi- s ← s0 tated policy to the original, thereby enabling a new approach end for to the adversarial problem of black-box attacks. Furthermore, end while the results indicate that the transferability improves with more demonstrations. This observation is in agreement with the general explanation of transferability: higher numbers of ex- References pert demonstrations decrease the gap between the distribution [Behzadan and Hsu, 2019] Vahid Behzadan and William of training data used by the original policy and that of the imi- Hsu. Rl-based method for benchmarking the adversarial tated policy. Hence, the likelihood of transferability increases resilience and robustness of deep reinforcement learning with more demonstrations. policies. arXiv preprint arXiv:1906.01110, 2019. 5 Discussion on Potential Defenses [Behzadan and Munir, 2017] Vahid Behzadan and Arslan Munir. Vulnerability of deep reinforcement learning to Mitigation of adversarial policy imitation is achieved by in- policy induction attacks. In International Conference on creasing the cost of such attacks to the adversary. A promis- Machine Learning and Data Mining in Pattern Recogni- ing venue of research in this area is that of policy randomiza- tion, pages 262–275. Springer, 2017. tion. However, such randomization may lead to unacceptable degradation of the agent’s performance. To address this is- [Hester et al., 2018] Todd Hester, Matej Vecerik, Olivier sue, we envision a class of solutions based on the Constrained Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Hor- Randomization of Policy (CRoP). Such techniques will in- gan, John Quan, Andrew Sendonaris, Ian Osband, et al. trinsically account for the trade-off between the mitigation of Deep q-learning from demonstrations. In Thirty-Second policy imitation and the inevitable loss of returns. The corre- AAAI Conference on Artificial Intelligence, 2018. sponding research challenge in developing CRoP techniques [Hussein et al., 2017] Ahmed Hussein, Mohamed Medhat is to find efficient and feasible constraints, which restrict the Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learn- set of possible random actions at each state s to those whose ing: A survey of learning methods. ACM Computing Sur- selection is guaranteed (or are likely within defined certainty) veys (CSUR), 50(2):21, 2017. to incur a total regret that is less than a maximum tolerable [Liu et al., 2016] Yanpei Liu, Xinyun Chen, Chang Liu, amount Ωmax . One potential choice of constraint is those applied to the Q-values of actions, leading to the technique and Dawn Song. Delving into transferable adversar- detailed in Algorithm 3. However, analyzing the feasibil- ial examples and black-box attacks. arXiv preprint ity of this approach will require the development of models arXiv:1611.02770, 2016. that explain and predict the quantitative relationship between [Papernot et al., 2018] Nicolas Papernot, Patrick McDaniel, number of observations and accuracy of estimation. With this Arunesh Sinha, and Michael P Wellman. Sok: Secu- rity and privacy in machine learning. In 2018 IEEE Eu- ropean Symposium on Security and Privacy (EuroS&P), pages 399–414. IEEE, 2018. [Piot et al., 2014] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Boosted bellman residual minimization han- dling expert demonstrations. In Joint European Confer- ence on Machine Learning and Knowledge Discovery in Databases, pages 549–564. Springer, 2014. [Tramèr et al., 2016] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing ma- chine learning models via prediction apis. In USENIX Se- curity Symposium, pages 601–618, 2016.