Sequential Triggers for Watermarking of Deep Reinforcement Learning Policies Vahid Behzadan∗ and William H. Hsu Kansas State University {behzadan, bhsu}@ksu.edu Abstract impact on the original functions of the policy. While the idea of watermarking has been explored for su- This paper proposes a novel scheme for the water- pervised machine learning models[Uchida et al., 2017], to the marking of Deep Reinforcement Learning (DRL) extent of our knowledge, this work is the first to develop a wa- policies. This scheme provides a mechanism for the termarking scheme for the general settings of sequential deci- integration of a unique identifier within the policy sion making models and policies. The proposed scheme pro- in the form of its response to a designated sequence vides a mechanism for integrating a unique identifier within of state transitions, while incurring minimal impact the policy as an unlikely sequence of transitions, which may on the nominal performance of the policy. The ap- only be realized if the driving policy of these transitions is plications of this watermarking scheme include de- already tuned to follow that exact sequence. tection of unauthorized replications of proprietary The remainder of this paper is organized as follows: Sec- policies, as well as enabling the graceful interrup- tion 2 presents the formal description and justification of the tion or termination of DRL activities by authorized proposed scheme. Section 3 provides the procedure for im- entities. We demonstrate the feasibility of our pro- plementing the proposed scheme, followed by the experiment posal via experimental evaluation of watermarking setup and results in Sections 4 and 5. The paper concludes in a DQN policy trained in the Cartpole environment. Section 6 with a discussion on the applications of this scheme and remarks on future directions of research. 1 Introduction 2 Solution Approach The rapid advancements of the Deep Reinforcement Learn- ing (DRL) techniques provide ample motivation for explor- The proposed scheme is as follows. Let π(s) be the desired ing the commercial applications of DRL policies in various policy for interacting with an MDP < S, A, P, R, γ > for an domains. However, as recent studies have established [Be- episodic training environment EM . Assume that A is inde- hzadan and Munir, 2018], the current state of the art in DRL pendent of the state (i.e., all actions in A are permissible in fails to satisfy many of the security requirements of enduring any state s ∈ S. In tandem, consider a second MDP for an commercial products. One such requirement is the protection alternate environment EW , denoted as < S0 , A0 , P0 , R0 , γ >, of proprietary DRL policies from theft and unlicensed dis- such that: tribution. While recent research [Behzadan and Hsu, 2019] 1. S0 ∩ S = ∅, demonstrate the feasibility of indirect replication of policies 2. The state dimensions of S and S 0 are equal: ∀s ∈ through imitation learning, this paper investigates the prob- Sand∀s0 ∈ S0 : |s| = |s0 | lem of direct policy extraction. Considering that DRL poli- cies are often composed solely of the weights and biases of 3. Action-space of both MDPs are equal: A = A0 a neural network, protecting against an adversary with physi- 4. The transition dynamics and reward distribution of the cal access to the host device of the policy is often impractical alternate environment, denoted by P0 and R0 , are deter- or disproportionately costly[Tramèr et al., 2016]. With roots ministic. in digital media and the entertainment industry[Shih, 2017], 5. EW is an episodic environment with the same number an alternative solution is watermarking. That is, embedding of steps before termination as EM , denoted by Nmax . distinctly recognizable signs of ownership in the content and functions of the policy, which provide the means for detect- Let s0terminal be a terminal state in EW , and define P0 be ing unauthorized or stolen copies of the policy. To this end, such that for any state s0t ∈ S0 , there exists only one action a necessary requirement of watermarks is to be sufficiently aw (s0t ) that will result in the transition s0t → s0t+1 . In this set- resistant to removal or tampering. Furthermore, the embed- ting, we designate the ordered tuple of states < s0t , s0t+1 >∈ ding and testing of watermarks shall result in minimal or zero L as links, where L is the set of all links in EW . Also, define R0 such that R0 (s0t , aw (s0t ), s0t+1 ) = c > 0 for all ∗ Contact Author < s0t , s0t+1 >∈ L, and R0 (s0t , a 6= aw (s0t ), s0 6= s0t+1 ) = −c. That is, link transitions receive the same positive reward, and and another frequency fW M for switching back to the all other transitions produce the same negative reward. main environment. For watermarking MDPs of much These settings provide two interesting results: Since the lower complexity than that of the main environment, se- state-spaces S and S0 are disjoint, the two MDPs can be com- lecting these two frequencies such that fW M < fM W bined to form a joint MDP < S ∪ S0 , A, P ∪ P0 , R ∪ R0 , γ >, can enhance the efficiency of the joint training process where: by allocating more exploration opportunities to the more complex settings. P if s1 , s2 ∈ S  0 P ∪ P (s1 , a1 , s2 ) = (1) P0 if s1 , s2 ∈ S0 Similarly, To examine the authenticity of policies, it is sufficient to run those policies in the watermarking environment. If the R if s1 , s2 ∈ S  0 resulting transitions match that of the identifier sequence in R ∪ R (s1 , a1 , s2 ) = (2) R0 if s1 , s2 ∈ S0 consecutive episodes, it is highly likely that the policy under Consequently, it is possible to train a single policy πj that test is an exact replica of the watermarked policy. However, is optimized for both EM and EW through the joint MDP. modifications and retraining of a replicated policy may re- In practice, the training of a policy for this joint MDP can sult in imperfect matches. In such cases, the average of to- be achieved by alternating between the environments at every tal rewards gained by the suspicious policy over consecutive fE th episode. episodes of the watermark environment provides a quantita- Furthermore, the structure of P0 and R0 enable the creation tive measure of the possibility that the model under test is of a looping sequence of transitions, which constitutes the re- based on an unauthorized replica. sulting trajectory of the optimal policy for EW . This looping sequence can be realized by designating a single state s0l to belong to two link transitions, comprised of a link transition < s0l , s0l+1 > where s0l is the source state, and another link transition < s0l−1 , s0l >, in which s0l is the destination state. 4 Experiment Setup It is noteworthy that the creation of such looping sequences provides sufficient flexibility for crafting unlikely and unique sequences. However, in designing the looping sequence as policy identifiers, two important restrictions must be consid- To evaluate the feasibility of the proposed scheme, the design ered: first, the structure of identifier sequences need to be and embedding of an identifier sequence for a DQN policy in such that the resulting probability of accidentally following the CartPole environment is investigated. Hyperparameters of the sequence is minimized. Second, the complexity (i.e., de- the DQN policy are provided in Table 1. The watermarking grees of freedom) of link and non-link transitions on the ring environment is implemented as a customized OpenAI Gym must be balanced against the training cost of the joint pol- environment. The state space of this environment comprises icy: more complex sequences will increase the training cost of 5 states with 4 dimensions each (Cart Position, Cart Ve- of the joint policy by expanding the search space of both envi- locity, Pole Angle, Pole Velocity At Tip). As denoted in Ta- ronments. Hence, efficient design of identifier sequences will ble 2, the original CartPole environment restricts the values necessitate the balancing of this trade-off between the secrecy of Cart Position to [−4.8, 4.8], and binds the Pole Angle to of identifier and the training cost. the range [−24deg, 24deg]. Consequently, the correspond- ing parameters of the alternate state-space are selected from 3 Watermarking Procedure beyond these ranges to ensure that the states remain disjoint from those of the original CartPole. The list of crafted states Building on the presented formalization, we propose the fol- is presented in Table 3. lowing procedure for the sequential watermarking of DRL policies: 1. Define the state-space of the watermarking environment EW such that it is disjoint from that of the main envi- Table 1: Parameters of DQN Policy ronment EM , while preserving the state dimensionality of the main state space. The latter condition is to enable No. Timesteps 105 the utilization of the same neural network model for the γ 0.99 agent through maintaining the same dimension across all Learning Rate 10−3 input data to the network. Replay Buffer Size 50000 2. Design P0 and R0 to craft the desired identifier looping First Learning Step 1000 sequence. Target Network Update Freq. 500 Prioritized Replay True 3. Modify the training procedure of EM to incorporate the mechanism of alternating between the two environ- Exploration Parameter-Space Noise ments every fE episodes. It may prove useful to im- Exploration Fraction 0.1 plement two different alternating frequencies, one fre- Final Exploration Prob. 0.02 quency fM W to control the switching from EM to EW , Max. Total Reward 500 Table 2: Specifications of the CartPole Environment mark environment results in termination, this outcome is in line with expectations. Cart Position [-4.8, +4.8] However, as established in Table 4, in the absence of ex- Cart Velocity [-inf, +inf] ploration, the test-time performance of this joint policy in the Observation Space Pole Angle [-24 deg, +24 deg] watermark environment is indeed optimal. This table also Pole Velocity at Tip [-inf, +inf] verifies that the test-time performance of the joint policy in 0 : Push cart to the left the main task is in par with that of the nominal (i.e., un- Action Space 1 : Push cart to the right watermarked) DQN policy. Therefore, it can be seen that Reward +1 for every step taken the watermarking process does not affect the agent’s ability Pole Angle is more than 12 degrees to perform the main task. Furthermore, this table presents Termination Cart Position is more than 2.4 the results of running unwatermarked policies in the water- Episode length is greater than 500 mark environment. The results indicate that unwatermarked policies fail to follow the identifier trajectory of the water- mark. Hence, these results verify the feasibility of our pro- Table 3: State Space of the Watermarking Environment posed scheme for sequential watermarking of DRL policy. State (x, ẋ, θ, θ̇ ) CartPole Watermark State[1] (-5, 0, -25, 0) Mean 100 Episode Reward State[2] (-5, 0, 25, 0) 400 State[3] (5, 0, -25, 0) 300 State[4] (5, 0, 25, 0) 200 Terminal (-6, 0, -26, 0) 100 0 0 50000 100000 150000 200000 250000 300000 Per the procedure of the proposed scheme, The action- Steps space of this environment is set to be the same as that of CartPole, defined as Actions := {0, 1}. The transition dy- namics and reward values of this environment are designed as follows: At State[i], applying Actions[i%2] results in a tran- sition to State[i%4 + 1], and produces a reward of +1. Alter- natively, if any action other than Actions[i%2] is played, the environment transitions into the Terminal state, which results in a reward of −1 and the termination of the episode. Hence, the identifier sequence is as follows: ... → State[1] → Figure 1: Training Performance for Joint CartPole-Watermark Pol- State[2] → State[3] → State[4] → State[1] → .... icy The training procedure of DQN is also modified to imple- ment the switching of environments. To account for the con- siderably lower complexity of the watermarking environment 6 Discussion compared to CartPole, the main environment is set to switch The proposed watermarking scheme presents the potential to the watermarking environment every 10 episodes. At this for adoption in other applications. From an adversarial per- point, the agent interacts with the watermarking environment spective, this scheme may be used to embed malicious back- for a single episode, and reverts back to the main environment doors in DRL policies. For instance, an adversary may apply afterwards. this scheme to poison a self-driving policy to perform harm- ful actions when a specific sequence of states are presented 5 Results to the policy. If the adversarial sequence is well-crafted, typical fuzzing-based testing techniques may fail to detect Figure 1 presents the training progress of the joint DQN pol- the presence of such backdoors. Therefore, there is a need icy in both the CartPole and watermark environments. It can for new approaches to the detection of such backdoors. A be seen that the joint policy converges in both cases. The promising solution is the adoption of the activation clustering convergence of this joint policy is achieved with increased technique[Chen et al., 2018] developed for the detection of training cost in comparison to the nominal CartPole DQN data poisoning attacks in supervised deep models. policy. This is due to the expansion of the state-space and Another potential application for this technique is in the transition dynamics resulting from the integration of the wa- area of AI safety. One of the major concerns in this domain is termark environment. It is also observed that at convergence, the switch-off problem[Amodei et al., 2016]: if the objective the total episodic reward produced by the joint policy in the function of an AI agent does not account for or prioritize user watermark environment is less than the best-possible value of demands for the halting of its operation, the resulting optimal 500. This is due to the exploration settings of the training policy may prevent any actions which would lead to halting algorithm, in which the minimum exploration rate is set to of the agent’s pursuit of its objective. An instance of such 2%. Considering that a single incorrect action in the water- actions is any attempt to turn off the agent before it satisfies its Table 4: Test-Time Performance Comparison of Watermarked and Nominal Policies CartPole Performance Watermark Performance Policy (mean 100 episodes) (mean 100 episodes) DQN-Watermarked 500 500 DQN 500 1.4 A2C 500 2.81 PPO2 500 2.43 objective. A promising solution to this problem is to leverage our proposed scheme to embed debug or halting modes in the policy, which are triggered through a pre-defined sequence of state observations. References [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. [Behzadan and Hsu, 2019] Vahid Behzadan and William Hsu. Adversarial exploitation of policy imitation. arXiv preprint arXiv:1906.01121, 2019. [Behzadan and Munir, 2018] Vahid Behzadan and Arslan Munir. The faults in our pi stars: Security issues and open challenges in deep reinforcement learning. arXiv preprint arXiv:1810.10369, 2018. [Chen et al., 2018] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting back- door attacks on deep neural networks by activation clus- tering. arXiv preprint arXiv:1811.03728, 2018. [Shih, 2017] Frank Y Shih. Digital watermarking and steganography: fundamentals and techniques. CRC press, 2017. [Tramèr et al., 2016] Florian Tramèr, Fan Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Stealing ma- chine learning models via prediction apis. In USENIX Se- curity Symposium, pages 601–618, 2016. [Uchida et al., 2017] Yusuke Uchida, Yuki Nagai, Shigeyuki Sakazawa, and Shin’ichi Satoh. Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Re- trieval, pages 269–277. ACM, 2017.