Sequential Triggers for Watermarking of Deep Reinforcement Learning Policies

                                          Vahid Behzadan∗ and William H. Hsu
                                                  Kansas State University
                                                {behzadan, bhsu}@ksu.edu


                            Abstract                               impact on the original functions of the policy.
                                                                      While the idea of watermarking has been explored for su-
        This paper proposes a novel scheme for the water-          pervised machine learning models[Uchida et al., 2017], to the
        marking of Deep Reinforcement Learning (DRL)               extent of our knowledge, this work is the first to develop a wa-
        policies. This scheme provides a mechanism for the         termarking scheme for the general settings of sequential deci-
        integration of a unique identifier within the policy       sion making models and policies. The proposed scheme pro-
        in the form of its response to a designated sequence       vides a mechanism for integrating a unique identifier within
        of state transitions, while incurring minimal impact       the policy as an unlikely sequence of transitions, which may
        on the nominal performance of the policy. The ap-          only be realized if the driving policy of these transitions is
        plications of this watermarking scheme include de-         already tuned to follow that exact sequence.
        tection of unauthorized replications of proprietary           The remainder of this paper is organized as follows: Sec-
        policies, as well as enabling the graceful interrup-       tion 2 presents the formal description and justification of the
        tion or termination of DRL activities by authorized        proposed scheme. Section 3 provides the procedure for im-
        entities. We demonstrate the feasibility of our pro-       plementing the proposed scheme, followed by the experiment
        posal via experimental evaluation of watermarking          setup and results in Sections 4 and 5. The paper concludes in
        a DQN policy trained in the Cartpole environment.          Section 6 with a discussion on the applications of this scheme
                                                                   and remarks on future directions of research.
1       Introduction
                                                                   2    Solution Approach
The rapid advancements of the Deep Reinforcement Learn-
ing (DRL) techniques provide ample motivation for explor-          The proposed scheme is as follows. Let π(s) be the desired
ing the commercial applications of DRL policies in various         policy for interacting with an MDP < S, A, P, R, γ > for an
domains. However, as recent studies have established [Be-          episodic training environment EM . Assume that A is inde-
hzadan and Munir, 2018], the current state of the art in DRL       pendent of the state (i.e., all actions in A are permissible in
fails to satisfy many of the security requirements of enduring     any state s ∈ S. In tandem, consider a second MDP for an
commercial products. One such requirement is the protection        alternate environment EW , denoted as < S0 , A0 , P0 , R0 , γ >,
of proprietary DRL policies from theft and unlicensed dis-         such that:
tribution. While recent research [Behzadan and Hsu, 2019]            1. S0 ∩ S = ∅,
demonstrate the feasibility of indirect replication of policies      2. The state dimensions of S and S 0 are equal: ∀s ∈
through imitation learning, this paper investigates the prob-            Sand∀s0 ∈ S0 : |s| = |s0 |
lem of direct policy extraction. Considering that DRL poli-
cies are often composed solely of the weights and biases of          3. Action-space of both MDPs are equal: A = A0
a neural network, protecting against an adversary with physi-        4. The transition dynamics and reward distribution of the
cal access to the host device of the policy is often impractical         alternate environment, denoted by P0 and R0 , are deter-
or disproportionately costly[Tramèr et al., 2016]. With roots           ministic.
in digital media and the entertainment industry[Shih, 2017],         5. EW is an episodic environment with the same number
an alternative solution is watermarking. That is, embedding              of steps before termination as EM , denoted by Nmax .
distinctly recognizable signs of ownership in the content and
functions of the policy, which provide the means for detect-          Let s0terminal be a terminal state in EW , and define P0 be
ing unauthorized or stolen copies of the policy. To this end,      such that for any state s0t ∈ S0 , there exists only one action
a necessary requirement of watermarks is to be sufficiently        aw (s0t ) that will result in the transition s0t → s0t+1 . In this set-
resistant to removal or tampering. Furthermore, the embed-         ting, we designate the ordered tuple of states < s0t , s0t+1 >∈
ding and testing of watermarks shall result in minimal or zero     L as links, where L is the set of all links in EW . Also,
                                                                   define R0 such that R0 (s0t , aw (s0t ), s0t+1 ) = c > 0 for all
    ∗
        Contact Author                                             < s0t , s0t+1 >∈ L, and R0 (s0t , a 6= aw (s0t ), s0 6= s0t+1 ) = −c.
That is, link transitions receive the same positive reward, and            and another frequency fW M for switching back to the
all other transitions produce the same negative reward.                    main environment. For watermarking MDPs of much
   These settings provide two interesting results: Since the               lower complexity than that of the main environment, se-
state-spaces S and S0 are disjoint, the two MDPs can be com-               lecting these two frequencies such that fW M < fM W
bined to form a joint MDP < S ∪ S0 , A, P ∪ P0 , R ∪ R0 , γ >,             can enhance the efficiency of the joint training process
where:                                                                     by allocating more exploration opportunities to the more
                                                                           complex settings.
                                      P if s1 , s2 ∈ S
                                    
                  0
            P ∪ P (s1 , a1 , s2 ) =                            (1)
                                      P0 if s1 , s2 ∈ S0
   Similarly,                                                           To examine the authenticity of policies, it is sufficient to
                                                                     run those policies in the watermarking environment. If the
                                      R if s1 , s2 ∈ S
                                    
                  0                                                  resulting transitions match that of the identifier sequence in
            R ∪ R (s1 , a1 , s2 ) =                            (2)
                                      R0 if s1 , s2 ∈ S0             consecutive episodes, it is highly likely that the policy under
   Consequently, it is possible to train a single policy πj that     test is an exact replica of the watermarked policy. However,
is optimized for both EM and EW through the joint MDP.               modifications and retraining of a replicated policy may re-
In practice, the training of a policy for this joint MDP can         sult in imperfect matches. In such cases, the average of to-
be achieved by alternating between the environments at every         tal rewards gained by the suspicious policy over consecutive
fE th episode.                                                       episodes of the watermark environment provides a quantita-
   Furthermore, the structure of P0 and R0 enable the creation       tive measure of the possibility that the model under test is
of a looping sequence of transitions, which constitutes the re-      based on an unauthorized replica.
sulting trajectory of the optimal policy for EW . This looping
sequence can be realized by designating a single state s0l to
belong to two link transitions, comprised of a link transition
< s0l , s0l+1 > where s0l is the source state, and another link
transition < s0l−1 , s0l >, in which s0l is the destination state.   4    Experiment Setup
It is noteworthy that the creation of such looping sequences
provides sufficient flexibility for crafting unlikely and unique
sequences. However, in designing the looping sequence as
policy identifiers, two important restrictions must be consid-       To evaluate the feasibility of the proposed scheme, the design
ered: first, the structure of identifier sequences need to be        and embedding of an identifier sequence for a DQN policy in
such that the resulting probability of accidentally following        the CartPole environment is investigated. Hyperparameters of
the sequence is minimized. Second, the complexity (i.e., de-         the DQN policy are provided in Table 1. The watermarking
grees of freedom) of link and non-link transitions on the ring       environment is implemented as a customized OpenAI Gym
must be balanced against the training cost of the joint pol-         environment. The state space of this environment comprises
icy: more complex sequences will increase the training cost          of 5 states with 4 dimensions each (Cart Position, Cart Ve-
of the joint policy by expanding the search space of both envi-      locity, Pole Angle, Pole Velocity At Tip). As denoted in Ta-
ronments. Hence, efficient design of identifier sequences will       ble 2, the original CartPole environment restricts the values
necessitate the balancing of this trade-off between the secrecy      of Cart Position to [−4.8, 4.8], and binds the Pole Angle to
of identifier and the training cost.                                 the range [−24deg, 24deg]. Consequently, the correspond-
                                                                     ing parameters of the alternate state-space are selected from
3   Watermarking Procedure                                           beyond these ranges to ensure that the states remain disjoint
                                                                     from those of the original CartPole. The list of crafted states
Building on the presented formalization, we propose the fol-
                                                                     is presented in Table 3.
lowing procedure for the sequential watermarking of DRL
policies:
  1. Define the state-space of the watermarking environment
     EW such that it is disjoint from that of the main envi-                        Table 1: Parameters of DQN Policy
     ronment EM , while preserving the state dimensionality
     of the main state space. The latter condition is to enable          No. Timesteps                   105
     the utilization of the same neural network model for the            γ                               0.99
     agent through maintaining the same dimension across all             Learning Rate                   10−3
     input data to the network.                                          Replay Buffer Size              50000
  2. Design P0 and R0 to craft the desired identifier looping            First Learning Step             1000
     sequence.                                                           Target Network Update Freq.     500
                                                                         Prioritized Replay              True
  3. Modify the training procedure of EM to incorporate
     the mechanism of alternating between the two environ-               Exploration                     Parameter-Space Noise
     ments every fE episodes. It may prove useful to im-                 Exploration Fraction            0.1
     plement two different alternating frequencies, one fre-             Final Exploration Prob.         0.02
     quency fM W to control the switching from EM to EW ,                Max. Total Reward               500
        Table 2: Specifications of the CartPole Environment      mark environment results in termination, this outcome is in
                                                                 line with expectations.
                          Cart Position [-4.8, +4.8]                However, as established in Table 4, in the absence of ex-
                          Cart Velocity [-inf, +inf]             ploration, the test-time performance of this joint policy in the
    Observation Space
                          Pole Angle [-24 deg, +24 deg]          watermark environment is indeed optimal. This table also
                          Pole Velocity at Tip [-inf, +inf]      verifies that the test-time performance of the joint policy in
                          0 : Push cart to the left              the main task is in par with that of the nominal (i.e., un-
    Action Space
                          1 : Push cart to the right             watermarked) DQN policy. Therefore, it can be seen that
    Reward                +1 for every step taken                the watermarking process does not affect the agent’s ability
                          Pole Angle is more than 12 degrees     to perform the main task. Furthermore, this table presents
    Termination           Cart Position is more than 2.4         the results of running unwatermarked policies in the water-
                          Episode length is greater than 500     mark environment. The results indicate that unwatermarked
                                                                 policies fail to follow the identifier trajectory of the water-
                                                                 mark. Hence, these results verify the feasibility of our pro-
       Table 3: State Space of the Watermarking Environment      posed scheme for sequential watermarking of DRL policy.

                    State         (x, ẋ, θ, θ̇ )
                                                                                                         CartPole                                   Watermark
                   State[1]     (-5, 0, -25, 0)


                                                                     Mean 100 Episode Reward
                   State[2]      (-5, 0, 25, 0)                                                400
                   State[3]      (5, 0, -25, 0)                                                300
                   State[4]      (5, 0, 25, 0)                                                 200
                   Terminal     (-6, 0, -26, 0)                                                100
                                                                                                0
                                                                                                     0       50000   100000   150000   200000   250000   300000
   Per the procedure of the proposed scheme, The action-                                                                      Steps
space of this environment is set to be the same as that of
CartPole, defined as Actions := {0, 1}. The transition dy-
namics and reward values of this environment are designed as
follows: At State[i], applying Actions[i%2] results in a tran-
sition to State[i%4 + 1], and produces a reward of +1. Alter-
natively, if any action other than Actions[i%2] is played, the
environment transitions into the Terminal state, which results
in a reward of −1 and the termination of the episode. Hence,
the identifier sequence is as follows: ... → State[1] →          Figure 1: Training Performance for Joint CartPole-Watermark Pol-
State[2] → State[3] → State[4] → State[1] → ....                 icy
   The training procedure of DQN is also modified to imple-
ment the switching of environments. To account for the con-
siderably lower complexity of the watermarking environment       6                             Discussion
compared to CartPole, the main environment is set to switch      The proposed watermarking scheme presents the potential
to the watermarking environment every 10 episodes. At this       for adoption in other applications. From an adversarial per-
point, the agent interacts with the watermarking environment     spective, this scheme may be used to embed malicious back-
for a single episode, and reverts back to the main environment   doors in DRL policies. For instance, an adversary may apply
afterwards.                                                      this scheme to poison a self-driving policy to perform harm-
                                                                 ful actions when a specific sequence of states are presented
5     Results                                                    to the policy. If the adversarial sequence is well-crafted,
                                                                 typical fuzzing-based testing techniques may fail to detect
Figure 1 presents the training progress of the joint DQN pol-    the presence of such backdoors. Therefore, there is a need
icy in both the CartPole and watermark environments. It can      for new approaches to the detection of such backdoors. A
be seen that the joint policy converges in both cases. The       promising solution is the adoption of the activation clustering
convergence of this joint policy is achieved with increased      technique[Chen et al., 2018] developed for the detection of
training cost in comparison to the nominal CartPole DQN          data poisoning attacks in supervised deep models.
policy. This is due to the expansion of the state-space and         Another potential application for this technique is in the
transition dynamics resulting from the integration of the wa-    area of AI safety. One of the major concerns in this domain is
termark environment. It is also observed that at convergence,    the switch-off problem[Amodei et al., 2016]: if the objective
the total episodic reward produced by the joint policy in the    function of an AI agent does not account for or prioritize user
watermark environment is less than the best-possible value of    demands for the halting of its operation, the resulting optimal
500. This is due to the exploration settings of the training     policy may prevent any actions which would lead to halting
algorithm, in which the minimum exploration rate is set to       of the agent’s pursuit of its objective. An instance of such
2%. Considering that a single incorrect action in the water-     actions is any attempt to turn off the agent before it satisfies its
                         Table 4: Test-Time Performance Comparison of Watermarked and Nominal Policies

                                                CartPole Performance       Watermark Performance
                                Policy
                                                 (mean 100 episodes)        (mean 100 episodes)
                         DQN-Watermarked                 500                       500
                             DQN                         500                        1.4
                              A2C                        500                       2.81
                             PPO2                        500                       2.43


objective. A promising solution to this problem is to leverage
our proposed scheme to embed debug or halting modes in the
policy, which are triggered through a pre-defined sequence of
state observations.

References
[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
   Steinhardt, Paul Christiano, John Schulman, and Dan
   Mané. Concrete problems in ai safety. arXiv preprint
   arXiv:1606.06565, 2016.
[Behzadan and Hsu, 2019] Vahid Behzadan and William
   Hsu. Adversarial exploitation of policy imitation. arXiv
   preprint arXiv:1906.01121, 2019.
[Behzadan and Munir, 2018] Vahid Behzadan and Arslan
   Munir. The faults in our pi stars: Security issues and open
   challenges in deep reinforcement learning. arXiv preprint
   arXiv:1810.10369, 2018.
[Chen et al., 2018] Bryant Chen, Wilka Carvalho, Nathalie
   Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung
   Lee, Ian Molloy, and Biplav Srivastava. Detecting back-
   door attacks on deep neural networks by activation clus-
   tering. arXiv preprint arXiv:1811.03728, 2018.
[Shih, 2017] Frank Y Shih. Digital watermarking and
   steganography: fundamentals and techniques. CRC press,
   2017.
[Tramèr et al., 2016] Florian Tramèr, Fan Zhang, Ari Juels,
   Michael K Reiter, and Thomas Ristenpart. Stealing ma-
   chine learning models via prediction apis. In USENIX Se-
   curity Symposium, pages 601–618, 2016.
[Uchida et al., 2017] Yusuke Uchida, Yuki Nagai, Shigeyuki
   Sakazawa, and Shin’ichi Satoh. Embedding watermarks
   into deep neural networks. In Proceedings of the 2017
   ACM on International Conference on Multimedia Re-
   trieval, pages 269–277. ACM, 2017.