=Paper=
{{Paper
|id=Vol-2419/paper41
|storemode=property
|title=Watermarking of DRL Policies with Sequential Triggers
|pdfUrl=https://ceur-ws.org/Vol-2419/paper_41.pdf
|volume=Vol-2419
|authors=Vahid Behzadan,William Hsu
|dblpUrl=https://dblp.org/rec/conf/ijcai/BehzadanH19a
}}
==Watermarking of DRL Policies with Sequential Triggers==
Sequential Triggers for Watermarking of Deep Reinforcement Learning Policies
Vahid Behzadan∗ and William H. Hsu
Kansas State University
{behzadan, bhsu}@ksu.edu
Abstract impact on the original functions of the policy.
While the idea of watermarking has been explored for su-
This paper proposes a novel scheme for the water- pervised machine learning models[Uchida et al., 2017], to the
marking of Deep Reinforcement Learning (DRL) extent of our knowledge, this work is the first to develop a wa-
policies. This scheme provides a mechanism for the termarking scheme for the general settings of sequential deci-
integration of a unique identifier within the policy sion making models and policies. The proposed scheme pro-
in the form of its response to a designated sequence vides a mechanism for integrating a unique identifier within
of state transitions, while incurring minimal impact the policy as an unlikely sequence of transitions, which may
on the nominal performance of the policy. The ap- only be realized if the driving policy of these transitions is
plications of this watermarking scheme include de- already tuned to follow that exact sequence.
tection of unauthorized replications of proprietary The remainder of this paper is organized as follows: Sec-
policies, as well as enabling the graceful interrup- tion 2 presents the formal description and justification of the
tion or termination of DRL activities by authorized proposed scheme. Section 3 provides the procedure for im-
entities. We demonstrate the feasibility of our pro- plementing the proposed scheme, followed by the experiment
posal via experimental evaluation of watermarking setup and results in Sections 4 and 5. The paper concludes in
a DQN policy trained in the Cartpole environment. Section 6 with a discussion on the applications of this scheme
and remarks on future directions of research.
1 Introduction
2 Solution Approach
The rapid advancements of the Deep Reinforcement Learn-
ing (DRL) techniques provide ample motivation for explor- The proposed scheme is as follows. Let π(s) be the desired
ing the commercial applications of DRL policies in various policy for interacting with an MDP < S, A, P, R, γ > for an
domains. However, as recent studies have established [Be- episodic training environment EM . Assume that A is inde-
hzadan and Munir, 2018], the current state of the art in DRL pendent of the state (i.e., all actions in A are permissible in
fails to satisfy many of the security requirements of enduring any state s ∈ S. In tandem, consider a second MDP for an
commercial products. One such requirement is the protection alternate environment EW , denoted as < S0 , A0 , P0 , R0 , γ >,
of proprietary DRL policies from theft and unlicensed dis- such that:
tribution. While recent research [Behzadan and Hsu, 2019] 1. S0 ∩ S = ∅,
demonstrate the feasibility of indirect replication of policies 2. The state dimensions of S and S 0 are equal: ∀s ∈
through imitation learning, this paper investigates the prob- Sand∀s0 ∈ S0 : |s| = |s0 |
lem of direct policy extraction. Considering that DRL poli-
cies are often composed solely of the weights and biases of 3. Action-space of both MDPs are equal: A = A0
a neural network, protecting against an adversary with physi- 4. The transition dynamics and reward distribution of the
cal access to the host device of the policy is often impractical alternate environment, denoted by P0 and R0 , are deter-
or disproportionately costly[Tramèr et al., 2016]. With roots ministic.
in digital media and the entertainment industry[Shih, 2017], 5. EW is an episodic environment with the same number
an alternative solution is watermarking. That is, embedding of steps before termination as EM , denoted by Nmax .
distinctly recognizable signs of ownership in the content and
functions of the policy, which provide the means for detect- Let s0terminal be a terminal state in EW , and define P0 be
ing unauthorized or stolen copies of the policy. To this end, such that for any state s0t ∈ S0 , there exists only one action
a necessary requirement of watermarks is to be sufficiently aw (s0t ) that will result in the transition s0t → s0t+1 . In this set-
resistant to removal or tampering. Furthermore, the embed- ting, we designate the ordered tuple of states < s0t , s0t+1 >∈
ding and testing of watermarks shall result in minimal or zero L as links, where L is the set of all links in EW . Also,
define R0 such that R0 (s0t , aw (s0t ), s0t+1 ) = c > 0 for all
∗
Contact Author < s0t , s0t+1 >∈ L, and R0 (s0t , a 6= aw (s0t ), s0 6= s0t+1 ) = −c.
That is, link transitions receive the same positive reward, and and another frequency fW M for switching back to the
all other transitions produce the same negative reward. main environment. For watermarking MDPs of much
These settings provide two interesting results: Since the lower complexity than that of the main environment, se-
state-spaces S and S0 are disjoint, the two MDPs can be com- lecting these two frequencies such that fW M < fM W
bined to form a joint MDP < S ∪ S0 , A, P ∪ P0 , R ∪ R0 , γ >, can enhance the efficiency of the joint training process
where: by allocating more exploration opportunities to the more
complex settings.
P if s1 , s2 ∈ S
0
P ∪ P (s1 , a1 , s2 ) = (1)
P0 if s1 , s2 ∈ S0
Similarly, To examine the authenticity of policies, it is sufficient to
run those policies in the watermarking environment. If the
R if s1 , s2 ∈ S
0 resulting transitions match that of the identifier sequence in
R ∪ R (s1 , a1 , s2 ) = (2)
R0 if s1 , s2 ∈ S0 consecutive episodes, it is highly likely that the policy under
Consequently, it is possible to train a single policy πj that test is an exact replica of the watermarked policy. However,
is optimized for both EM and EW through the joint MDP. modifications and retraining of a replicated policy may re-
In practice, the training of a policy for this joint MDP can sult in imperfect matches. In such cases, the average of to-
be achieved by alternating between the environments at every tal rewards gained by the suspicious policy over consecutive
fE th episode. episodes of the watermark environment provides a quantita-
Furthermore, the structure of P0 and R0 enable the creation tive measure of the possibility that the model under test is
of a looping sequence of transitions, which constitutes the re- based on an unauthorized replica.
sulting trajectory of the optimal policy for EW . This looping
sequence can be realized by designating a single state s0l to
belong to two link transitions, comprised of a link transition
< s0l , s0l+1 > where s0l is the source state, and another link
transition < s0l−1 , s0l >, in which s0l is the destination state. 4 Experiment Setup
It is noteworthy that the creation of such looping sequences
provides sufficient flexibility for crafting unlikely and unique
sequences. However, in designing the looping sequence as
policy identifiers, two important restrictions must be consid- To evaluate the feasibility of the proposed scheme, the design
ered: first, the structure of identifier sequences need to be and embedding of an identifier sequence for a DQN policy in
such that the resulting probability of accidentally following the CartPole environment is investigated. Hyperparameters of
the sequence is minimized. Second, the complexity (i.e., de- the DQN policy are provided in Table 1. The watermarking
grees of freedom) of link and non-link transitions on the ring environment is implemented as a customized OpenAI Gym
must be balanced against the training cost of the joint pol- environment. The state space of this environment comprises
icy: more complex sequences will increase the training cost of 5 states with 4 dimensions each (Cart Position, Cart Ve-
of the joint policy by expanding the search space of both envi- locity, Pole Angle, Pole Velocity At Tip). As denoted in Ta-
ronments. Hence, efficient design of identifier sequences will ble 2, the original CartPole environment restricts the values
necessitate the balancing of this trade-off between the secrecy of Cart Position to [−4.8, 4.8], and binds the Pole Angle to
of identifier and the training cost. the range [−24deg, 24deg]. Consequently, the correspond-
ing parameters of the alternate state-space are selected from
3 Watermarking Procedure beyond these ranges to ensure that the states remain disjoint
from those of the original CartPole. The list of crafted states
Building on the presented formalization, we propose the fol-
is presented in Table 3.
lowing procedure for the sequential watermarking of DRL
policies:
1. Define the state-space of the watermarking environment
EW such that it is disjoint from that of the main envi- Table 1: Parameters of DQN Policy
ronment EM , while preserving the state dimensionality
of the main state space. The latter condition is to enable No. Timesteps 105
the utilization of the same neural network model for the γ 0.99
agent through maintaining the same dimension across all Learning Rate 10−3
input data to the network. Replay Buffer Size 50000
2. Design P0 and R0 to craft the desired identifier looping First Learning Step 1000
sequence. Target Network Update Freq. 500
Prioritized Replay True
3. Modify the training procedure of EM to incorporate
the mechanism of alternating between the two environ- Exploration Parameter-Space Noise
ments every fE episodes. It may prove useful to im- Exploration Fraction 0.1
plement two different alternating frequencies, one fre- Final Exploration Prob. 0.02
quency fM W to control the switching from EM to EW , Max. Total Reward 500
Table 2: Specifications of the CartPole Environment mark environment results in termination, this outcome is in
line with expectations.
Cart Position [-4.8, +4.8] However, as established in Table 4, in the absence of ex-
Cart Velocity [-inf, +inf] ploration, the test-time performance of this joint policy in the
Observation Space
Pole Angle [-24 deg, +24 deg] watermark environment is indeed optimal. This table also
Pole Velocity at Tip [-inf, +inf] verifies that the test-time performance of the joint policy in
0 : Push cart to the left the main task is in par with that of the nominal (i.e., un-
Action Space
1 : Push cart to the right watermarked) DQN policy. Therefore, it can be seen that
Reward +1 for every step taken the watermarking process does not affect the agent’s ability
Pole Angle is more than 12 degrees to perform the main task. Furthermore, this table presents
Termination Cart Position is more than 2.4 the results of running unwatermarked policies in the water-
Episode length is greater than 500 mark environment. The results indicate that unwatermarked
policies fail to follow the identifier trajectory of the water-
mark. Hence, these results verify the feasibility of our pro-
Table 3: State Space of the Watermarking Environment posed scheme for sequential watermarking of DRL policy.
State (x, ẋ, θ, θ̇ )
CartPole Watermark
State[1] (-5, 0, -25, 0)
Mean 100 Episode Reward
State[2] (-5, 0, 25, 0) 400
State[3] (5, 0, -25, 0) 300
State[4] (5, 0, 25, 0) 200
Terminal (-6, 0, -26, 0) 100
0
0 50000 100000 150000 200000 250000 300000
Per the procedure of the proposed scheme, The action- Steps
space of this environment is set to be the same as that of
CartPole, defined as Actions := {0, 1}. The transition dy-
namics and reward values of this environment are designed as
follows: At State[i], applying Actions[i%2] results in a tran-
sition to State[i%4 + 1], and produces a reward of +1. Alter-
natively, if any action other than Actions[i%2] is played, the
environment transitions into the Terminal state, which results
in a reward of −1 and the termination of the episode. Hence,
the identifier sequence is as follows: ... → State[1] → Figure 1: Training Performance for Joint CartPole-Watermark Pol-
State[2] → State[3] → State[4] → State[1] → .... icy
The training procedure of DQN is also modified to imple-
ment the switching of environments. To account for the con-
siderably lower complexity of the watermarking environment 6 Discussion
compared to CartPole, the main environment is set to switch The proposed watermarking scheme presents the potential
to the watermarking environment every 10 episodes. At this for adoption in other applications. From an adversarial per-
point, the agent interacts with the watermarking environment spective, this scheme may be used to embed malicious back-
for a single episode, and reverts back to the main environment doors in DRL policies. For instance, an adversary may apply
afterwards. this scheme to poison a self-driving policy to perform harm-
ful actions when a specific sequence of states are presented
5 Results to the policy. If the adversarial sequence is well-crafted,
typical fuzzing-based testing techniques may fail to detect
Figure 1 presents the training progress of the joint DQN pol- the presence of such backdoors. Therefore, there is a need
icy in both the CartPole and watermark environments. It can for new approaches to the detection of such backdoors. A
be seen that the joint policy converges in both cases. The promising solution is the adoption of the activation clustering
convergence of this joint policy is achieved with increased technique[Chen et al., 2018] developed for the detection of
training cost in comparison to the nominal CartPole DQN data poisoning attacks in supervised deep models.
policy. This is due to the expansion of the state-space and Another potential application for this technique is in the
transition dynamics resulting from the integration of the wa- area of AI safety. One of the major concerns in this domain is
termark environment. It is also observed that at convergence, the switch-off problem[Amodei et al., 2016]: if the objective
the total episodic reward produced by the joint policy in the function of an AI agent does not account for or prioritize user
watermark environment is less than the best-possible value of demands for the halting of its operation, the resulting optimal
500. This is due to the exploration settings of the training policy may prevent any actions which would lead to halting
algorithm, in which the minimum exploration rate is set to of the agent’s pursuit of its objective. An instance of such
2%. Considering that a single incorrect action in the water- actions is any attempt to turn off the agent before it satisfies its
Table 4: Test-Time Performance Comparison of Watermarked and Nominal Policies
CartPole Performance Watermark Performance
Policy
(mean 100 episodes) (mean 100 episodes)
DQN-Watermarked 500 500
DQN 500 1.4
A2C 500 2.81
PPO2 500 2.43
objective. A promising solution to this problem is to leverage
our proposed scheme to embed debug or halting modes in the
policy, which are triggered through a pre-defined sequence of
state observations.
References
[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
Steinhardt, Paul Christiano, John Schulman, and Dan
Mané. Concrete problems in ai safety. arXiv preprint
arXiv:1606.06565, 2016.
[Behzadan and Hsu, 2019] Vahid Behzadan and William
Hsu. Adversarial exploitation of policy imitation. arXiv
preprint arXiv:1906.01121, 2019.
[Behzadan and Munir, 2018] Vahid Behzadan and Arslan
Munir. The faults in our pi stars: Security issues and open
challenges in deep reinforcement learning. arXiv preprint
arXiv:1810.10369, 2018.
[Chen et al., 2018] Bryant Chen, Wilka Carvalho, Nathalie
Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung
Lee, Ian Molloy, and Biplav Srivastava. Detecting back-
door attacks on deep neural networks by activation clus-
tering. arXiv preprint arXiv:1811.03728, 2018.
[Shih, 2017] Frank Y Shih. Digital watermarking and
steganography: fundamentals and techniques. CRC press,
2017.
[Tramèr et al., 2016] Florian Tramèr, Fan Zhang, Ari Juels,
Michael K Reiter, and Thomas Ristenpart. Stealing ma-
chine learning models via prediction apis. In USENIX Se-
curity Symposium, pages 601–618, 2016.
[Uchida et al., 2017] Yusuke Uchida, Yuki Nagai, Shigeyuki
Sakazawa, and Shin’ichi Satoh. Embedding watermarks
into deep neural networks. In Proceedings of the 2017
ACM on International Conference on Multimedia Re-
trieval, pages 269–277. ACM, 2017.