<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sequential Triggers for Watermarking of Deep Reinforcement Learning Policies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vahid Behzadan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William H. Hsu</string-name>
          <email>bhsug@ksu.edu</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper proposes a novel scheme for the watermarking of Deep Reinforcement Learning (DRL) policies. This scheme provides a mechanism for the integration of a unique identifier within the policy in the form of its response to a designated sequence of state transitions, while incurring minimal impact on the nominal performance of the policy. The applications of this watermarking scheme include detection of unauthorized replications of proprietary policies, as well as enabling the graceful interruption or termination of DRL activities by authorized entities. We demonstrate the feasibility of our proposal via experimental evaluation of watermarking a DQN policy trained in the Cartpole environment. Contact Author</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The rapid advancements of the Deep Reinforcement
Learning (DRL) techniques provide ample motivation for
exploring the commercial applications of DRL policies in various
domains. However, as recent studies have established
[Behzadan and Munir, 2018], the current state of the art in DRL
fails to satisfy many of the security requirements of enduring
commercial products. One such requirement is the protection
of proprietary DRL policies from theft and unlicensed
distribution. While recent research [Behzadan and Hsu, 2019]
demonstrate the feasibility of indirect replication of policies
through imitation learning, this paper investigates the
problem of direct policy extraction. Considering that DRL
policies are often composed solely of the weights and biases of
a neural network, protecting against an adversary with
physical access to the host device of the policy is often impractical
or disproportionately costly[Trame`r et al., 2016]. With roots
in digital media and the entertainment industry[Shih, 2017],
an alternative solution is watermarking. That is, embedding
distinctly recognizable signs of ownership in the content and
functions of the policy, which provide the means for
detecting unauthorized or stolen copies of the policy. To this end,
a necessary requirement of watermarks is to be sufficiently
resistant to removal or tampering. Furthermore, the
embedding and testing of watermarks shall result in minimal or zero
impact on the original functions of the policy.</p>
      <p>While the idea of watermarking has been explored for
supervised machine learning models[Uchida et al., 2017], to the
extent of our knowledge, this work is the first to develop a
watermarking scheme for the general settings of sequential
decision making models and policies. The proposed scheme
provides a mechanism for integrating a unique identifier within
the policy as an unlikely sequence of transitions, which may
only be realized if the driving policy of these transitions is
already tuned to follow that exact sequence.</p>
      <p>The remainder of this paper is organized as follows:
Section 2 presents the formal description and justification of the
proposed scheme. Section 3 provides the procedure for
implementing the proposed scheme, followed by the experiment
setup and results in Sections 4 and 5. The paper concludes in
Section 6 with a discussion on the applications of this scheme
and remarks on future directions of research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Solution Approach</title>
      <p>The proposed scheme is as follows. Let (s) be the desired
policy for interacting with an MDP &lt; S; A; P; R; &gt; for an
episodic training environment EM . Assume that A is
independent of the state (i.e., all actions in A are permissible in
any state s 2 S. In tandem, consider a second MDP for an
alternate environment EW , denoted as &lt; S0; A0; P0; R0; &gt;,
such that:
1. S0 \ S = ;,
2. The state dimensions of S and S0 are equal: 8s 2</p>
      <p>Sand8s0 2 S0 : jsj = js0j
3. Action-space of both MDPs are equal: A = A0
4. The transition dynamics and reward distribution of the
alternate environment, denoted by P0 and R0, are
deterministic.
5. EW is an episodic environment with the same number
of steps before termination as EM , denoted by Nmax.</p>
      <p>Let s0terminal be a terminal state in EW , and define P0 be
such that for any state s0t 2 S0, there exists only one action
aw(s0t) that will result in the transition s0t ! s0t+1. In this
setting, we designate the ordered tuple of states &lt; s0t; s0t+1 &gt;2
L as links, where L is the set of all links in EW . Also,
define R0 such that R0(s0t; aw(s0t); s0t+1) = c &gt; 0 for all
&lt; s0t; s0t+1 &gt;2 L, and R0(s0t; a 6= aw(s0t); s0 6= s0t+1) = c.
That is, link transitions receive the same positive reward, and
all other transitions produce the same negative reward.</p>
      <p>These settings provide two interesting results: Since the
state-spaces S and S0 are disjoint, the two MDPs can be
combined to form a joint MDP &lt; S [ S0; A; P [ P0; R [ R0; &gt;,
where:</p>
      <p>Similarly,</p>
      <p>P [ P0(s1; a1; s2) =
R [ R0(s1; a1; s2) =</p>
      <p>P
R</p>
      <p>if s1; s2 2 S
P0 if s1; s2 2 S0</p>
      <p>if s1; s2 2 S
R0 if s1; s2 2 S0
(1)
(2)</p>
      <p>Consequently, it is possible to train a single policy j that
is optimized for both EM and EW through the joint MDP.
In practice, the training of a policy for this joint MDP can
be achieved by alternating between the environments at every
fE th episode.</p>
      <p>Furthermore, the structure of P0 and R0 enable the creation
of a looping sequence of transitions, which constitutes the
resulting trajectory of the optimal policy for EW . This looping
sequence can be realized by designating a single state s0l to
belong to two link transitions, comprised of a link transition
&lt; s0; s0l+1 &gt; where s0l is the source state, and another link
l
transition &lt; s0l 1; s0l &gt;, in which s0l is the destination state.
It is noteworthy that the creation of such looping sequences
provides sufficient flexibility for crafting unlikely and unique
sequences. However, in designing the looping sequence as
policy identifiers, two important restrictions must be
considered: first, the structure of identifier sequences need to be
such that the resulting probability of accidentally following
the sequence is minimized. Second, the complexity (i.e.,
degrees of freedom) of link and non-link transitions on the ring
must be balanced against the training cost of the joint
policy: more complex sequences will increase the training cost
of the joint policy by expanding the search space of both
environments. Hence, efficient design of identifier sequences will
necessitate the balancing of this trade-off between the secrecy
of identifier and the training cost.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Watermarking Procedure</title>
      <p>Building on the presented formalization, we propose the
following procedure for the sequential watermarking of DRL
policies:
1. Define the state-space of the watermarking environment
EW such that it is disjoint from that of the main
environment EM , while preserving the state dimensionality
of the main state space. The latter condition is to enable
the utilization of the same neural network model for the
agent through maintaining the same dimension across all
input data to the network.
2. Design P0 and R0 to craft the desired identifier looping
sequence.
3. Modify the training procedure of EM to incorporate
the mechanism of alternating between the two
environments every fE episodes. It may prove useful to
implement two different alternating frequencies, one
frequency fMW to control the switching from EM to EW ,
and another frequency fW M for switching back to the
main environment. For watermarking MDPs of much
lower complexity than that of the main environment,
selecting these two frequencies such that fW M &lt; fMW
can enhance the efficiency of the joint training process
by allocating more exploration opportunities to the more
complex settings.</p>
      <p>To examine the authenticity of policies, it is sufficient to
run those policies in the watermarking environment. If the
resulting transitions match that of the identifier sequence in
consecutive episodes, it is highly likely that the policy under
test is an exact replica of the watermarked policy. However,
modifications and retraining of a replicated policy may
result in imperfect matches. In such cases, the average of
total rewards gained by the suspicious policy over consecutive
episodes of the watermark environment provides a
quantitative measure of the possibility that the model under test is
based on an unauthorized replica.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment Setup</title>
      <p>To evaluate the feasibility of the proposed scheme, the design
and embedding of an identifier sequence for a DQN policy in
the CartPole environment is investigated. Hyperparameters of
the DQN policy are provided in Table 1. The watermarking
environment is implemented as a customized OpenAI Gym
environment. The state space of this environment comprises
of 5 states with 4 dimensions each (Cart Position, Cart
Velocity, Pole Angle, Pole Velocity At Tip). As denoted in
Table 2, the original CartPole environment restricts the values
of Cart Position to [ 4:8; 4:8], and binds the Pole Angle to
the range [ 24deg; 24deg]. Consequently, the
corresponding parameters of the alternate state-space are selected from
beyond these ranges to ensure that the states remain disjoint
from those of the original CartPole. The list of crafted states
is presented in Table 3.</p>
      <p>Per the procedure of the proposed scheme, The
actionspace of this environment is set to be the same as that of
CartPole, defined as Actions := f0; 1g. The transition
dynamics and reward values of this environment are designed as
follows: At State[i], applying Actions[i%2] results in a
transition to State[i%4 + 1], and produces a reward of +1.
Alternatively, if any action other than Actions[i%2] is played, the
environment transitions into the Terminal state, which results
in a reward of 1 and the termination of the episode. Hence,
the identifier sequence is as follows: ::: ! State[1] !
State[2] ! State[3] ! State[4] ! State[1] ! :::.</p>
      <p>The training procedure of DQN is also modified to
implement the switching of environments. To account for the
considerably lower complexity of the watermarking environment
compared to CartPole, the main environment is set to switch
to the watermarking environment every 10 episodes. At this
point, the agent interacts with the watermarking environment
for a single episode, and reverts back to the main environment
afterwards.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Figure 1 presents the training progress of the joint DQN
policy in both the CartPole and watermark environments. It can
be seen that the joint policy converges in both cases. The
convergence of this joint policy is achieved with increased
training cost in comparison to the nominal CartPole DQN
policy. This is due to the expansion of the state-space and
transition dynamics resulting from the integration of the
watermark environment. It is also observed that at convergence,
the total episodic reward produced by the joint policy in the
watermark environment is less than the best-possible value of
500. This is due to the exploration settings of the training
algorithm, in which the minimum exploration rate is set to
2%. Considering that a single incorrect action in the
watermark environment results in termination, this outcome is in
line with expectations.</p>
      <p>However, as established in Table 4, in the absence of
exploration, the test-time performance of this joint policy in the
watermark environment is indeed optimal. This table also
verifies that the test-time performance of the joint policy in
the main task is in par with that of the nominal (i.e.,
unwatermarked) DQN policy. Therefore, it can be seen that
the watermarking process does not affect the agent’s ability
to perform the main task. Furthermore, this table presents
the results of running unwatermarked policies in the
watermark environment. The results indicate that unwatermarked
policies fail to follow the identifier trajectory of the
watermark. Hence, these results verify the feasibility of our
proposed scheme for sequential watermarking of DRL policy.</p>
      <p>CartPole
Watermark
d
r
aew400
R
ed300
o
isEp200
100100
n
ae 0
M
0
The proposed watermarking scheme presents the potential
for adoption in other applications. From an adversarial
perspective, this scheme may be used to embed malicious
backdoors in DRL policies. For instance, an adversary may apply
this scheme to poison a self-driving policy to perform
harmful actions when a specific sequence of states are presented
to the policy. If the adversarial sequence is well-crafted,
typical fuzzing-based testing techniques may fail to detect
the presence of such backdoors. Therefore, there is a need
for new approaches to the detection of such backdoors. A
promising solution is the adoption of the activation clustering
technique[Chen et al., 2018] developed for the detection of
data poisoning attacks in supervised deep models.</p>
      <p>Another potential application for this technique is in the
area of AI safety. One of the major concerns in this domain is
the switch-off problem[Amodei et al., 2016]: if the objective
function of an AI agent does not account for or prioritize user
demands for the halting of its operation, the resulting optimal
policy may prevent any actions which would lead to halting
of the agent’s pursuit of its objective. An instance of such
actions is any attempt to turn off the agent before it satisfies its
objective. A promising solution to this problem is to leverage
our proposed scheme to embed debug or halting modes in the
policy, which are triggered through a pre-defined sequence of
state observations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Amodei et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane´.
          <article-title>Concrete problems in ai safety</article-title>
          .
          <source>arXiv preprint arXiv:1606.06565</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Behzadan and Hsu</source>
          , 2019]
          <string-name>
            <given-names>Vahid</given-names>
            <surname>Behzadan</surname>
          </string-name>
          and
          <string-name>
            <given-names>William</given-names>
            <surname>Hsu</surname>
          </string-name>
          .
          <article-title>Adversarial exploitation of policy imitation</article-title>
          .
          <source>arXiv preprint arXiv:1906.01121</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Behzadan and Munir</source>
          , 2018]
          <string-name>
            <given-names>Vahid</given-names>
            <surname>Behzadan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Arslan</given-names>
            <surname>Munir</surname>
          </string-name>
          .
          <article-title>The faults in our pi stars: Security issues and open challenges in deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1810.10369</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Chen et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Bryant</given-names>
            <surname>Chen</surname>
          </string-name>
          , Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards,
          <string-name>
            <given-names>Taesung</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ian</given-names>
            <surname>Molloy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Biplav</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Detecting backdoor attacks on deep neural networks by activation clustering</article-title>
          .
          <source>arXiv preprint arXiv:1811.03728</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Shih</source>
          , 2017] Frank Y Shih.
          <article-title>Digital watermarking and steganography: fundamentals and techniques</article-title>
          . CRC press,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Trame`r et al.,
          <year>2016</year>
          ] Florian Trame`r, Fan Zhang, Ari Juels,
          <article-title>Michael K Reiter,</article-title>
          and
          <string-name>
            <surname>Thomas Ristenpart.</surname>
          </string-name>
          <article-title>Stealing machine learning models via prediction apis</article-title>
          .
          <source>In USENIX Security Symposium</source>
          , pages
          <fpage>601</fpage>
          -
          <lpage>618</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Uchida et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Yusuke</given-names>
            <surname>Uchida</surname>
          </string-name>
          , Yuki Nagai, Shigeyuki Sakazawa, and
          <article-title>Shin'ichi Satoh. Embedding watermarks into deep neural networks</article-title>
          .
          <source>In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval</source>
          , pages
          <fpage>269</fpage>
          -
          <lpage>277</lpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>