Adversarial Exploitation of Policy Imitation

                                            Vahid Behzadan∗ and William H. Hsu
                                                    Kansas State University
                                                  {behzadan, bhsu}@ksu.edu


                             Abstract                               which refers to the acquisition of skills or behaviors by ob-
                                                                    serving demonstrations of an expert performing those skills.
        This paper investigates a class of attacks target-          Typically, the settings of imitation learning are concerned
        ing the confidentiality aspect of security in Deep          with learning from human demonstrations. However, it is
        Reinforcement Learning (DRL) policies. Recent               straightforward to deduce that the techniques developed for
        research have established the vulnerability of su-          those settings may also be applied to learning from artifi-
        pervised machine learning models (e.g., classifiers)        cial experts, such as DRL agents. Of particular relevance to
        to model extraction attacks. Such attacks leverage          this research is the emerging area of Reinforcement Learning
        the loosely-restricted ability of the attacker to itera-    with Expert Demonstrations (RLED)[Piot et al., 2014]. The
        tively query the model for labels, thereby allowing         techniques of RLED aim to minimize the effect of modeling
        for the forging of a labeled dataset which can be           imperfections on the efficacy of the final RL policy, while
        used to train a replica of the original model. In this      minimizing the cost of training by leveraging the informa-
        work, we demonstrate the feasibility of exploiting          tion available demonstrations to reduce the search space of
        imitation learning techniques in launching model            the policy.
        extraction attacks on DRL agents. Furthermore,
        we develop proof-of-concept attacks that leverage
        such techniques for black-box attacks against the
        integrity of DRL policies. We also present a dis-              Accordingly, we hypothesize that the techniques developed
        cussion on potential solution concepts for mitiga-          for RLED may be maliciously exploited to replicate and ma-
        tion techniques.                                            nipulate DRL policies. To establish the validity of this hy-
                                                                    pothesis, we investigate the feasibility of RLED techniques
                                                                    in utilizing limited passive (i.e., non-interfering) observations
1       Introduction                                                of a DRL agent to replicate its policy with sufficient accu-
                                                                    racy to facilitate attacks on their integrity. To develop proof-
Recent research have established the vulnerability of super-        of-concept attacks, we study the adversarial utility of adopt-
vised machine learning models (e.g., classifiers) to model          ing a recently proposed RLED technique, known as Deep Q-
extraction attacks[Tramèr et al., 2016]. Such attacks lever-       Learning from Demonstrations (DQfD)[Hester et al., 2018]
age the loosely-restricted ability of the attacker to iteratively   for black-box state-space manipulation attacks, and develop
query the model for labels, thereby allowing for the forging        two attack mechanisms based on this technique. Furthermore,
of a labeled dataset which can be used to train a replica of the    we present a discussion on potential mitigation techniques,
original model. Model extraction is not only a serious risk         and present a solution concept for defending against policy
to the protection of intellectual property, but also a critical     imitation attacks.
threat to the integrity of the model. Recent literature[Paper-
not et al., 2018] report that the replicated model may facilitate
the discovery and crafting of adversarial examples which are
transferable to the original model.                                    The remainder of this paper is organized as follows: Sec-
   Inspired by this area of research, this work investigates the    tion 2 presents an overview of the DQfD algorithm used
feasibility and impact of model extraction attacks on DRL           in this study for adversarial imitation. Section 3 proposes
agents. The adversarial problem of model extraction can             the first proof-of-concept black-box attack based on imitated
be formally stated as the replication of a DRL policy based         policies, and presents experimental evaluation of its feasi-
on observations of its behavior (i.e., actions) in response to      bility and performance. Section 4 studies the transferabil-
changes in the environment (i.e., state). This problem closely      ity of adversarial examples between replicated and the orig-
resembles that of imitation learning[Hussein et al., 2017],         inal policies as a second proof-of-concept attack technique.
                                                                    The paper concludes with a discussion on potential mitiga-
    ∗
        Contact Author                                              tion techniques and a solution concept in Section 5.
2   Deep Q-Learning from Demonstrations                             Algorithm 1 Deep Q-learning from Demonstrations (DQfD)
    (DQfD)                                                            Inputs: Dreplay initialized with demonstration data, ran-
The DQfD technique[Hester et al., 2018] aims to overcome              domly initialized weights for the behavior network θ, ran-
the inaccuracies of simulation environments and models of             domly initialized weights for the target network θ0 , up-
complex phenomenon by enabling DRL agents to learn as                 dating frequency of the target network τ , number of pre-
much as possible from expert demonstrations before training           training gradient updates k
on the real system. More formally, the objective of this “pre-        for steps t ∈ {1, 2, ..., k} do
training” phase is to learn an imitation of the expert’s behav-          Sample a mini-batch of n transitions from Dreplay with
ior with a value function that is compatible with the Bellman            prioritization
equation, thereby enabling the agent to update this value func-          Calculate loss J(Q) based on target network
tion via TD updates through direct interaction with the envi-            Perform a gradient descent step to update θ
ronment after the pre-training stage. To achieve such an im-             if t mod τ = 0 then
itation from limited demonstration data during pre-training,                θ0 ← θ
the agent trains on sampled mini-batches of demonstrations               end if
to train a deep neural network model in a supervised man-             end for
ner. However, the training objective of this model in DQfD is         for steps t ∈ {1, 2, ...} do
the minimization of a hybrid loss, comprised of the following            Sample action from behavior policy a π Qθ
components:                                                              Apply action a and observe (s0 , r)
  1. 1-step double Q-learning loss JDQ (Q),                              Store (s, a, r, s0 ) into Dreplay , overwriting oldest self-
                                                                         generated transition if over capacity
  2. Supervised large margin classification loss JE (Q) =                Sample a mini-batch of n transitions from Dreplay with
      maxa∈A [Q(s, a)+l(aE , a)]−Q(s, aE ), where aE is the              prioritization
      expert’s action in state s and l(aE , a) is a margin func-         Calculate loss J(Q) using target network
      tion that is positive if a 6= aE , and is 0 when a = aE .          Perform a gradient descent step to update θ
  3. (n = 10)-step Return: rt + γrt+1 + ... + γ n−1 rt+n−1 +             if t mod τ = 0 then
      maxa γ n Q(st+n , a).                                                 θ0 ← θ
  4. L2 regularization loss: JL2 (Q)                                     end if
                                                                         s ← s0
The total loss is given by:                                           end for
 J(Q) = JDQ (Q) + λ1 Jn (Q) + λ2 JE (Q) + λ3 JL2 (Q) (1)
where λ factors provide the weighting between the losses.              To study the feasibility of imitation learning as an approach
   After the pre-training phase, the agent begins interacting       to this adversarial problem, we consider the first step of the
with the system and collecting self-generated data, which is        adversary to be the imitation of π(s) via DQfD to learn an
added to the replay buffer Dreplay . Once the buffer is full, the   imitated policy π̃. With this imitation at hand, the attack
agent only overwrites the self-generated data and leaves the        problem can be reformulated to finding an optimal adversar-
demonstration data untouched for use in the coming updates          ial control policy πadv (s), where there are two permissible
of the model. The complete training procedure for DQfD is           control actions: whether to perturb the current state to in-
presented in Algorithm 1.                                           duce the worst possible action (i.e., arg mina Q(s, a)) or to
                                                                    leave the state unperturbed. This setting allows for the direct
3   Adversarial Policy Imitation for Black-Box                      adoption of the DRL-based technique proposed in [Behzadan
    Attacks                                                         and Hsu, 2019] for resilience benchmarking of DRL policies.
                                                                    In this technique, the test-time resilience of a policy π ∗ to
Consider an adversary who aims to maximally reduce the              state-space perturbations is obtained via an adversarial DRL
cumulative discounted return (R(T )) of a target DRL agent          training procedure, outlined as follows:
by manipulating the behavior of the target’s policy π(s)
via perturbing its observations. The adversary is also con-          1. Train the adversarial agent against the target following
strained to minimizing the total cost of perturbations given by         π in its training environment according to the reward as-
              PT
Cadv (T ) = t=t0 cadv (t), where cadv (t) = 1 if the adver-             signment process outlined in Algorithm 2. Report the
                                                                                                      ∗
sary perturbs the state at time t, and cadv (t) = 0 otherwise.          optimal adversarial return Rperturbed  and the maximum
                                                                                              ∗
   The adversary is unaware of π(s) and its parameters. How-            adversarial regret Radv (T ), which is the difference be-
ever, it has access to a replica of the target’s environment            tween maximum achievable return by the target π and
(e.g., the simulation environment). Also, for any state tran-           its minimum achieved return from actions of adversarial
sition (s, a) → s0 , the adversary can perfectly observe the            policy.
target’s reward signal r(s, a, s0 ), and is able to observe the
                                                                     2. Apply the adversarial policy against the target in N
behavior of π(s) in response to each state s. Furthermore, the
                                                                        episodes, record total cost Cadv for each episode,
adversary is able to manipulate its target’s state observations,
but not its reward signal. Also, it is assumed that all targeted     3. Report the average of Cadv over N episodes as the mean
perturbations of the adversary are successful.                          test-time resilience of π in the given environment.
   While the original technique is dependent on the availabil-       3.2   Results
ity of target’s optimal state-action value function, we propose
to replace this function with the Q-function obtained from
DQfD imitation of the target policy, denoted by Q̃.                  Figures 1 – 3 illustrate the first 100000 training steps of DQfD
   2:                                                                from 5000 observations obtained from DQN, A2C, and PPO2
                                                                     policies in CartPole. While this limited window of training
Algorithm 2 Reward Assignment in Adversarial DRL for                 is not long enough for convergence to an optimal policy in
Measuring Adversarial Resilience                                     CartPole, the following results demonstrate its sufficiency for
                                                                     deriving adversarial perturbation policies for all three targets.
Require: Target policy π ∗ , Perturbation cost function
  cadv (., .), Maximum achievable score Rmax , Optimal                  With the imitated policies at hand, the next step is to train
  state-action value function Q∗ (., .), Current adversarial         an adversarial policy for efficient perturbation of these tar-
  policy π adv , Current state st , Current count of adversarial     gets. Figures 4 – 11 present the results obtained from adopt-
  actions AdvCount, Current score Rt                                 ing the procedure presented in Algorithm 2 for this purposes.
  Set ToPerturb ← π adv (st )                                        These results demonstrate that not only the limited training
  if ToPerturb is False then                                         period is sufficient for obtaining an efficient adversarial pol-
     at ← π ∗ (st )                                                  icy, but also that launching efficient attacks remain feasible
     Reward ← 0                                                      with relatively few observations (i.e., 2500 and 1000). How-
  else                                                               ever, the comparison of test-time performance of these poli-
     a0t ← arg mina Q∗ (st , a)                                      cies (presented in table 2) indicates that the efficiency of at-
     Reward ← −cadv (st , a0t )                                      tacks decreases with lower numbers of observations.
  end if
  if either st or s0t is terminal then
     Reward+ = (Rmax − Rt )
  end if                                                                              80


                                                                                      75
   With the imitated state-action value function Q̃ at hand,
the adversarial policy can be trained as a DRL agent with                             70
the procedure outlined in [Behzadan and Hsu, 2019]. The
                                                                             Reward


proposed attack procedure is summarized as follows:                                   65

  1. Observe and record N interactions (st , at , st+1 , rt+1 ) of                    60
     the target agent with the environment.
                                                                                      55
  2. Apply DQFD to learn an imitation of the target policy
     π(s) and Q∗ , denoted by π̃ and Q̃, respectively.                                     0   20000   40000       60000   80000   100000
                                                                                                          Training Step
  3. Train adversarial policy πadv (s) with Algorithm 2, using
     Q̃ as an approximation of target’s Q∗ .                         Figure 1: DQfD Training Progress on DQN Policy with 5k demon-
  4. Apply adversarial policy to the target environment.             strations

3.1   Experiment Setup
We consider a DQN-based adversarial agent, aiming to learn
an optimal adversarial state-perturbation policy to minimize
the return of its targets, consisting of DQN, A2C, and PPO2
policies trained in the CartPole environment. The architecture
and hyperparameters of the adversary and its targets are the                          80
same as those detailed in [Behzadan and Hsu, 2019]. The
adversary employs a DQfD agent to learn an imitation of each                          75
target, the hyperparameters of which are provided in Table 1.
                                                                             Reward


                                                                                      70
                  Pretraining Steps           5000
                    Large Margin               0.8                                    65

              Imitation Loss Coefficient        1
                                                                                      60
                Target Update Freq.           1000
                        n-steps                10
                                                                                           0   20000   40000       60000   80000   100000
                           γ                  0.99                                                       Training Step


               Table 1: Parameters of DQfD Agent                     Figure 2: DQfD Training Progress on A2C Policy with 5k demon-
                                                                     strations
                                                                                                                                                                                                        Regret                                      No. Perturbations
                                 80


                                                                                                                                                                                                                                                                             No. Perturbations Per Episode (100 Episodes Mean)
                                                                                                                                                                                                                                                                         18
                                 75                                                                                                                                                           480
                                                                                                                                                                                                                                                                         16
                                 70


                                                                                                                                                                    Mean 100 Episode Regret
                                                                                                                                                                                              460
                                                                                                                                                                                                                                                                         14
                                 65
              Reward


                                                                                                                                                                                              440
                                                                                                                                                                                                                                                                         12
                                 60
                                                                                                                                                                                              420                                                                        10
                                 55

                                                                                                                                                                                              400                                                                        8
                                 50

                                                                                                                                                                                                                                                                         6
                                 45                                                                                                                                                           380
                                       0        20000   40000       60000       80000      100000                                                                                                   0        20000       40000           60000      80000       100000
                                                          Training Step                                                                                                                                                          Steps


Figure 3: DQfD Training Progress on PPO2 Policy with 5k demon-                                                                                               Figure 6: Adversarial Training Progress on DQN Policy with 1k
strations                                                                                                                                                    demonstrations


  DQN:                                                                                                                                                         A2C:


                                           Regret                                No. Perturbations                                                                                                      Regret                                      No. Perturbations
                                 500


                                                                                                                                                                                                                                                                                                         No. Perturbations Per Episode (100 Episodes Mean)
                                                                                                     13
                                                                                                         No. Perturbations Per Episode (100 Episodes Mean)


                                                                                                                                                                                                                                                                         10.0
                                 475                                                                                                                                                          475
                                                                                                     12
                                                                                                                                                                                                                                                                         9.5
                                 450                                                                 11                                                             Mean 100 Episode Regret   450
       Mean 100 Episode Regret


                                                                                                     10                                                                                                                                                                  9.0
                                 425                                                                                                                                                          425

                                                                                                     9                                                                                                                                                                   8.5
                                                                                                                                                                                              400
                                 400
                                                                                                     8
                                                                                                                                                                                                                                                                         8.0
                                                                                                                                                                                              375
                                 375                                                                 7
                                                                                                                                                                                              350                                                                        7.5
                                 350                                                                 6

                                                                                                                                                                                              325                                                                        7.0
                                                                                                     5
                                 325
                                       0        20000   40000           60000   80000      100000                                                                                                   0            20000    40000             60000       80000
                                                                Steps                                                                                                                                                         Steps


Figure 4: Adversarial Training Progress on DQN Policy with 5k                                                                                                Figure 7: Adversarial Training Progress on A2C Policy with 5k
demonstrations                                                                                                                                               demonstrations


                                           Regret                                No. Perturbations                                                                                                      Regret                                      No. Perturbations
                                                                                                     40
                                                                                                         No. Perturbations Per Episode (100 Episodes Mean)


                                                                                                                                                                                                                                                                             No. Perturbations Per Episode (100 Episodes Mean)


                                                                                                                                                                                                                                                                         35
                                 480                                                                                                                                                          480
                                                                                                     35

                                                                                                                                                                                              460                                                                        30
                                 460
                                                                                                     30
       Mean 100 Episode Regret


                                                                                                                                                                    Mean 100 Episode Regret


                                 440                                                                                                                                                          440                                                                        25
                                                                                                     25

                                 420                                                                                                                                                          420                                                                        20
                                                                                                     20

                                 400                                                                                                                                                          400
                                                                                                     15                                                                                                                                                                  15

                                 380                                                                                                                                                          380
                                                                                                     10                                                                                                                                                                  10
                                 360                                                                                                                                                          360
                                                                                                     5
                                       0        20000   40000           60000   80000      100000                                                                                                   0        20000       40000           60000      80000       100000
                                                                Steps                                                                                                                                                            Steps


Figure 5: Adversarial Training Progress on DQN Policy with 2.5k                                                                                              Figure 8: Adversarial Training Progress on A2C Policy with 2.5k
demonstrations                                                                                                                                               demonstrations
                                            Regret                                No. Perturbations
                                                                                                                                                                  Target Policy                                 Avg. Regret              Avg. No. Perturbations
                                  500                                                                 50                                                            DQN-5k                                        490.73                          7.12


                                                                                                          No. Perturbations Per Episode (100 Episodes Mean)
                                                                                                                                                                   DQN-2.5k                                       488.12                          8.09
                                  450                                                                 40
                                                                                                                                                                    DQN-1k                                        486.37                          10.55
        Mean 100 Episode Regret


                                                                                                                                                                     A2C-5k                                       490.88                          8.48
                                  400                                                                 30
                                                                                                                                                                    A2C-2.5k                                      487.64                          8.73
                                                                                                                                                                     A2C-1k                                       487.21                          6.23
                                  350                                                                 20
                                                                                                                                                                    PPO2-5k                                       490.23                          8.73
                                                                                                                                                                   PPO2-2.5k                                      487.23                          7.76
                                  300
                                                                                                      10
                                                                                                                                                                    PPO2-1k                                       477.61                          7.31

                                                                                                                                                              Table 2: Comparison of Test-Time Performances of Adversarial
                                        0        20000   40000           60000   80000      100000
                                                                 Steps                                                                                        Policies

Figure 9: Adversarial Training Progress on A2C Policy with 1k
demonstrations                                                                                                                                                                                             Regret                               No. Perturbations


                                                                                                                                                                                                                                                                        No. Perturbations Per Episode (100 Episodes Mean)
                                                                                                                                                                                                                                                                    18
                                                                                                                                                                                                 480
                                                                                                                                                                                                                                                                    16


                                                                                                                                                                       Mean 100 Episode Regret
                                                                                                                                                                                                 460
  PPO2                                                                                                                                                                                                                                                              14

                                                                                                                                                                                                 440
                                                                                                                                                                                                                                                                    12

                                                                                                                                                                                                 420                                                                10
                                            Regret                                No. Perturbations
                                                                                                                                                                                                                                                                    8
                                                                                                      22                                                                                         400
                                                                                                          No. Perturbations Per Episode (100 Episodes Mean)


                                  490

                                                                                                      20                                                                                                                                                            6
                                  480
                                                                                                                                                                                                 380
                                                                                                                                                                                                       0       20000   40000           60000   80000       100000
        Mean 100 Episode Regret


                                  470                                                                 18
                                                                                                                                                                                                                               Steps
                                  460                                                                 16

                                  450
                                                                                                      14                                                      Figure 12: Adversarial Training Progress on PPO2 Policy with 1k
                                  440                                                                                                                         demonstrations
                                                                                                      12
                                  430
                                                                                                      10
                                  420
                                                                                                      8
                                  410                                                                                                                         4     Transferability of Adversarial Example
                                        0        20000   40000           60000   80000      100000
                                                                 Steps                                                                                              Attacks on Imitated Policies
                                                                                                                                                              It is well-established that adversarial examples crafted for a
Figure 10: Adversarial Training Progress on PPO2 Policy with 5k
demonstrations                                                                                                                                                supervised model can be used to attack another model trained
                                                                                                                                                              on a similar dataset as that of the original model[Liu et al.,
                                                                                                                                                              2016]. Furthermore, Behzadan et al.[Behzadan and Munir,
                                                                                                                                                              2017] demonstrate that adversarial examples crafted for one
                                                                                                                                                              DRL policy can transfer to another policy trained in the same
                                                                                                                                                              environment. Inspired by these findings, we hypothesize that
                                                                                                                                                              adversarial examples generated for an imitated policy can
                                            Regret                                No. Perturbations
                                                                                                                                                              also transfer to the original policy. To evaluate this claim,
                                                                                                          No. Perturbations Per Episode (100 Episodes Mean)


                                  490                                                                 16                                                      we propose the following procedure for black-box adversar-
                                  485
                                                                                                                                                              ial example attacks on DRL policies based on DQfD-based
                                                                                                      14                                                      policy imitation:
        Mean 100 Episode Regret


                                  480
                                                                                                                                                                  1. Learn an imitation of the target policy π, denoted as π̃.
                                  475                                                                 12
                                                                                                                                                                  2. Craft adversarial examples for π̃.
                                  470
                                                                                                      10
                                                                                                                                                                  3. Apply the same adversarial examples to the target’s
                                  465                                                                                                                                π(s).
                                                                                                      8
                                  460
                                                                                                                                                              4.1     Experiment Setup
                                        0        20000   40000           60000   80000      100000
                                                                 Steps                                                                                        We consider a set of targets consisting of the 9 imitated poli-
                                                                                                                                                              cies obtained in the previous section (i.e., DQN, A2C, PPO2,
Figure 11: Adversarial Training Progress on PPO2 Policy with 2.5k                                                                                             trained on each case of beginning with 5k, 2.5k, and 1k ex-
demonstrations                                                                                                                                                pert demonstrations). In test-time runs of each policy, we
    Target Policy   Avg. No. Successful Transfers Per Episode       model at hand, the next step is to determine the saddle-point
      DQN-5k                         175.11                         (or region) in the minimax settings of keeping the threshold
     DQN-2.5k                         78.19                         Ωmax low, while providing maximum protection against ad-
      DQN-1k                           3.30                         versarial imitation learning. This extensive line of research
       A2C-5k                        156.44                         is beyond the scope of this paper, and is only introduced as a
      A2C-2.5k                       151.47                         potential venue of future work to interested readers.
       A2C-1k                         21.58
      PPO2-5k                        173.94                         Algorithm 3 Solution Concept for Constrained Randomiza-
     PPO2-2.5k                       112.96                         tion of Policy (CRoP)
      PPO2-1k                         74.71
                                                                    Require: state-action value function Q(., .), maximum tol-
Table 3: No. of Successful Transfers Per Episode of Length 500        erable loss Ωmax , set of actions A
(100 Episode Mean)                                                    while Running do
                                                                         s = env(t = 0)
                                                                         for each step of the episode do
construct adversarial examples of each state against the im-               FeasibleActions = {}
itated policy, using FGSM [Papernot et al., 2018] with per-                a = arg maxa Q(s, a)
turbation step size eps = 0.01 and perturbation boundaries                 Append a to FeasibleActions
[−5.0, 5.0]. If such a perturbation is found, we then present              for a0 ∈ A do
it to the original policy. If the action selected by the orig-                 if Q(s, a) − Q(s, a0 ) ≥ Ωmax then
inal policy changes as a result of the perturbed input, then                      Append a0 to FeasibleActions
the adversarial example is successfully transferred from the                   end if
imitated policy to the original policy.                                    end for
                                                                           if |F easibleActions| > 1 then
4.2     Results                                                                a ← random(F easibleActions)
Table 3 presents the number of successful transfers averaged               end if
over 100 consecutive episodes. These results verify the hy-                s0 = env(s, a)
pothesis that adversarial examples can transfer from an imi-               s ← s0
tated policy to the original, thereby enabling a new approach            end for
to the adversarial problem of black-box attacks. Furthermore,         end while
the results indicate that the transferability improves with more
demonstrations. This observation is in agreement with the
general explanation of transferability: higher numbers of ex-       References
pert demonstrations decrease the gap between the distribution       [Behzadan and Hsu, 2019] Vahid Behzadan and William
of training data used by the original policy and that of the imi-      Hsu. Rl-based method for benchmarking the adversarial
tated policy. Hence, the likelihood of transferability increases       resilience and robustness of deep reinforcement learning
with more demonstrations.                                              policies. arXiv preprint arXiv:1906.01110, 2019.
5     Discussion on Potential Defenses                              [Behzadan and Munir, 2017] Vahid Behzadan and Arslan
                                                                       Munir. Vulnerability of deep reinforcement learning to
Mitigation of adversarial policy imitation is achieved by in-          policy induction attacks. In International Conference on
creasing the cost of such attacks to the adversary. A promis-          Machine Learning and Data Mining in Pattern Recogni-
ing venue of research in this area is that of policy randomiza-        tion, pages 262–275. Springer, 2017.
tion. However, such randomization may lead to unacceptable
degradation of the agent’s performance. To address this is-         [Hester et al., 2018] Todd Hester, Matej Vecerik, Olivier
sue, we envision a class of solutions based on the Constrained         Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Hor-
Randomization of Policy (CRoP). Such techniques will in-               gan, John Quan, Andrew Sendonaris, Ian Osband, et al.
trinsically account for the trade-off between the mitigation of        Deep q-learning from demonstrations. In Thirty-Second
policy imitation and the inevitable loss of returns. The corre-        AAAI Conference on Artificial Intelligence, 2018.
sponding research challenge in developing CRoP techniques           [Hussein et al., 2017] Ahmed Hussein, Mohamed Medhat
is to find efficient and feasible constraints, which restrict the      Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learn-
set of possible random actions at each state s to those whose          ing: A survey of learning methods. ACM Computing Sur-
selection is guaranteed (or are likely within defined certainty)       veys (CSUR), 50(2):21, 2017.
to incur a total regret that is less than a maximum tolerable
                                                                    [Liu et al., 2016] Yanpei Liu, Xinyun Chen, Chang Liu,
amount Ωmax . One potential choice of constraint is those
applied to the Q-values of actions, leading to the technique           and Dawn Song. Delving into transferable adversar-
detailed in Algorithm 3. However, analyzing the feasibil-              ial examples and black-box attacks. arXiv preprint
ity of this approach will require the development of models            arXiv:1611.02770, 2016.
that explain and predict the quantitative relationship between      [Papernot et al., 2018] Nicolas Papernot, Patrick McDaniel,
number of observations and accuracy of estimation. With this           Arunesh Sinha, and Michael P Wellman. Sok: Secu-
   rity and privacy in machine learning. In 2018 IEEE Eu-
   ropean Symposium on Security and Privacy (EuroS&P),
   pages 399–414. IEEE, 2018.
[Piot et al., 2014] Bilal Piot, Matthieu Geist, and Olivier
   Pietquin. Boosted bellman residual minimization han-
   dling expert demonstrations. In Joint European Confer-
   ence on Machine Learning and Knowledge Discovery in
   Databases, pages 549–564. Springer, 2014.
[Tramèr et al., 2016] Florian Tramèr, Fan Zhang, Ari Juels,
   Michael K Reiter, and Thomas Ristenpart. Stealing ma-
   chine learning models via prediction apis. In USENIX Se-
   curity Symposium, pages 601–618, 2016.