Constrained Policy Optimization for Controlled Contextual
Bandit Exploration
Mohammad Kachuee, Sungjin Lee
Amazon Alexa AI, Bellevue, WA 98004, USA


                                       Abstract
                                       Contextual bandits are widely used across the industry in many applications such as search engines, dialogue systems,
                                       recommendation systems, etc. In such applications, it is often necessary to update the policy regularly as the data distribution
                                       changes and new features are being on-boarded frequently. As any new policy deployment directly impacts the user experience,
                                       safety in model updates is an important consideration in real-world bandit learning. In this study, we introduce a scalable
                                       framework for policy update safety via user-defined constraints, supporting fine-grained exploration targets for individual
                                       domains. For example, in a digital voice assistant, we may want to ensure fewer policy deviations in business-critical domains
                                       such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel
                                       meta-gradient learning method that is scalable and practical to address this problem. The proposed method adjusts constraint
                                       violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains.
                                       We conduct extensive experiments using data from a real-world conversational AI system on a set of realistic constraint
                                       benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best
                                       balance between the policy value and constraint satisfaction rate.

                                       Keywords
                                       contextual bandit, constrained optimization, safe exploration


1. Introduction                                                                                        learning [7, 8], usually targeting exploration budgets or
                                                                                                       encouraging a behavior resembling a baseline policy.
Contextual bandits are important machine learning prob-                                                   In the context of off-policy bandit updates, we define
lems used in many real-world applications such as search exploration as any change in the model behavior result-
engines, advertisement systems, conversational systems, ing from replacing a current production policy with a
etc. In a contextual bandit problem, the agent is tasked to new updated policy. This definition is broad and encloses
select the action that achieves the highest reward given stochastic exploration actions as well as any behavior
a context. In this setting, the environment is assumed change when comparing the two consecutive policies.
to have no memory, and the reward is a stochastic func-                                                   Furthermore, we consider the scenario in which sam-
tion of the current context, independent of past agent- ples are naturally classified into a set of domains, each
environment interactions [1, 2, 3].                                                                    representing a unique data segment. For example, when
   For real-world bandits, ensuring safe policy updates dealing with product reviews, the product category (e.g.,
is a crucial consideration. Although there have been a books, computers, etc.) would be the domain for each
number of prior studies under safe bandit updates in review. In many real-world scenarios, it is desirable to
an online learning setting, online bandit learning is not have fine-grained control on the rate of exploration for
suitable for business-critical production systems where each domain. For example, in a conversational system,
an updated policy must go through extensive testing to we may want to explore business-critical domains such
ensure reliable performance on critical use cases before as shopping more conservatively, while aggressively ex-
it gets deployed. This calls out for an approach for safe ploring a domain such as entertainment. Note that solely
policy updates in the offline bandit learning setting.                                                 relying on a method such as Thompson sampling [9] or
   In a production system, it is crucial to not only estimate epsilon-greedy [10] would neither be effective to meet
but also control the changes of behavior a new policy the requirements for individual domains nor optimal to
introduces when compared to the current production balance between the desired fine-grained exploration and
policy. In the literature, this problem has been studied achieving the highest possible reward.
under safe bandit updates [4, 5, 6] and budgeted bandit                                                   While previous studies considered different aspects of
                                                                                                       constraining a bandit model, to the best of our knowledge
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety the problem of controlling off-policy bandit behavior
2022), July 24–25, 2022, Vienna, Austria
                                                                                                       changes across subsequent model updates with a fine-
$ kachum@amazon.com (M. Kachuee); sungjinl@amazon.com
(S. Lee)                                                                                               grained control on budgets for different data segments
 0000-0002-0099-3466 (M. Kachuee)                                                                     (domains) remains unaddressed. This study is the first to
          © Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   tackle the aforementioned issues by providing a scalable
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
and practical approach. The main contributions of this       2.2. Constrained Optimization
paper are as follows:
                                                            The general problem of constrained optimization has
     • Introducing a formulation for controlled explo-      been studied extensively in the literature. The type of
       ration in the context of off-policy contextual ban-  constraints and the optimization problem are important
       dit learning considering fine-grained control over   aspects in the design of such algorithms. In the context
       domains based on user-defined constraints.           of constrained optimization of differentiable function
     • Presenting a solution based on the primal-dual       approximators (e.g., neural networks) the constraints can
       minimax constraint optimization method that is       be defined on the parameters [12], outputs [13], or based
       effective but requires adjusting a few hyperpa-      on predefined functions of the model [14, 15].
       rameters.                                               Most relevant to this paper, penalty methods trans-
     • Proposing a novel meta gradient algorithm to bal-    late constraints into penalty terms used to encourage
       ance constraint satisfaction and reward maximiza-    constraint satisfaction in the optimization process. The
       tion that works out-of-the-box and outperforms       quadratic penalty method adds the weighted square of vi-
       other methods with no need for hyperparameter        olation metrics and adjusts the weight either by heuristics
       adjustment.                                          or solving a sequence of problems with a monotonically
                                                            increasing penalty weight. The augmented Lagrangian
     • Conducting extensive experiments on the skill
                                                            method leverages Lagrange multipliers to mitigate the
       routing problem in a real-world conversational
                                                            issue of ill-conditioning inherent in the quadratic penalty
       AI agent using a set of realistic constraint bench-
                                                            method [16, 17]. More applicable to neural network mod-
       marks.
                                                            els, Nandwani et al. [14] suggested a scalable approach to
                                                            defining constraints using hinge functions, then solving
2. Related Work                                             the dual minimax problem. Theoretically, given an infi-
                                                            nite number of training iterations and a decaying update
2.1. Constrained Bandit Learning                            rate for the max variables, it is known that the minimax
                                                            objective is guaranteed to converge to a local min-max
The majority of studies on controlled bandit learning con- point [18]. Nonetheless, such a guarantee does not apply
sider the case of simple multi-armed stochastic bandits to real-world settings in which a non-convex problem
(i.e., without context) with practical applications in ex- is being solved in finite time using limited compute re-
periment design [8] and automated machine learning [7]. sources.
Hoffman et al. [7] suggested a Bayesian approach to two-
phase exploration-exploitation bandit learning in which
there is a pre-specified budget for exploration arm eval- 2.3. Meta Learning
uations. Another aspect is to ensure safe exploration In the literature, meta-learning is defined as learning to
actions, which is especially useful for sensitive applica- learn in which the model training itself is considered
tions in industrial machine learning or healthcare. Amani an inner optimization process guided by an outer meta
et al. [6] introduced a solution in which an initial set of objective. It is a broad topic spanning multiple areas
exploration actions is defined, then the exploration set of research including reward function discovery in rein-
is gradually expanded to ensure minimal unexpected be- forcement learning [19, 20], life-long continual learning
havior.                                                     [21], few-shot learning [22], and transfer learning [23]
    For contextual bandits, safety has been an active re- to name a few.
search topic. Safety can be defined in the action space or     Most relevant to this work is the meta-gradient idea
in terms of model updates. For example, Daulton et al. [5] first suggested by Sutton [24]. In meta-gradient learn-
solves a two-metric setting in which one of the metrics, ing, we use online cross-validation on a meta objective
reward, is being maximized while enforcing a limit for to compute derivatives with respect to meta parameters
regression on an auxiliary metric compared to a baseline through a differentiable inner optimization loop. Xu et al.
status quo model. Balakrishnan et al. [11] attempts to [25] demonstrated the effectiveness of this meta-gradient
learn behavioral constraints by balancing between repli- idea in deep reinforcement learning to adapt the value
cation of the current baseline policy and making new ac- function as the agent is interacting with the environment.
tions that show promising rewards. In [4] authors define However, in practice, the compute/memory requirement
safety in terms of user experience metrics and suggest de- for computing high-order gradients is a major challenge
ciding on deploying a new model based on conservative for meta-reinforcement learning leading to low-order gra-
confidence bounds on the off-policy estimates of such dient approximations [26, 25]. Also, while meta-gradient
metrics.                                                    learning enables adaptive adjustment of hyperparame-
                                                            ters, it still requires a carefully designed meta objective
that provides the best learning signal.                          A common approach to optimize constrained prob-
                                                               lems such as (2) is to use the penalty method, translating
                                                               constraints into penalty terms that encourage constraint
3. Constrained Bandit Exploration                              satisfaction:

3.1. Problem Formulation                                           arg min Ex,𝑎,𝑟,𝑘∼D [𝐿Π𝜃 (x, 𝑎, 𝑟)+
                                                                         𝜃
We consider the general framework of off-policy con-           𝑒u𝑘
                                                                     max(0, 𝑐𝑚𝑖𝑛
                                                                             𝑘   −ℛ𝜃 (x))+ 𝑒v𝑘 max(0, ℛ𝜃 (x)−𝑐𝑚𝑎𝑥
                                                                                                              𝑘   )].
textual bandits in which a policy Π is used to select an                                                                      (4)
action 𝑎 ∈ 𝐴 given the observed context vector (x) to
                                                                  Here, penalty terms are always non-negative and in-
maximize the scalar reward (𝑟) received from the envi-
                                                               crease if the new policy assigns action probabilities that
ronment. Here, we assume stochastic policies of the form
                                                               deviate from the current policy outside the desired bound-
Π𝜃 (𝑎|x) in which a model parameterized by 𝜃 (e.g., a
                                                               ary. u ∈ 𝑅𝑀 and v ∈ 𝑅𝑀 are variables that adjust the
neural network) is used to assign action selection proba-
                                                               weight of each constraint violation term. The exponenti-
bilities to each action given the context. Furthermore, we
                                                               ation improves the sensitivity to these parameters and
assume that each sample belongs to a domain denoted
                                                               ensures having non-negative penalty terms. For (4) to
by 𝑘 ∈ 1 . . . 𝑀 that is provided as a feature in x.
                                                               actually solve the original constrained problem of (2),
   In the off-policy setting, the policy is refreshed after
                                                               proper values for u and v need to be used that enable
collecting a dataset of samples from the current policy.
                                                               the best balance between constraint satisfaction and the
We adopt a definition of exploration which considers any
                                                               policy value. In the constrained optimization literature,
change in the agent behavior compared to the current
                                                               various methods have been suggested to solve this form
policy as an exploration action. Alternatively, we can
                                                               of problem (see Section 2.2). In this paper, to solve this
consider replication with respect to the current policy
                                                               problem, we use the primal-dual minimax method sug-
as the rate at which the new policy makes similar de-
                                                               gested by Nandwani et al. [14] (Section 3.2) as well as a
cisions to the current policy when both evaluated and
                                                               novel meta-learning method (Section 3.3).
sampled stochastically. We define replication for Π𝜃 with
respect to Π0 based on the L1-distance of their action
propensities given a context x:                                3.2. Minimax Primal-Dual Method
                           |Π𝜃 (x) − Π0 (x)|1                Nandwani et al. [14] suggested a formulation of the aug-
           ℛ𝜃 (x) = 1 −                       .          (1) mented Lagrangian method that supports inequality con-
                                   2
                                                             straints. They solve the dual problem which is optimizing
   In a production system, it is desirable to precisely con- the dual maximin problem to improve the scalability:
trol the rate at which the new policy replicates the current
policy for each domain. This ensures safe model updates          min max Ex,𝑎,𝑟,𝑘∼D [𝐿Π𝜃 (x, 𝑎, 𝑟)+
                                                                   𝜃    u,v
for critical domains while enabling exploration for others
                                                             𝑒 max(0, 𝑐𝑚𝑖𝑛
                                                              u𝑘
                                                                                 −ℛ𝜃 (x))+𝑒v𝑘 max(0, ℛ𝜃 (x)−𝑐𝑚𝑎𝑥          )].
that may benefit from an extra exploration budget. Ac-                       𝑘                                       𝑘
                                                                                                                              (5)
cordingly, we define constraints to encourage the desired
behavior for samples of each domain, while learning an Algorithm 1 shows an outline of the policy training us-
off-policy bandit:                                           ing the minimax method. This method has four hyper-
                                                             parameters controlling the max player optimization via
                 arg min Ex,𝑎,𝑟,𝑘∼D 𝐿Π𝜃 ,                    adjusting the update frequency, learning rate, and decay
                      𝜃
                                                         (2) factors.
                       𝑚𝑖𝑛               𝑚𝑎𝑥
               𝑠.𝑡. 𝑐𝑘 ≤ ℛ𝜃 (x) ≤ 𝑐𝑘                            Intuitively, the min player is trying to update the policy
where context, action, reward, and domain (x, 𝑎, 𝑟, 𝑘) are   parameters     while the max player is increasingly penaliz-
sampled from a dataset collected from the current policy.    ing  it for any constraint violation. A stable point of this
In (2), we use 𝑐𝑘 and 𝑐𝑘
                   𝑚𝑖𝑛       𝑚𝑎𝑥
                                  to indicate user-defined   algorithm    would  be to gradually reduce the max player
replication constraints for domain 𝑘.                        update rate as the min player is getting better at satisfy-
   𝐿Π𝜃 can be any differentiable off-policy bandit learn- ing the constraints, eventually satisfying all constraints
ing objective, for simplicity of discussion, we consider resulting in a zero loss for the max player due to the zero
the vanilla inverse propensity scoring (IPS) objective:      hinge penalty terms.

                                  Π𝜃 (𝑎|x)                     3.3. Meta Gradient Method
             𝐿Π𝜃 (x, 𝑎, 𝑟) = −𝑟            ,             (3)
                                  Π0 (𝑎|x)
                                                               Theoretically, the primal-dual minimax method is capa-
where Π0 is the current policy and 𝑟 is the observed           ble of achieving Pareto optimal solutions [18, 14]. How-
reward for taking action 𝑎 collected in the dataset.           ever, in practice, it is infeasible to train for an infinite
 Algorithm 1: Minimax constrained bandit opti-                     Note that (6) is not directly dependent on u and v,
 mization algorithm                                             instead we rely on online cross-validation [24, 25] to up-
  input : D (dataset), 𝜂 (max learning rate), 𝛾 (max            date these variables. We define an inner objective the
            learning rate decay), 𝜏 (max update                 same as the min optimization problem of (5), do a differ-
            frequency), 𝜉 (max update frequency                 entiable optimization step, evaluate the meta objective
            decay)                                              on another batch of data, then update u and v by taking
  u, v, 𝑡 ← 0                                                   the derivative of the meta objective through the inner
  Initialize(Π𝜃 )                                               optimization trace.
  for x, 𝑎, 𝑟, 𝑘 ∼ 𝐷 do                                            Algorithm 2 presents an outline of the meta gradient
       /* use loss function of (5)                */            optimization method. Due to practical issues of dealing
      𝐿 ← 𝐿𝑜𝑠𝑠(x, 𝑎, 𝑟, 𝑘, 𝜃, u, v)                             with high-order gradients, we only consider the imme-
      if 𝑡%𝜏 is 0 then                                          diate impact of a single inner loop update on the meta
            /* gradient ascent, max player                      objective. We found that discarding the vanilla gradi-
            */                                                  ent descent used for the inner optimization and using
           u ← u + 𝜂∇u 𝐿                                        a more advanced optimizer (e.g., Adam) to update Π𝜃
           v ← v + 𝜂∇v 𝐿                                        works best. Regarding the 𝜆 hyperparameter, we found
            /* lr/update decay, max player                      that simply setting 𝜆 = 1 works well in practice. It ef-
            */                                                  fectively means that the meta-gradient solution does not
           𝜂 ←𝛾×𝜂                                               require any hyperparameter adjustments (experimental
           𝜏 ←𝜉×𝜏                                               evidence presented in Section 4.4).
        end
        /* optimize Π𝜃 , min player                 */           Algorithm 2: Meta gradient constrained bandit
        𝜃 ← 𝑓 (𝜃, ∇𝜃 𝐿)                                          optimization algorithm
        /* increment iteration counter              */            input : D (dataset), 𝜂 (learning rate), 𝜆 (penalty
       𝑡←𝑡+1                                                                weight)
    end                                                           u, v ← 0
                                                                  Initialize(Π𝜃 )
                                                                  for x, 𝑎, 𝑟, 𝑘 ∼ D and x′ , 𝑎′ , 𝑟′ , 𝑘′ ∼ D do
                                                                       /* clone policy parameters                     */
number of iterations, and therefore approximate inner op-
                                                                      𝜃′ ← 𝑐𝑙𝑜𝑛𝑒(𝜃)
timization loops are being used. To find the right balance
                                                                       /* compute inner loss with 𝜃 ′                 */
between constraint satisfaction and policy improvement
                                                                      𝐿𝑖𝑛𝑛𝑒𝑟 ← 𝐿𝑜𝑠𝑠𝑖𝑛𝑛𝑒𝑟 (x, 𝑎, 𝑟, 𝑘, 𝜃′ , u, v)
for the minimax algorithm, it is necessary to carefully
                                                                       /* gradient descent on cloned
adjust multiple hyperparameters. Note that an extensive
                                                                       model                                          */
hyperparameter search is undesirable in many real-world
                                                                      𝜃′ ← 𝜃′ − 𝜂∇𝜃′ 𝐿𝑖𝑛𝑛𝑒𝑟
large-scale scenarios that require frequent model updates
                                                                       /* compute meta loss                           */
as it entails not only significant compute costs associated
                                                                      𝐿𝑚𝑒𝑡𝑎 ← 𝐿𝑜𝑠𝑠𝑚𝑒𝑡𝑎 (x′ , 𝑎′ , 𝑟′ , 𝑘′ , 𝜃′ , 𝜆)
with the search but also increases the turn-around time
                                                                       /* diff. through inner update                  */
to deploy refreshed models. To mitigate this issue, we
                                                                      Compute ∇u 𝐿𝑚𝑒𝑡𝑎 and ∇v 𝐿𝑚𝑒𝑡𝑎
suggest a meta-gradient optimization idea that adapts
                                                                      /* optimize u, v using any
u and v based on a meta objective within the training
                                                                       optimizer                                      */
process.
                                                                      u ← 𝑓 (u, ∇u 𝐿𝑚𝑒𝑡𝑎 )
   Specifically, we define the following meta objective:
                                                                      v ← 𝑓 (v, ∇v 𝐿𝑚𝑒𝑡𝑎 )
                                                                       /* compute inner loss with 𝜃                   */
    𝐿𝑚𝑒𝑡𝑎 = Ex,𝑎,𝑟,𝑘∼D [(1 − 𝜆)𝐿Π𝜃 (x, 𝑎, 𝑟) +
                                                                      𝐿 ← 𝐿𝑜𝑠𝑠𝑖𝑛𝑛𝑒𝑟 (x, 𝑎, 𝑟, 𝑘, 𝜃, u, v)
    max(0, 𝑐𝑚𝑖𝑛
            𝑘   − ℛ𝜃 (x)) + max(0, ℛ𝜃 (x) − 𝑐𝑚𝑎𝑥
                                             𝑘   )                    /* optimize Π𝜃 using any
𝜆                                                  ],
                        𝑝(𝑘)                                           optimizer                                      */
                                                          (6)        𝜃 ← 𝑓 (𝜃, ∇𝜃 𝐿)
                                                                  end
where 𝜆 is a hyperparameter to balance between the
bandit objective and the constraint penalty terms. The
second term is the macro average of constraint violation           Intuitively, at each training iteration, the inner objec-
terms, in which 𝑝(𝑘) is the prior probability of samples        tive naturally minimizes the bandit loss that is penalized
belonging to domain 𝑘 that can be easily pre-computed           by constraint violation terms proportional to the current
for a large batch of samples.                                   u/v. Then, the meta objective computes a validation loss
that measures the impact of the inner policy update and
u/v on the macro-averaged constraint violations. Finally,
by computing the meta-gradient of the meta objective
through the inner optimization loop, u and v are up-
dated to better encourage the constraint satisfaction for
the next policy update iteration.


4. Experiments
4.1. Setup
We conduct experiments on a bandit agent for the prob-           Figure 1: An overview of the network architecture: a set of
lem of skill routing in commercial conversational systems        hypothesis are encoded as vectors and fed to a bi-directional
such as Apple Siri, Amazon Alexa, Google Assistant, and          LSTM which is followed by a shared MLP and a softmax layer
Microsoft Cortana. In these systems, skill routing is a          to normalize the candidate selection probabilities.
key component that takes the user’s utterance as well
as signals from automated speech recognition (ASR) and
natural language understanding (NLU), then it decides
which skill and NLU interpretation to be used to serve
the request [27].
   The skill routing problem in a commercial dialogue
agent is a natural fit for this study as any policy change di-
rectly impacts the user experience, making safe and con-
trolled policy updates crucial. Additionally, constraints
can be defined based on the NLU domain to ensure pol-
icy safety for business-critical domains such as shopping,
while potentially exploring others such as entertainment
more aggressively.
   Figure 1 shows an overview of the model architecture
used in our experiments. Input to the model is a set of
routing candidates i.e., a combination of embedded ASR,
NLU, and context vectors as well as skill embeddings. The
output is the softmax-normalized propensity of selecting         Figure 2:     The constraint configuration list for the
                                                                 exploration benchmark.
each candidate to handle the user request. The final
model has about 12M trainable parameters consisting of
a language model to encode utterance, embeddings for
contextual signals, and fully-connected layers. As our           for all domains. In addition to the global constraint,
architecture closely follows the design choices from Park        critical assert stronger limits for a set of critical do-
et al. [28], we refer readers to that paper for details.         mains defined based on the expert knowledge. The
   To train and evaluate our models, we use logged policy        exploration benchmark extends the critical bench-
actions from a current production policy. The observed           mark by adding constraints to encourage exploration
reward is based on a curated function of user satisfaction       for domains that may benefit from additional exploration.
metrics (refer to [29] for an example). Our dataset con-         Each benchmark is a list of constraints consisting of a
sists of about 40M samples divided into 85% training, 10%        short description, applicable domain, and the desired
validation, and 5% test sets covering 27 domains that are        replication range. Figure 2 shows the exploration
imbalanced in the number of samples. All data used in            benchmark as an example. We provide the exact con-
this work was deidentified to comply with our customer           straint configurations in the appendix.
privacy guidelines.
                                                                 4.3. Baselines and Metrics
4.2. Benchmarks
                                                                 As the first baseline, we consider the vanilla IPS objec-
In our experiments, we use three different benchmarks            tive which disregards the constraints. Additionally, we
for the constraint settings: global, critical, and               build on the IPS baseline to consider the constraints using
exploration. The global benchmark aims to con-                   constraint optimization approaches: quadratic (uniform
strain the new policy to be within an exploration limit          constant penalty weight), minimax (Algorithm 1), and
                                                            0.896
                                                                                    Violation Reuc
                                                                                   Reward(%)                                  Violation Reduction(%)
meta-gradient (Algorithm 2). Unless expressed other-        0.895
                                                            0.894                              η=0.01, γ=0.995      0%

wise, we use Adam optimizer with the default configura-     0.893
                                                            0.892
                                                                                               η=0.01, γ=0.999

                                                                                                                    12%
tion [30] (denoted by 𝑓 in Section 3).
                                                                                               η=0.1, γ=0.995
                                                            0.891
                                                                                               η=0.01, γ=1.0
                                                             0.89   η=0.1, γ=0.999
   Regarding hyperparameters, for the penalty
                                                                                                                    24%
                                                            0.889   η=1.0, γ=1.0               η=1.0, γ=0.999
                                                                                               η=0.1, γ=1.0
                                                            0.888
weight of the quadratic method we use values from           0.887                              η=1.0, γ=0.995       37%
                                                            0.886
{0.1, 1, 10, 100, 1000}.     For the minimax method                 0    1     2      3    4     5     6        7         0   1      2    3    4   5   6   7

(Algorithm 1), we found that setting 𝜏 and 𝜉 to one                                (a)                                                   (b)
while adjusting 𝜂 and 𝛾 presents very similar results       0.898
                                                            0.897
                                                                                   Reward(%)                         0%           Violation Reduction(%)
                                                                                   Reward(%)
to adjusting all four hyperparameters. Consequently,
                                                                                                        λ=0.01
                                                            0.896                                                   12%
                                                            0.895                                                   24%

we use a grid search of 𝜂 ∈ {1, 0.1, 0.01} and              0.894
                                                            0.893
                                                                                     Reward(%)                      37%
                                                                                                                    50%

𝛾 ∈ {1, 0.999, 0.995} to find the best settings for each
                                                            0.892
                                                                         λ=0.05                                     62%
                                                            0.891                                       λ=0.1
                                                             0.89
benchmark. For the meta-gradient method (Algorithm 2),
                                                                                                        λ=0.5
                                                            0.889                                       λ=0.75
                                                            0.888                                       λ=0.95
we found that simply using 𝜆 equal to one in the meta       0.887
                                                                    0    1     2      3    4     5     6
                                                                                                        λ=1.0
                                                                                                                7         0   1      2    3    4   5   6   7
objective (i.e., meta objective only focusing on the
                                                                                     (c)                                                 (d)
constraints) outperforms other works (see evidence in
Section 4.4). As a result, it does not require adjusting    Figure 3: Comparing the hyper-parameter sensitivity for
any hyperparameter and the same setting is used across      the minimax and meta-gradient methods on the critical
all benchmarks. We provide the hyperparameters used         benchmark. For the minimax method: (a) reward and (b)
for each benchmark and method in the appendix.              macro violation reduction wrt. different 𝜂 and 𝛾 settings. For
   Regarding the evaluation metrics, we use the expected    the meta-gradient method: (c) reward and (d) macro violation
off-policy reward as well as the rate of constraint vio-    reduction wrt. different 𝜆 settings.
lations averaged over all samples (micro-averaged) and
individual domains (macro-averaged). To comply with
our privacy and business guidelines, in all instances, we   𝜂 ∈ {1.0, 0.1, 0.01} and 𝛾 ∈ {1.0, 0.999, 0.995}. For
only report relative and normalized results which do not    the meta-gradient method (Algorithm 2), we use 𝜆 ∈
directly represent the actual scales or metric values.      {0.01, 0.05, 0.1, 0.5, 0.75, 0.95, 1.0}. Figure 3 shows
   We train each model until convergence or reaching 32     the results of such experiment. Based on this experiment,
epochs and take the model best performing based on the      the minimax approach shows a much higher sensitivity
macro-averaged violation rate. Each experiment was run      to its two hyperparameters, showing a significant im-
four times using different random seeds for data sampling   pact on both the reward and violation reduction metrics.
and weight initialization to report the mean and standard   However, the meta-gradient method shows much less
deviation of each result. We used a cluster of 32 NVIDIA    sensitivity to the 𝜆 hyperparameter. We found that sim-
V100 GPUs to process a mini-batch size of 32K samples.      ply setting 𝜆 = 1 works very well in practice. It can be
Each individual run took between 4 to 24 hours.             very desirable for real-world large-scale settings such as
                                                            conversational systems which require frequent model
4.4. Results                                                updates as new features are on-boarded every day, and
                                                            having a dependency on an extensive hyperparameter
Table 1 shows a comparison of results for the IPS,          search is very costly, if not impractical.
quadratic, minimax, and meta-gradient methods on all           To dive deeper into the reason behind the better per-
benchmarks. For each case, we report the expected re-       formance for the meta-gradient algorithm compared to
ward and the percentage of reduction in the rate of vio-    the minimax approach, we investigated the constraint
lations compared to the simple IPS objective. The meta-     penalty weight value for the first 3,000 iterations of train-
gradient approach consistently shows the best results       ing using the global benchmark. From Figure 4, we
across all benchmarks. The simple quadratic method          can see the minimax method is monotonically increas-
behaves very competitively to minimax, except for the       ing the penalty weight with each iteration which is a
explore benchmark which requires a more fine-grained        behavior consistent with the gradient ascent update rule
control on multiple constraints (see Figure 2). The meta-   in Algorithm 1. In other words, as long as there are any
gradient method, while having the highest reduction in      constraint violations, minimax will keep increasing the
constraints violations, also has very competitive perfor-   penalty, which in our opinion is the reason for high sen-
mance in terms of the reward metric.                        sitivity to the hyperparameters. On the other hand, the
   To study the impact of hyperparameters, we con-          meta-gradient approach is using a validation signal to
ducted an experiment using the critical benchmark           dynamically adjust the penalty weight. Consequently,
by training minimax and meta-gradient based mod-            it may keep the penalty term near zero for an initial
els using different hyperparameter values. Specifi-         phase, rapidly increase it, then decay when violations are
cally, we train minimax models (Algorithm 1) using          reduced and getting a higher reward is preferred.
                                                                                                             Benchmark
                                                                  global                                     critical                                 explore
  Method                                       reward              violation reduction            reward        violation reduction      reward         violation reduction
                                                 (%)          macro (%)           micro (%)           (%)    macro (%)      micro (%)      (%)       macro (%)      micro (%)
  IPS                                         89.45±0.01            0                 0         89.45±0.01       0              0       89.45±0.01       0              0
  Quadratic                                   88.95±0.01      63.67±0.46          63.67±0.46    88.94±0.01   50.13±0.90    69.29±0.67   88.36±0.04   28.37±4.62    65.24±2.30
  Minimax                                     88.91±0.01      63.28±0.08          63.28±0.08    88.93±0.01   37.88±0.49    62.51±0.21   88.11±0.01   61.51±0.59    81.50±0.24
  MetaGrad                                    88.94±0.01     75.91±0.49           75.91±0.49    88.94±0.01   60.63±0.95    79.69±0.85   88.41±0.01   78.23±0.17    89.95±0.20

Table 1
A comparison of the baseline IPS method with the quadratic, minimax, and meta-gradient constrained optimization methods
on different benchmarks. We report the normalized percentage of reduction in the number of constraint violations compared
to the IPS method.


                                                 Minimax                                                          policy evaluation for slate recommendation, arXiv
                                   1011          MetaGrad
                                                                                                                  preprint arXiv:1605.04812 (2016).
                                   109
                                                                                                              [2] T. Joachims, A. Swaminathan, M. de Rijke, Deep
        Violation Penalty Weight


                                                                                                                  learning with logged bandit feedback, in: Inter-
                                   107                                                                            national Conference on Learning Representations,
                                                                                                                  2018.
                                   105                                                                        [3] R. Lopez, I. S. Dhillon, M. I. Jordan, Learning from
                                                                                                                  extreme bandit feedback, Proc. Association for the
                                   103                                                                            Advancement of Artificial Intelligence (2021).
                                                                                                              [4] R. Jagerman, I. Markov, M. D. Rijke, Safe exploration
                                   101
                                                                                                                  for optimizing contextual bandits, ACM Transac-
                                          0       500      1000     1500     2000      2500    3000               tions on Information Systems (TOIS) 38 (2020) 1–23.
                                                             Training Iteration                               [5] S. Daulton, S. Singh, V. Avadhanula, D. Dimmery,
Figure 4: The constraint penalty weight values for the first                                                      E. Bakshy, Thompson sampling for contextual
3,000 iterations of training using the global benchmark.                                                          bandit problems with auxiliary safety constraints,
                                                                                                                  arXiv preprint arXiv:1911.00638 (2019).
                                                                                                              [6] S. Amani, M. Alizadeh, C. Thrampoulidis, Linear
5. Conclusion                                                                                                     stochastic bandits under safety constraints, arXiv
                                                                                                                  preprint arXiv:1908.05814 (2019).
This work studied the problem of controlled exploration                                                       [7] M. Hoffman, B. Shahriari, N. Freitas, On correlation
for off-policy contextual bandit learning. We presented                                                           and budget constraints in model-based bandit op-
a constraint optimization formulation that enables a hu-                                                          timization with application to automatic machine
man expert to define the boundary of the desired ex-                                                              learning, in: Artificial Intelligence and Statistics,
ploration rate for individual domains. We proposed a                                                              PMLR, 2014, pp. 365–374.
scalable and practical solution based on meta-gradient                                                        [8] S. Guha, K. Munagala, Multi-armed bandits with
learning which provides the highest constraint satisfac-                                                          limited exploration, in: Proceedings of the Annual
tion rates without any need for an extensive hyperpa-                                                             Symposium on Theory of Computing (STOC), Cite-
rameter adjustment. Finally, we conducted experiments                                                             seer, 2007.
using data from a real-world conversation system for                                                          [9] C. Riquelme, G. Tucker, J. Snoek, Deep bayesian
the skill routing problem on a set of different realistic                                                         bandits showdown: An empirical comparison of
constraint benchmarks. Based on the experimental re-                                                              bayesian deep networks for thompson sampling,
sults, we believe that the suggested controlled bandit                                                            arXiv preprint arXiv:1802.09127 (2018).
learning approach is very promising for application in                                                       [10] M. Collier, H. U. Llorens, Deep contextual multi-
real-world bandits in which frequent safe policy updates                                                          armed bandits, arXiv preprint arXiv:1807.09809
are of paramount importance.                                                                                      (2018).
                                                                                                             [11] A. Balakrishnan, D. Bouneffouf, N. Mattei, F. Rossi,
                                                                                                                  Using contextual bandits with behavioral con-
References                                                                                                        straints for constrained online movie recommenda-
                                                                                                                  tion., in: IJCAI, 2018, pp. 5802–5804.
 [1] A. Swaminathan, A. Krishnamurthy, A. Agarwal,                                                           [12] J. Liu, J. Ye, Efficient euclidean projections in linear
     M. Dudík, J. Langford, D. Jose, I. Zitouni, Off-
     time, in: Proceedings of the 26th Annual Interna-            arXiv:1803.02999 (2018).
     tional Conference on Machine Learning, 2009, pp.        [27] R. Sarikaya, The technology behind personal digital
     657–664.                                                     assistants: An overview of the system architecture
[13] P. L. Donti, D. Rolnick, J. Z. Kolter, Dc3: A learn-         and key components, IEEE Signal Processing Mag-
     ing method for optimization with hard constraints,           azine 34 (2017) 67–81.
     International Conference on Learning Representa-        [28] S. Park, H. Li, A. Patel, S. Mudgal, S. Lee, Y.-B. Kim,
     tions (ICLR) (2021).                                         S. Matsoukas, R. Sarikaya, A scalable framework for
[14] Y. Nandwani, A. Pathak, P. Singla, et al., A primal          learning from implicit user feedback to improve nat-
     dual formulation for deep learning with constraints,         ural language understanding in large-scale conver-
     in: Advances in Neural Information Processing Sys-           sational ai systems, arXiv preprint arXiv:2010.12251
     tems, 2019, pp. 12157–12168.                                 (2020).
[15] H. Kervadec, J. Dolz, J. Yuan, C. Desrosiers,           [29] M. Kachuee, H. Yuan, Y.-B. Kim, S. Lee, Self-
     E. Granger, I. B. Ayed, Constrained deep networks:           supervised contrastive learning for efficient user
     Lagrangian optimization via log-barrier extensions,          satisfaction prediction in conversational agents, in:
     arXiv:1904.04205 (2019).                                     Proceedings of the 2021 Conference of the North
[16] N. Andrei, Penalty and augmented lagrangian meth-            American Chapter of the Association for Computa-
     ods, in: Continuous Nonlinear Optimization for               tional Linguistics: Human Language Technologies,
     Engineering Applications in GAMS Technology,                 2021, pp. 4053–4064.
     Springer, 2017, pp. 185–201.                            [30] D. P. Kingma, J. Ba, Adam: A method for stochas-
[17] R. Glowinski, P. Le Tallec, Augmented Lagrangian             tic optimization, arXiv preprint arXiv:1412.6980
     and operator-splitting methods in nonlinear me-              (2014).
     chanics, SIAM, 1989.
[18] C. Jin, P. Netrapalli, M. I. Jordan, Minmax optimiza-
     tion: Stable limit points of gradient descent ascent
     are locally optimal, arXiv preprint arXiv:1902.00618
     (2019).
[19] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh,
     D. Silver, Meta-gradient reinforcement learning
     with an objective discovered online, arXiv preprint
     arXiv:2007.08433 (2020).
[20] L. Kirsch, S. van Steenkiste, J. Schmidhuber, Improv-
     ing generalization in meta reinforcement learning
     using learned objectives, in: International Confer-
     ence on Learning Representations, 2019.
[21] S. Ritter, J. Wang, Z. Kurth-Nelson, S. Jayakumar,
     C. Blundell, R. Pascanu, M. Botvinick, Been there,
     done that: Meta-learning with episodic recall, in:
     International Conference on Machine Learning,
     PMLR, 2018, pp. 4354–4363.
[22] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-
     learning for fast adaptation of deep networks, in:
     International Conference on Machine Learning,
     PMLR, 2017, pp. 1126–1135.
[23] M. Ren, W. Zeng, B. Yang, R. Urtasun, Learn-
     ing to reweight examples for robust deep learning,
     in: International Conference on Machine Learning,
     PMLR, 2018, pp. 4334–4343.
[24] R. S. Sutton, Adapting bias by gradient descent: An
     incremental version of delta-bar-delta, in: AAAI,
     San Jose, CA, 1992, pp. 171–176.
[25] Z. Xu, H. van Hasselt, D. Silver,               Meta-
     gradient reinforcement learning, arXiv preprint
     arXiv:1805.09801 (2018).
[26] A. Nichol, J. Achiam, J. Schulman, On first-
     order meta-learning algorithms, arXiv preprint
A. Appendix                                                                                 Benchmark
                                                                Method            global    critical     explore
A.1. Constraint Benchmarks
                                                                Quadratic    𝑤       10        1000          1000
Figure 5 presents the definition of constraint benchmarks                    𝜂      0.1         0.1           1
used in this paper: global, critical, and explore.              Minimax
                                                                             𝛾       1         0.999          1
The global benchmark sets a general minimum replica-
                                                                Meta-Grad    𝜆       1           1            1
tion rate for all domains. The critical benchmark
defines a tighter minimum replication rate for three          Table 2
business-critical domains (home automation, shopping,         The selected hyperparameters for each benchmark and
and notifications) and a more relaxed default case for        method.
all other domains. In the explore benchmark, we ex-
tend the critical benchmark to include exploration
encouragement for the knowledge and music domains.            parameter is presented in Algorithm 1 and 2.


                  (a) global benchmark


                (b) critical benchmark


                 (c) explore benchmark
Figure 5: The constraint benchmarks used in this paper: (a)
global, (b) critical, and (c) explore.


A.2. Selected Hyperparameters
Table 2 shows the final selected hyperparameters for each
benchmark and method. The definition of each hyper-