<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Constrained Policy Optimization for Controlled Contextual Bandit Exploration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammad Kachuee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sungjin Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon Alexa AI</institution>
          ,
          <addr-line>Bellevue, WA 98004</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Contextual bandits are widely used across the industry in many applications such as search engines, dialogue systems, recommendation systems, etc. In such applications, it is often necessary to update the policy regularly as the data distribution changes and new features are being on-boarded frequently. As any new policy deployment directly impacts the user experience, safety in model updates is an important consideration in real-world bandit learning. In this study, we introduce a scalable framework for policy update safety via user-defined constraints, supporting fine-grained exploration targets for individual domains. For example, in a digital voice assistant, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning method that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conduct extensive experiments using data from a real-world conversational AI system on a set of realistic constraint benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best balance between the policy value and constraint satisfaction rate.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;contextual bandit</kwd>
        <kwd>constrained optimization</kwd>
        <kwd>safe exploration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>learning [7, 8], usually targeting exploration budgets or
encouraging a behavior resembling a baseline policy.</p>
      <p>Contextual bandits are important machine learning prob- In the context of of-policy bandit updates, we define
lems used in many real-world applications such as search exploration as any change in the model behavior
resultengines, advertisement systems, conversational systems, ing from replacing a current production policy with a
etc. In a contextual bandit problem, the agent is tasked to new updated policy. This definition is broad and encloses
select the action that achieves the highest reward given stochastic exploration actions as well as any behavior
a context. In this setting, the environment is assumed change when comparing the two consecutive policies.
to have no memory, and the reward is a stochastic func- Furthermore, we consider the scenario in which
samtion of the current context, independent of past agent- ples are naturally classified into a set of domains, each
environment interactions [1, 2, 3]. representing a unique data segment. For example, when</p>
      <p>For real-world bandits, ensuring safe policy updates dealing with product reviews, the product category (e.g.,
is a crucial consideration. Although there have been a books, computers, etc.) would be the domain for each
number of prior studies under safe bandit updates in review. In many real-world scenarios, it is desirable to
an online learning setting, online bandit learning is not have fine-grained control on the rate of exploration for
suitable for business-critical production systems where each domain. For example, in a conversational system,
an updated policy must go through extensive testing to we may want to explore business-critical domains such
ensure reliable performance on critical use cases before as shopping more conservatively, while aggressively
exit gets deployed. This calls out for an approach for safe ploring a domain such as entertainment. Note that solely
policy updates in the ofline bandit learning setting. relying on a method such as Thompson sampling [9] or</p>
      <p>In a production system, it is crucial to not only estimate epsilon-greedy [10] would neither be efective to meet
but also control the changes of behavior a new policy the requirements for individual domains nor optimal to
introduces when compared to the current production balance between the desired fine-grained exploration and
policy. In the literature, this problem has been studied achieving the highest possible reward.
under safe bandit updates [4, 5, 6] and budgeted bandit While previous studies considered diferent aspects of
constraining a bandit model, to the best of our knowledge
The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety the problem of controlling of-policy bandit behavior
2022), July 24–25, 2022, Vienna, Austria changes across subsequent model updates with a
fine($S. kLaeceh)um@amazon.com (M. Kachuee); sungjinl@amazon.com grained control on budgets for diferent data segments
0000-0002-0099-3466 (M. Kachuee) (domains) remains unaddressed. This study is the first to
© Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License tackle the aforementioned issues by providing a scalable
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
and practical approach. The main contributions of this
paper are as follows:</p>
      <p>The general problem of constrained optimization has
• Introducing a formulation for controlled explo- been studied extensively in the literature. The type of
ration in the context of of-policy contextual ban- constraints and the optimization problem are important
dit learning considering fine-grained control over aspects in the design of such algorithms. In the context
domains based on user-defined constraints. of constrained optimization of diferentiable function
• Presenting a solution based on the primal-dual approximators (e.g., neural networks) the constraints can
minimax constraint optimization method that is be defined on the parameters [ 12], outputs [13], or based
efective but requires adjusting a few hyperpa- on predefined functions of the model [14, 15].
rameters. Most relevant to this paper, penalty methods
trans• Proposing a novel meta gradient algorithm to bal- late constraints into penalty terms used to encourage
ance constraint satisfaction and reward maximiza- constraint satisfaction in the optimization process. The
tion that works out-of-the-box and outperforms quadratic penalty method adds the weighted square of
viother methods with no need for hyperparameter olation metrics and adjusts the weight either by heuristics
adjustment. or solving a sequence of problems with a monotonically
• Conducting extensive experiments on the skill increasing penalty weight. The augmented Lagrangian
routing problem in a real-world conversational method leverages Lagrange multipliers to mitigate the
AI agent using a set of realistic constraint bench- issue of ill-conditioning inherent in the quadratic penalty
marks. method [16, 17]. More applicable to neural network
models, Nandwani et al. [14] suggested a scalable approach to
defining constraints using hinge functions, then solving
2. Related Work the dual minimax problem. Theoretically, given an
infinite number of training iterations and a decaying update
2.1. Constrained Bandit Learning rate for the max variables, it is known that the minimax
objective is guaranteed to converge to a local min-max
The majority of studies on controlled bandit learning con- point [18]. Nonetheless, such a guarantee does not apply
sider the case of simple multi-armed stochastic bandits to real-world settings in which a non-convex problem
(i.e., without context) with practical applications in ex- is being solved in finite time using limited compute
reperiment design [8] and automated machine learning [7]. sources.</p>
      <p>Hofman et al. [7] suggested a Bayesian approach to
twophase exploration-exploitation bandit learning in which
there is a pre-specified budget for exploration arm eval- 2.3. Meta Learning
uations. Another aspect is to ensure safe exploration In the literature, meta-learning is defined as learning to
actions, which is especially useful for sensitive applica- learn in which the model training itself is considered
tions in industrial machine learning or healthcare. Amani an inner optimization process guided by an outer meta
et al. [6] introduced a solution in which an initial set of objective. It is a broad topic spanning multiple areas
exploration actions is defined, then the exploration set of research including reward function discovery in
reinis gradually expanded to ensure minimal unexpected be- forcement learning [19, 20], life-long continual learning
havior. [21], few-shot learning [22], and transfer learning [23]</p>
      <p>For contextual bandits, safety has been an active re- to name a few.
search topic. Safety can be defined in the action space or Most relevant to this work is the meta-gradient idea
in terms of model updates. For example, Daulton et al. [5] ifrst suggested by Sutton [24]. In meta-gradient
learnsolves a two-metric setting in which one of the metrics, ing, we use online cross-validation on a meta objective
reward, is being maximized while enforcing a limit for to compute derivatives with respect to meta parameters
regression on an auxiliary metric compared to a baseline through a diferentiable inner optimization loop. Xu et al.
status quo model. Balakrishnan et al. [11] attempts to [25] demonstrated the efectiveness of this meta-gradient
learn behavioral constraints by balancing between repli- idea in deep reinforcement learning to adapt the value
cation of the current baseline policy and making new ac- function as the agent is interacting with the environment.
tions that show promising rewards. In [4] authors define However, in practice, the compute/memory requirement
safety in terms of user experience metrics and suggest de- for computing high-order gradients is a major challenge
ciding on deploying a new model based on conservative for meta-reinforcement learning leading to low-order
graconfidence bounds on the of-policy estimates of such dient approximations [26, 25]. Also, while meta-gradient
metrics. learning enables adaptive adjustment of
hyperparameters, it still requires a carefully designed meta objective
that provides the best learning signal.</p>
      <p>In a production system, it is desirable to precisely con- the dual maximin problem to improve the scalability:
Here, penalty terms are always non-negative and
increase if the new policy assigns action probabilities that
deviate from the current policy outside the desired
boundary. u ∈  and v ∈  are variables that adjust the
weight of each constraint violation term. The
exponentiation improves the sensitivity to these parameters and
ensures having non-negative penalty terms. For (4) to
actually solve the original constrained problem of (2),
proper values for u and v need to be used that enable
the best balance between constraint satisfaction and the
policy value. In the constrained optimization literature,
various methods have been suggested to solve this form
of problem (see Section 2.2). In this paper, to solve this
problem, we use the primal-dual minimax method
suggested by Nandwani et al. [14] (Section 3.2) as well as a
novel meta-learning method (Section 3.3).</p>
      <sec id="sec-1-1">
        <title>3.2. Minimax Primal-Dual Method</title>
        <p>(1)</p>
        <sec id="sec-1-1-1">
          <title>Nandwani et al. [14] suggested a formulation of the aug</title>
          <p>mented Lagrangian method that supports inequality
constraints. They solve the dual problem which is optimizing</p>
          <p>A common approach to optimize constrained
problems such as (2) is to use the penalty method, translating
constraints into penalty terms that encourage constraint
satisfaction:</p>
          <p>arg min Ex,,,∼ D[Π (x, , )+
u max(0, 

−ℛ  (x))+ v max(0, ℛ (x)− 
)].</p>
          <p>(4)
(5)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Constrained Bandit Exploration</title>
      <sec id="sec-2-1">
        <title>3.1. Problem Formulation</title>
        <p>We consider the general framework of of-policy
contextual bandits in which a policy Π is used to select an
action  ∈  given the observed context vector (x) to
maximize the scalar reward () received from the
environment. Here, we assume stochastic policies of the form
Π  (|x) in which a model parameterized by  (e.g., a
neural network) is used to assign action selection
probabilities to each action given the context. Furthermore, we
assume that each sample belongs to a domain denoted
by  ∈ 1 . . .  that is provided as a feature in x.</p>
        <sec id="sec-2-1-1">
          <title>In the of-policy setting, the policy is refreshed after</title>
          <p>collecting a dataset of samples from the current policy.
We adopt a definition of exploration which considers any
change in the agent behavior compared to the current
policy as an exploration action. Alternatively, we can
consider replication with respect to the current policy
as the rate at which the new policy makes similar
decisions to the current policy when both evaluated and
sampled stochastically. We define replication for Π  with
respect to Π 0 based on the L1-distance of their action
propensities given a context x:
ℛ (x) = 1
−
|Π  (x) − Π 0(x)|1 .</p>
          <p>2
trol the rate at which the new policy replicates the current
policy for each domain. This ensures safe model updates
for critical domains while enabling exploration for others
that may benefit from an extra exploration budget.
Accordingly, we define constraints to encourage the desired
behavior for samples of each domain, while learning an
of-policy bandit:
where context, action, reward, and domain (x, , , ) are
sampled from a dataset collected from the current policy.
In (2), we use  and  to indicate user-defined
replication constraints for domain .</p>
          <p>Π can be any diferentiable of-policy bandit
learning objective, for simplicity of discussion, we consider
the vanilla inverse propensity scoring (IPS) objective:
Π (x, , ) = −  Π  (|x)
Π 0(|x)
where Π 0 is the current policy and  is the observed
reward for taking action  collected in the dataset.</p>
          <p>arg min Ex,,,∼ D Π ,
.. 
≤ ℛ  (x) ≤ 
(2)
factors.</p>
          <p />
          <p>u,v
min max Ex,,,∼ D[Π (x, , )+
u max(0, 

−ℛ  (x))+v max(0, ℛ (x)− 
)].</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Algorithm 1 shows an outline of the policy training using the minimax method. This method has four hyperparameters controlling the max player optimization via adjusting the update frequency, learning rate, and decay</title>
          <p>Intuitively, the min player is trying to update the policy
parameters while the max player is increasingly
penalizing it for any constraint violation. A stable point of this
algorithm would be to gradually reduce the max player
update rate as the min player is getting better at
satisfying the constraints, eventually satisfying all constraints
resulting in a zero loss for the max player due to the zero
hinge penalty terms.
(3)</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.3. Meta Gradient Method</title>
        <sec id="sec-2-2-1">
          <title>Theoretically, the primal-dual minimax method is capable of achieving Pareto optimal solutions [18, 14]. However, in practice, it is infeasible to train for an infinite</title>
          <p>Algorithm 1: Minimax constrained bandit
optimization algorithm
input : D (dataset),  (max learning rate),  (max
learning rate decay),  (max update
frequency),  (max update frequency
decay)
u, v,  ←
Initialize(Π  )</p>
          <p>0
for x, , ,  ∼  do
/* use loss function of (5)
*/
 ←
if % is 0 then
(x, , , , , u, v)
/* gradient ascent, max player
/* lr/update decay, max player
/* optimize Π  , min player
*/
/* increment iteration counter */
number of iterations, and therefore approximate inner
optimization loops are being used. To find the right balance
between constraint satisfaction and policy improvement
for the minimax algorithm, it is necessary to carefully
adjust multiple hyperparameters. Note that an extensive
hyperparameter search is undesirable in many real-world
large-scale scenarios that require frequent model updates
as it entails not only significant compute costs associated
with the search but also increases the turn-around time
to deploy refreshed models. To mitigate this issue, we
suggest a meta-gradient optimization idea that adapts
u and v based on a meta objective within the training
process.</p>
          <p>Specifically, we define the following meta objective:
 = Ex,,,∼ D [(1 −  )Π (x, , ) +
max(0, 
 ←  × 
 ←  × 
end
 ←  (,</p>
          <p>∇ )
 ←  + 1
end
*/
*/
*/
*/
*/
*/
*/
*/</p>
          <p>Note that (6) is not directly dependent on u and v,
instead we rely on online cross-validation [24, 25] to
update these variables. We define an inner objective the
same as the min optimization problem of (5), do a
diferentiable optimization step, evaluate the meta objective
on another batch of data, then update u and v by taking
the derivative of the meta objective through the inner
optimization trace.</p>
          <p>Algorithm 2 presents an outline of the meta gradient
optimization method. Due to practical issues of dealing
with high-order gradients, we only consider the
immediate impact of a single inner loop update on the meta
objective. We found that discarding the vanilla
gradient descent used for the inner optimization and using
a more advanced optimizer (e.g., Adam) to update Π 
works best. Regarding the  hyperparameter, we found
that simply setting</p>
          <p>= 1 works well in practice. It
effectively means that the meta-gradient solution does not
require any hyperparameter adjustments (experimental
evidence presented in Section 4.4).</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Algorithm 2: Meta gradient constrained bandit</title>
          <p>optimization algorithm
input : D (dataset),  (learning rate),  (penalty
weight)
u, v ←
Initialize(Π  )</p>
          <p>0
for x, , ,  ∼ D and x′, ′, ′, ′ ∼ D do
/* clone policy parameters</p>
          <p>( )
 ′ ←
 ←
model
 ←
/* compute inner loss with  ′</p>
          <p>(x, , , ,  ′, u, v)
/* gradient descent on cloned
 ′ ←</p>
          <p>′ −  ∇ ′ 
/* compute meta loss</p>
          <p>(x′, ′, ′, ′,  ′,  )
/* diff. through inner update
Compute ∇u and ∇v
/* optimize u, v using any
optimizer
u ←  (u, ∇u)
v ←  (v, ∇v)
 ←
/* compute inner loss with</p>
          <p>(x, , , , , u, v)
/* optimize Π  using any
u/v. Then, the meta objective computes a validation loss
that measures the impact of the inner policy update and
u/v on the macro-averaged constraint violations. Finally,
by computing the meta-gradient of the meta objective
through the inner optimization loop, u and v are
updated to better encourage the constraint satisfaction for
the next policy update iteration.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <p>4.1. Setup
We conduct experiments on a bandit agent for the prob- Figure 1: An overview of the network architecture: a set of
lem of skill routing in commercial conversational systems hypothesis are encoded as vectors and fed to a bi-directional
such as Apple Siri, Amazon Alexa, Google Assistant, and LSTM which is followed by a shared MLP and a softmax layer
Microsoft Cortana. In these systems, skill routing is a to normalize the candidate selection probabilities.
key component that takes the user’s utterance as well
as signals from automated speech recognition (ASR) and
natural language understanding (NLU), then it decides
which skill and NLU interpretation to be used to serve
the request [27].</p>
      <p>The skill routing problem in a commercial dialogue
agent is a natural fit for this study as any policy change
directly impacts the user experience, making safe and
controlled policy updates crucial. Additionally, constraints
can be defined based on the NLU domain to ensure
policy safety for business-critical domains such as shopping,
while potentially exploring others such as entertainment
more aggressively.</p>
      <p>Figure 1 shows an overview of the model architecture
used in our experiments. Input to the model is a set of
routing candidates i.e., a combination of embedded ASR,
NLU, and context vectors as well as skill embeddings. The
output is the softmax-normalized propensity of selecting Figure 2: The constraint configuration list for the
each candidate to handle the user request. The final exploration benchmark.
model has about 12M trainable parameters consisting of
a language model to encode utterance, embeddings for
contextual signals, and fully-connected layers. As our for all domains. In addition to the global constraint,
architecture closely follows the design choices from Park critical assert stronger limits for a set of critical
doet al. [28], we refer readers to that paper for details. mains defined based on the expert knowledge. The</p>
      <p>To train and evaluate our models, we use logged policy exploration benchmark extends the critical
benchactions from a current production policy. The observed mark by adding constraints to encourage exploration
reward is based on a curated function of user satisfaction for domains that may benefit from additional exploration.
metrics (refer to [29] for an example). Our dataset con- Each benchmark is a list of constraints consisting of a
sists of about 40M samples divided into 85% training, 10% short description, applicable domain, and the desired
validation, and 5% test sets covering 27 domains that are replication range. Figure 2 shows the exploration
imbalanced in the number of samples. All data used in benchmark as an example. We provide the exact
conthis work was deidentified to comply with our customer straint configurations in the appendix.
privacy guidelines.</p>
      <sec id="sec-3-1">
        <title>4.2. Benchmarks</title>
        <sec id="sec-3-1-1">
          <title>In our experiments, we use three diferent benchmarks for the constraint settings: global, critical, and exploration. The global benchmark aims to constrain the new policy to be within an exploration limit</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.3. Baselines and Metrics</title>
        <p>As the first baseline, we consider the vanilla IPS
objective which disregards the constraints. Additionally, we
build on the IPS baseline to consider the constraints using
constraint optimization approaches: quadratic (uniform
constant penalty weight), minimax (Algorithm 1), and
wwtmi oieesRnitegae,[h-g3wgta0rer]oadudf(idsniteeeghnnAeotdt(qheaAudymlapgbdoeoyrrprapitttiaihmicrnmaizmmSe2eere)ctt.wtheirooiUtsndh,nl3tweh)s.feesodrueexsfepatruhveletasslcueoedpnsefigonuftrharoaletm-ry- 000000000.........0888888888.88899999989781543269 ηη==10..01,,γγ==10..0999ReVwioalradt(i%oηηηηηηηn)=======0011000.R......01000011,,,11e,,γγγ,,γu===γ=γγ===0001c0...10.9990...9999099559995 2314720%%%% Violation Reduction(%)
{0.1, 1, 10, 100, 1000}. For the minimax method 0.886 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
(Algorithm 1), we found that setting  and  to one (a) (b)
wwwtbooebeeh∈njaiefclduoceh{jtus1muiaevn,sdaed0tjria.uknt9(sh.gi9gt.Faei9ranot.i,,ldrgsl0imt.mfh9soeee9puata5mlarnry}cdehouhttyabosp-ijognfineepfrcgdrratpeidtavshireeeeeannqm∈bottusenmeasltvltyeee{ttrsorh1sefyo,.ototd0cnsiC.uni(e1mAosg,iinslinn0glsaf.goe0torhrq1rioerut}enehemasmncautethnhl2lttydae)s,, 00000000000...........088888888888.999999998888567834129789 0 1λ=0.052RReRewe3wwaaradrdr(4d%(%(%)) )5 6λλλλλλ======000010......5107091755 12356247020%%%%%% 0 1Viola2tion3Red4uctio5n(%)6 7
constraints) outperforms other works (see evidence in (c) (d)
Section 4.4). As a result, it does not require adjusting Figure 3: Comparing the hyper-parameter sensitivity for
any hyperparameter and the same setting is used across the minimax and meta-gradient methods on the critical
all benchmarks. We provide the hyperparameters used benchmark. For the minimax method: (a) reward and (b)
for each benchmark and method in the appendix. macro violation reduction wrt. diferent  and  settings. For</p>
        <p>Regarding the evaluation metrics, we use the expected the meta-gradient method: (c) reward and (d) macro violation
of-policy reward as well as the rate of constraint vio- reduction wrt. diferent  settings.
lations averaged over all samples (micro-averaged) and
individual domains (macro-averaged). To comply with
our privacy and business guidelines, in all instances, we  ∈ {1.0, 0.1, 0.01} and  ∈ {1.0, 0.999, 0.995}. For
only report relative and normalized results which do not the meta-gradient method (Algorithm 2), we use  ∈
directly represent the actual scales or metric values. {0.01, 0.05, 0.1, 0.5, 0.75, 0.95, 1.0}. Figure 3 shows</p>
        <p>We train each model until convergence or reaching 32 the results of such experiment. Based on this experiment,
epochs and take the model best performing based on the the minimax approach shows a much higher sensitivity
macro-averaged violation rate. Each experiment was run to its two hyperparameters, showing a significant
imfour times using diferent random seeds for data sampling pact on both the reward and violation reduction metrics.
and weight initialization to report the mean and standard However, the meta-gradient method shows much less
deviation of each result. We used a cluster of 32 NVIDIA sensitivity to the  hyperparameter. We found that
simV100 GPUs to process a mini-batch size of 32K samples. ply setting  = 1 works very well in practice. It can be
Each individual run took between 4 to 24 hours. very desirable for real-world large-scale settings such as
conversational systems which require frequent model
4.4. Results updates as new features are on-boarded every day, and
having a dependency on an extensive hyperparameter
Table 1 shows a comparison of results for the IPS, search is very costly, if not impractical.
quadratic, minimax, and meta-gradient methods on all To dive deeper into the reason behind the better
perbenchmarks. For each case, we report the expected re- formance for the meta-gradient algorithm compared to
ward and the percentage of reduction in the rate of vio- the minimax approach, we investigated the constraint
lations compared to the simple IPS objective. The meta- penalty weight value for the first 3,000 iterations of
traingradient approach consistently shows the best results ing using the global benchmark. From Figure 4, we
across all benchmarks. The simple quadratic method can see the minimax method is monotonically
increasbehaves very competitively to minimax, except for the ing the penalty weight with each iteration which is a
explore benchmark which requires a more fine-grained behavior consistent with the gradient ascent update rule
control on multiple constraints (see Figure 2). The meta- in Algorithm 1. In other words, as long as there are any
gradient method, while having the highest reduction in constraint violations, minimax will keep increasing the
constraints violations, also has very competitive perfor- penalty, which in our opinion is the reason for high
senmance in terms of the reward metric. sitivity to the hyperparameters. On the other hand, the</p>
        <p>To study the impact of hyperparameters, we con- meta-gradient approach is using a validation signal to
ducted an experiment using the critical benchmark dynamically adjust the penalty weight. Consequently,
by training minimax and meta-gradient based mod- it may keep the penalty term near zero for an initial
els using diferent hyperparameter values. Specifi- phase, rapidly increase it, then decay when violations are
cally, we train minimax models (Algorithm 1) using reduced and getting a higher reward is preferred.
IPS
Quadratic
Minimax
MetaGrad</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>This work studied the problem of controlled exploration
for of-policy contextual bandit learning. We presented
a constraint optimization formulation that enables a
human expert to define the boundary of the desired
exploration rate for individual domains. We proposed a
scalable and practical solution based on meta-gradient
learning which provides the highest constraint
satisfaction rates without any need for an extensive
hyperparameter adjustment. Finally, we conducted experiments
using data from a real-world conversation system for
the skill routing problem on a set of diferent realistic
constraint benchmarks. Based on the experimental
results, we believe that the suggested controlled bandit
learning approach is very promising for application in
real-world bandits in which frequent safe policy updates
are of paramount importance.
time, in: Proceedings of the 26th Annual Interna- arXiv:1803.02999 (2018).
tional Conference on Machine Learning, 2009, pp. [27] R. Sarikaya, The technology behind personal digital
657–664. assistants: An overview of the system architecture
[13] P. L. Donti, D. Rolnick, J. Z. Kolter, Dc3: A learn- and key components, IEEE Signal Processing
Maging method for optimization with hard constraints, azine 34 (2017) 67–81.</p>
      <p>International Conference on Learning Representa- [28] S. Park, H. Li, A. Patel, S. Mudgal, S. Lee, Y.-B. Kim,
tions (ICLR) (2021). S. Matsoukas, R. Sarikaya, A scalable framework for
[14] Y. Nandwani, A. Pathak, P. Singla, et al., A primal learning from implicit user feedback to improve
natdual formulation for deep learning with constraints, ural language understanding in large-scale
converin: Advances in Neural Information Processing Sys- sational ai systems, arXiv preprint arXiv:2010.12251
tems, 2019, pp. 12157–12168. (2020).
[15] H. Kervadec, J. Dolz, J. Yuan, C. Desrosiers, [29] M. Kachuee, H. Yuan, Y.-B. Kim, S. Lee,
SelfE. Granger, I. B. Ayed, Constrained deep networks: supervised contrastive learning for eficient user
Lagrangian optimization via log-barrier extensions, satisfaction prediction in conversational agents, in:
arXiv:1904.04205 (2019). Proceedings of the 2021 Conference of the North
[16] N. Andrei, Penalty and augmented lagrangian meth- American Chapter of the Association for
Computaods, in: Continuous Nonlinear Optimization for tional Linguistics: Human Language Technologies,
Engineering Applications in GAMS Technology, 2021, pp. 4053–4064.</p>
      <p>Springer, 2017, pp. 185–201. [30] D. P. Kingma, J. Ba, Adam: A method for
stochas[17] R. Glowinski, P. Le Tallec, Augmented Lagrangian tic optimization, arXiv preprint arXiv:1412.6980
and operator-splitting methods in nonlinear me- (2014).</p>
      <p>chanics, SIAM, 1989.
[18] C. Jin, P. Netrapalli, M. I. Jordan, Minmax
optimization: Stable limit points of gradient descent ascent
are locally optimal, arXiv preprint arXiv:1902.00618
(2019).
[19] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh,</p>
      <p>D. Silver, Meta-gradient reinforcement learning
with an objective discovered online, arXiv preprint
arXiv:2007.08433 (2020).
[20] L. Kirsch, S. van Steenkiste, J. Schmidhuber,
Improving generalization in meta reinforcement learning
using learned objectives, in: International
Conference on Learning Representations, 2019.
[21] S. Ritter, J. Wang, Z. Kurth-Nelson, S. Jayakumar,</p>
      <p>C. Blundell, R. Pascanu, M. Botvinick, Been there,
done that: Meta-learning with episodic recall, in:
International Conference on Machine Learning,</p>
      <p>PMLR, 2018, pp. 4354–4363.
[22] C. Finn, P. Abbeel, S. Levine, Model-agnostic
metalearning for fast adaptation of deep networks, in:
International Conference on Machine Learning,</p>
      <p>PMLR, 2017, pp. 1126–1135.
[23] M. Ren, W. Zeng, B. Yang, R. Urtasun,
Learning to reweight examples for robust deep learning,
in: International Conference on Machine Learning,</p>
      <p>PMLR, 2018, pp. 4334–4343.
[24] R. S. Sutton, Adapting bias by gradient descent: An
incremental version of delta-bar-delta, in: AAAI,</p>
      <p>San Jose, CA, 1992, pp. 171–176.
[25] Z. Xu, H. van Hasselt, D. Silver,
Metagradient reinforcement learning, arXiv preprint
arXiv:1805.09801 (2018).
[26] A. Nichol, J. Achiam, J. Schulman, On
first</p>
      <p>order meta-learning algorithms, arXiv preprint</p>
      <sec id="sec-4-1">
        <title>A.1. Constraint Benchmarks</title>
        <p>used in this paper: global, critical, and explore.</p>
        <p>The global benchmark sets a general minimum
replication rate for all domains. The critical benchmark
defines a tighter minimum replication rate for three
and notifications) and a more relaxed default case for
all other domains. In the explore benchmark, we
extend the critical benchmark to include exploration
encouragement for the knowledge and music domains.</p>
        <p>Benchmark
global
critical</p>
        <p>explore
10
business-critical domains (home automation, shopping, The selected hyperparameters for each benchmark and
(a) global benchmark
(b) critical benchmark
(c) explore benchmark
global, (b) critical, and (c) explore.</p>
      </sec>
      <sec id="sec-4-2">
        <title>A.2. Selected Hyperparameters</title>
        <p>benchmark and method. The definition of each
hyper</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>0.1 1 1 1000 0.1 0.999 1 1000 1 1</source>
          <article-title>1 parameter is presented in Algorithm 1 and 2</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>