Constrained Policy Optimization for Controlled Contextual Bandit Exploration Mohammad Kachuee, Sungjin Lee Amazon Alexa AI, Bellevue, WA 98004, USA Abstract Contextual bandits are widely used across the industry in many applications such as search engines, dialogue systems, recommendation systems, etc. In such applications, it is often necessary to update the policy regularly as the data distribution changes and new features are being on-boarded frequently. As any new policy deployment directly impacts the user experience, safety in model updates is an important consideration in real-world bandit learning. In this study, we introduce a scalable framework for policy update safety via user-defined constraints, supporting fine-grained exploration targets for individual domains. For example, in a digital voice assistant, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning method that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conduct extensive experiments using data from a real-world conversational AI system on a set of realistic constraint benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best balance between the policy value and constraint satisfaction rate. Keywords contextual bandit, constrained optimization, safe exploration 1. Introduction learning [7, 8], usually targeting exploration budgets or encouraging a behavior resembling a baseline policy. Contextual bandits are important machine learning prob- In the context of off-policy bandit updates, we define lems used in many real-world applications such as search exploration as any change in the model behavior result- engines, advertisement systems, conversational systems, ing from replacing a current production policy with a etc. In a contextual bandit problem, the agent is tasked to new updated policy. This definition is broad and encloses select the action that achieves the highest reward given stochastic exploration actions as well as any behavior a context. In this setting, the environment is assumed change when comparing the two consecutive policies. to have no memory, and the reward is a stochastic func- Furthermore, we consider the scenario in which sam- tion of the current context, independent of past agent- ples are naturally classified into a set of domains, each environment interactions [1, 2, 3]. representing a unique data segment. For example, when For real-world bandits, ensuring safe policy updates dealing with product reviews, the product category (e.g., is a crucial consideration. Although there have been a books, computers, etc.) would be the domain for each number of prior studies under safe bandit updates in review. In many real-world scenarios, it is desirable to an online learning setting, online bandit learning is not have fine-grained control on the rate of exploration for suitable for business-critical production systems where each domain. For example, in a conversational system, an updated policy must go through extensive testing to we may want to explore business-critical domains such ensure reliable performance on critical use cases before as shopping more conservatively, while aggressively ex- it gets deployed. This calls out for an approach for safe ploring a domain such as entertainment. Note that solely policy updates in the offline bandit learning setting. relying on a method such as Thompson sampling [9] or In a production system, it is crucial to not only estimate epsilon-greedy [10] would neither be effective to meet but also control the changes of behavior a new policy the requirements for individual domains nor optimal to introduces when compared to the current production balance between the desired fine-grained exploration and policy. In the literature, this problem has been studied achieving the highest possible reward. under safe bandit updates [4, 5, 6] and budgeted bandit While previous studies considered different aspects of constraining a bandit model, to the best of our knowledge The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety the problem of controlling off-policy bandit behavior 2022), July 24–25, 2022, Vienna, Austria changes across subsequent model updates with a fine- $ kachum@amazon.com (M. Kachuee); sungjinl@amazon.com (S. Lee) grained control on budgets for different data segments  0000-0002-0099-3466 (M. Kachuee) (domains) remains unaddressed. This study is the first to © Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tackle the aforementioned issues by providing a scalable CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) and practical approach. The main contributions of this 2.2. Constrained Optimization paper are as follows: The general problem of constrained optimization has • Introducing a formulation for controlled explo- been studied extensively in the literature. The type of ration in the context of off-policy contextual ban- constraints and the optimization problem are important dit learning considering fine-grained control over aspects in the design of such algorithms. In the context domains based on user-defined constraints. of constrained optimization of differentiable function • Presenting a solution based on the primal-dual approximators (e.g., neural networks) the constraints can minimax constraint optimization method that is be defined on the parameters [12], outputs [13], or based effective but requires adjusting a few hyperpa- on predefined functions of the model [14, 15]. rameters. Most relevant to this paper, penalty methods trans- • Proposing a novel meta gradient algorithm to bal- late constraints into penalty terms used to encourage ance constraint satisfaction and reward maximiza- constraint satisfaction in the optimization process. The tion that works out-of-the-box and outperforms quadratic penalty method adds the weighted square of vi- other methods with no need for hyperparameter olation metrics and adjusts the weight either by heuristics adjustment. or solving a sequence of problems with a monotonically increasing penalty weight. The augmented Lagrangian • Conducting extensive experiments on the skill method leverages Lagrange multipliers to mitigate the routing problem in a real-world conversational issue of ill-conditioning inherent in the quadratic penalty AI agent using a set of realistic constraint bench- method [16, 17]. More applicable to neural network mod- marks. els, Nandwani et al. [14] suggested a scalable approach to defining constraints using hinge functions, then solving 2. Related Work the dual minimax problem. Theoretically, given an infi- nite number of training iterations and a decaying update 2.1. Constrained Bandit Learning rate for the max variables, it is known that the minimax objective is guaranteed to converge to a local min-max The majority of studies on controlled bandit learning con- point [18]. Nonetheless, such a guarantee does not apply sider the case of simple multi-armed stochastic bandits to real-world settings in which a non-convex problem (i.e., without context) with practical applications in ex- is being solved in finite time using limited compute re- periment design [8] and automated machine learning [7]. sources. Hoffman et al. [7] suggested a Bayesian approach to two- phase exploration-exploitation bandit learning in which there is a pre-specified budget for exploration arm eval- 2.3. Meta Learning uations. Another aspect is to ensure safe exploration In the literature, meta-learning is defined as learning to actions, which is especially useful for sensitive applica- learn in which the model training itself is considered tions in industrial machine learning or healthcare. Amani an inner optimization process guided by an outer meta et al. [6] introduced a solution in which an initial set of objective. It is a broad topic spanning multiple areas exploration actions is defined, then the exploration set of research including reward function discovery in rein- is gradually expanded to ensure minimal unexpected be- forcement learning [19, 20], life-long continual learning havior. [21], few-shot learning [22], and transfer learning [23] For contextual bandits, safety has been an active re- to name a few. search topic. Safety can be defined in the action space or Most relevant to this work is the meta-gradient idea in terms of model updates. For example, Daulton et al. [5] first suggested by Sutton [24]. In meta-gradient learn- solves a two-metric setting in which one of the metrics, ing, we use online cross-validation on a meta objective reward, is being maximized while enforcing a limit for to compute derivatives with respect to meta parameters regression on an auxiliary metric compared to a baseline through a differentiable inner optimization loop. Xu et al. status quo model. Balakrishnan et al. [11] attempts to [25] demonstrated the effectiveness of this meta-gradient learn behavioral constraints by balancing between repli- idea in deep reinforcement learning to adapt the value cation of the current baseline policy and making new ac- function as the agent is interacting with the environment. tions that show promising rewards. In [4] authors define However, in practice, the compute/memory requirement safety in terms of user experience metrics and suggest de- for computing high-order gradients is a major challenge ciding on deploying a new model based on conservative for meta-reinforcement learning leading to low-order gra- confidence bounds on the off-policy estimates of such dient approximations [26, 25]. Also, while meta-gradient metrics. learning enables adaptive adjustment of hyperparame- ters, it still requires a carefully designed meta objective that provides the best learning signal. A common approach to optimize constrained prob- lems such as (2) is to use the penalty method, translating constraints into penalty terms that encourage constraint 3. Constrained Bandit Exploration satisfaction: 3.1. Problem Formulation arg min Ex,𝑎,𝑟,𝑘∼D [𝐿Π𝜃 (x, 𝑎, 𝑟)+ 𝜃 We consider the general framework of off-policy con- 𝑒u𝑘 max(0, 𝑐𝑚𝑖𝑛 𝑘 −ℛ𝜃 (x))+ 𝑒v𝑘 max(0, ℛ𝜃 (x)−𝑐𝑚𝑎𝑥 𝑘 )]. textual bandits in which a policy Π is used to select an (4) action 𝑎 ∈ 𝐴 given the observed context vector (x) to Here, penalty terms are always non-negative and in- maximize the scalar reward (𝑟) received from the envi- crease if the new policy assigns action probabilities that ronment. Here, we assume stochastic policies of the form deviate from the current policy outside the desired bound- Π𝜃 (𝑎|x) in which a model parameterized by 𝜃 (e.g., a ary. u ∈ 𝑅𝑀 and v ∈ 𝑅𝑀 are variables that adjust the neural network) is used to assign action selection proba- weight of each constraint violation term. The exponenti- bilities to each action given the context. Furthermore, we ation improves the sensitivity to these parameters and assume that each sample belongs to a domain denoted ensures having non-negative penalty terms. For (4) to by 𝑘 ∈ 1 . . . 𝑀 that is provided as a feature in x. actually solve the original constrained problem of (2), In the off-policy setting, the policy is refreshed after proper values for u and v need to be used that enable collecting a dataset of samples from the current policy. the best balance between constraint satisfaction and the We adopt a definition of exploration which considers any policy value. In the constrained optimization literature, change in the agent behavior compared to the current various methods have been suggested to solve this form policy as an exploration action. Alternatively, we can of problem (see Section 2.2). In this paper, to solve this consider replication with respect to the current policy problem, we use the primal-dual minimax method sug- as the rate at which the new policy makes similar de- gested by Nandwani et al. [14] (Section 3.2) as well as a cisions to the current policy when both evaluated and novel meta-learning method (Section 3.3). sampled stochastically. We define replication for Π𝜃 with respect to Π0 based on the L1-distance of their action propensities given a context x: 3.2. Minimax Primal-Dual Method |Π𝜃 (x) − Π0 (x)|1 Nandwani et al. [14] suggested a formulation of the aug- ℛ𝜃 (x) = 1 − . (1) mented Lagrangian method that supports inequality con- 2 straints. They solve the dual problem which is optimizing In a production system, it is desirable to precisely con- the dual maximin problem to improve the scalability: trol the rate at which the new policy replicates the current policy for each domain. This ensures safe model updates min max Ex,𝑎,𝑟,𝑘∼D [𝐿Π𝜃 (x, 𝑎, 𝑟)+ 𝜃 u,v for critical domains while enabling exploration for others 𝑒 max(0, 𝑐𝑚𝑖𝑛 u𝑘 −ℛ𝜃 (x))+𝑒v𝑘 max(0, ℛ𝜃 (x)−𝑐𝑚𝑎𝑥 )]. that may benefit from an extra exploration budget. Ac- 𝑘 𝑘 (5) cordingly, we define constraints to encourage the desired behavior for samples of each domain, while learning an Algorithm 1 shows an outline of the policy training us- off-policy bandit: ing the minimax method. This method has four hyper- parameters controlling the max player optimization via arg min Ex,𝑎,𝑟,𝑘∼D 𝐿Π𝜃 , adjusting the update frequency, learning rate, and decay 𝜃 (2) factors. 𝑚𝑖𝑛 𝑚𝑎𝑥 𝑠.𝑡. 𝑐𝑘 ≤ ℛ𝜃 (x) ≤ 𝑐𝑘 Intuitively, the min player is trying to update the policy where context, action, reward, and domain (x, 𝑎, 𝑟, 𝑘) are parameters while the max player is increasingly penaliz- sampled from a dataset collected from the current policy. ing it for any constraint violation. A stable point of this In (2), we use 𝑐𝑘 and 𝑐𝑘 𝑚𝑖𝑛 𝑚𝑎𝑥 to indicate user-defined algorithm would be to gradually reduce the max player replication constraints for domain 𝑘. update rate as the min player is getting better at satisfy- 𝐿Π𝜃 can be any differentiable off-policy bandit learn- ing the constraints, eventually satisfying all constraints ing objective, for simplicity of discussion, we consider resulting in a zero loss for the max player due to the zero the vanilla inverse propensity scoring (IPS) objective: hinge penalty terms. Π𝜃 (𝑎|x) 3.3. Meta Gradient Method 𝐿Π𝜃 (x, 𝑎, 𝑟) = −𝑟 , (3) Π0 (𝑎|x) Theoretically, the primal-dual minimax method is capa- where Π0 is the current policy and 𝑟 is the observed ble of achieving Pareto optimal solutions [18, 14]. How- reward for taking action 𝑎 collected in the dataset. ever, in practice, it is infeasible to train for an infinite Algorithm 1: Minimax constrained bandit opti- Note that (6) is not directly dependent on u and v, mization algorithm instead we rely on online cross-validation [24, 25] to up- input : D (dataset), 𝜂 (max learning rate), 𝛾 (max date these variables. We define an inner objective the learning rate decay), 𝜏 (max update same as the min optimization problem of (5), do a differ- frequency), 𝜉 (max update frequency entiable optimization step, evaluate the meta objective decay) on another batch of data, then update u and v by taking u, v, 𝑡 ← 0 the derivative of the meta objective through the inner Initialize(Π𝜃 ) optimization trace. for x, 𝑎, 𝑟, 𝑘 ∼ 𝐷 do Algorithm 2 presents an outline of the meta gradient /* use loss function of (5) */ optimization method. Due to practical issues of dealing 𝐿 ← 𝐿𝑜𝑠𝑠(x, 𝑎, 𝑟, 𝑘, 𝜃, u, v) with high-order gradients, we only consider the imme- if 𝑡%𝜏 is 0 then diate impact of a single inner loop update on the meta /* gradient ascent, max player objective. We found that discarding the vanilla gradi- */ ent descent used for the inner optimization and using u ← u + 𝜂∇u 𝐿 a more advanced optimizer (e.g., Adam) to update Π𝜃 v ← v + 𝜂∇v 𝐿 works best. Regarding the 𝜆 hyperparameter, we found /* lr/update decay, max player that simply setting 𝜆 = 1 works well in practice. It ef- */ fectively means that the meta-gradient solution does not 𝜂 ←𝛾×𝜂 require any hyperparameter adjustments (experimental 𝜏 ←𝜉×𝜏 evidence presented in Section 4.4). end /* optimize Π𝜃 , min player */ Algorithm 2: Meta gradient constrained bandit 𝜃 ← 𝑓 (𝜃, ∇𝜃 𝐿) optimization algorithm /* increment iteration counter */ input : D (dataset), 𝜂 (learning rate), 𝜆 (penalty 𝑡←𝑡+1 weight) end u, v ← 0 Initialize(Π𝜃 ) for x, 𝑎, 𝑟, 𝑘 ∼ D and x′ , 𝑎′ , 𝑟′ , 𝑘′ ∼ D do /* clone policy parameters */ number of iterations, and therefore approximate inner op- 𝜃′ ← 𝑐𝑙𝑜𝑛𝑒(𝜃) timization loops are being used. To find the right balance /* compute inner loss with 𝜃 ′ */ between constraint satisfaction and policy improvement 𝐿𝑖𝑛𝑛𝑒𝑟 ← 𝐿𝑜𝑠𝑠𝑖𝑛𝑛𝑒𝑟 (x, 𝑎, 𝑟, 𝑘, 𝜃′ , u, v) for the minimax algorithm, it is necessary to carefully /* gradient descent on cloned adjust multiple hyperparameters. Note that an extensive model */ hyperparameter search is undesirable in many real-world 𝜃′ ← 𝜃′ − 𝜂∇𝜃′ 𝐿𝑖𝑛𝑛𝑒𝑟 large-scale scenarios that require frequent model updates /* compute meta loss */ as it entails not only significant compute costs associated 𝐿𝑚𝑒𝑡𝑎 ← 𝐿𝑜𝑠𝑠𝑚𝑒𝑡𝑎 (x′ , 𝑎′ , 𝑟′ , 𝑘′ , 𝜃′ , 𝜆) with the search but also increases the turn-around time /* diff. through inner update */ to deploy refreshed models. To mitigate this issue, we Compute ∇u 𝐿𝑚𝑒𝑡𝑎 and ∇v 𝐿𝑚𝑒𝑡𝑎 suggest a meta-gradient optimization idea that adapts /* optimize u, v using any u and v based on a meta objective within the training optimizer */ process. u ← 𝑓 (u, ∇u 𝐿𝑚𝑒𝑡𝑎 ) Specifically, we define the following meta objective: v ← 𝑓 (v, ∇v 𝐿𝑚𝑒𝑡𝑎 ) /* compute inner loss with 𝜃 */ 𝐿𝑚𝑒𝑡𝑎 = Ex,𝑎,𝑟,𝑘∼D [(1 − 𝜆)𝐿Π𝜃 (x, 𝑎, 𝑟) + 𝐿 ← 𝐿𝑜𝑠𝑠𝑖𝑛𝑛𝑒𝑟 (x, 𝑎, 𝑟, 𝑘, 𝜃, u, v) max(0, 𝑐𝑚𝑖𝑛 𝑘 − ℛ𝜃 (x)) + max(0, ℛ𝜃 (x) − 𝑐𝑚𝑎𝑥 𝑘 ) /* optimize Π𝜃 using any 𝜆 ], 𝑝(𝑘) optimizer */ (6) 𝜃 ← 𝑓 (𝜃, ∇𝜃 𝐿) end where 𝜆 is a hyperparameter to balance between the bandit objective and the constraint penalty terms. The second term is the macro average of constraint violation Intuitively, at each training iteration, the inner objec- terms, in which 𝑝(𝑘) is the prior probability of samples tive naturally minimizes the bandit loss that is penalized belonging to domain 𝑘 that can be easily pre-computed by constraint violation terms proportional to the current for a large batch of samples. u/v. Then, the meta objective computes a validation loss that measures the impact of the inner policy update and u/v on the macro-averaged constraint violations. Finally, by computing the meta-gradient of the meta objective through the inner optimization loop, u and v are up- dated to better encourage the constraint satisfaction for the next policy update iteration. 4. Experiments 4.1. Setup We conduct experiments on a bandit agent for the prob- Figure 1: An overview of the network architecture: a set of lem of skill routing in commercial conversational systems hypothesis are encoded as vectors and fed to a bi-directional such as Apple Siri, Amazon Alexa, Google Assistant, and LSTM which is followed by a shared MLP and a softmax layer Microsoft Cortana. In these systems, skill routing is a to normalize the candidate selection probabilities. key component that takes the user’s utterance as well as signals from automated speech recognition (ASR) and natural language understanding (NLU), then it decides which skill and NLU interpretation to be used to serve the request [27]. The skill routing problem in a commercial dialogue agent is a natural fit for this study as any policy change di- rectly impacts the user experience, making safe and con- trolled policy updates crucial. Additionally, constraints can be defined based on the NLU domain to ensure pol- icy safety for business-critical domains such as shopping, while potentially exploring others such as entertainment more aggressively. Figure 1 shows an overview of the model architecture used in our experiments. Input to the model is a set of routing candidates i.e., a combination of embedded ASR, NLU, and context vectors as well as skill embeddings. The output is the softmax-normalized propensity of selecting Figure 2: The constraint configuration list for the exploration benchmark. each candidate to handle the user request. The final model has about 12M trainable parameters consisting of a language model to encode utterance, embeddings for contextual signals, and fully-connected layers. As our for all domains. In addition to the global constraint, architecture closely follows the design choices from Park critical assert stronger limits for a set of critical do- et al. [28], we refer readers to that paper for details. mains defined based on the expert knowledge. The To train and evaluate our models, we use logged policy exploration benchmark extends the critical bench- actions from a current production policy. The observed mark by adding constraints to encourage exploration reward is based on a curated function of user satisfaction for domains that may benefit from additional exploration. metrics (refer to [29] for an example). Our dataset con- Each benchmark is a list of constraints consisting of a sists of about 40M samples divided into 85% training, 10% short description, applicable domain, and the desired validation, and 5% test sets covering 27 domains that are replication range. Figure 2 shows the exploration imbalanced in the number of samples. All data used in benchmark as an example. We provide the exact con- this work was deidentified to comply with our customer straint configurations in the appendix. privacy guidelines. 4.3. Baselines and Metrics 4.2. Benchmarks As the first baseline, we consider the vanilla IPS objec- In our experiments, we use three different benchmarks tive which disregards the constraints. Additionally, we for the constraint settings: global, critical, and build on the IPS baseline to consider the constraints using exploration. The global benchmark aims to con- constraint optimization approaches: quadratic (uniform strain the new policy to be within an exploration limit constant penalty weight), minimax (Algorithm 1), and 0.896 Violation Reuc Reward(%) Violation Reduction(%) meta-gradient (Algorithm 2). Unless expressed other- 0.895 0.894 η=0.01, γ=0.995 0% wise, we use Adam optimizer with the default configura- 0.893 0.892 η=0.01, γ=0.999 12% tion [30] (denoted by 𝑓 in Section 3). η=0.1, γ=0.995 0.891 η=0.01, γ=1.0 0.89 η=0.1, γ=0.999 Regarding hyperparameters, for the penalty 24% 0.889 η=1.0, γ=1.0 η=1.0, γ=0.999 η=0.1, γ=1.0 0.888 weight of the quadratic method we use values from 0.887 η=1.0, γ=0.995 37% 0.886 {0.1, 1, 10, 100, 1000}. For the minimax method 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 (Algorithm 1), we found that setting 𝜏 and 𝜉 to one (a) (b) while adjusting 𝜂 and 𝛾 presents very similar results 0.898 0.897 Reward(%) 0% Violation Reduction(%) Reward(%) to adjusting all four hyperparameters. Consequently, λ=0.01 0.896 12% 0.895 24% we use a grid search of 𝜂 ∈ {1, 0.1, 0.01} and 0.894 0.893 Reward(%) 37% 50% 𝛾 ∈ {1, 0.999, 0.995} to find the best settings for each 0.892 λ=0.05 62% 0.891 λ=0.1 0.89 benchmark. For the meta-gradient method (Algorithm 2), λ=0.5 0.889 λ=0.75 0.888 λ=0.95 we found that simply using 𝜆 equal to one in the meta 0.887 0 1 2 3 4 5 6 λ=1.0 7 0 1 2 3 4 5 6 7 objective (i.e., meta objective only focusing on the (c) (d) constraints) outperforms other works (see evidence in Section 4.4). As a result, it does not require adjusting Figure 3: Comparing the hyper-parameter sensitivity for any hyperparameter and the same setting is used across the minimax and meta-gradient methods on the critical all benchmarks. We provide the hyperparameters used benchmark. For the minimax method: (a) reward and (b) for each benchmark and method in the appendix. macro violation reduction wrt. different 𝜂 and 𝛾 settings. For Regarding the evaluation metrics, we use the expected the meta-gradient method: (c) reward and (d) macro violation off-policy reward as well as the rate of constraint vio- reduction wrt. different 𝜆 settings. lations averaged over all samples (micro-averaged) and individual domains (macro-averaged). To comply with our privacy and business guidelines, in all instances, we 𝜂 ∈ {1.0, 0.1, 0.01} and 𝛾 ∈ {1.0, 0.999, 0.995}. For only report relative and normalized results which do not the meta-gradient method (Algorithm 2), we use 𝜆 ∈ directly represent the actual scales or metric values. {0.01, 0.05, 0.1, 0.5, 0.75, 0.95, 1.0}. Figure 3 shows We train each model until convergence or reaching 32 the results of such experiment. Based on this experiment, epochs and take the model best performing based on the the minimax approach shows a much higher sensitivity macro-averaged violation rate. Each experiment was run to its two hyperparameters, showing a significant im- four times using different random seeds for data sampling pact on both the reward and violation reduction metrics. and weight initialization to report the mean and standard However, the meta-gradient method shows much less deviation of each result. We used a cluster of 32 NVIDIA sensitivity to the 𝜆 hyperparameter. We found that sim- V100 GPUs to process a mini-batch size of 32K samples. ply setting 𝜆 = 1 works very well in practice. It can be Each individual run took between 4 to 24 hours. very desirable for real-world large-scale settings such as conversational systems which require frequent model 4.4. Results updates as new features are on-boarded every day, and having a dependency on an extensive hyperparameter Table 1 shows a comparison of results for the IPS, search is very costly, if not impractical. quadratic, minimax, and meta-gradient methods on all To dive deeper into the reason behind the better per- benchmarks. For each case, we report the expected re- formance for the meta-gradient algorithm compared to ward and the percentage of reduction in the rate of vio- the minimax approach, we investigated the constraint lations compared to the simple IPS objective. The meta- penalty weight value for the first 3,000 iterations of train- gradient approach consistently shows the best results ing using the global benchmark. From Figure 4, we across all benchmarks. The simple quadratic method can see the minimax method is monotonically increas- behaves very competitively to minimax, except for the ing the penalty weight with each iteration which is a explore benchmark which requires a more fine-grained behavior consistent with the gradient ascent update rule control on multiple constraints (see Figure 2). The meta- in Algorithm 1. In other words, as long as there are any gradient method, while having the highest reduction in constraint violations, minimax will keep increasing the constraints violations, also has very competitive perfor- penalty, which in our opinion is the reason for high sen- mance in terms of the reward metric. sitivity to the hyperparameters. On the other hand, the To study the impact of hyperparameters, we con- meta-gradient approach is using a validation signal to ducted an experiment using the critical benchmark dynamically adjust the penalty weight. Consequently, by training minimax and meta-gradient based mod- it may keep the penalty term near zero for an initial els using different hyperparameter values. Specifi- phase, rapidly increase it, then decay when violations are cally, we train minimax models (Algorithm 1) using reduced and getting a higher reward is preferred. Benchmark global critical explore Method reward violation reduction reward violation reduction reward violation reduction (%) macro (%) micro (%) (%) macro (%) micro (%) (%) macro (%) micro (%) IPS 89.45±0.01 0 0 89.45±0.01 0 0 89.45±0.01 0 0 Quadratic 88.95±0.01 63.67±0.46 63.67±0.46 88.94±0.01 50.13±0.90 69.29±0.67 88.36±0.04 28.37±4.62 65.24±2.30 Minimax 88.91±0.01 63.28±0.08 63.28±0.08 88.93±0.01 37.88±0.49 62.51±0.21 88.11±0.01 61.51±0.59 81.50±0.24 MetaGrad 88.94±0.01 75.91±0.49 75.91±0.49 88.94±0.01 60.63±0.95 79.69±0.85 88.41±0.01 78.23±0.17 89.95±0.20 Table 1 A comparison of the baseline IPS method with the quadratic, minimax, and meta-gradient constrained optimization methods on different benchmarks. We report the normalized percentage of reduction in the number of constraint violations compared to the IPS method. Minimax policy evaluation for slate recommendation, arXiv 1011 MetaGrad preprint arXiv:1605.04812 (2016). 109 [2] T. Joachims, A. Swaminathan, M. de Rijke, Deep Violation Penalty Weight learning with logged bandit feedback, in: Inter- 107 national Conference on Learning Representations, 2018. 105 [3] R. Lopez, I. S. Dhillon, M. I. Jordan, Learning from extreme bandit feedback, Proc. Association for the 103 Advancement of Artificial Intelligence (2021). [4] R. Jagerman, I. Markov, M. D. Rijke, Safe exploration 101 for optimizing contextual bandits, ACM Transac- 0 500 1000 1500 2000 2500 3000 tions on Information Systems (TOIS) 38 (2020) 1–23. Training Iteration [5] S. Daulton, S. Singh, V. Avadhanula, D. Dimmery, Figure 4: The constraint penalty weight values for the first E. Bakshy, Thompson sampling for contextual 3,000 iterations of training using the global benchmark. bandit problems with auxiliary safety constraints, arXiv preprint arXiv:1911.00638 (2019). [6] S. Amani, M. Alizadeh, C. Thrampoulidis, Linear 5. Conclusion stochastic bandits under safety constraints, arXiv preprint arXiv:1908.05814 (2019). This work studied the problem of controlled exploration [7] M. Hoffman, B. Shahriari, N. Freitas, On correlation for off-policy contextual bandit learning. We presented and budget constraints in model-based bandit op- a constraint optimization formulation that enables a hu- timization with application to automatic machine man expert to define the boundary of the desired ex- learning, in: Artificial Intelligence and Statistics, ploration rate for individual domains. We proposed a PMLR, 2014, pp. 365–374. scalable and practical solution based on meta-gradient [8] S. Guha, K. Munagala, Multi-armed bandits with learning which provides the highest constraint satisfac- limited exploration, in: Proceedings of the Annual tion rates without any need for an extensive hyperpa- Symposium on Theory of Computing (STOC), Cite- rameter adjustment. Finally, we conducted experiments seer, 2007. using data from a real-world conversation system for [9] C. Riquelme, G. Tucker, J. Snoek, Deep bayesian the skill routing problem on a set of different realistic bandits showdown: An empirical comparison of constraint benchmarks. Based on the experimental re- bayesian deep networks for thompson sampling, sults, we believe that the suggested controlled bandit arXiv preprint arXiv:1802.09127 (2018). learning approach is very promising for application in [10] M. Collier, H. U. Llorens, Deep contextual multi- real-world bandits in which frequent safe policy updates armed bandits, arXiv preprint arXiv:1807.09809 are of paramount importance. (2018). [11] A. Balakrishnan, D. Bouneffouf, N. Mattei, F. Rossi, Using contextual bandits with behavioral con- References straints for constrained online movie recommenda- tion., in: IJCAI, 2018, pp. 5802–5804. [1] A. Swaminathan, A. Krishnamurthy, A. Agarwal, [12] J. Liu, J. Ye, Efficient euclidean projections in linear M. Dudík, J. Langford, D. Jose, I. Zitouni, Off- time, in: Proceedings of the 26th Annual Interna- arXiv:1803.02999 (2018). tional Conference on Machine Learning, 2009, pp. [27] R. Sarikaya, The technology behind personal digital 657–664. assistants: An overview of the system architecture [13] P. L. Donti, D. Rolnick, J. Z. Kolter, Dc3: A learn- and key components, IEEE Signal Processing Mag- ing method for optimization with hard constraints, azine 34 (2017) 67–81. International Conference on Learning Representa- [28] S. Park, H. Li, A. Patel, S. Mudgal, S. Lee, Y.-B. Kim, tions (ICLR) (2021). S. Matsoukas, R. Sarikaya, A scalable framework for [14] Y. Nandwani, A. Pathak, P. Singla, et al., A primal learning from implicit user feedback to improve nat- dual formulation for deep learning with constraints, ural language understanding in large-scale conver- in: Advances in Neural Information Processing Sys- sational ai systems, arXiv preprint arXiv:2010.12251 tems, 2019, pp. 12157–12168. (2020). [15] H. Kervadec, J. Dolz, J. Yuan, C. Desrosiers, [29] M. Kachuee, H. Yuan, Y.-B. Kim, S. Lee, Self- E. Granger, I. B. Ayed, Constrained deep networks: supervised contrastive learning for efficient user Lagrangian optimization via log-barrier extensions, satisfaction prediction in conversational agents, in: arXiv:1904.04205 (2019). Proceedings of the 2021 Conference of the North [16] N. Andrei, Penalty and augmented lagrangian meth- American Chapter of the Association for Computa- ods, in: Continuous Nonlinear Optimization for tional Linguistics: Human Language Technologies, Engineering Applications in GAMS Technology, 2021, pp. 4053–4064. Springer, 2017, pp. 185–201. [30] D. P. Kingma, J. Ba, Adam: A method for stochas- [17] R. Glowinski, P. Le Tallec, Augmented Lagrangian tic optimization, arXiv preprint arXiv:1412.6980 and operator-splitting methods in nonlinear me- (2014). chanics, SIAM, 1989. [18] C. Jin, P. Netrapalli, M. I. Jordan, Minmax optimiza- tion: Stable limit points of gradient descent ascent are locally optimal, arXiv preprint arXiv:1902.00618 (2019). [19] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh, D. Silver, Meta-gradient reinforcement learning with an objective discovered online, arXiv preprint arXiv:2007.08433 (2020). [20] L. Kirsch, S. van Steenkiste, J. Schmidhuber, Improv- ing generalization in meta reinforcement learning using learned objectives, in: International Confer- ence on Learning Representations, 2019. [21] S. Ritter, J. Wang, Z. Kurth-Nelson, S. Jayakumar, C. Blundell, R. Pascanu, M. Botvinick, Been there, done that: Meta-learning with episodic recall, in: International Conference on Machine Learning, PMLR, 2018, pp. 4354–4363. [22] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta- learning for fast adaptation of deep networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 1126–1135. [23] M. Ren, W. Zeng, B. Yang, R. Urtasun, Learn- ing to reweight examples for robust deep learning, in: International Conference on Machine Learning, PMLR, 2018, pp. 4334–4343. [24] R. S. Sutton, Adapting bias by gradient descent: An incremental version of delta-bar-delta, in: AAAI, San Jose, CA, 1992, pp. 171–176. [25] Z. Xu, H. van Hasselt, D. Silver, Meta- gradient reinforcement learning, arXiv preprint arXiv:1805.09801 (2018). [26] A. Nichol, J. Achiam, J. Schulman, On first- order meta-learning algorithms, arXiv preprint A. Appendix Benchmark Method global critical explore A.1. Constraint Benchmarks Quadratic 𝑤 10 1000 1000 Figure 5 presents the definition of constraint benchmarks 𝜂 0.1 0.1 1 used in this paper: global, critical, and explore. Minimax 𝛾 1 0.999 1 The global benchmark sets a general minimum replica- Meta-Grad 𝜆 1 1 1 tion rate for all domains. The critical benchmark defines a tighter minimum replication rate for three Table 2 business-critical domains (home automation, shopping, The selected hyperparameters for each benchmark and and notifications) and a more relaxed default case for method. all other domains. In the explore benchmark, we ex- tend the critical benchmark to include exploration encouragement for the knowledge and music domains. parameter is presented in Algorithm 1 and 2. (a) global benchmark (b) critical benchmark (c) explore benchmark Figure 5: The constraint benchmarks used in this paper: (a) global, (b) critical, and (c) explore. A.2. Selected Hyperparameters Table 2 shows the final selected hyperparameters for each benchmark and method. The definition of each hyper-