1. Introduction

Constrained Policy Optimization for Controlled Contextual Bandit Exploration

Mohammad Kachuee

Sungjin Lee

0 0 Amazon Alexa AI , Bellevue, WA 98004 , USA

Contextual bandits are widely used across the industry in many applications such as search engines, dialogue systems, recommendation systems, etc. In such applications, it is often necessary to update the policy regularly as the data distribution changes and new features are being on-boarded frequently. As any new policy deployment directly impacts the user experience, safety in model updates is an important consideration in real-world bandit learning. In this study, we introduce a scalable framework for policy update safety via user-defined constraints, supporting fine-grained exploration targets for individual domains. For example, in a digital voice assistant, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning method that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conduct extensive experiments using data from a real-world conversational AI system on a set of realistic constraint benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best balance between the policy value and constraint satisfaction rate.

eol>contextual bandit constrained optimization safe exploration

1. Introduction

learning [7, 8], usually targeting exploration budgets or encouraging a behavior resembling a baseline policy.

Contextual bandits are important machine learning prob- In the context of of-policy bandit updates, we define lems used in many real-world applications such as search exploration as any change in the model behavior resultengines, advertisement systems, conversational systems, ing from replacing a current production policy with a etc. In a contextual bandit problem, the agent is tasked to new updated policy. This definition is broad and encloses select the action that achieves the highest reward given stochastic exploration actions as well as any behavior a context. In this setting, the environment is assumed change when comparing the two consecutive policies. to have no memory, and the reward is a stochastic func- Furthermore, we consider the scenario in which samtion of the current context, independent of past agent- ples are naturally classified into a set of domains, each environment interactions [1, 2, 3]. representing a unique data segment. For example, when

For real-world bandits, ensuring safe policy updates dealing with product reviews, the product category (e.g., is a crucial consideration. Although there have been a books, computers, etc.) would be the domain for each number of prior studies under safe bandit updates in review. In many real-world scenarios, it is desirable to an online learning setting, online bandit learning is not have fine-grained control on the rate of exploration for suitable for business-critical production systems where each domain. For example, in a conversational system, an updated policy must go through extensive testing to we may want to explore business-critical domains such ensure reliable performance on critical use cases before as shopping more conservatively, while aggressively exit gets deployed. This calls out for an approach for safe ploring a domain such as entertainment. Note that solely policy updates in the ofline bandit learning setting. relying on a method such as Thompson sampling [9] or

In a production system, it is crucial to not only estimate epsilon-greedy [10] would neither be efective to meet but also control the changes of behavior a new policy the requirements for individual domains nor optimal to introduces when compared to the current production balance between the desired fine-grained exploration and policy. In the literature, this problem has been studied achieving the highest possible reward. under safe bandit updates [4, 5, 6] and budgeted bandit While previous studies considered diferent aspects of constraining a bandit model, to the best of our knowledge The IJCAI-ECAI-22 Workshop on Artificial Intelligence Safety (AISafety the problem of controlling of-policy bandit behavior 2022), July 24–25, 2022, Vienna, Austria changes across subsequent model updates with a fine($S. kLaeceh)um@amazon.com (M. Kachuee); sungjinl@amazon.com grained control on budgets for diferent data segments 0000-0002-0099-3466 (M. Kachuee) (domains) remains unaddressed. This study is the first to © Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License tackle the aforementioned issues by providing a scalable CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) and practical approach. The main contributions of this paper are as follows:

The general problem of constrained optimization has • Introducing a formulation for controlled explo- been studied extensively in the literature. The type of ration in the context of of-policy contextual ban- constraints and the optimization problem are important dit learning considering fine-grained control over aspects in the design of such algorithms. In the context domains based on user-defined constraints. of constrained optimization of diferentiable function • Presenting a solution based on the primal-dual approximators (e.g., neural networks) the constraints can minimax constraint optimization method that is be defined on the parameters [ 12], outputs [13], or based efective but requires adjusting a few hyperpa- on predefined functions of the model [14, 15]. rameters. Most relevant to this paper, penalty methods trans• Proposing a novel meta gradient algorithm to bal- late constraints into penalty terms used to encourage ance constraint satisfaction and reward maximiza- constraint satisfaction in the optimization process. The tion that works out-of-the-box and outperforms quadratic penalty method adds the weighted square of viother methods with no need for hyperparameter olation metrics and adjusts the weight either by heuristics adjustment. or solving a sequence of problems with a monotonically • Conducting extensive experiments on the skill increasing penalty weight. The augmented Lagrangian routing problem in a real-world conversational method leverages Lagrange multipliers to mitigate the AI agent using a set of realistic constraint bench- issue of ill-conditioning inherent in the quadratic penalty marks. method [16, 17]. More applicable to neural network models, Nandwani et al. [14] suggested a scalable approach to defining constraints using hinge functions, then solving 2. Related Work the dual minimax problem. Theoretically, given an infinite number of training iterations and a decaying update 2.1. Constrained Bandit Learning rate for the max variables, it is known that the minimax objective is guaranteed to converge to a local min-max The majority of studies on controlled bandit learning con- point [18]. Nonetheless, such a guarantee does not apply sider the case of simple multi-armed stochastic bandits to real-world settings in which a non-convex problem (i.e., without context) with practical applications in ex- is being solved in finite time using limited compute reperiment design [8] and automated machine learning [7]. sources.

Hofman et al. [7] suggested a Bayesian approach to twophase exploration-exploitation bandit learning in which there is a pre-specified budget for exploration arm eval- 2.3. Meta Learning uations. Another aspect is to ensure safe exploration In the literature, meta-learning is defined as learning to actions, which is especially useful for sensitive applica- learn in which the model training itself is considered tions in industrial machine learning or healthcare. Amani an inner optimization process guided by an outer meta et al. [6] introduced a solution in which an initial set of objective. It is a broad topic spanning multiple areas exploration actions is defined, then the exploration set of research including reward function discovery in reinis gradually expanded to ensure minimal unexpected be- forcement learning [19, 20], life-long continual learning havior. [21], few-shot learning [22], and transfer learning [23]

For contextual bandits, safety has been an active re- to name a few. search topic. Safety can be defined in the action space or Most relevant to this work is the meta-gradient idea in terms of model updates. For example, Daulton et al. [5] ifrst suggested by Sutton [24]. In meta-gradient learnsolves a two-metric setting in which one of the metrics, ing, we use online cross-validation on a meta objective reward, is being maximized while enforcing a limit for to compute derivatives with respect to meta parameters regression on an auxiliary metric compared to a baseline through a diferentiable inner optimization loop. Xu et al. status quo model. Balakrishnan et al. [11] attempts to [25] demonstrated the efectiveness of this meta-gradient learn behavioral constraints by balancing between repli- idea in deep reinforcement learning to adapt the value cation of the current baseline policy and making new ac- function as the agent is interacting with the environment. tions that show promising rewards. In [4] authors define However, in practice, the compute/memory requirement safety in terms of user experience metrics and suggest de- for computing high-order gradients is a major challenge ciding on deploying a new model based on conservative for meta-reinforcement learning leading to low-order graconfidence bounds on the of-policy estimates of such dient approximations [26, 25]. Also, while meta-gradient metrics. learning enables adaptive adjustment of hyperparameters, it still requires a carefully designed meta objective that provides the best learning signal.

In a production system, it is desirable to precisely con- the dual maximin problem to improve the scalability: Here, penalty terms are always non-negative and increase if the new policy assigns action probabilities that deviate from the current policy outside the desired boundary. u ∈ and v ∈ are variables that adjust the weight of each constraint violation term. The exponentiation improves the sensitivity to these parameters and ensures having non-negative penalty terms. For (4) to actually solve the original constrained problem of (2), proper values for u and v need to be used that enable the best balance between constraint satisfaction and the policy value. In the constrained optimization literature, various methods have been suggested to solve this form of problem (see Section 2.2). In this paper, to solve this problem, we use the primal-dual minimax method suggested by Nandwani et al. [14] (Section 3.2) as well as a novel meta-learning method (Section 3.3).

3.2. Minimax Primal-Dual Method

(1)

Nandwani et al. [14] suggested a formulation of the aug

mented Lagrangian method that supports inequality constraints. They solve the dual problem which is optimizing

A common approach to optimize constrained problems such as (2) is to use the penalty method, translating constraints into penalty terms that encourage constraint satisfaction:

arg min Ex,,,∼ D[Π (x, , )+ u max(0, −ℛ (x))+ v max(0, ℛ (x)− )].

(4) (5)

3. Constrained Bandit Exploration 3.1. Problem Formulation

We consider the general framework of of-policy contextual bandits in which a policy Π is used to select an action ∈ given the observed context vector (x) to maximize the scalar reward () received from the environment. Here, we assume stochastic policies of the form Π (|x) in which a model parameterized by (e.g., a neural network) is used to assign action selection probabilities to each action given the context. Furthermore, we assume that each sample belongs to a domain denoted by ∈ 1 . . . that is provided as a feature in x.

In the of-policy setting, the policy is refreshed after

collecting a dataset of samples from the current policy. We adopt a definition of exploration which considers any change in the agent behavior compared to the current policy as an exploration action. Alternatively, we can consider replication with respect to the current policy as the rate at which the new policy makes similar decisions to the current policy when both evaluated and sampled stochastically. We define replication for Π with respect to Π 0 based on the L1-distance of their action propensities given a context x: ℛ (x) = 1 − |Π (x) − Π 0(x)|1 .

2 trol the rate at which the new policy replicates the current policy for each domain. This ensures safe model updates for critical domains while enabling exploration for others that may benefit from an extra exploration budget. Accordingly, we define constraints to encourage the desired behavior for samples of each domain, while learning an of-policy bandit: where context, action, reward, and domain (x, , , ) are sampled from a dataset collected from the current policy. In (2), we use and to indicate user-defined replication constraints for domain .

Π can be any diferentiable of-policy bandit learning objective, for simplicity of discussion, we consider the vanilla inverse propensity scoring (IPS) objective: Π (x, , ) = − Π (|x) Π 0(|x) where Π 0 is the current policy and is the observed reward for taking action collected in the dataset.

arg min Ex,,,∼ D Π , .. ≤ ℛ (x) ≤ (2) factors.

u,v min max Ex,,,∼ D[Π (x, , )+ u max(0, −ℛ (x))+v max(0, ℛ (x)− )].

Algorithm 1 shows an outline of the policy training using the minimax method. This method has four hyperparameters controlling the max player optimization via adjusting the update frequency, learning rate, and decay

Intuitively, the min player is trying to update the policy parameters while the max player is increasingly penalizing it for any constraint violation. A stable point of this algorithm would be to gradually reduce the max player update rate as the min player is getting better at satisfying the constraints, eventually satisfying all constraints resulting in a zero loss for the max player due to the zero hinge penalty terms. (3)

3.3. Meta Gradient Method Theoretically, the primal-dual minimax method is capable of achieving Pareto optimal solutions [18, 14]. However, in practice, it is infeasible to train for an infinite

Algorithm 1: Minimax constrained bandit optimization algorithm input : D (dataset), (max learning rate), (max learning rate decay), (max update frequency), (max update frequency decay) u, v, ← Initialize(Π )

0 for x, , , ∼ do /* use loss function of (5) */ ← if % is 0 then (x, , , , , u, v) /* gradient ascent, max player /* lr/update decay, max player /* optimize Π , min player */ /* increment iteration counter */ number of iterations, and therefore approximate inner optimization loops are being used. To find the right balance between constraint satisfaction and policy improvement for the minimax algorithm, it is necessary to carefully adjust multiple hyperparameters. Note that an extensive hyperparameter search is undesirable in many real-world large-scale scenarios that require frequent model updates as it entails not only significant compute costs associated with the search but also increases the turn-around time to deploy refreshed models. To mitigate this issue, we suggest a meta-gradient optimization idea that adapts u and v based on a meta objective within the training process.

Specifically, we define the following meta objective: = Ex,,,∼ D [(1 − )Π (x, , ) + max(0, ← × ← × end ← (,

∇ ) ← + 1 end */ */ */ */ */ */ */ */

Note that (6) is not directly dependent on u and v, instead we rely on online cross-validation [24, 25] to update these variables. We define an inner objective the same as the min optimization problem of (5), do a diferentiable optimization step, evaluate the meta objective on another batch of data, then update u and v by taking the derivative of the meta objective through the inner optimization trace.

Algorithm 2 presents an outline of the meta gradient optimization method. Due to practical issues of dealing with high-order gradients, we only consider the immediate impact of a single inner loop update on the meta objective. We found that discarding the vanilla gradient descent used for the inner optimization and using a more advanced optimizer (e.g., Adam) to update Π works best. Regarding the hyperparameter, we found that simply setting

= 1 works well in practice. It effectively means that the meta-gradient solution does not require any hyperparameter adjustments (experimental evidence presented in Section 4.4).

Algorithm 2: Meta gradient constrained bandit

optimization algorithm input : D (dataset), (learning rate), (penalty weight) u, v ← Initialize(Π )

0 for x, , , ∼ D and x′, ′, ′, ′ ∼ D do /* clone policy parameters

( ) ′ ← ← model ← /* compute inner loss with ′

(x, , , , ′, u, v) /* gradient descent on cloned ′ ←

′ − ∇ ′ /* compute meta loss

(x′, ′, ′, ′, ′, ) /* diff. through inner update Compute ∇u and ∇v /* optimize u, v using any optimizer u ← (u, ∇u) v ← (v, ∇v) ← /* compute inner loss with

(x, , , , , u, v) /* optimize Π using any u/v. Then, the meta objective computes a validation loss that measures the impact of the inner policy update and u/v on the macro-averaged constraint violations. Finally, by computing the meta-gradient of the meta objective through the inner optimization loop, u and v are updated to better encourage the constraint satisfaction for the next policy update iteration.

4. Experiments

4.1. Setup We conduct experiments on a bandit agent for the prob- Figure 1: An overview of the network architecture: a set of lem of skill routing in commercial conversational systems hypothesis are encoded as vectors and fed to a bi-directional such as Apple Siri, Amazon Alexa, Google Assistant, and LSTM which is followed by a shared MLP and a softmax layer Microsoft Cortana. In these systems, skill routing is a to normalize the candidate selection probabilities. key component that takes the user’s utterance as well as signals from automated speech recognition (ASR) and natural language understanding (NLU), then it decides which skill and NLU interpretation to be used to serve the request [27].

The skill routing problem in a commercial dialogue agent is a natural fit for this study as any policy change directly impacts the user experience, making safe and controlled policy updates crucial. Additionally, constraints can be defined based on the NLU domain to ensure policy safety for business-critical domains such as shopping, while potentially exploring others such as entertainment more aggressively.

Figure 1 shows an overview of the model architecture used in our experiments. Input to the model is a set of routing candidates i.e., a combination of embedded ASR, NLU, and context vectors as well as skill embeddings. The output is the softmax-normalized propensity of selecting Figure 2: The constraint configuration list for the each candidate to handle the user request. The final exploration benchmark. model has about 12M trainable parameters consisting of a language model to encode utterance, embeddings for contextual signals, and fully-connected layers. As our for all domains. In addition to the global constraint, architecture closely follows the design choices from Park critical assert stronger limits for a set of critical doet al. [28], we refer readers to that paper for details. mains defined based on the expert knowledge. The

To train and evaluate our models, we use logged policy exploration benchmark extends the critical benchactions from a current production policy. The observed mark by adding constraints to encourage exploration reward is based on a curated function of user satisfaction for domains that may benefit from additional exploration. metrics (refer to [29] for an example). Our dataset con- Each benchmark is a list of constraints consisting of a sists of about 40M samples divided into 85% training, 10% short description, applicable domain, and the desired validation, and 5% test sets covering 27 domains that are replication range. Figure 2 shows the exploration imbalanced in the number of samples. All data used in benchmark as an example. We provide the exact conthis work was deidentified to comply with our customer straint configurations in the appendix. privacy guidelines.

4.2. Benchmarks In our experiments, we use three diferent benchmarks for the constraint settings: global, critical, and exploration. The global benchmark aims to constrain the new policy to be within an exploration limit 4.3. Baselines and Metrics

As the first baseline, we consider the vanilla IPS objective which disregards the constraints. Additionally, we build on the IPS baseline to consider the constraints using constraint optimization approaches: quadratic (uniform constant penalty weight), minimax (Algorithm 1), and wwtmi oieesRnitegae,[h-g3wgta0rer]oadudf(idsniteeeghnnAeotdt(qheaAudymlapgbdoeoyrrprapitttiaihmicrnmaizmmSe2eere)ctt.wtheirooiUtsndh,nl3tweh)s.feesodrueexsfepatruhveletasslcueoedpnsefigonuftrharoaletm-ry- 000000000.........0888888888.88899999989781543269 ηη==10..01,,γγ==10..0999ReVwioalradt(i%oηηηηηηηn)=======0011000.R......01000011,,,11e,,γγγ,,γu===γ=γγ===0001c0...10.9990...9999099559995 2314720%%%% Violation Reduction(%) {0.1, 1, 10, 100, 1000}. For the minimax method 0.886 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 (Algorithm 1), we found that setting and to one (a) (b) wwwtbooebeeh∈njaiefclduoceh{jtus1muiaevn,sdaed0tjria.uknt9(sh.gi9gt.Faei9ranot.i,,ldrgsl0imt.mfh9soeee9puata5mlarnry}cdehouhttyabosp-ijognfineepfrcgdrratpeidtavshireeeeeannqm∈bottusenmeasltvltyeee{ttrsorh1sefyo,.ototd0cnsiC.uni(e1mAosg,iinslinn0glsaf.goe0torhrq1rioerut}enehemasmncautethnhl2lttydae)s,, 00000000000...........088888888888.999999998888567834129789 0 1λ=0.052RReRewe3wwaaradrdr(4d%(%(%)) )5 6λλλλλλ======000010......5107091755 12356247020%%%%%% 0 1Viola2tion3Red4uctio5n(%)6 7 constraints) outperforms other works (see evidence in (c) (d) Section 4.4). As a result, it does not require adjusting Figure 3: Comparing the hyper-parameter sensitivity for any hyperparameter and the same setting is used across the minimax and meta-gradient methods on the critical all benchmarks. We provide the hyperparameters used benchmark. For the minimax method: (a) reward and (b) for each benchmark and method in the appendix. macro violation reduction wrt. diferent and settings. For

Regarding the evaluation metrics, we use the expected the meta-gradient method: (c) reward and (d) macro violation of-policy reward as well as the rate of constraint vio- reduction wrt. diferent settings. lations averaged over all samples (micro-averaged) and individual domains (macro-averaged). To comply with our privacy and business guidelines, in all instances, we ∈ {1.0, 0.1, 0.01} and ∈ {1.0, 0.999, 0.995}. For only report relative and normalized results which do not the meta-gradient method (Algorithm 2), we use ∈ directly represent the actual scales or metric values. {0.01, 0.05, 0.1, 0.5, 0.75, 0.95, 1.0}. Figure 3 shows

We train each model until convergence or reaching 32 the results of such experiment. Based on this experiment, epochs and take the model best performing based on the the minimax approach shows a much higher sensitivity macro-averaged violation rate. Each experiment was run to its two hyperparameters, showing a significant imfour times using diferent random seeds for data sampling pact on both the reward and violation reduction metrics. and weight initialization to report the mean and standard However, the meta-gradient method shows much less deviation of each result. We used a cluster of 32 NVIDIA sensitivity to the hyperparameter. We found that simV100 GPUs to process a mini-batch size of 32K samples. ply setting = 1 works very well in practice. It can be Each individual run took between 4 to 24 hours. very desirable for real-world large-scale settings such as conversational systems which require frequent model 4.4. Results updates as new features are on-boarded every day, and having a dependency on an extensive hyperparameter Table 1 shows a comparison of results for the IPS, search is very costly, if not impractical. quadratic, minimax, and meta-gradient methods on all To dive deeper into the reason behind the better perbenchmarks. For each case, we report the expected re- formance for the meta-gradient algorithm compared to ward and the percentage of reduction in the rate of vio- the minimax approach, we investigated the constraint lations compared to the simple IPS objective. The meta- penalty weight value for the first 3,000 iterations of traingradient approach consistently shows the best results ing using the global benchmark. From Figure 4, we across all benchmarks. The simple quadratic method can see the minimax method is monotonically increasbehaves very competitively to minimax, except for the ing the penalty weight with each iteration which is a explore benchmark which requires a more fine-grained behavior consistent with the gradient ascent update rule control on multiple constraints (see Figure 2). The meta- in Algorithm 1. In other words, as long as there are any gradient method, while having the highest reduction in constraint violations, minimax will keep increasing the constraints violations, also has very competitive perfor- penalty, which in our opinion is the reason for high senmance in terms of the reward metric. sitivity to the hyperparameters. On the other hand, the

To study the impact of hyperparameters, we con- meta-gradient approach is using a validation signal to ducted an experiment using the critical benchmark dynamically adjust the penalty weight. Consequently, by training minimax and meta-gradient based mod- it may keep the penalty term near zero for an initial els using diferent hyperparameter values. Specifi- phase, rapidly increase it, then decay when violations are cally, we train minimax models (Algorithm 1) using reduced and getting a higher reward is preferred. IPS Quadratic Minimax MetaGrad

5. Conclusion

This work studied the problem of controlled exploration for of-policy contextual bandit learning. We presented a constraint optimization formulation that enables a human expert to define the boundary of the desired exploration rate for individual domains. We proposed a scalable and practical solution based on meta-gradient learning which provides the highest constraint satisfaction rates without any need for an extensive hyperparameter adjustment. Finally, we conducted experiments using data from a real-world conversation system for the skill routing problem on a set of diferent realistic constraint benchmarks. Based on the experimental results, we believe that the suggested controlled bandit learning approach is very promising for application in real-world bandits in which frequent safe policy updates are of paramount importance. time, in: Proceedings of the 26th Annual Interna- arXiv:1803.02999 (2018). tional Conference on Machine Learning, 2009, pp. [27] R. Sarikaya, The technology behind personal digital 657–664. assistants: An overview of the system architecture [13] P. L. Donti, D. Rolnick, J. Z. Kolter, Dc3: A learn- and key components, IEEE Signal Processing Maging method for optimization with hard constraints, azine 34 (2017) 67–81.

International Conference on Learning Representa- [28] S. Park, H. Li, A. Patel, S. Mudgal, S. Lee, Y.-B. Kim, tions (ICLR) (2021). S. Matsoukas, R. Sarikaya, A scalable framework for [14] Y. Nandwani, A. Pathak, P. Singla, et al., A primal learning from implicit user feedback to improve natdual formulation for deep learning with constraints, ural language understanding in large-scale converin: Advances in Neural Information Processing Sys- sational ai systems, arXiv preprint arXiv:2010.12251 tems, 2019, pp. 12157–12168. (2020). [15] H. Kervadec, J. Dolz, J. Yuan, C. Desrosiers, [29] M. Kachuee, H. Yuan, Y.-B. Kim, S. Lee, SelfE. Granger, I. B. Ayed, Constrained deep networks: supervised contrastive learning for eficient user Lagrangian optimization via log-barrier extensions, satisfaction prediction in conversational agents, in: arXiv:1904.04205 (2019). Proceedings of the 2021 Conference of the North [16] N. Andrei, Penalty and augmented lagrangian meth- American Chapter of the Association for Computaods, in: Continuous Nonlinear Optimization for tional Linguistics: Human Language Technologies, Engineering Applications in GAMS Technology, 2021, pp. 4053–4064.

Springer, 2017, pp. 185–201. [30] D. P. Kingma, J. Ba, Adam: A method for stochas[17] R. Glowinski, P. Le Tallec, Augmented Lagrangian tic optimization, arXiv preprint arXiv:1412.6980 and operator-splitting methods in nonlinear me- (2014).

chanics, SIAM, 1989. [18] C. Jin, P. Netrapalli, M. I. Jordan, Minmax optimization: Stable limit points of gradient descent ascent are locally optimal, arXiv preprint arXiv:1902.00618 (2019). [19] Z. Xu, H. van Hasselt, M. Hessel, J. Oh, S. Singh,

D. Silver, Meta-gradient reinforcement learning with an objective discovered online, arXiv preprint arXiv:2007.08433 (2020). [20] L. Kirsch, S. van Steenkiste, J. Schmidhuber, Improving generalization in meta reinforcement learning using learned objectives, in: International Conference on Learning Representations, 2019. [21] S. Ritter, J. Wang, Z. Kurth-Nelson, S. Jayakumar,

C. Blundell, R. Pascanu, M. Botvinick, Been there, done that: Meta-learning with episodic recall, in: International Conference on Machine Learning,

PMLR, 2018, pp. 4354–4363. [22] C. Finn, P. Abbeel, S. Levine, Model-agnostic metalearning for fast adaptation of deep networks, in: International Conference on Machine Learning,

PMLR, 2017, pp. 1126–1135. [23] M. Ren, W. Zeng, B. Yang, R. Urtasun, Learning to reweight examples for robust deep learning, in: International Conference on Machine Learning,

PMLR, 2018, pp. 4334–4343. [24] R. S. Sutton, Adapting bias by gradient descent: An incremental version of delta-bar-delta, in: AAAI,

San Jose, CA, 1992, pp. 171–176. [25] Z. Xu, H. van Hasselt, D. Silver, Metagradient reinforcement learning, arXiv preprint arXiv:1805.09801 (2018). [26] A. Nichol, J. Achiam, J. Schulman, On first

order meta-learning algorithms, arXiv preprint

A.1. Constraint Benchmarks

used in this paper: global, critical, and explore.

The global benchmark sets a general minimum replication rate for all domains. The critical benchmark defines a tighter minimum replication rate for three and notifications) and a more relaxed default case for all other domains. In the explore benchmark, we extend the critical benchmark to include exploration encouragement for the knowledge and music domains.

Benchmark global critical

explore 10 business-critical domains (home automation, shopping, The selected hyperparameters for each benchmark and (a) global benchmark (b) critical benchmark (c) explore benchmark global, (b) critical, and (c) explore.

A.2. Selected Hyperparameters

benchmark and method. The definition of each hyper

0.1 1 1 1000 0.1 0.999 1 1000 1 1 1 parameter is presented in Algorithm 1 and 2 .