1. Introduction

A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing

Ruofeng Wen

Wenjun Zeng

Yi Liu

0 0 Customer Engagement Technology , Amazon

Amazon Customer Service (CS) provides real-time support for millions of customer contacts every year. While bot-resolver helps automate some trafic, we still see high demand for human agents, also called subject matter experts (SMEs). Customers outreach with questions in diferent domains (return policy, device troubleshooting, etc.). Depending on their training, not all SMEs are eligible to handle all contacts. Routing contacts to eligible SMEs turns out to be a non-trivial problem because SMEs' domain eligibility is subject to training quality and can change over time. To optimally recommend SMEs while simultaneously learning the true eligibility status, we propose to formulate the routing problem with a nonparametric contextual bandit algorithm (K-Boot) plus an eligibility control (EC) algorithm. K-Boot models reward with a kernel smoother on similar past samples selected by -NN, and Bootstrap Thompson Sampling for exploration. EC filters arms (SMEs) by the initially system-claimed eligibility and dynamically validates the reliability of this information. The proposed K-Boot is a general bandit algorithm, and EC is applicable to other bandits. Our simulation studies show that K-Boot performs on par with state-of-the-art Bandit models, and EC boosts K-Boot performance when stochastic eligibility signal exists.

eol>Bandit Customer Service Routing Arm Eligibility Nonparametric

1. Introduction

In Amazon Customer Service (CS), we dispatch human agents, also called subject matter experts (SMEs) in realtime to handle millions of customer contacts. The SME routing automation process has two steps: first there is a natural language understanding (NLU) model to process customer’s input and identify the relevant domain (return policy, device troubleshooting, etc.); then it dispatches an SME who is eligible. We define eligibility when the SME masters the required skill in the relevant domain through training. SME routing turns out to be a non-trivial problem for four reasons. First, the NLU model is unlikely to categorize the domain with perfect accuracy. Second, the reliability in SME eligibility as identified by operation team is subject to training program quality and readiness.

Third, the domain taxonomy and SMEs’ eligibility status can change in a decentralized way. In reality, it is dificult to keep track of all the eligibility updates, assum- Figure 1: The CS Routing Problem. The goal of the recoming correctness. Finally, eligible SMEs do not guarantee mender is to select an agent that will result in the best outcome customer satisfaction, leading to noisy feedback signals. of this customer contact.

All these uncertainties make SME routing a challenging problem.

To tackle the complexity, we formulate CS routing dit as a solution. Bandit, a framework for sequential as a recommendation problem and use contextual Ban- decision making, has been used for online recommendation across companies such as Amazon, Google, Netflix 4th Edition of Knowledge-aware and Conversational Recommender and Yahoo [1]. In contextual bandit, the decision maker Systems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seat- sequentially chooses an arm/action (SME in our case), *tlCe,oWrrAes,pUoSnAd.ing author. based on available contextual information (contact tran$ ruofeng@amazon.com (R. Wen); zengwenj@amazon.com script embedding, customer profile, etc.), and observes (W. Zeng); yiam@amazon.com (Y. Liu) a reward signal (customer satisfaction, contact transfer, © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License etc.) [2]. The objective is to recommend arms at each CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) step to maximize the expected cumulative reward over parametric manner, TS is implemented by drawing the time. We apply bandit to the routing problem because reward function parameter from a posterior distribuit can optimally explore uncertainties and adapt to dy- tion, calculating rewards with the sampled parameters, namic changes [3]. Particularly, the uncertainties in SME and selecting the arm that maximizes the estimated reeligibility motivate us to formulate a new type of bandit ward function. is often assumed to follow Conjugateto utilize arm-level eligibility signals. Exponential Family distributions resulting closed-form

Our contribution. We propose K-Boot, a nonpara- posteriors. There are also work where variational methmetric contextual bandit algorithm to model reward and ods are used to provide an analytical approximation to explore, and an Eligibility Controller (EC) algorithm the posterior and enforce trackability [8]. Bootstrap TS to model arm-level eligibility. K-Boot uses a -nearest introduces volatility in a diferent way: it pulls the arm neighbors (-NN) approach to find similar samples from with the highest bootstrap mean, estimated from reward the past, applies a Nadaraya–Watson kernel regression history in a nonparametric way [9]. One drawback of the among the found samples to estimate the reward, and approach is the computational cost to train one bandit adopts Bootstrap Thompson Sampling as the exploration model per Bootstrapped training set, which is dificult strategy. The EC component implements a dynamic top for large-scale real-time problems. [10] proposed a genarm filter based on its estimated Spearman’s Rank Cor- eral nonparametric bandit algorithm but they fell back to relation Coeficient between eligibility and reward. We the parametric reward function approach when there is are interested in this end-to-end nonparametric setup context. [11] further improved the exploration eficiency for practicality: robustness in performance, strong inter- with Bayesian Bootstrap, but also only for non-contextual pretability and simple non-technical maintenance which bandit. We pair Bootstrap TS with a single -NN for nonis friendly to business partners. For instance, the -NN parametric contextual exploration. component makes it easy to investigate which historical Kernel-based Bandit uses a nonparametric reward escontacts support the SME recommendation, and then de- timator via kernel functions, combined with an exploploy instant online fixes by trimming unwanted outliers. ration policy like TS or UCB. Gaussian Process (GP) has While K-Boot and EC are proposed here as a suite, each wide applications in bandit domain, e.g. GP-UCB [12] can be applied independently - K-Boot is a general Bandit and GP-TS [13]. However, GP inference is expensive algorithm and EC can control arm-eligibility for other the standard approach has ( 3) complexity, where bandits, such as LinUCB [4]. is the number of training samples. The most eficient

In the remainder of the paper, we review related liter- sparse or low-rank GP approach still requires ( 2), ature in Section 2, set up the formal problem in Section 3 where is a hyperparameter indicating the number of and detail the proposed algorithms in Section 4. We then inducing variables - see [14] for a review. We use -NN present our model results in Section 5, and conclude the to enable ( log ) inference time. paper last.

2. Related Work

In most bandit applications, the items to recommend are products and webpage content (widget, news articles, etc.). Recently, we see research eforts in using bandit for routing recommendation in a conversation [5, 6]. In [6], they used contextual bandit to utilize query and user features along with the conversational context and build an orchestration to route the dialog. In [5], they built bandit models to assign solutions given customer’s queries on Microsoft’s product to a chatbot. This is the most relevant work to our problem. However, they assume an upstream model to provide a set of plausible actions to start with and the bandit itself does not deal with the uncertainty in arm eligibility. Overall, comparing to the extensive bandit literature existing for product and webpage content recommendation, few bandit works are there for routing recommendation.

Thompson Sampling (TS) is a heuristic for balancing exploration and exploitation in bandits [7]. In the

3. Formulation: Contextual Bandit with Arm Eligibility

We define two types of eligibility setup with diferent data interfaces. ( 1 ) Eligibility Scores: for a given context x, a list of eligibility scores for each arm e = (1, 2, . . . ) is observed, where ∈ [0, 1] is for arm . A larger score means higher confidence in the arm being eligible for the current sample/iteration. An intuitive interpretation is that the arms are bidding for themselves. This score setup assumes (the owner of) each arm is able to adaptively assess the dynamic environment in a distributed and independent manner that can be unknown to the bandit. Typical examples include ads bidding and voice assistant’s skills detection - each ad/skill may be managed via the same API, but by diferent clients under dynamic competition. (2) Eligibility States: assume the system at any time can be classified into one of possible states. Each arm claims whether it is eligible under a state, represented by a constant binary vector c of length for arm . If the -th element of c is 1, arm claims to be eligible under the -th state, and 0 otherwise. Given a context x, a stochastic distribution over the states is observed, represented by a probability vector p of length . For CS routing, the state is the type of customer issue, the claim is SME’s corresponding skill-set, and p is the NLU predictive distribution. Note that a score can be computed as the inner product of the claimed eligibility binary vector and state probability vectors: := c⊺ p, so the state setup converges to the score setup. The major practical diference between the two interfaces is the amount of efort in the claiming side: the state interface simplifies the eligibility claims to static binary votes on a finite set of states, while the score interface mandates scoring with context awareness.

Consider a contextual bandit with a set of arms {1, 2, . . . }. At the th round, context x is observed and eligibility score e is either observed (score interface) or derived (state interface). After an arm is pulled, a reward ∈ [0, 1] is revealed. The goal is to minimize the expected regret over rounds: ∑︀=1(E[|*, x, e] − E[|, x, e]), where * is the optimal action for this round in hindsight.

4. Methodology The proposed algorithms run in order: Eligibility Controller (EC) to filter out ineligible arms, and K-Boot to find the optimal one within the rest. Below, we first introduce K-Boot to set the base and then EC.

4.1. K-Boot

K-Boot is a nonparametric bandit model. For a given con

text x, it estimates the reward for arm as the NadarayaWatson kernel weighted average over the observed rewards of the nearest neighbors of x from all historical samples where arm was pulled. We use Bootstrap TS as the exploration strategy. -NN regression was known to fit well with Bootstrap-Aggregation, with an analytical form of smoothed conditional mean estimate that does not require actual resampling [15, 16]. However, sampling from the confidence distribution of the mean estimate has little discussion. We devise a trick to shrink the resampling range from all historical samples to a local wrapper around the -NN.

K-Boot is detailed in Algorithm 1. At iteration with context x, for the th arm , let be the set of historical samples where was pulled, and = ||.

If is zero, we sample reward from a standard uni- 4.2. Eligibility Control form distribution (Line 6). If is greater than , we ifrst build an influential subset ′ containing the ′NN of x ( < ′). Intuitively, the ′-NN serves as a bufer to cover enough sample variance around the -NN, so that a Bootstrap on the influential samples is a good approximate of that on all samples. Formally, a sample is considered influential to make inference about x, if its probability of being in the -NN of x after a Bootstrap on is greater than . This probability is computed analytically on Line 9, based on the derived equation (6) in [15]. Here the tolerance hyperparameter controls the risk of missing influential samples and thus the fidelity of approximated Bootstrap TS. In our experiments with ≤ 104 and = 0.01, ′ is empirically well bounded by 2. This process shrinks all later computation to ′ samples, with the overhead neighbors search cost (log ) per sample (we used Hnswlib [17] for approximated search). If is less than , itself is the influential set. We then add two pseudo samples to ′ on Line 14-15, in order to expand reward’s empirical range for Bootstrap exploration [10]. To give pseudo-samples proper contexts thus kernel weights, we set their distance to x as that of a random observation in ′, so its weight shrinks as more data seen. On Line 16-18, we then draw a Bootstrap resample and select nearest samples from it to calculate the estimated reward for as the kernel weighted average. we implement ℎ(· , · ) as the simplest () Nadaraya-Watson kernel estimator with automatic bandwidth selection [20] - more advanced models can be plugged in.

As we model reward for each arm separately as a common practice, K-Boot is flexible to add and removal of arms. This benefits applications like ours with decentralized and independent arms management. In addition, K-Boot has simple maintenance: ( 1 ) the nonparametric model has minimal assumption in data and only a single important hyperparameter to balance between accuracy and computation. = 0.01 works well across our experiments. (2) the algorithm is business friendly in terms of decision interpretability by neighboring examples and allowing intuitive online instant modification of model behaviors. For example, CS business owners may hide contacts with undesired trending patterns during a specific period (e.g. under a legacy policy), or let newly available SMEs "inherit" past example contacts or eligibility claims (next section) via domain knowledge (e.g. a rehired agent or additional training programs).

The changes on data visibility will instantly and predictably afect the bandit model in production, with clear attribution. This data-driven while human-in-the-loop fungibility is essential to fast-paced operational business like customer service.

EC is used to recognize arm-eligibility and the associated

uncertainties, for optimized bandit exploration. In an ordinary contextual bandit model, the eligibility information can be trivially considered as part of the context x - it is assumed to be positively correlated with , thus Algorithm 1: K-Boot: a fully nonparametric contextual bandit Input : Number of iterations , number of arms , number of nearest neighbors , kernel function with bandwidth ℎ(· , · ), regularized incomplete beta function , (· ), approximation tolerance . if > then ′ := min(′), s.t. ∑︀=1 [,− +1( ′ ) − ,− +1( ′− 1 )] > 1 −

Find influential neighbors ′ := the top ′ samples in , with the largest ℎ(x, · ) else

Set ′ := end Draw a random sample (x⋆, ⋆) from ′ Add pseudo-samples: ′ := ′ ∪ {(x⋆, 0), (x⋆, 1)} Draw a Bootstrap sample ⋆ from ′ Find neighbors , := the top min(, ||) samples in ⋆, with the largest ℎ(x, · ) ⋆ ⋆ Estimate reward: ˆ, := ∑︀ ℎ(x, x)/ ∑︀ ℎ(x, x), summing over , ⋆ end 9 10 end Pull arm ⋆ := arg max ˆ,, and observe true reward ∈ [0, 1]

Update ⋆ := ⋆ ∪ {(x, )} and ⋆ := ⋆ + 1 predictive as a plain input to a reward estimator. This = 1 is the oracle solution. If eligibility score has zero or assumes the reward model is able to eventually learn even negative correlation ( ≤ 0) with reward, = the relation between and after enough rounds. This is the optimal solution - otherwise the arm exploration trivial approach does not directly utilize eligibility to is restricted adverserially or randomly for no benefits. In limit the range of arm exploration, and may unneces- the no-correlation case, the probability of “the best arm sarily explore inappropriate actions with catastrophic by reward is not in the top- by score”, defined as a leak, regret during cold-starts. The other extreme is to strictly is linear in : (leak|, = 0) = 1 − / . It is obfollow the eligibility information and ignore the reward vious (leak|, 0 < < 1) < (leak|, = 0). This feedback: for the eligibility score interface, this could observation reveals two insights: ( 1 ) := (leak|, ) be simply pulling the arm with the highest score as if characterizes a type of risk in applying a top- score it were an oracle; for the eligibility state interface, this iflter to arms; (2) when controlling the risk of a leak at could be nfiding the state with the highest probability certain level, e.g. = 0.01, is a function of . The true then remove arms that did not claim that state. Therefore, correlation between reward and eligibility score is the empirical data would never be used to validate and unknown, but can be estimated from historical observacorrect the potentially biased business logic - a common tions: {(, )}. Therefore, EC essentially calculates pitfall in practice. such that the risk (leak|, ˆ) is controlled at a given

The main idea of EC is to leave only the top- arms level. with the highest eligibility scores before applying a nor- To pair with K-Boot, EC also models (leak|, ) in a mal bandit, and adjust periodically based on the empiri- nonparametric fashion to avoid extra assumptions. Per , cal correlation between eligibility scores and rewards. We although Spearman’s rank correlation coeficient is used use Spearman’s Rank Correlation Coeficient, denoted for estimation, Kendall’s or any other rank-based coras . Using top- arms is a heuristic trade-of between relation measure applies similarly. Across the arms and = (the trivial case; is the total number of arms), rounds, we assume the rank of is a noisy perturbation and = 1 (the strict rule case). Intuitively, for the case of the rank of , and parameterize this perturbation as of perfect correlation between score and reward ( = 1), “performing random neighbor inversions”. An inverAlgorithm 2: Eligibility Control Input : Number of iterations , number of arms , risk level , score initial threshold 0, a pre-computed empirical dictionary (, , ) → 1 Initialize threshold := 0 and observation pool := ∅ 2 for = 1, . . . , do 3 Observe context x, and eligibility scores {1, · · · , }, for each arm 4 Find the th largest element [] among the scores 5 Sample reward for each arm if ≥ [] (Algorithm 1, Line 5-18) 6 Pull the ⋆th arm with highest reward estimate (Algorithm 1, Line 21-22) 7 Observe true reward , and update := ∪ {(, ⋆ )} 8 (Periodically) compute ˆ from and set := (, , ˆ) 9 Set := if ≥ (1 − ) 10 end sion is simply switching two elements in a sequence, and bution, specifically the gap between the first and second a neighbor inversion switches two neighboring elements. best arm. We leave other types of risk control mechanism Note = 0 indicates = 1 because the perturbed rank with distributional assumptions to future work. sequence is identical to the original one; → ∞ is effectively a random permutation thus = 0. This setup provides a generative process to simulate the joint dis- 5. Experiments tribution of all parameters for a rank sequence of length In this section, we evaluate the performance of K-Boot : ( 1 ) do random neighbor inversions on the sorted with two benchmarks: LinUCB [4] and NeuralUCB [21]. sneaqmueelnycief tohferaenlekmse(n1t, 1·· · is ,no)w; (a2t)osrebeeiyfotnhderteheisathlepaok-, The experiment setup mostly follows the methodology sition; (3) compute ˆ between the original and current reported in [21] with 4 synthetic simulations and 4 resequences. The three steps can be replicated for a suf- alworld datasets. EC is tested on synthetic data with ifcient number of times, to obtain the final correlation an eligibility score setup, then on an Amazon Customer Service routing dataset. Datasets are introduced below. coeficient ˆ and risk level ˆ by averaging individual ˆ and counting the frequency of leaks. For diferent combi- Synthetic Datasets. The bandit has = 10 arms, nations of (, , ), as the above process does not depend and runs on = 5000 samples/rounds. It observes on any actual data, it can be performed ofline in batch context with = 20 dimensions, each following indeto get the corresponding (ˆ, ˆ). The resulting empirical pendent ( 0, 1 ). Four types of true reward functions are tested - ( 1 ) linear: ℎ0(x) = 0.1x a; (2) quadratic: dduicrtiinognaornyline:in(fe,r,ence). →To aviosidstuornende,caensdsalraytecroqnutreorlieind 0ℎ.10(0x2)x=A0A.0x5(; x(4) aco)2si;ne(3:)ℎ3in(xne)r=-prcoodsu(c3tx:ℎa2);(xw)her=e the edge case where eligibility score turns out to be pure each entry of the parameters A ∈ R× and a ∈ R is noise, is reset to (no EC) if outputs a that is greater than the trivial value (1 − ) . Finally note the drawn from an independent ( 0, 1 ), diferent for each generative distribution of random neighbor inversions is arm. The observed reward has noise: := ℎ(x) + , assumed to be uniform along the whole sequence, yield- where ∼ (0, 2 ) and is drawn from (0.01, 0.5) ing the unbiasedness of the estimator ˆ even if the ob- independently for each arm. 20 runs are replicated by random seeds. iflsteerrvpedoldicayt.a points {(, )} are censored by the top- Synthetic Datasets with Eligibility. Given above,

The full online EC process is described in Algorithm 2 the eligibility scores are further simulated by perturbing (excluding the ofline dictionary generation). The eligi- the counter-factual reward of each arm: := + ′, to bility state interface is converted into the score interface induce the correlation. Here is an arm-specific scaling before EC, so the scores are the only required inputs. coeficient draw from (− 0.1, 1). It has a small chance The algorithm takes K-Boot (Algorithm 1) as the bandit to leave and negatively correlated, as tolerating eligicounterpart just for demonstration, and applies to any bility claim error in systems under the overall positive bandit that has arm-independent reward models. EC has correlation - the only assumption of EC. The noise term a single hyperparameter: the controlled risk of a leak . ′ ∼ (0, 2 ) controls the correlation efect size. By Note is the risk of missing the best arm, so its influence setting a proper , the Spearman’s rank correlation coon cumulative rewards depends on the arm reward distri- eficient between reward and eligibility score can be ifxed at an arbitrary positive value.

UCI Classification. Four real-world datasets from context, and a 20-dimensional probabilistic prediction UCI Machine Learning Repository [22] are gathered: from the emulator as eligibility scores for each arm/skill. covertype, magic, statlog, and mnist. These multi-class If the bandit chooses a skill matching the final one, reclassification datasets are converted into multi-arm con- ward is 1 and otherwise 0. In this case, the cumulative textual bandits problems, following the method in [4]: regret is an upper bound for the number of transferred each class is treated as an arm, and reward is 1 if ban- contacts, because agents assigned to diferent skills may dit pulls the correct arm (identify the correct class) and resolve the contact regardless (e.g. digital order agents 0 otherwise. Each dataset is shufled and split into 10 are often equipped to resolve retail order issues, so they mutually exclusive runs with 5000 samples each, and fed may not transfer an incorrectly routed contact). The rest to bandit one in each round (magic has less than 50000 5% chats are used to generate 10 bandit runs with 8000 samples so resampling is performed). samples each1. We observe that the empirical across

CS Routing with Eligibility. Agent skill-level rout- the runs is 0.3492. We leave the below to future work: ( 1 ) ing is considered: CS agents are assigned to diferent skill online experiments to measure real transfers and positive groups by operation teams, and the routing system deter- customer reviews as the reward metric; (2) agent-level mines which skill is required to resolve a given customer routing given agent profiles and finer grain eligibility contact based on the customer profile, policy rules and definitions than skill groups. natural language understanding. As a proof of concept before any online experiments, we formulate an ofline 5.1. K-Boot Benchmarks bandit dataset with historical observations. About 2 million Amazon US customer service text chats with bot-to- We compare K-Boot with LinUCB and NeuralUCB on agent routing are gathered. To introduce stochasticity, both synthetic and UCI classification datasets. With a historical routing policy emulator is trained on a 95% the 4 synthetic datasets, we first select the following random subset of the chats, by taking the customer-bot hyperparameters for each Bandit algorithm of interest transcripts before the routing action as input to predict ( 1 ) number of nearest neighbors ∈ {20, 50, 100} which skill the system chose. Specifically, the emulator for K-Boot; (2) weight parameter for exploration ∈ consists of a pretrained GPT-2 transformer encoder [23], {0.1, 1.0, 10} for LinUCB; (3) regularization paramewith a multi-class sequence classifier head to predict the ter ∈ {0.1, 0.01, 0.001} and exploration parameter top 20 skill groups. The model is trained (fine-tuned) ∈ {0.2, 0.02, 0.002} for NeuralUCB - other hyperpafor one epoch with learning rate 10− 4 and batch size 32. rameters such as learning rate, batch settings and number For the bandit simulation, the final actual resolving skill, of layers are set the same as in the original code base2. after all the potential agent-to-agent transfers since the We settle on = 100, = 10, (, ) = (0.001, 0.002) initial routing, serves as the ground truth - if the routed per having the lowest cumulative regret averaged across skill agrees with the final skill, the action is considered ℎ0 ∼ ℎ3 and 10 replicated runs, and carry these values as accurate and having avoided a potential transfer. At each round, the bandit receives a 256-dimensional text embedding from the same GPT-2 encoder as observed

1From another perspective, this ofline bandit-with-eligibility simu

lation setup is equivalent to "using a bandit for online improvement of a black-box probabilistic classifier with unknown accuracy" . 2https://github.com/uclaml/NeuralUCB over to compare the actual model performance on the 4 policy: pull the arm with the highest eligibility score. UCI datasets. We believe this setup is more realistic be- Figure 3 shows the results of applying EC to K-Boot, cause most bandit applications do not have the luxury for with eligibility scores used by EC only (explained later). intensive hyperparameter tuning due to small sample size When the signal in is weak (low ), EC adaptively or non-stationary environment. Finally, all experiments raises the top- threshold close to , achieving the same in this paper implement online, single-sample model up- performance as no EC. As grows, the advantage reveals dates. Figure 2 left shows the performance of the models - using EC is consistently better than not. The dynamic with final selected hyperparameters. Except for the linear thresholding may dominate the top-1 policy even under reward scenario where LinUCB/NeuralUCB has advan- strong eligibility signals, with the right . EC performs tage in correct/close model specification, there is no sig- the best for = 0.5 across all synthetic datasets, and nificant performance diference between K-Boot and Neu- the same value is carried over to the next experiment on ralUCB while LinUCB is inferior. Figure 2 right shows the CS Routing data. performance comparison on the UCI real-world datasets. EC can be applied to other bandits. Figure 4 shows K-Boot and NeuralUCB have a tie of winning on two the same synthetic data results but with EC + LinUCB. datasets each, and LinUCB is consistently the worst. By Here eligibility scores are also added to bandit context. examing the 80% confidence band (P10-P90 across 10 The reason why the previous experiment did not take runs), we find NeuralUCB has larger performance volatil- scores as K-Boot input is because -NN has no native ity (wider bands) on real data than the other two models, feature selection mechanism. If eligibility score is close due to learning instability in certain runs. to white noise (low ), model struggles to fit the reward, leading to a distraction from other presented visual pat5.2. Eligibility Control Testing terns. For LinUCB, such side-efect is minimal for the simulated noise is Gaussian, so it is a better condition to For the synthetic data, we control to sim- test the diference between the trivial way versus the EC ulate diferent levels of reward-reflecting reliabil- way of utilizing eligibility. EC still dominates no-EC in ity in eligibility scores, with resulting (, ) ∈ most cases. There are a few exceptions for the linear true {0.05, 0.15, 0.3, 0.45, 0.6, 0.75}, for each true reward reward function ℎ0, where = 0.5 is worse because function ℎ0 - ℎ3. In the algorithm, we set diferent risk LinUCB learns so well that taking a high risk of missing level in EC. Note = 0 is efectively not using EC, and the best arm is not cost-efective. An interesting obserwe abuse the notation → 1 to denote the strict top-1 vation is that, while the strict top-1 rule was dominated by K-Boot + EC (Figure 3), it actually beats LinUCB + EC for nonlinear true reward functions even at = 0.15.

This indicates the power of the bandit model is still a key driver for ML to out-perform static rules.

Figure 5 shows the result on CS Routing data, where EC significantly improves routing accuracy. Routing to the skill with the highest probability from the emulator serves a stochastic surrogate of the existing policy. Pure K-Boot result without knowing the emulator outputs is set as a baseline. With only 8000 samples, the average accuracy of the emulator policy is improved by K-Boot + EC from 53.37% to 57.78%.

6. Conclusions We proposed a nonparametric contextual Bandit algo

rithm K-Boot with arm-level eligibility control (EC) for routing customer contacts to eligible SMEs in real-time. While K-Boot and EC are proposed here as a suite, each can be applied independently - K-Boot is a general Bandit algorithm and EC can control arm-eligibility for other bandits. We compared K-Boot with LinUCB and NeuralUCB. When looking at average regret performance over simulation runs, K-Boot and NeuralUCB had a tie in winning scenarios and were comparable in terms of robustness to diferent data distributions. Both worked better than LinUCB. However, we observe larger performance variability for NeuralUCB as opposed to K-Boot.

We further found the EC component improved K-Boot’s performance on both synthetic datasets and a simulation scenario in CS routing when eligibility exists.

[2] Y. Liu, L. Li, A map of bandits for e-commerce, [17] Y. A. Malkov, D. A. Yashunin, Eficient and robust arXiv preprint arXiv:2107.00680 (2021). approximate nearest neighbor search using hierar[3] T. Lattimore, C. Szepesvári, Bandit algorithms, Cam- chical navigable small world graphs, IEEE transacbridge University Press, 2020. tions on pattern analysis and machine intelligence [4] L. Li, W. Chu, J. Langford, R. E. Schapire, A 42 (2018) 824–836.

contextual-bandit approach to personalized news [18] J. Johnson, M. Douze, H. Jégou, Billion-scale simiarticle recommendation, in: Proceedings of the larity search with GPUs, IEEE Transactions on Big 19th international conference on World wide web, Data 7 (2019) 535–547.

2010, pp. 661–670. [19] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, [5] S. Sajeev, J. Huang, N. Karampatziakis, M. Hall, F. Chern, S. Kumar, Accelerating large-scale inferS. Kochman, W. Chen, Contextual bandit applica- ence with anisotropic vector quantization, in: Intertions in a customer support bot, in: Proceedings of national Conference on Machine Learning, PMLR, the 27th ACM SIGKDD Conference on Knowledge 2020, pp. 3887–3896.

Discovery & Data Mining, 2021, pp. 3522–3530. [20] B. W. Silverman, Density estimation for statistics [6] S. Upadhyay, M. Agarwal, D. Bounnefouf, Y. Khaz- and data analysis, Routledge, 2018. aeni, A bandit approach to posterior dialog [21] D. Zhou, L. Li, Q. Gu, Neuralucb: Contextual banorchestration under a budget, arXiv preprint dits with neural network-based exploration (2019). arXiv:1906.09384 (2019). [22] D. Dua, C. Graf, UCI machine learning repository, [7] O. Chapelle, L. Li, An empirical evaluation of 2017. URL: http://archive.ics.uci.edu/ml. thompson sampling, Advances in neural informa- [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, tion processing systems 24 (2011). I. Sutskever, et al., Language models are unsuper[8] T. Graepel, J. Quiñonero Candela, T. Borchert, vised multitask learners, OpenAI blog 1 (2019) 9.

R. Herbrich, Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine, in: Proceedings of the 27th International Conference on Machine Learning ICML 2010, Invited Applications Track (unreviewed, to appear), 2010. [9] I. Osband, B. Van Roy, Bootstrapped thompson sampling and deep exploration, arXiv preprint arXiv:1507.00300 (2015). [10] B. Kveton, C. Szepesvari, S. Vaswani, Z. Wen, T. Lattimore, M. Ghavamzadeh, Garbage in, reward out: Bootstrapping exploration in multi-armed bandits, in: International Conference on Machine Learning,

PMLR, 2019, pp. 3601–3610. [11] C. Riou, J. Honda, Bandit algorithms based on thompson sampling for bounded reward distributions, in: Algorithmic Learning Theory, PMLR, 2020, pp. 777–826. [12] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger,

Gaussian process optimization in the bandit setting: No regret and experimental design, arXiv preprint arXiv:0912.3995 (2009). [13] S. R. Chowdhury, A. Gopalan, On kernelized multiarmed bandits, in: International Conference on

Machine Learning, PMLR, 2017, pp. 844–853. [14] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, arXiv preprint arXiv:1309.6835 (2013). [15] B. M. Steele, Exact bootstrap k-nearest neighbor

learners, Machine Learning 74 (2009) 235–255. [16] G. Biau, F. Cérou, A. Guyader, On the rate of convergence of the bagged nearest neighbor estimate., Journal of Machine Learning Research 11 (2010).

[1]

Elena ,

Milos , I. Eugene , Survey of multiarmed bandit algorithms applied to recommendation systems , International Journal of Open Information Technologies 9 ( 2021 ) 12 - 27 .