=Paper= {{Paper |id=Vol-3294/long2 |storemode=property |title=A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing |pdfUrl=https://ceur-ws.org/Vol-3294/long2.pdf |volume=Vol-3294 |authors=Ruofeng Wen,Wenjun Zeng,Yi Liu |dblpUrl=https://dblp.org/rec/conf/recsys/WenZL22 }} ==A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing== https://ceur-ws.org/Vol-3294/long2.pdf

A Nonparametric Contextual Bandit with Arm-level
Eligibility Control for Customer Service Routing
Ruofeng Wen1,* , Wenjun Zeng1 and Yi Liu1
1
Customer Engagement Technology, Amazon

Abstract
Amazon Customer Service (CS) provides real-time support for millions of customer contacts every year. While bot-resolver
helps automate some traffic, we still see high demand for human agents, also called subject matter experts (SMEs). Customers
outreach with questions in different domains (return policy, device troubleshooting, etc.). Depending on their training,
not all SMEs are eligible to handle all contacts. Routing contacts to eligible SMEs turns out to be a non-trivial problem
because SMEs’ domain eligibility is subject to training quality and can change over time. To optimally recommend SMEs
while simultaneously learning the true eligibility status, we propose to formulate the routing problem with a nonparametric
contextual bandit algorithm (K-Boot) plus an eligibility control (EC) algorithm. K-Boot models reward with a kernel smoother
on similar past samples selected by 𝑘-NN, and Bootstrap Thompson Sampling for exploration. EC filters arms (SMEs) by the
initially system-claimed eligibility and dynamically validates the reliability of this information. The proposed K-Boot is a
general bandit algorithm, and EC is applicable to other bandits. Our simulation studies show that K-Boot performs on par
with state-of-the-art Bandit models, and EC boosts K-Boot performance when stochastic eligibility signal exists.

Keywords
Bandit, Customer Service Routing, Arm Eligibility, Nonparametric

1. Introduction
In Amazon Customer Service (CS), we dispatch human
agents, also called subject matter experts (SMEs) in real-
time to handle millions of customer contacts. The SME
routing automation process has two steps: first there is a
natural language understanding (NLU) model to process
customer’s input and identify the relevant domain (return
policy, device troubleshooting, etc.); then it dispatches an
SME who is eligible. We define eligibility when the SME
masters the required skill in the relevant domain through
training. SME routing turns out to be a non-trivial prob-
lem for four reasons. First, the NLU model is unlikely to
categorize the domain with perfect accuracy. Second, the
reliability in SME eligibility as identified by operation
team is subject to training program quality and readiness.
Third, the domain taxonomy and SMEs’ eligibility sta-
tus can change in a decentralized way. In reality, it is
difficult to keep track of all the eligibility updates, assum- Figure 1: The CS Routing Problem. The goal of the recom-
ing correctness. Finally, eligible SMEs do not guarantee mender is to select an agent that will result in the best outcome
customer satisfaction, leading to noisy feedback signals. of this customer contact.
All these uncertainties make SME routing a challenging
problem.
To tackle the complexity, we formulate CS routing dit as a solution. Bandit, a framework for sequential
as a recommendation problem and use contextual Ban- decision making, has been used for online recommenda-
tion across companies such as Amazon, Google, Netflix
4th Edition of Knowledge-aware and Conversational Recommender and Yahoo [1]. In contextual bandit, the decision maker
Systems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seat- sequentially chooses an arm/action (SME in our case),
tle, WA, USA.
*
Corresponding author. based on available contextual information (contact tran-
$ ruofeng@amazon.com (R. Wen); zengwenj@amazon.com script embedding, customer profile, etc.), and observes
(W. Zeng); yiam@amazon.com (Y. Liu) a reward signal (customer satisfaction, contact transfer,
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). etc.) [2]. The objective is to recommend arms at each
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
step to maximize the expected cumulative reward over parametric manner, TS is implemented by drawing the
time. We apply bandit to the routing problem because reward function parameter 𝜃 from a posterior distribu-
it can optimally explore uncertainties and adapt to dy- tion, calculating rewards with the sampled parameters,
namic changes [3]. Particularly, the uncertainties in SME and selecting the arm that maximizes the estimated re-
eligibility motivate us to formulate a new type of bandit ward function. 𝜃 is often assumed to follow Conjugate-
to utilize arm-level eligibility signals. Exponential Family distributions resulting closed-form
Our contribution. We propose K-Boot, a nonpara- posteriors. There are also work where variational meth-
metric contextual bandit algorithm to model reward and ods are used to provide an analytical approximation to
explore, and an Eligibility Controller (EC) algorithm the posterior and enforce trackability [8]. Bootstrap TS
to model arm-level eligibility. K-Boot uses a 𝑘-nearest introduces volatility in a different way: it pulls the arm
neighbors (𝑘-NN) approach to find similar samples from with the highest bootstrap mean, estimated from reward
the past, applies a Nadaraya–Watson kernel regression history in a nonparametric way [9]. One drawback of the
among the found samples to estimate the reward, and approach is the computational cost to train one bandit
adopts Bootstrap Thompson Sampling as the exploration model per Bootstrapped training set, which is difficult
strategy. The EC component implements a dynamic top for large-scale real-time problems. [10] proposed a gen-
arm filter based on its estimated Spearman’s Rank Cor- eral nonparametric bandit algorithm but they fell back to
relation Coefficient between eligibility and reward. We the parametric reward function approach when there is
are interested in this end-to-end nonparametric setup context. [11] further improved the exploration efficiency
for practicality: robustness in performance, strong inter- with Bayesian Bootstrap, but also only for non-contextual
pretability and simple non-technical maintenance which bandit. We pair Bootstrap TS with a single 𝑘-NN for non-
is friendly to business partners. For instance, the 𝑘-NN parametric contextual exploration.
component makes it easy to investigate which historical Kernel-based Bandit uses a nonparametric reward es-
contacts support the SME recommendation, and then de- timator via kernel functions, combined with an explo-
ploy instant online fixes by trimming unwanted outliers. ration policy like TS or UCB. Gaussian Process (GP) has
While K-Boot and EC are proposed here as a suite, each wide applications in bandit domain, e.g. GP-UCB [12]
can be applied independently - K-Boot is a general Bandit and GP-TS [13]. However, GP inference is expensive -
algorithm and EC can control arm-eligibility for other the standard approach has 𝑂(𝑁 3 ) complexity, where 𝑁
bandits, such as LinUCB [4]. is the number of training samples. The most efficient
In the remainder of the paper, we review related liter- sparse or low-rank GP approach still requires 𝑂(𝑁 𝐷2 ),
ature in Section 2, set up the formal problem in Section 3 where 𝐷 is a hyperparameter indicating the number of
and detail the proposed algorithms in Section 4. We then inducing variables - see [14] for a review. We use 𝑘-NN
present our model results in Section 5, and conclude the to enable 𝑂(𝑘 log 𝑁 ) inference time.
paper last.

3. Formulation: Contextual Bandit
2. Related Work with Arm Eligibility
In most bandit applications, the items to recommend are
We define two types of eligibility setup with different data
products and webpage content (widget, news articles,
interfaces. (1) Eligibility Scores: for a given context x,
etc.). Recently, we see research efforts in using bandit for
a list of eligibility scores for each arm e = (𝑒1 , 𝑒2 , . . . ) is
routing recommendation in a conversation [5, 6]. In [6],
observed, where 𝑒𝑗 ∈ [0, 1] is for arm 𝑎𝑗 . A larger score
they used contextual bandit to utilize query and user fea-
means higher confidence in the arm being eligible for
tures along with the conversational context and build an
the current sample/iteration. An intuitive interpretation
orchestration to route the dialog. In [5], they built bandit
is that the arms are bidding for themselves. This score
models to assign solutions given customer’s queries on
setup assumes (the owner of) each arm is able to adap-
Microsoft’s product to a chatbot. This is the most relevant
tively assess the dynamic environment in a distributed
work to our problem. However, they assume an upstream
and independent manner that can be unknown to the
model to provide a set of plausible actions to start with
bandit. Typical examples include ads bidding and voice
and the bandit itself does not deal with the uncertainty in
assistant’s skills detection - each ad/skill may be managed
arm eligibility. Overall, comparing to the extensive ban-
via the same API, but by different clients under dynamic
dit literature existing for product and webpage content
competition. (2) Eligibility States: assume the system
recommendation, few bandit works are there for routing
at any time can be classified into one of 𝐿 possible states.
recommendation.
Each arm claims whether it is eligible under a state, rep-
Thompson Sampling (TS) is a heuristic for balanc-
resented by a constant binary vector c𝑗 of length 𝐿 for
ing exploration and exploitation in bandits [7]. In the
arm 𝑎𝑗 . If the 𝑙-th element of c𝑗 is 1, arm 𝑎𝑗 claims to
be eligible under the 𝑙-th state, and 0 otherwise. Given a a good approximate of that on all samples. Formally, a
context x, a stochastic distribution over the 𝐿 states is sample is considered influential to make inference about
observed, represented by a probability vector p of length x𝑛 , if its probability of being in the 𝐾-NN of x𝑛 after a
𝐿. For CS routing, the state is the type of customer issue, Bootstrap on 𝒟𝑚 is greater than 𝜀. This probability is
the claim is SME’s corresponding skill-set, and p is the computed analytically on Line 9, based on the derived
NLU predictive distribution. Note that a score can be equation (6) in [15]. Here the tolerance hyperparameter
computed as the inner product of the claimed eligibility 𝜀 controls the risk of missing influential samples and thus
binary vector and state probability vectors: 𝑒𝑗 := c⊺𝑗 p, the fidelity of approximated Bootstrap TS. In our experi-
so the state setup converges to the score setup. The ma- ments with 𝑁𝑚 ≤ 104 and 𝜀 = 0.01, 𝐾 ′ is empirically
jor practical difference between the two interfaces is the well bounded by 2𝐾. This process shrinks all later com-
amount of effort in the claiming side: the state interface putation to 𝐾 ′ samples, with the overhead neighbors
simplifies the eligibility claims to static binary votes on search cost 𝑂(log 𝑁𝑚 ) per sample (we used Hnswlib
a finite set of states, while the score interface mandates [17] for approximated search). If 𝑁𝑚 is less than 𝐾, 𝒟𝑚
scoring with context awareness. itself is the influential set. We then add two pseudo sam-
Consider a contextual bandit with a set of arms ples to 𝒟𝑚 ′
on Line 14-15, in order to expand reward’s
{𝑎1 , 𝑎2 , . . . }. At the 𝑛th round, context x𝑛 is observed empirical range for Bootstrap exploration [10]. To give
and eligibility score e𝑛 is either observed (score in- pseudo-samples proper contexts thus kernel weights, we
terface) or derived (state interface). After an arm is set their distance to x as that of a random observation
pulled, a reward 𝑟𝑛 ∈ [0, 1] is revealed. The goal in 𝒟𝑚 ′
, so its weight shrinks as more data seen. On Line
is to minimize the expected regret over 𝑁 rounds: 16-18, we then draw a Bootstrap resample and select 𝐾
𝑛=1 (E[𝑟|𝑎* , x𝑛 , e𝑛 ] − E[𝑟|𝑎 , x𝑛 , e𝑛 ]), where 𝑎* is nearest samples from it to calculate the estimated reward
∑︀𝑁 𝑛 𝑛 𝑛

the optimal action for this round in hindsight. for 𝑎𝑚 as the kernel weighted average. we implement
𝐾ℎ (·, ·) as the simplest 𝑂(𝐾) Nadaraya-Watson kernel
estimator with automatic bandwidth selection [20] - more
4. Methodology advanced models can be plugged in.
As we model reward for each arm separately as a com-
The proposed algorithms run in order: Eligibility Con-
mon practice, K-Boot is flexible to add and removal of
troller (EC) to filter out ineligible arms, and K-Boot to find
arms. This benefits applications like ours with decen-
the optimal one within the rest. Below, we first introduce
tralized and independent arms management. In addition,
K-Boot to set the base and then EC.
K-Boot has simple maintenance: (1) the nonparametric
model has minimal assumption in data and only a single
4.1. K-Boot important hyperparameter 𝐾 to balance between accu-
K-Boot is a nonparametric bandit model. For a given con- racy and computation. 𝜀 = 0.01 works well across our
text x, it estimates the reward for arm 𝑎 as the Nadaraya- experiments. (2) the algorithm is business friendly in
Watson kernel weighted average over the observed re- terms of decision interpretability by neighboring exam-
wards of the 𝑘 nearest neighbors of x from all historical ples and allowing intuitive online instant modification
samples where arm 𝑎 was pulled. We use Bootstrap TS of model behaviors. For example, CS business owners
as the exploration strategy. 𝑘-NN regression was known may hide contacts with undesired trending patterns dur-
to fit well with Bootstrap-Aggregation, with an analyt- ing a specific period (e.g. under a legacy policy), or let
ical form of smoothed conditional mean estimate that newly available SMEs "inherit" past example contacts or
does not require actual resampling [15, 16]. However, eligibility claims (next section) via domain knowledge
sampling from the confidence distribution of the mean (e.g. a rehired agent or additional training programs).
estimate has little discussion. We devise a trick to shrink The changes on data visibility will instantly and pre-
the resampling range from all historical samples to a local dictably affect the bandit model in production, with clear
wrapper around the 𝑘-NN. attribution. This data-driven while human-in-the-loop
K-Boot is detailed in Algorithm 1. At iteration 𝑛 with fungibility is essential to fast-paced operational business
context x𝑛 , for the 𝑚th arm 𝑎𝑚 , let 𝒟𝑚 be the set of his- like customer service.
torical samples where 𝑎𝑚 was pulled, and 𝑁𝑚 = |𝒟𝑚 |.
If 𝑁𝑚 is zero, we sample reward from a standard uni- 4.2. Eligibility Control
form distribution (Line 6). If 𝑁𝑚 is greater than 𝐾, we
EC is used to recognize arm-eligibility and the associated
first build an influential subset 𝒟𝑚 ′
containing the 𝐾 ′ -
uncertainties, for optimized bandit exploration. In an
NN of x𝑛 (𝐾 < 𝐾 ). Intuitively, the 𝐾 ′ -NN serves as
′
ordinary contextual bandit model, the eligibility infor-
a buffer to cover enough sample variance around the
mation can be trivially considered as part of the context
𝐾-NN, so that a Bootstrap on the influential samples is
x - it is assumed to be positively correlated with 𝑟, thus
Algorithm 1: K-Boot: a fully nonparametric contextual bandit
Input : Number of iterations 𝑁 , number of arms 𝑀 , number of nearest neighbors 𝐾, kernel function with
bandwidth 𝐾ℎ (·, ·), regularized incomplete beta function 𝐹𝛼,𝛽 (·), approximation tolerance 𝜀.
1 Initialize sample pool 𝒟𝑚 := ∅ and its size 𝑁𝑚 := 0, for each arm 𝑚 = 1, . . . , 𝑀
2 for 𝑛 = 1, · · · , 𝑁 do
3 Observe context x𝑛
4 for 𝑚 = 1, · · · , 𝑀 do
5 if 𝑁𝑚 = 0 then
6 Estimate reward: ˆ𝑟𝑚,𝑛 ∼ 𝒰 (0, 1)
7 else
8 if 𝑁𝑚 > 𝐾 then
𝑘′ 𝑘′ −1
𝐾 ′ := min(𝑘′ ), s.t. 𝐾
∑︀
9 𝑖=1 [𝐹𝑖,𝑁𝑚 −𝑖+1 ( 𝑁𝑚 ) − 𝐹𝑖,𝑁𝑚 −𝑖+1 ( 𝑁𝑚 )] > 1 − 𝜀
10 Find influential neighbors 𝒟𝑚 := the top 𝐾 samples in 𝒟𝑚 , with the largest 𝐾ℎ (x𝑛 , ·)
′ ′

11 else
12 Set 𝒟𝑚′
:= 𝒟𝑚
13 end
14 Draw a random sample (x⋆ , 𝑟⋆ ) from 𝒟𝑚 ′

15 Add pseudo-samples: 𝒟𝑚 := 𝒟𝑚 ∪ {(x⋆ , 0), (x⋆ , 1)}
′ ′

16 Draw a Bootstrap sample 𝒟𝑚 ⋆
from 𝒟𝑚 ′

17 Find neighbors 𝒟𝑚,𝐾
⋆
:= the top min(𝐾, |𝒟𝑚⋆
|) samples in 𝒟𝑚
⋆
, with the largest 𝐾ℎ (x𝑛 , ·)
Estimate reward: ˆ𝑟𝑚,𝑛 := 𝑖 𝑟𝑖 𝐾ℎ (x𝑛 , x𝑖 )/ 𝑖 𝐾ℎ (x𝑛 , x𝑖 ), summing over 𝒟𝑚,𝐾 ⋆
∑︀ ∑︀
18
19 end
20 end
21 Pull arm 𝑚⋆ := arg max𝑚 ˆ 𝑟𝑚,𝑛 , and observe true reward 𝑟𝑛 ∈ [0, 1]
22 Update 𝒟𝑚⋆ := 𝒟𝑚⋆ ∪ {(x𝑛 , 𝑟𝑛 )} and 𝑁𝑚⋆ := 𝑁𝑚⋆ + 1
23 end

predictive as a plain input to a reward estimator. This 𝑘 = 1 is the oracle solution. If eligibility score has zero or
assumes the reward model is able to eventually learn even negative correlation (𝜌 ≤ 0) with reward, 𝑘 = 𝑀
the relation between 𝑟 and 𝑒 after enough rounds. This is the optimal solution - otherwise the arm exploration
trivial approach does not directly utilize eligibility to is restricted adverserially or randomly for no benefits. In
limit the range of arm exploration, and may unneces- the no-correlation case, the probability of “the best arm
sarily explore inappropriate actions with catastrophic by reward is not in the top-𝑘 by score”, defined as a leak,
regret during cold-starts. The other extreme is to strictly is linear in 𝑘: 𝑃 (leak|𝑘, 𝜌 = 0) = 1 − 𝑘/𝑀 . It is ob-
follow the eligibility information and ignore the reward vious 𝑃 (leak|𝑘, 0 < 𝜌 < 1) < 𝑃 (leak|𝑘, 𝜌 = 0). This
feedback: for the eligibility score interface, this could observation reveals two insights: (1) 𝛼 := 𝑃 (leak|𝑘, 𝜌)
be simply pulling the arm with the highest score as if characterizes a type of risk in applying a top-𝑘 score
it were an oracle; for the eligibility state interface, this filter to arms; (2) when controlling the risk of a leak at
could be finding the state with the highest probability certain level, e.g. 𝛼 = 0.01, 𝑘 is a function of 𝜌. The true
then remove arms that did not claim that state. Therefore, correlation 𝜌 between reward 𝑟 and eligibility score 𝑒 is
the empirical data would never be used to validate and unknown, but can be estimated from historical observa-
correct the potentially biased business logic - a common tions: {(𝑟𝑖 , 𝑒𝑖 )}𝑖 . Therefore, EC essentially calculates 𝑘
pitfall in practice. such that the risk 𝑃 (leak|𝑘, 𝜌 ˆ) is controlled at a given
The main idea of EC is to leave only the top-𝑘 arms level.
with the highest eligibility scores before applying a nor- To pair with K-Boot, EC also models 𝑃 (leak|𝑘, 𝜌) in a
mal bandit, and adjust 𝑘 periodically based on the empiri- nonparametric fashion to avoid extra assumptions. Per 𝜌,
cal correlation between eligibility scores and rewards. We although Spearman’s rank correlation coefficient is used
use Spearman’s Rank Correlation Coefficient, denoted for estimation, Kendall’s 𝜏 or any other rank-based cor-
as 𝜌. Using top-𝑘 arms is a heuristic trade-off between relation measure applies similarly. Across the arms and
𝑘 = 𝑀 (the trivial case; 𝑀 is the total number of arms), rounds, we assume the rank of 𝑒 is a noisy perturbation
and 𝑘 = 1 (the strict rule case). Intuitively, for the case of the rank of 𝑟, and parameterize this perturbation as
of perfect correlation between score and reward (𝜌 = 1), “performing 𝑝 random neighbor inversions”. An inver-
Algorithm 2: Eligibility Control
Input : Number of iterations 𝑁 , number of arms 𝑀 , risk level 𝛼, score initial threshold 𝑘0 , a pre-computed
empirical dictionary 𝐺(𝑛, 𝛼, 𝜌) → 𝑘
1 Initialize threshold 𝑘 := 𝑘0 and observation pool 𝒟𝑒 := ∅
2 for 𝑛 = 1, . . . , 𝑁 do
3 Observe context x𝑛 , and eligibility scores {𝑒1 , · · · , 𝑒𝑀 }, for each arm
4 Find the 𝑘th largest element 𝑒[𝑘] among the scores
5 Sample reward for each arm if 𝑒𝑚 ≥ 𝑒[𝑘] (Algorithm 1, Line 5-18)
6 Pull the 𝑚⋆ th arm with highest reward estimate (Algorithm 1, Line 21-22)
7 Observe true reward 𝑟𝑛 , and update 𝒟𝑒 := 𝒟𝑒 ∪ {(𝑟𝑛 , 𝑒𝑚⋆ )}
8 (Periodically) compute 𝜌ˆ from 𝒟𝑒 and set 𝑘 := 𝐺(𝑛, 𝛼, 𝜌 ˆ)
9 Set 𝑘 := 𝑀 if 𝑘 ≥ (1 − 𝛼)𝑀
10 end

sion is simply switching two elements in a sequence, and bution, specifically the gap between the first and second
a neighbor inversion switches two neighboring elements. best arm. We leave other types of risk control mechanism
Note 𝑝 = 0 indicates 𝜌 = 1 because the perturbed rank with distributional assumptions to future work.
sequence is identical to the original one; 𝑝 → ∞ is ef-
fectively a random permutation thus 𝜌 = 0. This setup
provides a generative process to simulate the joint dis- 5. Experiments
tribution of all parameters for a rank sequence of length
In this section, we evaluate the performance of K-Boot
𝑛: (1) do 𝑝 random neighbor inversions on the sorted
with two benchmarks: LinUCB [4] and NeuralUCB [21].
sequence of ranks (1, · · · , 𝑛); (2) see if there is a leak,
The experiment setup mostly follows the methodology
namely if the element 1 is now at or beyond the 𝑘th po-
reported in [21] with 4 synthetic simulations and 4 re-
sition; (3) compute 𝜌 ˆ between the original and current
alworld datasets. EC is tested on synthetic data with
sequences. The three steps can be replicated for a suf-
an eligibility score setup, then on an Amazon Customer
ficient number of times, to obtain the final correlation
Service routing dataset. Datasets are introduced below.
coefficient 𝜌ˆ and risk level 𝛼ˆ by averaging individual 𝜌 ˆ
Synthetic Datasets. The bandit has 𝑀 = 10 arms,
and counting the frequency of leaks. For different combi-
and runs on 𝑁 = 5000 samples/rounds. It observes
nations of (𝑘, 𝑝, 𝑛), as the above process does not depend
context with 𝑑 = 20 dimensions, each following inde-
on any actual data, it can be performed offline in batch
pendent 𝒩 (0, 1). Four types of true reward functions
to get the corresponding (𝜌 ˆ ). The resulting empirical
ˆ, 𝛼
are tested - (1) linear: ℎ0 (x) = 0.1x𝑇 a; (2) quadratic:
dictionary 𝐺 : (𝑛, 𝛼, 𝜌) → 𝑘 is stored, and later queried
ℎ1 (x) = 0.05(x𝑇 a)2 ; (3) inner-product: ℎ2 (x) =
during online inference. To avoid unnecessary control in
0.002x𝑇 A𝑇 Ax; (4) cosine: ℎ3 (x) = cos(3x𝑇 a); where
the edge case where eligibility score turns out to be pure
each entry of the parameters A ∈ R𝑑×𝑑 and a ∈ R𝑑 is
noise, 𝑘 is reset to 𝑀 (no EC) if 𝐺 outputs a 𝑘 that is
drawn from an independent 𝒩 (0, 1), different for each
greater than the trivial value (1 − 𝛼)𝑀 . Finally note the
arm. The observed reward has noise: 𝑟 := ℎ(x) + 𝜀,
generative distribution of random neighbor inversions is
where 𝜀 ∼ 𝒩 (0, 𝜎𝑟2 ) and 𝜎𝑟 is drawn from 𝒰(0.01, 0.5)
assumed to be uniform along the whole sequence, yield-
independently for each arm. 20 runs are replicated by
ing the unbiasedness of the estimator 𝜌 ˆ even if the ob-
random seeds.
served data points {(𝑟𝑖 , 𝑒𝑖 )}𝑖 are censored by the top-𝑘
Synthetic Datasets with Eligibility. Given above,
filter policy.
the eligibility scores are further simulated by perturbing
The full online EC process is described in Algorithm 2
the counter-factual reward of each arm: 𝑒 := 𝑤𝑟 + 𝜀′ , to
(excluding the offline dictionary generation). The eligi-
induce the correlation. Here 𝑤 is an arm-specific scaling
bility state interface is converted into the score interface
coefficient draw from 𝒰(−0.1, 1). It has a small chance
before EC, so the scores are the only required inputs.
to leave 𝑒 and 𝑟 negatively correlated, as tolerating eligi-
The algorithm takes K-Boot (Algorithm 1) as the bandit
bility claim error in systems under the overall positive
counterpart just for demonstration, and applies to any
correlation - the only assumption of EC. The noise term
bandit that has arm-independent reward models. EC has
𝜀′ ∼ 𝒩 (0, 𝜎𝑒2 ) controls the correlation effect size. By
a single hyperparameter: the controlled risk of a leak 𝛼.
setting a proper 𝜎𝑒 , the Spearman’s rank correlation co-
Note 𝛼 is the risk of missing the best arm, so its influence
efficient 𝜌 between reward and eligibility score can be
on cumulative rewards depends on the arm reward distri-
fixed at an arbitrary positive value.
Figure 2: Benchmarking K-Boot, LinUCB and NeuralUCB. Left half: synthetic datasets; right half: UCI datasets. Model
hyperparameters are set as the best on synthetic datasets. Metric is averaged across 10 runs with an 80% confidence band.

UCI Classification. Four real-world datasets from context, and a 20-dimensional probabilistic prediction
UCI Machine Learning Repository [22] are gathered: from the emulator as eligibility scores for each arm/skill.
covertype, magic, statlog, and mnist. These multi-class If the bandit chooses a skill matching the final one, re-
classification datasets are converted into multi-arm con- ward is 1 and otherwise 0. In this case, the cumulative
textual bandits problems, following the method in [4]: regret is an upper bound for the number of transferred
each class is treated as an arm, and reward is 1 if ban- contacts, because agents assigned to different skills may
dit pulls the correct arm (identify the correct class) and resolve the contact regardless (e.g. digital order agents
0 otherwise. Each dataset is shuffled and split into 10 are often equipped to resolve retail order issues, so they
mutually exclusive runs with 5000 samples each, and fed may not transfer an incorrectly routed contact). The rest
to bandit one in each round (magic has less than 50000 5% chats are used to generate 10 bandit runs with 8000
samples so resampling is performed). samples each1 . We observe that the empirical 𝜌 across
CS Routing with Eligibility. Agent skill-level rout- the runs is 0.3492. We leave the below to future work: (1)
ing is considered: CS agents are assigned to different skill online experiments to measure real transfers and positive
groups by operation teams, and the routing system deter- customer reviews as the reward metric; (2) agent-level
mines which skill is required to resolve a given customer routing given agent profiles and finer grain eligibility
contact based on the customer profile, policy rules and definitions than skill groups.
natural language understanding. As a proof of concept
before any online experiments, we formulate an offline 5.1. K-Boot Benchmarks
bandit dataset with historical observations. About 2 mil-
lion Amazon US customer service text chats with bot-to- We compare K-Boot with LinUCB and NeuralUCB on
agent routing are gathered. To introduce stochasticity, both synthetic and UCI classification datasets. With
a historical routing policy emulator is trained on a 95% the 4 synthetic datasets, we first select the following
random subset of the chats, by taking the customer-bot hyperparameters for each Bandit algorithm of interest
transcripts before the routing action as input to predict (1) number of nearest neighbors 𝑘 ∈ {20, 50, 100}
which skill the system chose. Specifically, the emulator for K-Boot; (2) weight parameter for exploration 𝛼 ∈
consists of a pretrained GPT-2 transformer encoder [23], {0.1, 1.0, 10} for LinUCB; (3) regularization parame-
with a multi-class sequence classifier head to predict the ter 𝜆 ∈ {0.1, 0.01, 0.001} and exploration parameter
top 20 skill groups. The model is trained (fine-tuned) 𝜈 ∈ {0.2, 0.02, 0.002} for NeuralUCB - other hyperpa-
for one epoch with learning rate 10−4 and batch size 32. rameters such as learning rate, batch settings and number
For the bandit simulation, the final actual resolving skill, of layers are set the same as in the original code base2 .
after all the potential agent-to-agent transfers since the We settle on 𝑘 = 100, 𝛼 = 10, (𝜆, 𝜈) = (0.001, 0.002)
initial routing, serves as the ground truth - if the routed per having the lowest cumulative regret averaged across
skill agrees with the final skill, the action is considered ℎ0 ∼ ℎ3 and 10 replicated runs, and carry these values
as accurate and having avoided a potential transfer. At 1
From another perspective, this offline bandit-with-eligibility simu-
each round, the bandit receives a 256-dimensional text lation setup is equivalent to "using a bandit for online improvement
embedding from the same GPT-2 encoder as observed of a black-box probabilistic classifier with unknown accuracy".
2
https://github.com/uclaml/NeuralUCB
Figure 3: EC under different Spearman’s Correlation 𝜌 between eligibility score and reward, with different risk level 𝛼. The
bandit part is K-Boot and eligibility scores are not in context.

over to compare the actual model performance on the 4 policy: pull the arm with the highest eligibility score.
UCI datasets. We believe this setup is more realistic be- Figure 3 shows the results of applying EC to K-Boot,
cause most bandit applications do not have the luxury for with eligibility scores used by EC only (explained later).
intensive hyperparameter tuning due to small sample size When the signal in 𝑒 is weak (low 𝜌), EC adaptively
or non-stationary environment. Finally, all experiments raises the top-𝑘 threshold close to 𝑀 , achieving the same
in this paper implement online, single-sample model up- performance as no EC. As 𝜌 grows, the advantage reveals
dates. Figure 2 left shows the performance of the models - using EC is consistently better than not. The dynamic
with final selected hyperparameters. Except for the linear thresholding may dominate the top-1 policy even under
reward scenario where LinUCB/NeuralUCB has advan- strong eligibility signals, with the right 𝛼. EC performs
tage in correct/close model specification, there is no sig- the best for 𝛼 = 0.5 across all synthetic datasets, and
nificant performance difference between K-Boot and Neu- the same value is carried over to the next experiment on
ralUCB while LinUCB is inferior. Figure 2 right shows the CS Routing data.
performance comparison on the UCI real-world datasets. EC can be applied to other bandits. Figure 4 shows
K-Boot and NeuralUCB have a tie of winning on two the same synthetic data results but with EC + LinUCB.
datasets each, and LinUCB is consistently the worst. By Here eligibility scores are also added to bandit context.
examing the 80% confidence band (P10-P90 across 10 The reason why the previous experiment did not take
runs), we find NeuralUCB has larger performance volatil- scores as K-Boot input is because 𝑘-NN has no native
ity (wider bands) on real data than the other two models, feature selection mechanism. If eligibility score is close
due to learning instability in certain runs. to white noise (low 𝜌), model struggles to fit the reward,
leading to a distraction from other presented visual pat-
5.2. Eligibility Control Testing terns. For LinUCB, such side-effect is minimal for the
simulated noise is Gaussian, so it is a better condition to
For the synthetic data, we control 𝜎𝑒 to sim- test the difference between the trivial way versus the EC
ulate different levels of reward-reflecting reliabil- way of utilizing eligibility. EC still dominates no-EC in
ity in eligibility scores, with resulting 𝜌(𝑟, 𝑒) ∈ most cases. There are a few exceptions for the linear true
{0.05, 0.15, 0.3, 0.45, 0.6, 0.75}, for each true reward reward function ℎ0 , where 𝛼 = 0.5 is worse because
function ℎ0 - ℎ3 . In the algorithm, we set different risk LinUCB learns so well that taking a high risk of missing
level 𝛼 in EC. Note 𝛼 = 0 is effectively not using EC, and the best arm is not cost-effective. An interesting obser-
we abuse the notation 𝛼 → 1 to denote the strict top-1 vation is that, while the strict top-1 rule was dominated
Figure 4: EC under different Spearman’s Correlation 𝜌 between eligibility score and reward, with different risk level 𝛼. The
bandit part is LinUCB and eligibility scores are part of context.

by K-Boot + EC (Figure 3), it actually beats LinUCB + EC
for nonlinear true reward functions even at 𝜌 = 0.15.
This indicates the power of the bandit model is still a key
driver for ML to out-perform static rules.
Figure 5 shows the result on CS Routing data, where
EC significantly improves routing accuracy. Routing to
the skill with the highest probability from the emulator
serves a stochastic surrogate of the existing policy. Pure
K-Boot result without knowing the emulator outputs is
set as a baseline. With only 8000 samples, the average
accuracy of the emulator policy is improved by K-Boot +
EC from 53.37% to 57.78%.

6. Conclusions Figure 5: CS Routing Results. Metric is averaged across 10
runs with an 80% confidence band.
We proposed a nonparametric contextual Bandit algo-
rithm K-Boot with arm-level eligibility control (EC) for
routing customer contacts to eligible SMEs in real-time. We further found the EC component improved K-Boot’s
While K-Boot and EC are proposed here as a suite, each performance on both synthetic datasets and a simulation
can be applied independently - K-Boot is a general Bandit scenario in CS routing when eligibility exists.
algorithm and EC can control arm-eligibility for other
bandits. We compared K-Boot with LinUCB and Neu-
ralUCB. When looking at average regret performance References
over simulation runs, K-Boot and NeuralUCB had a tie
in winning scenarios and were comparable in terms of [1] G. Elena, K. Milos, I. Eugene, Survey of multiarmed
robustness to different data distributions. Both worked bandit algorithms applied to recommendation sys-
better than LinUCB. However, we observe larger perfor- tems, International Journal of Open Information
mance variability for NeuralUCB as opposed to K-Boot. Technologies 9 (2021) 12–27.
[2] Y. Liu, L. Li, A map of bandits for e-commerce, [17] Y. A. Malkov, D. A. Yashunin, Efficient and robust
arXiv preprint arXiv:2107.00680 (2021). approximate nearest neighbor search using hierar-
[3] T. Lattimore, C. Szepesvári, Bandit algorithms, Cam- chical navigable small world graphs, IEEE transac-
bridge University Press, 2020. tions on pattern analysis and machine intelligence
[4] L. Li, W. Chu, J. Langford, R. E. Schapire, A 42 (2018) 824–836.
contextual-bandit approach to personalized news [18] J. Johnson, M. Douze, H. Jégou, Billion-scale simi-
article recommendation, in: Proceedings of the larity search with GPUs, IEEE Transactions on Big
19th international conference on World wide web, Data 7 (2019) 535–547.
2010, pp. 661–670. [19] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha,
[5] S. Sajeev, J. Huang, N. Karampatziakis, M. Hall, F. Chern, S. Kumar, Accelerating large-scale infer-
S. Kochman, W. Chen, Contextual bandit applica- ence with anisotropic vector quantization, in: Inter-
tions in a customer support bot, in: Proceedings of national Conference on Machine Learning, PMLR,
the 27th ACM SIGKDD Conference on Knowledge 2020, pp. 3887–3896.
Discovery & Data Mining, 2021, pp. 3522–3530. [20] B. W. Silverman, Density estimation for statistics
[6] S. Upadhyay, M. Agarwal, D. Bounneffouf, Y. Khaz- and data analysis, Routledge, 2018.
aeni, A bandit approach to posterior dialog [21] D. Zhou, L. Li, Q. Gu, Neuralucb: Contextual ban-
orchestration under a budget, arXiv preprint dits with neural network-based exploration (2019).
arXiv:1906.09384 (2019). [22] D. Dua, C. Graff, UCI machine learning repository,
[7] O. Chapelle, L. Li, An empirical evaluation of 2017. URL: http://archive.ics.uci.edu/ml.
thompson sampling, Advances in neural informa- [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
tion processing systems 24 (2011). I. Sutskever, et al., Language models are unsuper-
[8] T. Graepel, J. Quiñonero Candela, T. Borchert, vised multitask learners, OpenAI blog 1 (2019) 9.
R. Herbrich, Web-scale bayesian click-through rate
prediction for sponsored search advertising in mi-
crosoft’s bing search engine, in: Proceedings of the
27th International Conference on Machine Learn-
ing ICML 2010, Invited Applications Track (unre-
viewed, to appear), 2010.
[9] I. Osband, B. Van Roy, Bootstrapped thompson
sampling and deep exploration, arXiv preprint
arXiv:1507.00300 (2015).
[10] B. Kveton, C. Szepesvari, S. Vaswani, Z. Wen, T. Lat-
timore, M. Ghavamzadeh, Garbage in, reward out:
Bootstrapping exploration in multi-armed bandits,
in: International Conference on Machine Learning,
PMLR, 2019, pp. 3601–3610.
[11] C. Riou, J. Honda, Bandit algorithms based on
thompson sampling for bounded reward distribu-
tions, in: Algorithmic Learning Theory, PMLR,
2020, pp. 777–826.
[12] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger,
Gaussian process optimization in the bandit setting:
No regret and experimental design, arXiv preprint
arXiv:0912.3995 (2009).
[13] S. R. Chowdhury, A. Gopalan, On kernelized multi-
armed bandits, in: International Conference on
Machine Learning, PMLR, 2017, pp. 844–853.
[14] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian pro-
cesses for big data, arXiv preprint arXiv:1309.6835
(2013).
[15] B. M. Steele, Exact bootstrap k-nearest neighbor
learners, Machine Learning 74 (2009) 235–255.
[16] G. Biau, F. Cérou, A. Guyader, On the rate of con-
vergence of the bagged nearest neighbor estimate.,
Journal of Machine Learning Research 11 (2010).