=Paper= {{Paper |id=Vol-3294/long2 |storemode=property |title=A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing |pdfUrl=https://ceur-ws.org/Vol-3294/long2.pdf |volume=Vol-3294 |authors=Ruofeng Wen,Wenjun Zeng,Yi Liu |dblpUrl=https://dblp.org/rec/conf/recsys/WenZL22 }} ==A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing== https://ceur-ws.org/Vol-3294/long2.pdf
A Nonparametric Contextual Bandit with Arm-level
Eligibility Control for Customer Service Routing
Ruofeng Wen1,* , Wenjun Zeng1 and Yi Liu1
1
    Customer Engagement Technology, Amazon


                                          Abstract
                                          Amazon Customer Service (CS) provides real-time support for millions of customer contacts every year. While bot-resolver
                                          helps automate some traffic, we still see high demand for human agents, also called subject matter experts (SMEs). Customers
                                          outreach with questions in different domains (return policy, device troubleshooting, etc.). Depending on their training,
                                          not all SMEs are eligible to handle all contacts. Routing contacts to eligible SMEs turns out to be a non-trivial problem
                                          because SMEs’ domain eligibility is subject to training quality and can change over time. To optimally recommend SMEs
                                          while simultaneously learning the true eligibility status, we propose to formulate the routing problem with a nonparametric
                                          contextual bandit algorithm (K-Boot) plus an eligibility control (EC) algorithm. K-Boot models reward with a kernel smoother
                                          on similar past samples selected by π‘˜-NN, and Bootstrap Thompson Sampling for exploration. EC filters arms (SMEs) by the
                                          initially system-claimed eligibility and dynamically validates the reliability of this information. The proposed K-Boot is a
                                          general bandit algorithm, and EC is applicable to other bandits. Our simulation studies show that K-Boot performs on par
                                          with state-of-the-art Bandit models, and EC boosts K-Boot performance when stochastic eligibility signal exists.

                                          Keywords
                                          Bandit, Customer Service Routing, Arm Eligibility, Nonparametric



1. Introduction
In Amazon Customer Service (CS), we dispatch human
agents, also called subject matter experts (SMEs) in real-
time to handle millions of customer contacts. The SME
routing automation process has two steps: first there is a
natural language understanding (NLU) model to process
customer’s input and identify the relevant domain (return
policy, device troubleshooting, etc.); then it dispatches an
SME who is eligible. We define eligibility when the SME
masters the required skill in the relevant domain through
training. SME routing turns out to be a non-trivial prob-
lem for four reasons. First, the NLU model is unlikely to
categorize the domain with perfect accuracy. Second, the
reliability in SME eligibility as identified by operation
team is subject to training program quality and readiness.
Third, the domain taxonomy and SMEs’ eligibility sta-
tus can change in a decentralized way. In reality, it is
difficult to keep track of all the eligibility updates, assum-                                    Figure 1: The CS Routing Problem. The goal of the recom-
ing correctness. Finally, eligible SMEs do not guarantee                                          mender is to select an agent that will result in the best outcome
customer satisfaction, leading to noisy feedback signals.                                         of this customer contact.
All these uncertainties make SME routing a challenging
problem.
   To tackle the complexity, we formulate CS routing                                                   dit as a solution. Bandit, a framework for sequential
as a recommendation problem and use contextual Ban-                                                    decision making, has been used for online recommenda-
                                                                                                       tion across companies such as Amazon, Google, Netflix
4th Edition of Knowledge-aware and Conversational Recommender and Yahoo [1]. In contextual bandit, the decision maker
Systems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seat- sequentially chooses an arm/action (SME in our case),
tle, WA, USA.
*
  Corresponding author.                                                                                based on available contextual information (contact tran-
$ ruofeng@amazon.com (R. Wen); zengwenj@amazon.com                                                     script embedding, customer profile, etc.), and observes
(W. Zeng); yiam@amazon.com (Y. Liu)                                                                    a reward signal (customer satisfaction, contact transfer,
          Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   etc.) [2]. The objective is to recommend arms at each
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
step to maximize the expected cumulative reward over           parametric manner, TS is implemented by drawing the
time. We apply bandit to the routing problem because           reward function parameter πœƒ from a posterior distribu-
it can optimally explore uncertainties and adapt to dy-        tion, calculating rewards with the sampled parameters,
namic changes [3]. Particularly, the uncertainties in SME      and selecting the arm that maximizes the estimated re-
eligibility motivate us to formulate a new type of bandit      ward function. πœƒ is often assumed to follow Conjugate-
to utilize arm-level eligibility signals.                      Exponential Family distributions resulting closed-form
   Our contribution. We propose K-Boot, a nonpara-             posteriors. There are also work where variational meth-
metric contextual bandit algorithm to model reward and         ods are used to provide an analytical approximation to
explore, and an Eligibility Controller (EC) algorithm          the posterior and enforce trackability [8]. Bootstrap TS
to model arm-level eligibility. K-Boot uses a π‘˜-nearest        introduces volatility in a different way: it pulls the arm
neighbors (π‘˜-NN) approach to find similar samples from         with the highest bootstrap mean, estimated from reward
the past, applies a Nadaraya–Watson kernel regression          history in a nonparametric way [9]. One drawback of the
among the found samples to estimate the reward, and            approach is the computational cost to train one bandit
adopts Bootstrap Thompson Sampling as the exploration          model per Bootstrapped training set, which is difficult
strategy. The EC component implements a dynamic top            for large-scale real-time problems. [10] proposed a gen-
arm filter based on its estimated Spearman’s Rank Cor-         eral nonparametric bandit algorithm but they fell back to
relation Coefficient between eligibility and reward. We        the parametric reward function approach when there is
are interested in this end-to-end nonparametric setup          context. [11] further improved the exploration efficiency
for practicality: robustness in performance, strong inter-     with Bayesian Bootstrap, but also only for non-contextual
pretability and simple non-technical maintenance which         bandit. We pair Bootstrap TS with a single π‘˜-NN for non-
is friendly to business partners. For instance, the π‘˜-NN       parametric contextual exploration.
component makes it easy to investigate which historical           Kernel-based Bandit uses a nonparametric reward es-
contacts support the SME recommendation, and then de-          timator via kernel functions, combined with an explo-
ploy instant online fixes by trimming unwanted outliers.       ration policy like TS or UCB. Gaussian Process (GP) has
While K-Boot and EC are proposed here as a suite, each         wide applications in bandit domain, e.g. GP-UCB [12]
can be applied independently - K-Boot is a general Bandit      and GP-TS [13]. However, GP inference is expensive -
algorithm and EC can control arm-eligibility for other         the standard approach has 𝑂(𝑁 3 ) complexity, where 𝑁
bandits, such as LinUCB [4].                                   is the number of training samples. The most efficient
   In the remainder of the paper, we review related liter-     sparse or low-rank GP approach still requires 𝑂(𝑁 𝐷2 ),
ature in Section 2, set up the formal problem in Section 3     where 𝐷 is a hyperparameter indicating the number of
and detail the proposed algorithms in Section 4. We then       inducing variables - see [14] for a review. We use π‘˜-NN
present our model results in Section 5, and conclude the       to enable 𝑂(π‘˜ log 𝑁 ) inference time.
paper last.

                                                               3. Formulation: Contextual Bandit
2. Related Work                                                   with Arm Eligibility
In most bandit applications, the items to recommend are
                                                               We define two types of eligibility setup with different data
products and webpage content (widget, news articles,
                                                               interfaces. (1) Eligibility Scores: for a given context x,
etc.). Recently, we see research efforts in using bandit for
                                                               a list of eligibility scores for each arm e = (𝑒1 , 𝑒2 , . . . ) is
routing recommendation in a conversation [5, 6]. In [6],
                                                               observed, where 𝑒𝑗 ∈ [0, 1] is for arm π‘Žπ‘— . A larger score
they used contextual bandit to utilize query and user fea-
                                                               means higher confidence in the arm being eligible for
tures along with the conversational context and build an
                                                               the current sample/iteration. An intuitive interpretation
orchestration to route the dialog. In [5], they built bandit
                                                               is that the arms are bidding for themselves. This score
models to assign solutions given customer’s queries on
                                                               setup assumes (the owner of) each arm is able to adap-
Microsoft’s product to a chatbot. This is the most relevant
                                                               tively assess the dynamic environment in a distributed
work to our problem. However, they assume an upstream
                                                               and independent manner that can be unknown to the
model to provide a set of plausible actions to start with
                                                               bandit. Typical examples include ads bidding and voice
and the bandit itself does not deal with the uncertainty in
                                                               assistant’s skills detection - each ad/skill may be managed
arm eligibility. Overall, comparing to the extensive ban-
                                                               via the same API, but by different clients under dynamic
dit literature existing for product and webpage content
                                                               competition. (2) Eligibility States: assume the system
recommendation, few bandit works are there for routing
                                                               at any time can be classified into one of 𝐿 possible states.
recommendation.
                                                               Each arm claims whether it is eligible under a state, rep-
   Thompson Sampling (TS) is a heuristic for balanc-
                                                               resented by a constant binary vector c𝑗 of length 𝐿 for
ing exploration and exploitation in bandits [7]. In the
                                                               arm π‘Žπ‘— . If the 𝑙-th element of c𝑗 is 1, arm π‘Žπ‘— claims to
be eligible under the 𝑙-th state, and 0 otherwise. Given a       a good approximate of that on all samples. Formally, a
context x, a stochastic distribution over the 𝐿 states is        sample is considered influential to make inference about
observed, represented by a probability vector p of length        x𝑛 , if its probability of being in the 𝐾-NN of x𝑛 after a
𝐿. For CS routing, the state is the type of customer issue,      Bootstrap on π’Ÿπ‘š is greater than πœ€. This probability is
the claim is SME’s corresponding skill-set, and p is the         computed analytically on Line 9, based on the derived
NLU predictive distribution. Note that a score can be            equation (6) in [15]. Here the tolerance hyperparameter
computed as the inner product of the claimed eligibility         πœ€ controls the risk of missing influential samples and thus
binary vector and state probability vectors: 𝑒𝑗 := cβŠΊπ‘— p,        the fidelity of approximated Bootstrap TS. In our experi-
so the state setup converges to the score setup. The ma-         ments with π‘π‘š ≀ 104 and πœ€ = 0.01, 𝐾 β€² is empirically
jor practical difference between the two interfaces is the       well bounded by 2𝐾. This process shrinks all later com-
amount of effort in the claiming side: the state interface       putation to 𝐾 β€² samples, with the overhead neighbors
simplifies the eligibility claims to static binary votes on      search cost 𝑂(log π‘π‘š ) per sample (we used Hnswlib
a finite set of states, while the score interface mandates       [17] for approximated search). If π‘π‘š is less than 𝐾, π’Ÿπ‘š
scoring with context awareness.                                  itself is the influential set. We then add two pseudo sam-
   Consider a contextual bandit with a set of arms               ples to π’Ÿπ‘š   β€²
                                                                                 on Line 14-15, in order to expand reward’s
{π‘Ž1 , π‘Ž2 , . . . }. At the 𝑛th round, context x𝑛 is observed     empirical range for Bootstrap exploration [10]. To give
and eligibility score e𝑛 is either observed (score in-           pseudo-samples proper contexts thus kernel weights, we
terface) or derived (state interface). After an arm is           set their distance to x as that of a random observation
pulled, a reward π‘Ÿπ‘› ∈ [0, 1] is revealed. The goal               in π’Ÿπ‘š β€²
                                                                          , so its weight shrinks as more data seen. On Line
is  to minimize the expected regret over 𝑁 rounds:               16-18, we then draw a Bootstrap resample and select 𝐾
   𝑛=1 (E[π‘Ÿ|π‘Ž* , x𝑛 , e𝑛 ] βˆ’ E[π‘Ÿ|π‘Ž , x𝑛 , e𝑛 ]), where π‘Ž* is     nearest samples from it to calculate the estimated reward
βˆ‘οΈ€π‘               𝑛                 𝑛                    𝑛

the optimal action for this round in hindsight.                  for π‘Žπ‘š as the kernel weighted average. we implement
                                                                 πΎβ„Ž (Β·, Β·) as the simplest 𝑂(𝐾) Nadaraya-Watson kernel
                                                                 estimator with automatic bandwidth selection [20] - more
4. Methodology                                                   advanced models can be plugged in.
                                                                    As we model reward for each arm separately as a com-
The proposed algorithms run in order: Eligibility Con-
                                                                 mon practice, K-Boot is flexible to add and removal of
troller (EC) to filter out ineligible arms, and K-Boot to find
                                                                 arms. This benefits applications like ours with decen-
the optimal one within the rest. Below, we first introduce
                                                                 tralized and independent arms management. In addition,
K-Boot to set the base and then EC.
                                                                 K-Boot has simple maintenance: (1) the nonparametric
                                                                 model has minimal assumption in data and only a single
4.1. K-Boot                                                      important hyperparameter 𝐾 to balance between accu-
K-Boot is a nonparametric bandit model. For a given con-         racy and computation. πœ€ = 0.01 works well across our
text x, it estimates the reward for arm π‘Ž as the Nadaraya-       experiments. (2) the algorithm is business friendly in
Watson kernel weighted average over the observed re-             terms of decision interpretability by neighboring exam-
wards of the π‘˜ nearest neighbors of x from all historical        ples and allowing intuitive online instant modification
samples where arm π‘Ž was pulled. We use Bootstrap TS              of model behaviors. For example, CS business owners
as the exploration strategy. π‘˜-NN regression was known           may hide contacts with undesired trending patterns dur-
to fit well with Bootstrap-Aggregation, with an analyt-          ing a specific period (e.g. under a legacy policy), or let
ical form of smoothed conditional mean estimate that             newly available SMEs "inherit" past example contacts or
does not require actual resampling [15, 16]. However,            eligibility claims (next section) via domain knowledge
sampling from the confidence distribution of the mean            (e.g. a rehired agent or additional training programs).
estimate has little discussion. We devise a trick to shrink      The changes on data visibility will instantly and pre-
the resampling range from all historical samples to a local      dictably affect the bandit model in production, with clear
wrapper around the π‘˜-NN.                                         attribution. This data-driven while human-in-the-loop
   K-Boot is detailed in Algorithm 1. At iteration 𝑛 with        fungibility is essential to fast-paced operational business
context x𝑛 , for the π‘šth arm π‘Žπ‘š , let π’Ÿπ‘š be the set of his-      like customer service.
torical samples where π‘Žπ‘š was pulled, and π‘π‘š = |π’Ÿπ‘š |.
If π‘π‘š is zero, we sample reward from a standard uni-             4.2. Eligibility Control
form distribution (Line 6). If π‘π‘š is greater than 𝐾, we
                                                                 EC is used to recognize arm-eligibility and the associated
first build an influential subset π’Ÿπ‘š β€²
                                       containing the 𝐾 β€² -
                                                                 uncertainties, for optimized bandit exploration. In an
NN of x𝑛 (𝐾 < 𝐾 ). Intuitively, the 𝐾 β€² -NN serves as
                      β€²
                                                                 ordinary contextual bandit model, the eligibility infor-
a buffer to cover enough sample variance around the
                                                                 mation can be trivially considered as part of the context
𝐾-NN, so that a Bootstrap on the influential samples is
                                                                 x - it is assumed to be positively correlated with π‘Ÿ, thus
 Algorithm 1: K-Boot: a fully nonparametric contextual bandit
  Input : Number of iterations 𝑁 , number of arms 𝑀 , number of nearest neighbors 𝐾, kernel function with
          bandwidth πΎβ„Ž (Β·, Β·), regularized incomplete beta function 𝐹𝛼,𝛽 (Β·), approximation tolerance πœ€.
1   Initialize sample pool π’Ÿπ‘š := βˆ… and its size π‘π‘š := 0, for each arm π‘š = 1, . . . , 𝑀
2   for 𝑛 = 1, Β· Β· Β· , 𝑁 do
  3     Observe context x𝑛
  4     for π‘š = 1, Β· Β· Β· , 𝑀 do
  5          if π‘π‘š = 0 then
  6              Estimate reward: Λ†π‘Ÿπ‘š,𝑛 ∼ 𝒰 (0, 1)
  7          else
  8              if π‘π‘š > 𝐾 then
                                                               π‘˜β€²                  π‘˜β€² βˆ’1
                     𝐾 β€² := min(π‘˜β€² ), s.t. 𝐾
                                          βˆ‘οΈ€
  9                                          𝑖=1 [𝐹𝑖,π‘π‘š βˆ’π‘–+1 ( π‘π‘š ) βˆ’ 𝐹𝑖,π‘π‘š βˆ’π‘–+1 ( π‘π‘š )] > 1 βˆ’ πœ€
 10                  Find influential neighbors π’Ÿπ‘š := the top 𝐾 samples in π’Ÿπ‘š , with the largest πΎβ„Ž (x𝑛 , Β·)
                                                    β€²               β€²

11               else
 12                  Set π’Ÿπ‘šβ€²
                              := π’Ÿπ‘š
13               end
14               Draw a random sample (x⋆ , π‘Ÿβ‹† ) from π’Ÿπ‘š   β€²

15               Add pseudo-samples: π’Ÿπ‘š := π’Ÿπ‘š βˆͺ {(x⋆ , 0), (x⋆ , 1)}
                                          β€²        β€²

16               Draw a Bootstrap sample π’Ÿπ‘š   ⋆
                                                 from π’Ÿπ‘š β€²

17               Find neighbors π’Ÿπ‘š,𝐾
                                   ⋆
                                         := the top   min(𝐾, |π’Ÿπ‘šβ‹†
                                                                  |) samples in π’Ÿπ‘š
                                                                                 ⋆
                                                                                   , with the largest πΎβ„Ž (x𝑛 , Β·)
                 Estimate reward: Λ†π‘Ÿπ‘š,𝑛 := 𝑖 π‘Ÿπ‘– πΎβ„Ž (x𝑛 , x𝑖 )/ 𝑖 πΎβ„Ž (x𝑛 , x𝑖 ), summing over π’Ÿπ‘š,𝐾   ⋆
                                             βˆ‘οΈ€                   βˆ‘οΈ€
18
19           end
20      end
21      Pull arm π‘šβ‹† := arg maxπ‘š Λ†     π‘Ÿπ‘š,𝑛 , and observe true reward π‘Ÿπ‘› ∈ [0, 1]
22      Update π’Ÿπ‘šβ‹† := π’Ÿπ‘šβ‹† βˆͺ {(x𝑛 , π‘Ÿπ‘› )} and π‘π‘šβ‹† := π‘π‘šβ‹† + 1
23 end




predictive as a plain input to a reward estimator. This        π‘˜ = 1 is the oracle solution. If eligibility score has zero or
assumes the reward model is able to eventually learn           even negative correlation (𝜌 ≀ 0) with reward, π‘˜ = 𝑀
the relation between π‘Ÿ and 𝑒 after enough rounds. This         is the optimal solution - otherwise the arm exploration
trivial approach does not directly utilize eligibility to      is restricted adverserially or randomly for no benefits. In
limit the range of arm exploration, and may unneces-           the no-correlation case, the probability of β€œthe best arm
sarily explore inappropriate actions with catastrophic         by reward is not in the top-π‘˜ by score”, defined as a leak,
regret during cold-starts. The other extreme is to strictly    is linear in π‘˜: 𝑃 (leak|π‘˜, 𝜌 = 0) = 1 βˆ’ π‘˜/𝑀 . It is ob-
follow the eligibility information and ignore the reward       vious 𝑃 (leak|π‘˜, 0 < 𝜌 < 1) < 𝑃 (leak|π‘˜, 𝜌 = 0). This
feedback: for the eligibility score interface, this could      observation reveals two insights: (1) 𝛼 := 𝑃 (leak|π‘˜, 𝜌)
be simply pulling the arm with the highest score as if         characterizes a type of risk in applying a top-π‘˜ score
it were an oracle; for the eligibility state interface, this   filter to arms; (2) when controlling the risk of a leak at
could be finding the state with the highest probability        certain level, e.g. 𝛼 = 0.01, π‘˜ is a function of 𝜌. The true
then remove arms that did not claim that state. Therefore,     correlation 𝜌 between reward π‘Ÿ and eligibility score 𝑒 is
the empirical data would never be used to validate and         unknown, but can be estimated from historical observa-
correct the potentially biased business logic - a common       tions: {(π‘Ÿπ‘– , 𝑒𝑖 )}𝑖 . Therefore, EC essentially calculates π‘˜
pitfall in practice.                                           such that the risk 𝑃 (leak|π‘˜, 𝜌   Λ†) is controlled at a given
   The main idea of EC is to leave only the top-π‘˜ arms         level.
with the highest eligibility scores before applying a nor-        To pair with K-Boot, EC also models 𝑃 (leak|π‘˜, 𝜌) in a
mal bandit, and adjust π‘˜ periodically based on the empiri-     nonparametric fashion to avoid extra assumptions. Per 𝜌,
cal correlation between eligibility scores and rewards. We     although Spearman’s rank correlation coefficient is used
use Spearman’s Rank Correlation Coefficient, denoted           for estimation, Kendall’s 𝜏 or any other rank-based cor-
as 𝜌. Using top-π‘˜ arms is a heuristic trade-off between        relation measure applies similarly. Across the arms and
π‘˜ = 𝑀 (the trivial case; 𝑀 is the total number of arms),       rounds, we assume the rank of 𝑒 is a noisy perturbation
and π‘˜ = 1 (the strict rule case). Intuitively, for the case    of the rank of π‘Ÿ, and parameterize this perturbation as
of perfect correlation between score and reward (𝜌 = 1),       β€œperforming 𝑝 random neighbor inversions”. An inver-
 Algorithm 2: Eligibility Control
  Input : Number of iterations 𝑁 , number of arms 𝑀 , risk level 𝛼, score initial threshold π‘˜0 , a pre-computed
          empirical dictionary 𝐺(𝑛, 𝛼, 𝜌) β†’ π‘˜
1  Initialize threshold π‘˜ := π‘˜0 and observation pool π’Ÿπ‘’ := βˆ…
2  for 𝑛 = 1, . . . , 𝑁 do
 3     Observe context x𝑛 , and eligibility scores {𝑒1 , Β· Β· Β· , 𝑒𝑀 }, for each arm
 4     Find the π‘˜th largest element 𝑒[π‘˜] among the scores
 5     Sample reward for each arm if π‘’π‘š β‰₯ 𝑒[π‘˜] (Algorithm 1, Line 5-18)
 6     Pull the π‘šβ‹† th arm with highest reward estimate (Algorithm 1, Line 21-22)
 7     Observe true reward π‘Ÿπ‘› , and update π’Ÿπ‘’ := π’Ÿπ‘’ βˆͺ {(π‘Ÿπ‘› , π‘’π‘šβ‹† )}
 8     (Periodically) compute πœŒΛ† from π’Ÿπ‘’ and set π‘˜ := 𝐺(𝑛, 𝛼, 𝜌     Λ†)
 9     Set π‘˜ := 𝑀 if π‘˜ β‰₯ (1 βˆ’ 𝛼)𝑀
10 end




sion is simply switching two elements in a sequence, and        bution, specifically the gap between the first and second
a neighbor inversion switches two neighboring elements.         best arm. We leave other types of risk control mechanism
Note 𝑝 = 0 indicates 𝜌 = 1 because the perturbed rank           with distributional assumptions to future work.
sequence is identical to the original one; 𝑝 β†’ ∞ is ef-
fectively a random permutation thus 𝜌 = 0. This setup
provides a generative process to simulate the joint dis-        5. Experiments
tribution of all parameters for a rank sequence of length
                                                                In this section, we evaluate the performance of K-Boot
𝑛: (1) do 𝑝 random neighbor inversions on the sorted
                                                                with two benchmarks: LinUCB [4] and NeuralUCB [21].
sequence of ranks (1, Β· Β· Β· , 𝑛); (2) see if there is a leak,
                                                                The experiment setup mostly follows the methodology
namely if the element 1 is now at or beyond the π‘˜th po-
                                                                reported in [21] with 4 synthetic simulations and 4 re-
sition; (3) compute 𝜌 Λ† between the original and current
                                                                alworld datasets. EC is tested on synthetic data with
sequences. The three steps can be replicated for a suf-
                                                                an eligibility score setup, then on an Amazon Customer
ficient number of times, to obtain the final correlation
                                                                Service routing dataset. Datasets are introduced below.
coefficient πœŒΛ† and risk level 𝛼ˆ by averaging individual 𝜌 Λ†
                                                                   Synthetic Datasets. The bandit has 𝑀 = 10 arms,
and counting the frequency of leaks. For different combi-
                                                                and runs on 𝑁 = 5000 samples/rounds. It observes
nations of (π‘˜, 𝑝, 𝑛), as the above process does not depend
                                                                context with 𝑑 = 20 dimensions, each following inde-
on any actual data, it can be performed offline in batch
                                                                pendent 𝒩 (0, 1). Four types of true reward functions
to get the corresponding (𝜌     Λ† ). The resulting empirical
                             Λ†, 𝛼
                                                                are tested - (1) linear: β„Ž0 (x) = 0.1x𝑇 a; (2) quadratic:
dictionary 𝐺 : (𝑛, 𝛼, 𝜌) β†’ π‘˜ is stored, and later queried
                                                                β„Ž1 (x) = 0.05(x𝑇 a)2 ; (3) inner-product: β„Ž2 (x) =
during online inference. To avoid unnecessary control in
                                                                0.002x𝑇 A𝑇 Ax; (4) cosine: β„Ž3 (x) = cos(3x𝑇 a); where
the edge case where eligibility score turns out to be pure
                                                                each entry of the parameters A ∈ R𝑑×𝑑 and a ∈ R𝑑 is
noise, π‘˜ is reset to 𝑀 (no EC) if 𝐺 outputs a π‘˜ that is
                                                                drawn from an independent 𝒩 (0, 1), different for each
greater than the trivial value (1 βˆ’ 𝛼)𝑀 . Finally note the
                                                                arm. The observed reward has noise: π‘Ÿ := β„Ž(x) + πœ€,
generative distribution of random neighbor inversions is
                                                                where πœ€ ∼ 𝒩 (0, πœŽπ‘Ÿ2 ) and πœŽπ‘Ÿ is drawn from 𝒰(0.01, 0.5)
assumed to be uniform along the whole sequence, yield-
                                                                independently for each arm. 20 runs are replicated by
ing the unbiasedness of the estimator 𝜌    Λ† even if the ob-
                                                                random seeds.
served data points {(π‘Ÿπ‘– , 𝑒𝑖 )}𝑖 are censored by the top-π‘˜
                                                                   Synthetic Datasets with Eligibility. Given above,
filter policy.
                                                                the eligibility scores are further simulated by perturbing
    The full online EC process is described in Algorithm 2
                                                                the counter-factual reward of each arm: 𝑒 := π‘€π‘Ÿ + πœ€β€² , to
(excluding the offline dictionary generation). The eligi-
                                                                induce the correlation. Here 𝑀 is an arm-specific scaling
bility state interface is converted into the score interface
                                                                coefficient draw from 𝒰(βˆ’0.1, 1). It has a small chance
before EC, so the scores are the only required inputs.
                                                                to leave 𝑒 and π‘Ÿ negatively correlated, as tolerating eligi-
The algorithm takes K-Boot (Algorithm 1) as the bandit
                                                                bility claim error in systems under the overall positive
counterpart just for demonstration, and applies to any
                                                                correlation - the only assumption of EC. The noise term
bandit that has arm-independent reward models. EC has
                                                                πœ€β€² ∼ 𝒩 (0, πœŽπ‘’2 ) controls the correlation effect size. By
a single hyperparameter: the controlled risk of a leak 𝛼.
                                                                setting a proper πœŽπ‘’ , the Spearman’s rank correlation co-
Note 𝛼 is the risk of missing the best arm, so its influence
                                                                efficient 𝜌 between reward and eligibility score can be
on cumulative rewards depends on the arm reward distri-
                                                                fixed at an arbitrary positive value.
Figure 2: Benchmarking K-Boot, LinUCB and NeuralUCB. Left half: synthetic datasets; right half: UCI datasets. Model
hyperparameters are set as the best on synthetic datasets. Metric is averaged across 10 runs with an 80% confidence band.



   UCI Classification. Four real-world datasets from           context, and a 20-dimensional probabilistic prediction
UCI Machine Learning Repository [22] are gathered:             from the emulator as eligibility scores for each arm/skill.
covertype, magic, statlog, and mnist. These multi-class        If the bandit chooses a skill matching the final one, re-
classification datasets are converted into multi-arm con-      ward is 1 and otherwise 0. In this case, the cumulative
textual bandits problems, following the method in [4]:         regret is an upper bound for the number of transferred
each class is treated as an arm, and reward is 1 if ban-       contacts, because agents assigned to different skills may
dit pulls the correct arm (identify the correct class) and     resolve the contact regardless (e.g. digital order agents
0 otherwise. Each dataset is shuffled and split into 10        are often equipped to resolve retail order issues, so they
mutually exclusive runs with 5000 samples each, and fed        may not transfer an incorrectly routed contact). The rest
to bandit one in each round (magic has less than 50000         5% chats are used to generate 10 bandit runs with 8000
samples so resampling is performed).                           samples each1 . We observe that the empirical 𝜌 across
   CS Routing with Eligibility. Agent skill-level rout-        the runs is 0.3492. We leave the below to future work: (1)
ing is considered: CS agents are assigned to different skill   online experiments to measure real transfers and positive
groups by operation teams, and the routing system deter-       customer reviews as the reward metric; (2) agent-level
mines which skill is required to resolve a given customer      routing given agent profiles and finer grain eligibility
contact based on the customer profile, policy rules and        definitions than skill groups.
natural language understanding. As a proof of concept
before any online experiments, we formulate an offline         5.1. K-Boot Benchmarks
bandit dataset with historical observations. About 2 mil-
lion Amazon US customer service text chats with bot-to-        We compare K-Boot with LinUCB and NeuralUCB on
agent routing are gathered. To introduce stochasticity,        both synthetic and UCI classification datasets. With
a historical routing policy emulator is trained on a 95%       the 4 synthetic datasets, we first select the following
random subset of the chats, by taking the customer-bot         hyperparameters for each Bandit algorithm of interest
transcripts before the routing action as input to predict      (1) number of nearest neighbors π‘˜ ∈ {20, 50, 100}
which skill the system chose. Specifically, the emulator       for K-Boot; (2) weight parameter for exploration 𝛼 ∈
consists of a pretrained GPT-2 transformer encoder [23],       {0.1, 1.0, 10} for LinUCB; (3) regularization parame-
with a multi-class sequence classifier head to predict the     ter πœ† ∈ {0.1, 0.01, 0.001} and exploration parameter
top 20 skill groups. The model is trained (fine-tuned)         𝜈 ∈ {0.2, 0.02, 0.002} for NeuralUCB - other hyperpa-
for one epoch with learning rate 10βˆ’4 and batch size 32.       rameters such as learning rate, batch settings and number
For the bandit simulation, the final actual resolving skill,   of layers are set the same as in the original code base2 .
after all the potential agent-to-agent transfers since the     We settle on π‘˜ = 100, 𝛼 = 10, (πœ†, 𝜈) = (0.001, 0.002)
initial routing, serves as the ground truth - if the routed    per having the lowest cumulative regret averaged across
skill agrees with the final skill, the action is considered    β„Ž0 ∼ β„Ž3 and 10 replicated runs, and carry these values
as accurate and having avoided a potential transfer. At        1
                                                                 From another perspective, this offline bandit-with-eligibility simu-
each round, the bandit receives a 256-dimensional text           lation setup is equivalent to "using a bandit for online improvement
embedding from the same GPT-2 encoder as observed                of a black-box probabilistic classifier with unknown accuracy".
                                                               2
                                                                 https://github.com/uclaml/NeuralUCB
Figure 3: EC under different Spearman’s Correlation 𝜌 between eligibility score and reward, with different risk level 𝛼. The
bandit part is K-Boot and eligibility scores are not in context.



over to compare the actual model performance on the 4          policy: pull the arm with the highest eligibility score.
UCI datasets. We believe this setup is more realistic be-         Figure 3 shows the results of applying EC to K-Boot,
cause most bandit applications do not have the luxury for      with eligibility scores used by EC only (explained later).
intensive hyperparameter tuning due to small sample size       When the signal in 𝑒 is weak (low 𝜌), EC adaptively
or non-stationary environment. Finally, all experiments        raises the top-π‘˜ threshold close to 𝑀 , achieving the same
in this paper implement online, single-sample model up-        performance as no EC. As 𝜌 grows, the advantage reveals
dates. Figure 2 left shows the performance of the models       - using EC is consistently better than not. The dynamic
with final selected hyperparameters. Except for the linear     thresholding may dominate the top-1 policy even under
reward scenario where LinUCB/NeuralUCB has advan-              strong eligibility signals, with the right 𝛼. EC performs
tage in correct/close model specification, there is no sig-    the best for 𝛼 = 0.5 across all synthetic datasets, and
nificant performance difference between K-Boot and Neu-        the same value is carried over to the next experiment on
ralUCB while LinUCB is inferior. Figure 2 right shows the      CS Routing data.
performance comparison on the UCI real-world datasets.            EC can be applied to other bandits. Figure 4 shows
K-Boot and NeuralUCB have a tie of winning on two              the same synthetic data results but with EC + LinUCB.
datasets each, and LinUCB is consistently the worst. By        Here eligibility scores are also added to bandit context.
examing the 80% confidence band (P10-P90 across 10             The reason why the previous experiment did not take
runs), we find NeuralUCB has larger performance volatil-       scores as K-Boot input is because π‘˜-NN has no native
ity (wider bands) on real data than the other two models,      feature selection mechanism. If eligibility score is close
due to learning instability in certain runs.                   to white noise (low 𝜌), model struggles to fit the reward,
                                                               leading to a distraction from other presented visual pat-
5.2. Eligibility Control Testing                               terns. For LinUCB, such side-effect is minimal for the
                                                               simulated noise is Gaussian, so it is a better condition to
For the synthetic data, we control πœŽπ‘’ to sim-                  test the difference between the trivial way versus the EC
ulate different levels of reward-reflecting reliabil-          way of utilizing eligibility. EC still dominates no-EC in
ity in eligibility scores, with resulting 𝜌(π‘Ÿ, 𝑒) ∈            most cases. There are a few exceptions for the linear true
{0.05, 0.15, 0.3, 0.45, 0.6, 0.75}, for each true reward       reward function β„Ž0 , where 𝛼 = 0.5 is worse because
function β„Ž0 - β„Ž3 . In the algorithm, we set different risk     LinUCB learns so well that taking a high risk of missing
level 𝛼 in EC. Note 𝛼 = 0 is effectively not using EC, and     the best arm is not cost-effective. An interesting obser-
we abuse the notation 𝛼 β†’ 1 to denote the strict top-1         vation is that, while the strict top-1 rule was dominated
Figure 4: EC under different Spearman’s Correlation 𝜌 between eligibility score and reward, with different risk level 𝛼. The
bandit part is LinUCB and eligibility scores are part of context.



by K-Boot + EC (Figure 3), it actually beats LinUCB + EC
for nonlinear true reward functions even at 𝜌 = 0.15.
This indicates the power of the bandit model is still a key
driver for ML to out-perform static rules.
  Figure 5 shows the result on CS Routing data, where
EC significantly improves routing accuracy. Routing to
the skill with the highest probability from the emulator
serves a stochastic surrogate of the existing policy. Pure
K-Boot result without knowing the emulator outputs is
set as a baseline. With only 8000 samples, the average
accuracy of the emulator policy is improved by K-Boot +
EC from 53.37% to 57.78%.


6. Conclusions                                                 Figure 5: CS Routing Results. Metric is averaged across 10
                                                               runs with an 80% confidence band.
We proposed a nonparametric contextual Bandit algo-
rithm K-Boot with arm-level eligibility control (EC) for
routing customer contacts to eligible SMEs in real-time. We further found the EC component improved K-Boot’s
While K-Boot and EC are proposed here as a suite, each performance on both synthetic datasets and a simulation
can be applied independently - K-Boot is a general Bandit scenario in CS routing when eligibility exists.
algorithm and EC can control arm-eligibility for other
bandits. We compared K-Boot with LinUCB and Neu-
ralUCB. When looking at average regret performance References
over simulation runs, K-Boot and NeuralUCB had a tie
in winning scenarios and were comparable in terms of       [1] G. Elena, K. Milos, I. Eugene, Survey of multiarmed
robustness to different data distributions. Both worked        bandit algorithms applied to recommendation sys-
better than LinUCB. However, we observe larger perfor-         tems, International Journal of Open Information
mance variability for NeuralUCB as opposed to K-Boot.          Technologies 9 (2021) 12–27.
 [2] Y. Liu, L. Li, A map of bandits for e-commerce,         [17] Y. A. Malkov, D. A. Yashunin, Efficient and robust
     arXiv preprint arXiv:2107.00680 (2021).                      approximate nearest neighbor search using hierar-
 [3] T. Lattimore, C. SzepesvΓ‘ri, Bandit algorithms, Cam-         chical navigable small world graphs, IEEE transac-
     bridge University Press, 2020.                               tions on pattern analysis and machine intelligence
 [4] L. Li, W. Chu, J. Langford, R. E. Schapire, A                42 (2018) 824–836.
     contextual-bandit approach to personalized news         [18] J. Johnson, M. Douze, H. JΓ©gou, Billion-scale simi-
     article recommendation, in: Proceedings of the               larity search with GPUs, IEEE Transactions on Big
     19th international conference on World wide web,             Data 7 (2019) 535–547.
     2010, pp. 661–670.                                      [19] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha,
 [5] S. Sajeev, J. Huang, N. Karampatziakis, M. Hall,             F. Chern, S. Kumar, Accelerating large-scale infer-
     S. Kochman, W. Chen, Contextual bandit applica-              ence with anisotropic vector quantization, in: Inter-
     tions in a customer support bot, in: Proceedings of          national Conference on Machine Learning, PMLR,
     the 27th ACM SIGKDD Conference on Knowledge                  2020, pp. 3887–3896.
     Discovery & Data Mining, 2021, pp. 3522–3530.           [20] B. W. Silverman, Density estimation for statistics
 [6] S. Upadhyay, M. Agarwal, D. Bounneffouf, Y. Khaz-            and data analysis, Routledge, 2018.
     aeni, A bandit approach to posterior dialog             [21] D. Zhou, L. Li, Q. Gu, Neuralucb: Contextual ban-
     orchestration under a budget, arXiv preprint                 dits with neural network-based exploration (2019).
     arXiv:1906.09384 (2019).                                [22] D. Dua, C. Graff, UCI machine learning repository,
 [7] O. Chapelle, L. Li, An empirical evaluation of               2017. URL: http://archive.ics.uci.edu/ml.
     thompson sampling, Advances in neural informa-          [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
     tion processing systems 24 (2011).                           I. Sutskever, et al., Language models are unsuper-
 [8] T. Graepel, J. QuiΓ±onero Candela, T. Borchert,               vised multitask learners, OpenAI blog 1 (2019) 9.
     R. Herbrich, Web-scale bayesian click-through rate
     prediction for sponsored search advertising in mi-
     crosoft’s bing search engine, in: Proceedings of the
     27th International Conference on Machine Learn-
     ing ICML 2010, Invited Applications Track (unre-
     viewed, to appear), 2010.
 [9] I. Osband, B. Van Roy, Bootstrapped thompson
     sampling and deep exploration, arXiv preprint
     arXiv:1507.00300 (2015).
[10] B. Kveton, C. Szepesvari, S. Vaswani, Z. Wen, T. Lat-
     timore, M. Ghavamzadeh, Garbage in, reward out:
     Bootstrapping exploration in multi-armed bandits,
     in: International Conference on Machine Learning,
     PMLR, 2019, pp. 3601–3610.
[11] C. Riou, J. Honda, Bandit algorithms based on
     thompson sampling for bounded reward distribu-
     tions, in: Algorithmic Learning Theory, PMLR,
     2020, pp. 777–826.
[12] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger,
     Gaussian process optimization in the bandit setting:
     No regret and experimental design, arXiv preprint
     arXiv:0912.3995 (2009).
[13] S. R. Chowdhury, A. Gopalan, On kernelized multi-
     armed bandits, in: International Conference on
     Machine Learning, PMLR, 2017, pp. 844–853.
[14] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian pro-
     cesses for big data, arXiv preprint arXiv:1309.6835
     (2013).
[15] B. M. Steele, Exact bootstrap k-nearest neighbor
     learners, Machine Learning 74 (2009) 235–255.
[16] G. Biau, F. CΓ©rou, A. Guyader, On the rate of con-
     vergence of the bagged nearest neighbor estimate.,
     Journal of Machine Learning Research 11 (2010).