=Paper=
{{Paper
|id=Vol-3294/long2
|storemode=property
|title=A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing
|pdfUrl=https://ceur-ws.org/Vol-3294/long2.pdf
|volume=Vol-3294
|authors=Ruofeng Wen,Wenjun Zeng,Yi Liu
|dblpUrl=https://dblp.org/rec/conf/recsys/WenZL22
}}
==A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing==
A Nonparametric Contextual Bandit with Arm-level
Eligibility Control for Customer Service Routing
Ruofeng Wen1,* , Wenjun Zeng1 and Yi Liu1
1
Customer Engagement Technology, Amazon
Abstract
Amazon Customer Service (CS) provides real-time support for millions of customer contacts every year. While bot-resolver
helps automate some traffic, we still see high demand for human agents, also called subject matter experts (SMEs). Customers
outreach with questions in different domains (return policy, device troubleshooting, etc.). Depending on their training,
not all SMEs are eligible to handle all contacts. Routing contacts to eligible SMEs turns out to be a non-trivial problem
because SMEsβ domain eligibility is subject to training quality and can change over time. To optimally recommend SMEs
while simultaneously learning the true eligibility status, we propose to formulate the routing problem with a nonparametric
contextual bandit algorithm (K-Boot) plus an eligibility control (EC) algorithm. K-Boot models reward with a kernel smoother
on similar past samples selected by π-NN, and Bootstrap Thompson Sampling for exploration. EC filters arms (SMEs) by the
initially system-claimed eligibility and dynamically validates the reliability of this information. The proposed K-Boot is a
general bandit algorithm, and EC is applicable to other bandits. Our simulation studies show that K-Boot performs on par
with state-of-the-art Bandit models, and EC boosts K-Boot performance when stochastic eligibility signal exists.
Keywords
Bandit, Customer Service Routing, Arm Eligibility, Nonparametric
1. Introduction
In Amazon Customer Service (CS), we dispatch human
agents, also called subject matter experts (SMEs) in real-
time to handle millions of customer contacts. The SME
routing automation process has two steps: first there is a
natural language understanding (NLU) model to process
customerβs input and identify the relevant domain (return
policy, device troubleshooting, etc.); then it dispatches an
SME who is eligible. We define eligibility when the SME
masters the required skill in the relevant domain through
training. SME routing turns out to be a non-trivial prob-
lem for four reasons. First, the NLU model is unlikely to
categorize the domain with perfect accuracy. Second, the
reliability in SME eligibility as identified by operation
team is subject to training program quality and readiness.
Third, the domain taxonomy and SMEsβ eligibility sta-
tus can change in a decentralized way. In reality, it is
difficult to keep track of all the eligibility updates, assum- Figure 1: The CS Routing Problem. The goal of the recom-
ing correctness. Finally, eligible SMEs do not guarantee mender is to select an agent that will result in the best outcome
customer satisfaction, leading to noisy feedback signals. of this customer contact.
All these uncertainties make SME routing a challenging
problem.
To tackle the complexity, we formulate CS routing dit as a solution. Bandit, a framework for sequential
as a recommendation problem and use contextual Ban- decision making, has been used for online recommenda-
tion across companies such as Amazon, Google, Netflix
4th Edition of Knowledge-aware and Conversational Recommender and Yahoo [1]. In contextual bandit, the decision maker
Systems (KaRS) Workshop @ RecSys 2022, September 18β23 2023, Seat- sequentially chooses an arm/action (SME in our case),
tle, WA, USA.
*
Corresponding author. based on available contextual information (contact tran-
$ ruofeng@amazon.com (R. Wen); zengwenj@amazon.com script embedding, customer profile, etc.), and observes
(W. Zeng); yiam@amazon.com (Y. Liu) a reward signal (customer satisfaction, contact transfer,
Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). etc.) [2]. The objective is to recommend arms at each
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
step to maximize the expected cumulative reward over parametric manner, TS is implemented by drawing the
time. We apply bandit to the routing problem because reward function parameter π from a posterior distribu-
it can optimally explore uncertainties and adapt to dy- tion, calculating rewards with the sampled parameters,
namic changes [3]. Particularly, the uncertainties in SME and selecting the arm that maximizes the estimated re-
eligibility motivate us to formulate a new type of bandit ward function. π is often assumed to follow Conjugate-
to utilize arm-level eligibility signals. Exponential Family distributions resulting closed-form
Our contribution. We propose K-Boot, a nonpara- posteriors. There are also work where variational meth-
metric contextual bandit algorithm to model reward and ods are used to provide an analytical approximation to
explore, and an Eligibility Controller (EC) algorithm the posterior and enforce trackability [8]. Bootstrap TS
to model arm-level eligibility. K-Boot uses a π-nearest introduces volatility in a different way: it pulls the arm
neighbors (π-NN) approach to find similar samples from with the highest bootstrap mean, estimated from reward
the past, applies a NadarayaβWatson kernel regression history in a nonparametric way [9]. One drawback of the
among the found samples to estimate the reward, and approach is the computational cost to train one bandit
adopts Bootstrap Thompson Sampling as the exploration model per Bootstrapped training set, which is difficult
strategy. The EC component implements a dynamic top for large-scale real-time problems. [10] proposed a gen-
arm filter based on its estimated Spearmanβs Rank Cor- eral nonparametric bandit algorithm but they fell back to
relation Coefficient between eligibility and reward. We the parametric reward function approach when there is
are interested in this end-to-end nonparametric setup context. [11] further improved the exploration efficiency
for practicality: robustness in performance, strong inter- with Bayesian Bootstrap, but also only for non-contextual
pretability and simple non-technical maintenance which bandit. We pair Bootstrap TS with a single π-NN for non-
is friendly to business partners. For instance, the π-NN parametric contextual exploration.
component makes it easy to investigate which historical Kernel-based Bandit uses a nonparametric reward es-
contacts support the SME recommendation, and then de- timator via kernel functions, combined with an explo-
ploy instant online fixes by trimming unwanted outliers. ration policy like TS or UCB. Gaussian Process (GP) has
While K-Boot and EC are proposed here as a suite, each wide applications in bandit domain, e.g. GP-UCB [12]
can be applied independently - K-Boot is a general Bandit and GP-TS [13]. However, GP inference is expensive -
algorithm and EC can control arm-eligibility for other the standard approach has π(π 3 ) complexity, where π
bandits, such as LinUCB [4]. is the number of training samples. The most efficient
In the remainder of the paper, we review related liter- sparse or low-rank GP approach still requires π(π π·2 ),
ature in Section 2, set up the formal problem in Section 3 where π· is a hyperparameter indicating the number of
and detail the proposed algorithms in Section 4. We then inducing variables - see [14] for a review. We use π-NN
present our model results in Section 5, and conclude the to enable π(π log π ) inference time.
paper last.
3. Formulation: Contextual Bandit
2. Related Work with Arm Eligibility
In most bandit applications, the items to recommend are
We define two types of eligibility setup with different data
products and webpage content (widget, news articles,
interfaces. (1) Eligibility Scores: for a given context x,
etc.). Recently, we see research efforts in using bandit for
a list of eligibility scores for each arm e = (π1 , π2 , . . . ) is
routing recommendation in a conversation [5, 6]. In [6],
observed, where ππ β [0, 1] is for arm ππ . A larger score
they used contextual bandit to utilize query and user fea-
means higher confidence in the arm being eligible for
tures along with the conversational context and build an
the current sample/iteration. An intuitive interpretation
orchestration to route the dialog. In [5], they built bandit
is that the arms are bidding for themselves. This score
models to assign solutions given customerβs queries on
setup assumes (the owner of) each arm is able to adap-
Microsoftβs product to a chatbot. This is the most relevant
tively assess the dynamic environment in a distributed
work to our problem. However, they assume an upstream
and independent manner that can be unknown to the
model to provide a set of plausible actions to start with
bandit. Typical examples include ads bidding and voice
and the bandit itself does not deal with the uncertainty in
assistantβs skills detection - each ad/skill may be managed
arm eligibility. Overall, comparing to the extensive ban-
via the same API, but by different clients under dynamic
dit literature existing for product and webpage content
competition. (2) Eligibility States: assume the system
recommendation, few bandit works are there for routing
at any time can be classified into one of πΏ possible states.
recommendation.
Each arm claims whether it is eligible under a state, rep-
Thompson Sampling (TS) is a heuristic for balanc-
resented by a constant binary vector cπ of length πΏ for
ing exploration and exploitation in bandits [7]. In the
arm ππ . If the π-th element of cπ is 1, arm ππ claims to
be eligible under the π-th state, and 0 otherwise. Given a a good approximate of that on all samples. Formally, a
context x, a stochastic distribution over the πΏ states is sample is considered influential to make inference about
observed, represented by a probability vector p of length xπ , if its probability of being in the πΎ-NN of xπ after a
πΏ. For CS routing, the state is the type of customer issue, Bootstrap on ππ is greater than π. This probability is
the claim is SMEβs corresponding skill-set, and p is the computed analytically on Line 9, based on the derived
NLU predictive distribution. Note that a score can be equation (6) in [15]. Here the tolerance hyperparameter
computed as the inner product of the claimed eligibility π controls the risk of missing influential samples and thus
binary vector and state probability vectors: ππ := cβΊπ p, the fidelity of approximated Bootstrap TS. In our experi-
so the state setup converges to the score setup. The ma- ments with ππ β€ 104 and π = 0.01, πΎ β² is empirically
jor practical difference between the two interfaces is the well bounded by 2πΎ. This process shrinks all later com-
amount of effort in the claiming side: the state interface putation to πΎ β² samples, with the overhead neighbors
simplifies the eligibility claims to static binary votes on search cost π(log ππ ) per sample (we used Hnswlib
a finite set of states, while the score interface mandates [17] for approximated search). If ππ is less than πΎ, ππ
scoring with context awareness. itself is the influential set. We then add two pseudo sam-
Consider a contextual bandit with a set of arms ples to ππ β²
on Line 14-15, in order to expand rewardβs
{π1 , π2 , . . . }. At the πth round, context xπ is observed empirical range for Bootstrap exploration [10]. To give
and eligibility score eπ is either observed (score in- pseudo-samples proper contexts thus kernel weights, we
terface) or derived (state interface). After an arm is set their distance to x as that of a random observation
pulled, a reward ππ β [0, 1] is revealed. The goal in ππ β²
, so its weight shrinks as more data seen. On Line
is to minimize the expected regret over π rounds: 16-18, we then draw a Bootstrap resample and select πΎ
π=1 (E[π|π* , xπ , eπ ] β E[π|π , xπ , eπ ]), where π* is nearest samples from it to calculate the estimated reward
βοΈπ π π π
the optimal action for this round in hindsight. for ππ as the kernel weighted average. we implement
πΎβ (Β·, Β·) as the simplest π(πΎ) Nadaraya-Watson kernel
estimator with automatic bandwidth selection [20] - more
4. Methodology advanced models can be plugged in.
As we model reward for each arm separately as a com-
The proposed algorithms run in order: Eligibility Con-
mon practice, K-Boot is flexible to add and removal of
troller (EC) to filter out ineligible arms, and K-Boot to find
arms. This benefits applications like ours with decen-
the optimal one within the rest. Below, we first introduce
tralized and independent arms management. In addition,
K-Boot to set the base and then EC.
K-Boot has simple maintenance: (1) the nonparametric
model has minimal assumption in data and only a single
4.1. K-Boot important hyperparameter πΎ to balance between accu-
K-Boot is a nonparametric bandit model. For a given con- racy and computation. π = 0.01 works well across our
text x, it estimates the reward for arm π as the Nadaraya- experiments. (2) the algorithm is business friendly in
Watson kernel weighted average over the observed re- terms of decision interpretability by neighboring exam-
wards of the π nearest neighbors of x from all historical ples and allowing intuitive online instant modification
samples where arm π was pulled. We use Bootstrap TS of model behaviors. For example, CS business owners
as the exploration strategy. π-NN regression was known may hide contacts with undesired trending patterns dur-
to fit well with Bootstrap-Aggregation, with an analyt- ing a specific period (e.g. under a legacy policy), or let
ical form of smoothed conditional mean estimate that newly available SMEs "inherit" past example contacts or
does not require actual resampling [15, 16]. However, eligibility claims (next section) via domain knowledge
sampling from the confidence distribution of the mean (e.g. a rehired agent or additional training programs).
estimate has little discussion. We devise a trick to shrink The changes on data visibility will instantly and pre-
the resampling range from all historical samples to a local dictably affect the bandit model in production, with clear
wrapper around the π-NN. attribution. This data-driven while human-in-the-loop
K-Boot is detailed in Algorithm 1. At iteration π with fungibility is essential to fast-paced operational business
context xπ , for the πth arm ππ , let ππ be the set of his- like customer service.
torical samples where ππ was pulled, and ππ = |ππ |.
If ππ is zero, we sample reward from a standard uni- 4.2. Eligibility Control
form distribution (Line 6). If ππ is greater than πΎ, we
EC is used to recognize arm-eligibility and the associated
first build an influential subset ππ β²
containing the πΎ β² -
uncertainties, for optimized bandit exploration. In an
NN of xπ (πΎ < πΎ ). Intuitively, the πΎ β² -NN serves as
β²
ordinary contextual bandit model, the eligibility infor-
a buffer to cover enough sample variance around the
mation can be trivially considered as part of the context
πΎ-NN, so that a Bootstrap on the influential samples is
x - it is assumed to be positively correlated with π, thus
Algorithm 1: K-Boot: a fully nonparametric contextual bandit
Input : Number of iterations π , number of arms π , number of nearest neighbors πΎ, kernel function with
bandwidth πΎβ (Β·, Β·), regularized incomplete beta function πΉπΌ,π½ (Β·), approximation tolerance π.
1 Initialize sample pool ππ := β
and its size ππ := 0, for each arm π = 1, . . . , π
2 for π = 1, Β· Β· Β· , π do
3 Observe context xπ
4 for π = 1, Β· Β· Β· , π do
5 if ππ = 0 then
6 Estimate reward: Λππ,π βΌ π° (0, 1)
7 else
8 if ππ > πΎ then
πβ² πβ² β1
πΎ β² := min(πβ² ), s.t. πΎ
βοΈ
9 π=1 [πΉπ,ππ βπ+1 ( ππ ) β πΉπ,ππ βπ+1 ( ππ )] > 1 β π
10 Find influential neighbors ππ := the top πΎ samples in ππ , with the largest πΎβ (xπ , Β·)
β² β²
11 else
12 Set ππβ²
:= ππ
13 end
14 Draw a random sample (xβ , πβ ) from ππ β²
15 Add pseudo-samples: ππ := ππ βͺ {(xβ , 0), (xβ , 1)}
β² β²
16 Draw a Bootstrap sample ππ β
from ππ β²
17 Find neighbors ππ,πΎ
β
:= the top min(πΎ, |ππβ
|) samples in ππ
β
, with the largest πΎβ (xπ , Β·)
Estimate reward: Λππ,π := π ππ πΎβ (xπ , xπ )/ π πΎβ (xπ , xπ ), summing over ππ,πΎ β
βοΈ βοΈ
18
19 end
20 end
21 Pull arm πβ := arg maxπ Λ ππ,π , and observe true reward ππ β [0, 1]
22 Update ππβ := ππβ βͺ {(xπ , ππ )} and ππβ := ππβ + 1
23 end
predictive as a plain input to a reward estimator. This π = 1 is the oracle solution. If eligibility score has zero or
assumes the reward model is able to eventually learn even negative correlation (π β€ 0) with reward, π = π
the relation between π and π after enough rounds. This is the optimal solution - otherwise the arm exploration
trivial approach does not directly utilize eligibility to is restricted adverserially or randomly for no benefits. In
limit the range of arm exploration, and may unneces- the no-correlation case, the probability of βthe best arm
sarily explore inappropriate actions with catastrophic by reward is not in the top-π by scoreβ, defined as a leak,
regret during cold-starts. The other extreme is to strictly is linear in π: π (leak|π, π = 0) = 1 β π/π . It is ob-
follow the eligibility information and ignore the reward vious π (leak|π, 0 < π < 1) < π (leak|π, π = 0). This
feedback: for the eligibility score interface, this could observation reveals two insights: (1) πΌ := π (leak|π, π)
be simply pulling the arm with the highest score as if characterizes a type of risk in applying a top-π score
it were an oracle; for the eligibility state interface, this filter to arms; (2) when controlling the risk of a leak at
could be finding the state with the highest probability certain level, e.g. πΌ = 0.01, π is a function of π. The true
then remove arms that did not claim that state. Therefore, correlation π between reward π and eligibility score π is
the empirical data would never be used to validate and unknown, but can be estimated from historical observa-
correct the potentially biased business logic - a common tions: {(ππ , ππ )}π . Therefore, EC essentially calculates π
pitfall in practice. such that the risk π (leak|π, π Λ) is controlled at a given
The main idea of EC is to leave only the top-π arms level.
with the highest eligibility scores before applying a nor- To pair with K-Boot, EC also models π (leak|π, π) in a
mal bandit, and adjust π periodically based on the empiri- nonparametric fashion to avoid extra assumptions. Per π,
cal correlation between eligibility scores and rewards. We although Spearmanβs rank correlation coefficient is used
use Spearmanβs Rank Correlation Coefficient, denoted for estimation, Kendallβs π or any other rank-based cor-
as π. Using top-π arms is a heuristic trade-off between relation measure applies similarly. Across the arms and
π = π (the trivial case; π is the total number of arms), rounds, we assume the rank of π is a noisy perturbation
and π = 1 (the strict rule case). Intuitively, for the case of the rank of π, and parameterize this perturbation as
of perfect correlation between score and reward (π = 1), βperforming π random neighbor inversionsβ. An inver-
Algorithm 2: Eligibility Control
Input : Number of iterations π , number of arms π , risk level πΌ, score initial threshold π0 , a pre-computed
empirical dictionary πΊ(π, πΌ, π) β π
1 Initialize threshold π := π0 and observation pool ππ := β
2 for π = 1, . . . , π do
3 Observe context xπ , and eligibility scores {π1 , Β· Β· Β· , ππ }, for each arm
4 Find the πth largest element π[π] among the scores
5 Sample reward for each arm if ππ β₯ π[π] (Algorithm 1, Line 5-18)
6 Pull the πβ th arm with highest reward estimate (Algorithm 1, Line 21-22)
7 Observe true reward ππ , and update ππ := ππ βͺ {(ππ , ππβ )}
8 (Periodically) compute πΛ from ππ and set π := πΊ(π, πΌ, π Λ)
9 Set π := π if π β₯ (1 β πΌ)π
10 end
sion is simply switching two elements in a sequence, and bution, specifically the gap between the first and second
a neighbor inversion switches two neighboring elements. best arm. We leave other types of risk control mechanism
Note π = 0 indicates π = 1 because the perturbed rank with distributional assumptions to future work.
sequence is identical to the original one; π β β is ef-
fectively a random permutation thus π = 0. This setup
provides a generative process to simulate the joint dis- 5. Experiments
tribution of all parameters for a rank sequence of length
In this section, we evaluate the performance of K-Boot
π: (1) do π random neighbor inversions on the sorted
with two benchmarks: LinUCB [4] and NeuralUCB [21].
sequence of ranks (1, Β· Β· Β· , π); (2) see if there is a leak,
The experiment setup mostly follows the methodology
namely if the element 1 is now at or beyond the πth po-
reported in [21] with 4 synthetic simulations and 4 re-
sition; (3) compute π Λ between the original and current
alworld datasets. EC is tested on synthetic data with
sequences. The three steps can be replicated for a suf-
an eligibility score setup, then on an Amazon Customer
ficient number of times, to obtain the final correlation
Service routing dataset. Datasets are introduced below.
coefficient πΛ and risk level πΌΛ by averaging individual π Λ
Synthetic Datasets. The bandit has π = 10 arms,
and counting the frequency of leaks. For different combi-
and runs on π = 5000 samples/rounds. It observes
nations of (π, π, π), as the above process does not depend
context with π = 20 dimensions, each following inde-
on any actual data, it can be performed offline in batch
pendent π© (0, 1). Four types of true reward functions
to get the corresponding (π Λ ). The resulting empirical
Λ, πΌ
are tested - (1) linear: β0 (x) = 0.1xπ a; (2) quadratic:
dictionary πΊ : (π, πΌ, π) β π is stored, and later queried
β1 (x) = 0.05(xπ a)2 ; (3) inner-product: β2 (x) =
during online inference. To avoid unnecessary control in
0.002xπ Aπ Ax; (4) cosine: β3 (x) = cos(3xπ a); where
the edge case where eligibility score turns out to be pure
each entry of the parameters A β RπΓπ and a β Rπ is
noise, π is reset to π (no EC) if πΊ outputs a π that is
drawn from an independent π© (0, 1), different for each
greater than the trivial value (1 β πΌ)π . Finally note the
arm. The observed reward has noise: π := β(x) + π,
generative distribution of random neighbor inversions is
where π βΌ π© (0, ππ2 ) and ππ is drawn from π°(0.01, 0.5)
assumed to be uniform along the whole sequence, yield-
independently for each arm. 20 runs are replicated by
ing the unbiasedness of the estimator π Λ even if the ob-
random seeds.
served data points {(ππ , ππ )}π are censored by the top-π
Synthetic Datasets with Eligibility. Given above,
filter policy.
the eligibility scores are further simulated by perturbing
The full online EC process is described in Algorithm 2
the counter-factual reward of each arm: π := π€π + πβ² , to
(excluding the offline dictionary generation). The eligi-
induce the correlation. Here π€ is an arm-specific scaling
bility state interface is converted into the score interface
coefficient draw from π°(β0.1, 1). It has a small chance
before EC, so the scores are the only required inputs.
to leave π and π negatively correlated, as tolerating eligi-
The algorithm takes K-Boot (Algorithm 1) as the bandit
bility claim error in systems under the overall positive
counterpart just for demonstration, and applies to any
correlation - the only assumption of EC. The noise term
bandit that has arm-independent reward models. EC has
πβ² βΌ π© (0, ππ2 ) controls the correlation effect size. By
a single hyperparameter: the controlled risk of a leak πΌ.
setting a proper ππ , the Spearmanβs rank correlation co-
Note πΌ is the risk of missing the best arm, so its influence
efficient π between reward and eligibility score can be
on cumulative rewards depends on the arm reward distri-
fixed at an arbitrary positive value.
Figure 2: Benchmarking K-Boot, LinUCB and NeuralUCB. Left half: synthetic datasets; right half: UCI datasets. Model
hyperparameters are set as the best on synthetic datasets. Metric is averaged across 10 runs with an 80% confidence band.
UCI Classification. Four real-world datasets from context, and a 20-dimensional probabilistic prediction
UCI Machine Learning Repository [22] are gathered: from the emulator as eligibility scores for each arm/skill.
covertype, magic, statlog, and mnist. These multi-class If the bandit chooses a skill matching the final one, re-
classification datasets are converted into multi-arm con- ward is 1 and otherwise 0. In this case, the cumulative
textual bandits problems, following the method in [4]: regret is an upper bound for the number of transferred
each class is treated as an arm, and reward is 1 if ban- contacts, because agents assigned to different skills may
dit pulls the correct arm (identify the correct class) and resolve the contact regardless (e.g. digital order agents
0 otherwise. Each dataset is shuffled and split into 10 are often equipped to resolve retail order issues, so they
mutually exclusive runs with 5000 samples each, and fed may not transfer an incorrectly routed contact). The rest
to bandit one in each round (magic has less than 50000 5% chats are used to generate 10 bandit runs with 8000
samples so resampling is performed). samples each1 . We observe that the empirical π across
CS Routing with Eligibility. Agent skill-level rout- the runs is 0.3492. We leave the below to future work: (1)
ing is considered: CS agents are assigned to different skill online experiments to measure real transfers and positive
groups by operation teams, and the routing system deter- customer reviews as the reward metric; (2) agent-level
mines which skill is required to resolve a given customer routing given agent profiles and finer grain eligibility
contact based on the customer profile, policy rules and definitions than skill groups.
natural language understanding. As a proof of concept
before any online experiments, we formulate an offline 5.1. K-Boot Benchmarks
bandit dataset with historical observations. About 2 mil-
lion Amazon US customer service text chats with bot-to- We compare K-Boot with LinUCB and NeuralUCB on
agent routing are gathered. To introduce stochasticity, both synthetic and UCI classification datasets. With
a historical routing policy emulator is trained on a 95% the 4 synthetic datasets, we first select the following
random subset of the chats, by taking the customer-bot hyperparameters for each Bandit algorithm of interest
transcripts before the routing action as input to predict (1) number of nearest neighbors π β {20, 50, 100}
which skill the system chose. Specifically, the emulator for K-Boot; (2) weight parameter for exploration πΌ β
consists of a pretrained GPT-2 transformer encoder [23], {0.1, 1.0, 10} for LinUCB; (3) regularization parame-
with a multi-class sequence classifier head to predict the ter π β {0.1, 0.01, 0.001} and exploration parameter
top 20 skill groups. The model is trained (fine-tuned) π β {0.2, 0.02, 0.002} for NeuralUCB - other hyperpa-
for one epoch with learning rate 10β4 and batch size 32. rameters such as learning rate, batch settings and number
For the bandit simulation, the final actual resolving skill, of layers are set the same as in the original code base2 .
after all the potential agent-to-agent transfers since the We settle on π = 100, πΌ = 10, (π, π) = (0.001, 0.002)
initial routing, serves as the ground truth - if the routed per having the lowest cumulative regret averaged across
skill agrees with the final skill, the action is considered β0 βΌ β3 and 10 replicated runs, and carry these values
as accurate and having avoided a potential transfer. At 1
From another perspective, this offline bandit-with-eligibility simu-
each round, the bandit receives a 256-dimensional text lation setup is equivalent to "using a bandit for online improvement
embedding from the same GPT-2 encoder as observed of a black-box probabilistic classifier with unknown accuracy".
2
https://github.com/uclaml/NeuralUCB
Figure 3: EC under different Spearmanβs Correlation π between eligibility score and reward, with different risk level πΌ. The
bandit part is K-Boot and eligibility scores are not in context.
over to compare the actual model performance on the 4 policy: pull the arm with the highest eligibility score.
UCI datasets. We believe this setup is more realistic be- Figure 3 shows the results of applying EC to K-Boot,
cause most bandit applications do not have the luxury for with eligibility scores used by EC only (explained later).
intensive hyperparameter tuning due to small sample size When the signal in π is weak (low π), EC adaptively
or non-stationary environment. Finally, all experiments raises the top-π threshold close to π , achieving the same
in this paper implement online, single-sample model up- performance as no EC. As π grows, the advantage reveals
dates. Figure 2 left shows the performance of the models - using EC is consistently better than not. The dynamic
with final selected hyperparameters. Except for the linear thresholding may dominate the top-1 policy even under
reward scenario where LinUCB/NeuralUCB has advan- strong eligibility signals, with the right πΌ. EC performs
tage in correct/close model specification, there is no sig- the best for πΌ = 0.5 across all synthetic datasets, and
nificant performance difference between K-Boot and Neu- the same value is carried over to the next experiment on
ralUCB while LinUCB is inferior. Figure 2 right shows the CS Routing data.
performance comparison on the UCI real-world datasets. EC can be applied to other bandits. Figure 4 shows
K-Boot and NeuralUCB have a tie of winning on two the same synthetic data results but with EC + LinUCB.
datasets each, and LinUCB is consistently the worst. By Here eligibility scores are also added to bandit context.
examing the 80% confidence band (P10-P90 across 10 The reason why the previous experiment did not take
runs), we find NeuralUCB has larger performance volatil- scores as K-Boot input is because π-NN has no native
ity (wider bands) on real data than the other two models, feature selection mechanism. If eligibility score is close
due to learning instability in certain runs. to white noise (low π), model struggles to fit the reward,
leading to a distraction from other presented visual pat-
5.2. Eligibility Control Testing terns. For LinUCB, such side-effect is minimal for the
simulated noise is Gaussian, so it is a better condition to
For the synthetic data, we control ππ to sim- test the difference between the trivial way versus the EC
ulate different levels of reward-reflecting reliabil- way of utilizing eligibility. EC still dominates no-EC in
ity in eligibility scores, with resulting π(π, π) β most cases. There are a few exceptions for the linear true
{0.05, 0.15, 0.3, 0.45, 0.6, 0.75}, for each true reward reward function β0 , where πΌ = 0.5 is worse because
function β0 - β3 . In the algorithm, we set different risk LinUCB learns so well that taking a high risk of missing
level πΌ in EC. Note πΌ = 0 is effectively not using EC, and the best arm is not cost-effective. An interesting obser-
we abuse the notation πΌ β 1 to denote the strict top-1 vation is that, while the strict top-1 rule was dominated
Figure 4: EC under different Spearmanβs Correlation π between eligibility score and reward, with different risk level πΌ. The
bandit part is LinUCB and eligibility scores are part of context.
by K-Boot + EC (Figure 3), it actually beats LinUCB + EC
for nonlinear true reward functions even at π = 0.15.
This indicates the power of the bandit model is still a key
driver for ML to out-perform static rules.
Figure 5 shows the result on CS Routing data, where
EC significantly improves routing accuracy. Routing to
the skill with the highest probability from the emulator
serves a stochastic surrogate of the existing policy. Pure
K-Boot result without knowing the emulator outputs is
set as a baseline. With only 8000 samples, the average
accuracy of the emulator policy is improved by K-Boot +
EC from 53.37% to 57.78%.
6. Conclusions Figure 5: CS Routing Results. Metric is averaged across 10
runs with an 80% confidence band.
We proposed a nonparametric contextual Bandit algo-
rithm K-Boot with arm-level eligibility control (EC) for
routing customer contacts to eligible SMEs in real-time. We further found the EC component improved K-Bootβs
While K-Boot and EC are proposed here as a suite, each performance on both synthetic datasets and a simulation
can be applied independently - K-Boot is a general Bandit scenario in CS routing when eligibility exists.
algorithm and EC can control arm-eligibility for other
bandits. We compared K-Boot with LinUCB and Neu-
ralUCB. When looking at average regret performance References
over simulation runs, K-Boot and NeuralUCB had a tie
in winning scenarios and were comparable in terms of [1] G. Elena, K. Milos, I. Eugene, Survey of multiarmed
robustness to different data distributions. Both worked bandit algorithms applied to recommendation sys-
better than LinUCB. However, we observe larger perfor- tems, International Journal of Open Information
mance variability for NeuralUCB as opposed to K-Boot. Technologies 9 (2021) 12β27.
[2] Y. Liu, L. Li, A map of bandits for e-commerce, [17] Y. A. Malkov, D. A. Yashunin, Efficient and robust
arXiv preprint arXiv:2107.00680 (2021). approximate nearest neighbor search using hierar-
[3] T. Lattimore, C. SzepesvΓ‘ri, Bandit algorithms, Cam- chical navigable small world graphs, IEEE transac-
bridge University Press, 2020. tions on pattern analysis and machine intelligence
[4] L. Li, W. Chu, J. Langford, R. E. Schapire, A 42 (2018) 824β836.
contextual-bandit approach to personalized news [18] J. Johnson, M. Douze, H. JΓ©gou, Billion-scale simi-
article recommendation, in: Proceedings of the larity search with GPUs, IEEE Transactions on Big
19th international conference on World wide web, Data 7 (2019) 535β547.
2010, pp. 661β670. [19] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha,
[5] S. Sajeev, J. Huang, N. Karampatziakis, M. Hall, F. Chern, S. Kumar, Accelerating large-scale infer-
S. Kochman, W. Chen, Contextual bandit applica- ence with anisotropic vector quantization, in: Inter-
tions in a customer support bot, in: Proceedings of national Conference on Machine Learning, PMLR,
the 27th ACM SIGKDD Conference on Knowledge 2020, pp. 3887β3896.
Discovery & Data Mining, 2021, pp. 3522β3530. [20] B. W. Silverman, Density estimation for statistics
[6] S. Upadhyay, M. Agarwal, D. Bounneffouf, Y. Khaz- and data analysis, Routledge, 2018.
aeni, A bandit approach to posterior dialog [21] D. Zhou, L. Li, Q. Gu, Neuralucb: Contextual ban-
orchestration under a budget, arXiv preprint dits with neural network-based exploration (2019).
arXiv:1906.09384 (2019). [22] D. Dua, C. Graff, UCI machine learning repository,
[7] O. Chapelle, L. Li, An empirical evaluation of 2017. URL: http://archive.ics.uci.edu/ml.
thompson sampling, Advances in neural informa- [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
tion processing systems 24 (2011). I. Sutskever, et al., Language models are unsuper-
[8] T. Graepel, J. QuiΓ±onero Candela, T. Borchert, vised multitask learners, OpenAI blog 1 (2019) 9.
R. Herbrich, Web-scale bayesian click-through rate
prediction for sponsored search advertising in mi-
crosoftβs bing search engine, in: Proceedings of the
27th International Conference on Machine Learn-
ing ICML 2010, Invited Applications Track (unre-
viewed, to appear), 2010.
[9] I. Osband, B. Van Roy, Bootstrapped thompson
sampling and deep exploration, arXiv preprint
arXiv:1507.00300 (2015).
[10] B. Kveton, C. Szepesvari, S. Vaswani, Z. Wen, T. Lat-
timore, M. Ghavamzadeh, Garbage in, reward out:
Bootstrapping exploration in multi-armed bandits,
in: International Conference on Machine Learning,
PMLR, 2019, pp. 3601β3610.
[11] C. Riou, J. Honda, Bandit algorithms based on
thompson sampling for bounded reward distribu-
tions, in: Algorithmic Learning Theory, PMLR,
2020, pp. 777β826.
[12] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger,
Gaussian process optimization in the bandit setting:
No regret and experimental design, arXiv preprint
arXiv:0912.3995 (2009).
[13] S. R. Chowdhury, A. Gopalan, On kernelized multi-
armed bandits, in: International Conference on
Machine Learning, PMLR, 2017, pp. 844β853.
[14] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian pro-
cesses for big data, arXiv preprint arXiv:1309.6835
(2013).
[15] B. M. Steele, Exact bootstrap k-nearest neighbor
learners, Machine Learning 74 (2009) 235β255.
[16] G. Biau, F. CΓ©rou, A. Guyader, On the rate of con-
vergence of the bagged nearest neighbor estimate.,
Journal of Machine Learning Research 11 (2010).