<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Nonparametric Contextual Bandit with Arm-level Eligibility Control for Customer Service Routing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruofeng Wen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenjun Zeng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Customer Engagement Technology</institution>
          ,
          <addr-line>Amazon</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Amazon Customer Service (CS) provides real-time support for millions of customer contacts every year. While bot-resolver helps automate some trafic, we still see high demand for human agents, also called subject matter experts (SMEs). Customers outreach with questions in diferent domains (return policy, device troubleshooting, etc.). Depending on their training, not all SMEs are eligible to handle all contacts. Routing contacts to eligible SMEs turns out to be a non-trivial problem because SMEs' domain eligibility is subject to training quality and can change over time. To optimally recommend SMEs while simultaneously learning the true eligibility status, we propose to formulate the routing problem with a nonparametric contextual bandit algorithm (K-Boot) plus an eligibility control (EC) algorithm. K-Boot models reward with a kernel smoother on similar past samples selected by -NN, and Bootstrap Thompson Sampling for exploration. EC filters arms (SMEs) by the initially system-claimed eligibility and dynamically validates the reliability of this information. The proposed K-Boot is a general bandit algorithm, and EC is applicable to other bandits. Our simulation studies show that K-Boot performs on par with state-of-the-art Bandit models, and EC boosts K-Boot performance when stochastic eligibility signal exists.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Bandit</kwd>
        <kwd>Customer Service Routing</kwd>
        <kwd>Arm Eligibility</kwd>
        <kwd>Nonparametric</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In Amazon Customer Service (CS), we dispatch human
agents, also called subject matter experts (SMEs) in
realtime to handle millions of customer contacts. The SME
routing automation process has two steps: first there is a
natural language understanding (NLU) model to process
customer’s input and identify the relevant domain (return
policy, device troubleshooting, etc.); then it dispatches an
SME who is eligible. We define eligibility when the SME
masters the required skill in the relevant domain through
training. SME routing turns out to be a non-trivial
problem for four reasons. First, the NLU model is unlikely to
categorize the domain with perfect accuracy. Second, the
reliability in SME eligibility as identified by operation
team is subject to training program quality and readiness.</p>
      <p>Third, the domain taxonomy and SMEs’ eligibility
status can change in a decentralized way. In reality, it is
dificult to keep track of all the eligibility updates, assum- Figure 1: The CS Routing Problem. The goal of the
recoming correctness. Finally, eligible SMEs do not guarantee mender is to select an agent that will result in the best outcome
customer satisfaction, leading to noisy feedback signals. of this customer contact.</p>
      <p>All these uncertainties make SME routing a challenging
problem.</p>
      <p>To tackle the complexity, we formulate CS routing dit as a solution. Bandit, a framework for sequential
as a recommendation problem and use contextual Ban- decision making, has been used for online
recommendation across companies such as Amazon, Google, Netflix
4th Edition of Knowledge-aware and Conversational Recommender and Yahoo [1]. In contextual bandit, the decision maker
Systems (KaRS) Workshop @ RecSys 2022, September 18–23 2023, Seat- sequentially chooses an arm/action (SME in our case),
*tlCe,oWrrAes,pUoSnAd.ing author. based on available contextual information (contact
tran$ ruofeng@amazon.com (R. Wen); zengwenj@amazon.com script embedding, customer profile, etc.), and observes
(W. Zeng); yiam@amazon.com (Y. Liu) a reward signal (customer satisfaction, contact transfer,
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License etc.) [2]. The objective is to recommend arms at each
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
step to maximize the expected cumulative reward over parametric manner, TS is implemented by drawing the
time. We apply bandit to the routing problem because reward function parameter  from a posterior
distribuit can optimally explore uncertainties and adapt to dy- tion, calculating rewards with the sampled parameters,
namic changes [3]. Particularly, the uncertainties in SME and selecting the arm that maximizes the estimated
reeligibility motivate us to formulate a new type of bandit ward function.  is often assumed to follow
Conjugateto utilize arm-level eligibility signals. Exponential Family distributions resulting closed-form</p>
      <p>Our contribution. We propose K-Boot, a nonpara- posteriors. There are also work where variational
methmetric contextual bandit algorithm to model reward and ods are used to provide an analytical approximation to
explore, and an Eligibility Controller (EC) algorithm the posterior and enforce trackability [8]. Bootstrap TS
to model arm-level eligibility. K-Boot uses a -nearest introduces volatility in a diferent way: it pulls the arm
neighbors (-NN) approach to find similar samples from with the highest bootstrap mean, estimated from reward
the past, applies a Nadaraya–Watson kernel regression history in a nonparametric way [9]. One drawback of the
among the found samples to estimate the reward, and approach is the computational cost to train one bandit
adopts Bootstrap Thompson Sampling as the exploration model per Bootstrapped training set, which is dificult
strategy. The EC component implements a dynamic top for large-scale real-time problems. [10] proposed a
genarm filter based on its estimated Spearman’s Rank Cor- eral nonparametric bandit algorithm but they fell back to
relation Coeficient between eligibility and reward. We the parametric reward function approach when there is
are interested in this end-to-end nonparametric setup context. [11] further improved the exploration eficiency
for practicality: robustness in performance, strong inter- with Bayesian Bootstrap, but also only for non-contextual
pretability and simple non-technical maintenance which bandit. We pair Bootstrap TS with a single -NN for
nonis friendly to business partners. For instance, the -NN parametric contextual exploration.
component makes it easy to investigate which historical Kernel-based Bandit uses a nonparametric reward
escontacts support the SME recommendation, and then de- timator via kernel functions, combined with an
exploploy instant online fixes by trimming unwanted outliers. ration policy like TS or UCB. Gaussian Process (GP) has
While K-Boot and EC are proposed here as a suite, each wide applications in bandit domain, e.g. GP-UCB [12]
can be applied independently - K-Boot is a general Bandit and GP-TS [13]. However, GP inference is expensive
algorithm and EC can control arm-eligibility for other the standard approach has ( 3) complexity, where 
bandits, such as LinUCB [4]. is the number of training samples. The most eficient</p>
      <p>In the remainder of the paper, we review related liter- sparse or low-rank GP approach still requires ( 2),
ature in Section 2, set up the formal problem in Section 3 where  is a hyperparameter indicating the number of
and detail the proposed algorithms in Section 4. We then inducing variables - see [14] for a review. We use -NN
present our model results in Section 5, and conclude the to enable ( log  ) inference time.
paper last.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In most bandit applications, the items to recommend are
products and webpage content (widget, news articles,
etc.). Recently, we see research eforts in using bandit for
routing recommendation in a conversation [5, 6]. In [6],
they used contextual bandit to utilize query and user
features along with the conversational context and build an
orchestration to route the dialog. In [5], they built bandit
models to assign solutions given customer’s queries on
Microsoft’s product to a chatbot. This is the most relevant
work to our problem. However, they assume an upstream
model to provide a set of plausible actions to start with
and the bandit itself does not deal with the uncertainty in
arm eligibility. Overall, comparing to the extensive
bandit literature existing for product and webpage content
recommendation, few bandit works are there for routing
recommendation.</p>
      <p>Thompson Sampling (TS) is a heuristic for
balancing exploration and exploitation in bandits [7]. In the</p>
    </sec>
    <sec id="sec-3">
      <title>3. Formulation: Contextual Bandit with Arm Eligibility</title>
      <p>
        We define two types of eligibility setup with diferent data
interfaces. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Eligibility Scores: for a given context x,
a list of eligibility scores for each arm e = (1, 2, . . . ) is
observed, where  ∈ [0, 1] is for arm  . A larger score
means higher confidence in the arm being eligible for
the current sample/iteration. An intuitive interpretation
is that the arms are bidding for themselves. This score
setup assumes (the owner of) each arm is able to
adaptively assess the dynamic environment in a distributed
and independent manner that can be unknown to the
bandit. Typical examples include ads bidding and voice
assistant’s skills detection - each ad/skill may be managed
via the same API, but by diferent clients under dynamic
competition. (2) Eligibility States: assume the system
at any time can be classified into one of  possible states.
Each arm claims whether it is eligible under a state,
represented by a constant binary vector c of length  for
arm  . If the -th element of c is 1, arm  claims to
be eligible under the -th state, and 0 otherwise. Given a
context x, a stochastic distribution over the  states is
observed, represented by a probability vector p of length
. For CS routing, the state is the type of customer issue,
the claim is SME’s corresponding skill-set, and p is the
NLU predictive distribution. Note that a score can be
computed as the inner product of the claimed eligibility
binary vector and state probability vectors:  := c⊺ p,
so the state setup converges to the score setup. The
major practical diference between the two interfaces is the
amount of efort in the claiming side: the state interface
simplifies the eligibility claims to static binary votes on
a finite set of states, while the score interface mandates
scoring with context awareness.
      </p>
      <p>Consider a contextual bandit with a set of arms
{1, 2, . . . }. At the th round, context x is observed
and eligibility score e is either observed (score
interface) or derived (state interface). After an arm is
pulled, a reward  ∈ [0, 1] is revealed. The goal
is to minimize the expected regret over  rounds:
∑︀=1(E[|*, x, e] − E[|, x, e]), where * is
the optimal action for this round in hindsight.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>The proposed algorithms run in order: Eligibility Controller (EC) to filter out ineligible arms, and K-Boot to find the optimal one within the rest. Below, we first introduce K-Boot to set the base and then EC.</title>
        <p>4.1. K-Boot</p>
      </sec>
      <sec id="sec-4-2">
        <title>K-Boot is a nonparametric bandit model. For a given con</title>
        <p>text x, it estimates the reward for arm  as the
NadarayaWatson kernel weighted average over the observed
rewards of the  nearest neighbors of x from all historical
samples where arm  was pulled. We use Bootstrap TS
as the exploration strategy. -NN regression was known
to fit well with Bootstrap-Aggregation, with an
analytical form of smoothed conditional mean estimate that
does not require actual resampling [15, 16]. However,
sampling from the confidence distribution of the mean
estimate has little discussion. We devise a trick to shrink
the resampling range from all historical samples to a local
wrapper around the -NN.</p>
        <p>K-Boot is detailed in Algorithm 1. At iteration  with
context x, for the th arm , let  be the set of
historical samples where  was pulled, and  = ||.</p>
        <p>If  is zero, we sample reward from a standard uni- 4.2. Eligibility Control
form distribution (Line 6). If  is greater than , we
ifrst build an influential subset ′ containing the
′NN of x ( &lt; ′). Intuitively, the ′-NN serves as
a bufer to cover enough sample variance around the
-NN, so that a Bootstrap on the influential samples is
a good approximate of that on all samples. Formally, a
sample is considered influential to make inference about
x, if its probability of being in the -NN of x after a
Bootstrap on  is greater than . This probability is
computed analytically on Line 9, based on the derived
equation (6) in [15]. Here the tolerance hyperparameter
 controls the risk of missing influential samples and thus
the fidelity of approximated Bootstrap TS. In our
experiments with  ≤ 104 and  = 0.01, ′ is empirically
well bounded by 2. This process shrinks all later
computation to ′ samples, with the overhead neighbors
search cost (log ) per sample (we used Hnswlib
[17] for approximated search). If  is less than , 
itself is the influential set. We then add two pseudo
samples to ′ on Line 14-15, in order to expand reward’s
empirical range for Bootstrap exploration [10]. To give
pseudo-samples proper contexts thus kernel weights, we
set their distance to x as that of a random observation
in ′, so its weight shrinks as more data seen. On Line
16-18, we then draw a Bootstrap resample and select 
nearest samples from it to calculate the estimated reward
for  as the kernel weighted average. we implement
ℎ(· , · ) as the simplest () Nadaraya-Watson kernel
estimator with automatic bandwidth selection [20] - more
advanced models can be plugged in.</p>
        <p>
          As we model reward for each arm separately as a
common practice, K-Boot is flexible to add and removal of
arms. This benefits applications like ours with
decentralized and independent arms management. In addition,
K-Boot has simple maintenance: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) the nonparametric
model has minimal assumption in data and only a single
important hyperparameter  to balance between
accuracy and computation.  = 0.01 works well across our
experiments. (2) the algorithm is business friendly in
terms of decision interpretability by neighboring
examples and allowing intuitive online instant modification
of model behaviors. For example, CS business owners
may hide contacts with undesired trending patterns
during a specific period (e.g. under a legacy policy), or let
newly available SMEs "inherit" past example contacts or
eligibility claims (next section) via domain knowledge
(e.g. a rehired agent or additional training programs).
        </p>
        <p>The changes on data visibility will instantly and
predictably afect the bandit model in production, with clear
attribution. This data-driven while human-in-the-loop
fungibility is essential to fast-paced operational business
like customer service.</p>
      </sec>
      <sec id="sec-4-3">
        <title>EC is used to recognize arm-eligibility and the associated</title>
        <p>uncertainties, for optimized bandit exploration. In an
ordinary contextual bandit model, the eligibility
information can be trivially considered as part of the context
x - it is assumed to be positively correlated with , thus
Algorithm 1: K-Boot: a fully nonparametric contextual bandit
Input : Number of iterations  , number of arms  , number of nearest neighbors , kernel function with
bandwidth ℎ(· , · ), regularized incomplete beta function , (· ), approximation tolerance .
if  &gt;  then
′ := min(′), s.t. ∑︀=1 [,− +1( ′ ) − ,− +1( ′− 1 )] &gt; 1 −</p>
        <p>Find influential neighbors ′ := the top ′ samples in , with the largest ℎ(x, · )
else</p>
        <p>Set ′ := 
end
Draw a random sample (x⋆, ⋆) from ′
Add pseudo-samples: ′ := ′ ∪ {(x⋆, 0), (x⋆, 1)}
Draw a Bootstrap sample ⋆ from ′
Find neighbors , := the top min(, ||) samples in ⋆, with the largest ℎ(x, · )
⋆ ⋆
Estimate reward: ˆ, := ∑︀ ℎ(x, x)/ ∑︀ ℎ(x, x), summing over ,
⋆
end
9
10
end
Pull arm ⋆ := arg max ˆ,, and observe true reward  ∈ [0, 1]</p>
        <p>
          Update ⋆ := ⋆ ∪ {(x, )} and ⋆ := ⋆ + 1
predictive as a plain input to a reward estimator. This  = 1 is the oracle solution. If eligibility score has zero or
assumes the reward model is able to eventually learn even negative correlation ( ≤ 0) with reward,  = 
the relation between  and  after enough rounds. This is the optimal solution - otherwise the arm exploration
trivial approach does not directly utilize eligibility to is restricted adverserially or randomly for no benefits. In
limit the range of arm exploration, and may unneces- the no-correlation case, the probability of “the best arm
sarily explore inappropriate actions with catastrophic by reward is not in the top- by score”, defined as a leak,
regret during cold-starts. The other extreme is to strictly is linear in :  (leak|,  = 0) = 1 − / . It is
obfollow the eligibility information and ignore the reward vious  (leak|, 0 &lt;  &lt; 1) &lt;  (leak|,  = 0). This
feedback: for the eligibility score interface, this could observation reveals two insights: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )  :=  (leak|,  )
be simply pulling the arm with the highest score as if characterizes a type of risk in applying a top- score
it were an oracle; for the eligibility state interface, this iflter to arms; (2) when controlling the risk of a leak at
could be nfiding the state with the highest probability certain level, e.g.  = 0.01,  is a function of  . The true
then remove arms that did not claim that state. Therefore, correlation  between reward  and eligibility score  is
the empirical data would never be used to validate and unknown, but can be estimated from historical
observacorrect the potentially biased business logic - a common tions: {(, )}. Therefore, EC essentially calculates 
pitfall in practice. such that the risk  (leak|, ˆ) is controlled at a given
        </p>
        <p>
          The main idea of EC is to leave only the top- arms level.
with the highest eligibility scores before applying a nor- To pair with K-Boot, EC also models  (leak|,  ) in a
mal bandit, and adjust  periodically based on the empiri- nonparametric fashion to avoid extra assumptions. Per  ,
cal correlation between eligibility scores and rewards. We although Spearman’s rank correlation coeficient is used
use Spearman’s Rank Correlation Coeficient, denoted for estimation, Kendall’s  or any other rank-based
coras  . Using top- arms is a heuristic trade-of between relation measure applies similarly. Across the arms and
 =  (the trivial case;  is the total number of arms), rounds, we assume the rank of  is a noisy perturbation
and  = 1 (the strict rule case). Intuitively, for the case of the rank of , and parameterize this perturbation as
of perfect correlation between score and reward ( = 1), “performing  random neighbor inversions”. An
inverAlgorithm 2: Eligibility Control
Input : Number of iterations  , number of arms  , risk level  , score initial threshold 0, a pre-computed
empirical dictionary (, ,  ) → 
1 Initialize threshold  := 0 and observation pool  := ∅
2 for  = 1, . . . ,  do
3 Observe context x, and eligibility scores {1, · · · ,  }, for each arm
4 Find the th largest element [] among the scores
5 Sample reward for each arm if  ≥ [] (Algorithm 1, Line 5-18)
6 Pull the ⋆th arm with highest reward estimate (Algorithm 1, Line 21-22)
7 Observe true reward , and update  :=  ∪ {(, ⋆ )}
8 (Periodically) compute ˆ from  and set  := (, ,  ˆ)
9 Set  :=  if  ≥ (1 −  )
10 end
sion is simply switching two elements in a sequence, and bution, specifically the gap between the first and second
a neighbor inversion switches two neighboring elements. best arm. We leave other types of risk control mechanism
Note  = 0 indicates  = 1 because the perturbed rank with distributional assumptions to future work.
sequence is identical to the original one;  → ∞ is
effectively a random permutation thus  = 0. This setup
provides a generative process to simulate the joint dis- 5. Experiments
tribution of all parameters for a rank sequence of length
In this section, we evaluate the performance of K-Boot
: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) do  random neighbor inversions on the sorted
with two benchmarks: LinUCB [4] and NeuralUCB [21].
sneaqmueelnycief tohferaenlekmse(n1t, 1·· · is ,no)w; (a2t)osrebeeiyfotnhderteheisathlepaok-, The experiment setup mostly follows the methodology
sition; (3) compute ˆ between the original and current reported in [21] with 4 synthetic simulations and 4
resequences. The three steps can be replicated for a suf- alworld datasets. EC is tested on synthetic data with
ifcient number of times, to obtain the final correlation an eligibility score setup, then on an Amazon Customer
Service routing dataset. Datasets are introduced below.
coeficient ˆ and risk level ˆ by averaging individual ˆ
and counting the frequency of leaks. For diferent combi- Synthetic Datasets. The bandit has  = 10 arms,
nations of (, , ), as the above process does not depend and runs on  = 5000 samples/rounds. It observes
on any actual data, it can be performed ofline in batch context with  = 20 dimensions, each following
indeto get the corresponding (ˆ, ˆ). The resulting empirical pendent  (
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ). Four types of true reward functions
are tested - (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) linear: ℎ0(x) = 0.1x a; (2) quadratic:
dduicrtiinognaornyline:in(fe,r,ence). →To aviosidstuornende,caensdsalraytecroqnutreorlieind 0ℎ.10(0x2)x=A0A.0x5(; x(4) aco)2si;ne(3:)ℎ3in(xne)r=-prcoodsu(c3tx:ℎa2);(xw)her=e
the edge case where eligibility score turns out to be pure each entry of the parameters A ∈ R×  and a ∈ R is
noise,  is reset to  (no EC) if  outputs a  that is
greater than the trivial value (1 −  ) . Finally note the drawn from an independent  (
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ), diferent for each
generative distribution of random neighbor inversions is arm. The observed reward has noise:  := ℎ(x) + ,
assumed to be uniform along the whole sequence, yield- where  ∼  (0,  2 ) and   is drawn from  (0.01, 0.5)
ing the unbiasedness of the estimator ˆ even if the ob- independently for each arm. 20 runs are replicated by
random seeds.
iflsteerrvpedoldicayt.a points {(, )} are censored by the top- Synthetic Datasets with Eligibility. Given above,
        </p>
        <p>The full online EC process is described in Algorithm 2 the eligibility scores are further simulated by perturbing
(excluding the ofline dictionary generation). The eligi- the counter-factual reward of each arm:  :=  + ′, to
bility state interface is converted into the score interface induce the correlation. Here  is an arm-specific scaling
before EC, so the scores are the only required inputs. coeficient draw from  (− 0.1, 1). It has a small chance
The algorithm takes K-Boot (Algorithm 1) as the bandit to leave  and  negatively correlated, as tolerating
eligicounterpart just for demonstration, and applies to any bility claim error in systems under the overall positive
bandit that has arm-independent reward models. EC has correlation - the only assumption of EC. The noise term
a single hyperparameter: the controlled risk of a leak  . ′ ∼  (0,  2 ) controls the correlation efect size. By
Note  is the risk of missing the best arm, so its influence setting a proper  , the Spearman’s rank correlation
coon cumulative rewards depends on the arm reward distri- eficient  between reward and eligibility score can be
ifxed at an arbitrary positive value.</p>
        <p>UCI Classification. Four real-world datasets from context, and a 20-dimensional probabilistic prediction
UCI Machine Learning Repository [22] are gathered: from the emulator as eligibility scores for each arm/skill.
covertype, magic, statlog, and mnist. These multi-class If the bandit chooses a skill matching the final one,
reclassification datasets are converted into multi-arm con- ward is 1 and otherwise 0. In this case, the cumulative
textual bandits problems, following the method in [4]: regret is an upper bound for the number of transferred
each class is treated as an arm, and reward is 1 if ban- contacts, because agents assigned to diferent skills may
dit pulls the correct arm (identify the correct class) and resolve the contact regardless (e.g. digital order agents
0 otherwise. Each dataset is shufled and split into 10 are often equipped to resolve retail order issues, so they
mutually exclusive runs with 5000 samples each, and fed may not transfer an incorrectly routed contact). The rest
to bandit one in each round (magic has less than 50000 5% chats are used to generate 10 bandit runs with 8000
samples so resampling is performed). samples each1. We observe that the empirical  across</p>
        <p>
          CS Routing with Eligibility. Agent skill-level rout- the runs is 0.3492. We leave the below to future work: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
ing is considered: CS agents are assigned to diferent skill online experiments to measure real transfers and positive
groups by operation teams, and the routing system deter- customer reviews as the reward metric; (2) agent-level
mines which skill is required to resolve a given customer routing given agent profiles and finer grain eligibility
contact based on the customer profile, policy rules and definitions than skill groups.
natural language understanding. As a proof of concept
before any online experiments, we formulate an ofline 5.1. K-Boot Benchmarks
bandit dataset with historical observations. About 2
million Amazon US customer service text chats with bot-to- We compare K-Boot with LinUCB and NeuralUCB on
agent routing are gathered. To introduce stochasticity, both synthetic and UCI classification datasets. With
a historical routing policy emulator is trained on a 95% the 4 synthetic datasets, we first select the following
random subset of the chats, by taking the customer-bot hyperparameters for each Bandit algorithm of interest
transcripts before the routing action as input to predict (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) number of nearest neighbors  ∈ {20, 50, 100}
which skill the system chose. Specifically, the emulator for K-Boot; (2) weight parameter for exploration  ∈
consists of a pretrained GPT-2 transformer encoder [23], {0.1, 1.0, 10} for LinUCB; (3) regularization
paramewith a multi-class sequence classifier head to predict the ter  ∈ {0.1, 0.01, 0.001} and exploration parameter
top 20 skill groups. The model is trained (fine-tuned)  ∈ {0.2, 0.02, 0.002} for NeuralUCB - other
hyperpafor one epoch with learning rate 10− 4 and batch size 32. rameters such as learning rate, batch settings and number
For the bandit simulation, the final actual resolving skill, of layers are set the same as in the original code base2.
after all the potential agent-to-agent transfers since the We settle on  = 100,  = 10, (,  ) = (0.001, 0.002)
initial routing, serves as the ground truth - if the routed per having the lowest cumulative regret averaged across
skill agrees with the final skill, the action is considered ℎ0 ∼ ℎ3 and 10 replicated runs, and carry these values
as accurate and having avoided a potential transfer. At
each round, the bandit receives a 256-dimensional text
embedding from the same GPT-2 encoder as observed
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>1From another perspective, this ofline bandit-with-eligibility simu</title>
        <p>lation setup is equivalent to "using a bandit for online improvement
of a black-box probabilistic classifier with unknown accuracy" .
2https://github.com/uclaml/NeuralUCB
over to compare the actual model performance on the 4 policy: pull the arm with the highest eligibility score.
UCI datasets. We believe this setup is more realistic be- Figure 3 shows the results of applying EC to K-Boot,
cause most bandit applications do not have the luxury for with eligibility scores used by EC only (explained later).
intensive hyperparameter tuning due to small sample size When the signal in  is weak (low  ), EC adaptively
or non-stationary environment. Finally, all experiments raises the top- threshold close to  , achieving the same
in this paper implement online, single-sample model up- performance as no EC. As  grows, the advantage reveals
dates. Figure 2 left shows the performance of the models - using EC is consistently better than not. The dynamic
with final selected hyperparameters. Except for the linear thresholding may dominate the top-1 policy even under
reward scenario where LinUCB/NeuralUCB has advan- strong eligibility signals, with the right  . EC performs
tage in correct/close model specification, there is no sig- the best for  = 0.5 across all synthetic datasets, and
nificant performance diference between K-Boot and Neu- the same value is carried over to the next experiment on
ralUCB while LinUCB is inferior. Figure 2 right shows the CS Routing data.
performance comparison on the UCI real-world datasets. EC can be applied to other bandits. Figure 4 shows
K-Boot and NeuralUCB have a tie of winning on two the same synthetic data results but with EC + LinUCB.
datasets each, and LinUCB is consistently the worst. By Here eligibility scores are also added to bandit context.
examing the 80% confidence band (P10-P90 across 10 The reason why the previous experiment did not take
runs), we find NeuralUCB has larger performance volatil- scores as K-Boot input is because -NN has no native
ity (wider bands) on real data than the other two models, feature selection mechanism. If eligibility score is close
due to learning instability in certain runs. to white noise (low  ), model struggles to fit the reward,
leading to a distraction from other presented visual
pat5.2. Eligibility Control Testing terns. For LinUCB, such side-efect is minimal for the
simulated noise is Gaussian, so it is a better condition to
For the synthetic data, we control   to sim- test the diference between the trivial way versus the EC
ulate diferent levels of reward-reflecting reliabil- way of utilizing eligibility. EC still dominates no-EC in
ity in eligibility scores, with resulting  (, ) ∈ most cases. There are a few exceptions for the linear true
{0.05, 0.15, 0.3, 0.45, 0.6, 0.75}, for each true reward reward function ℎ0, where  = 0.5 is worse because
function ℎ0 - ℎ3. In the algorithm, we set diferent risk LinUCB learns so well that taking a high risk of missing
level  in EC. Note  = 0 is efectively not using EC, and the best arm is not cost-efective. An interesting
obserwe abuse the notation  → 1 to denote the strict top-1 vation is that, while the strict top-1 rule was dominated
by K-Boot + EC (Figure 3), it actually beats LinUCB + EC
for nonlinear true reward functions even at  = 0.15.</p>
        <p>This indicates the power of the bandit model is still a key
driver for ML to out-perform static rules.</p>
        <p>Figure 5 shows the result on CS Routing data, where
EC significantly improves routing accuracy. Routing to
the skill with the highest probability from the emulator
serves a stochastic surrogate of the existing policy. Pure
K-Boot result without knowing the emulator outputs is
set as a baseline. With only 8000 samples, the average
accuracy of the emulator policy is improved by K-Boot +
EC from 53.37% to 57.78%.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <sec id="sec-5-1">
        <title>We proposed a nonparametric contextual Bandit algo</title>
        <p>rithm K-Boot with arm-level eligibility control (EC) for
routing customer contacts to eligible SMEs in real-time.
While K-Boot and EC are proposed here as a suite, each
can be applied independently - K-Boot is a general Bandit
algorithm and EC can control arm-eligibility for other
bandits. We compared K-Boot with LinUCB and
NeuralUCB. When looking at average regret performance
over simulation runs, K-Boot and NeuralUCB had a tie
in winning scenarios and were comparable in terms of
robustness to diferent data distributions. Both worked
better than LinUCB. However, we observe larger
performance variability for NeuralUCB as opposed to K-Boot.</p>
      </sec>
      <sec id="sec-5-2">
        <title>We further found the EC component improved K-Boot’s performance on both synthetic datasets and a simulation scenario in CS routing when eligibility exists.</title>
        <p>[2] Y. Liu, L. Li, A map of bandits for e-commerce, [17] Y. A. Malkov, D. A. Yashunin, Eficient and robust
arXiv preprint arXiv:2107.00680 (2021). approximate nearest neighbor search using
hierar[3] T. Lattimore, C. Szepesvári, Bandit algorithms, Cam- chical navigable small world graphs, IEEE
transacbridge University Press, 2020. tions on pattern analysis and machine intelligence
[4] L. Li, W. Chu, J. Langford, R. E. Schapire, A 42 (2018) 824–836.</p>
        <p>contextual-bandit approach to personalized news [18] J. Johnson, M. Douze, H. Jégou, Billion-scale
simiarticle recommendation, in: Proceedings of the larity search with GPUs, IEEE Transactions on Big
19th international conference on World wide web, Data 7 (2019) 535–547.</p>
        <p>2010, pp. 661–670. [19] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha,
[5] S. Sajeev, J. Huang, N. Karampatziakis, M. Hall, F. Chern, S. Kumar, Accelerating large-scale
inferS. Kochman, W. Chen, Contextual bandit applica- ence with anisotropic vector quantization, in:
Intertions in a customer support bot, in: Proceedings of national Conference on Machine Learning, PMLR,
the 27th ACM SIGKDD Conference on Knowledge 2020, pp. 3887–3896.</p>
        <p>Discovery &amp; Data Mining, 2021, pp. 3522–3530. [20] B. W. Silverman, Density estimation for statistics
[6] S. Upadhyay, M. Agarwal, D. Bounnefouf, Y. Khaz- and data analysis, Routledge, 2018.
aeni, A bandit approach to posterior dialog [21] D. Zhou, L. Li, Q. Gu, Neuralucb: Contextual
banorchestration under a budget, arXiv preprint dits with neural network-based exploration (2019).
arXiv:1906.09384 (2019). [22] D. Dua, C. Graf, UCI machine learning repository,
[7] O. Chapelle, L. Li, An empirical evaluation of 2017. URL: http://archive.ics.uci.edu/ml.
thompson sampling, Advances in neural informa- [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
tion processing systems 24 (2011). I. Sutskever, et al., Language models are
unsuper[8] T. Graepel, J. Quiñonero Candela, T. Borchert, vised multitask learners, OpenAI blog 1 (2019) 9.</p>
        <p>R. Herbrich, Web-scale bayesian click-through rate
prediction for sponsored search advertising in
microsoft’s bing search engine, in: Proceedings of the
27th International Conference on Machine
Learning ICML 2010, Invited Applications Track
(unreviewed, to appear), 2010.
[9] I. Osband, B. Van Roy, Bootstrapped thompson
sampling and deep exploration, arXiv preprint
arXiv:1507.00300 (2015).
[10] B. Kveton, C. Szepesvari, S. Vaswani, Z. Wen, T.
Lattimore, M. Ghavamzadeh, Garbage in, reward out:
Bootstrapping exploration in multi-armed bandits,
in: International Conference on Machine Learning,</p>
        <p>PMLR, 2019, pp. 3601–3610.
[11] C. Riou, J. Honda, Bandit algorithms based on
thompson sampling for bounded reward
distributions, in: Algorithmic Learning Theory, PMLR,
2020, pp. 777–826.
[12] N. Srinivas, A. Krause, S. M. Kakade, M. Seeger,</p>
        <p>Gaussian process optimization in the bandit setting:
No regret and experimental design, arXiv preprint
arXiv:0912.3995 (2009).
[13] S. R. Chowdhury, A. Gopalan, On kernelized
multiarmed bandits, in: International Conference on</p>
        <p>Machine Learning, PMLR, 2017, pp. 844–853.
[14] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian
processes for big data, arXiv preprint arXiv:1309.6835
(2013).
[15] B. M. Steele, Exact bootstrap k-nearest neighbor</p>
        <p>learners, Machine Learning 74 (2009) 235–255.
[16] G. Biau, F. Cérou, A. Guyader, On the rate of
convergence of the bagged nearest neighbor estimate.,
Journal of Machine Learning Research 11 (2010).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Elena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Milos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Eugene</surname>
          </string-name>
          ,
          <article-title>Survey of multiarmed bandit algorithms applied to recommendation systems</article-title>
          ,
          <source>International Journal of Open Information Technologies</source>
          <volume>9</volume>
          (
          <year>2021</year>
          )
          <fpage>12</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>