=Paper= {{Paper |id=Vol-2410/paper18.pdf |storemode=property |title=Learning to Diversify for E-commerce Search with Multi-Armed Bandit |pdfUrl=https://ceur-ws.org/Vol-2410/paper18.pdf |volume=Vol-2410 |authors=Anjan Goswami,Chengxiang Zhai,Prasant Mohapatra |dblpUrl=https://dblp.org/rec/conf/sigir/GoswamiZM19 }} ==Learning to Diversify for E-commerce Search with Multi-Armed Bandit== https://ceur-ws.org/Vol-2410/paper18.pdf

Learning to Diversify for E-commerce Search with Multi-Armed
Bandit
Anjan Goswami Chengxiang Zhai Prasant Mohapatra
agoswami@ucdavis.edu czhai@illinois.edu pmohapatra@ucdavis.edu
University of California, Davis University of Illinois, University of California, Davis
Urbana-Champaign

ABSTRACT results are relevant to the query. Many products shown in first few
Search is central to e-commerce platforms. Diversification of search pages have great reviews and it is clear that these products have
results is essential to cater to the diverse preferences of the cus- been regularly purchased many times from Amazon. This scenario
tomers. One of the primary metrics of e-commerce businesses is illustrates that there are often too many choices for users on a large
revenue. On the other hand, the prices of the products shown e-commerce web site. However, it has been well researched that
influence customer preferences. Hence, diversifying e-commerce that user clicks drop dramatically after the first page [30, 39] in
search results requires learning the diverse price preferences of the web and e-commerce search. Consequently, a user often does not
customers and simultaneously maximizing the revenue without browse through all the relevant products returned by the search
hurting the relevance of the results. In this paper, we introduce query and can abandon the search if there are only a few or no
the learning to diversify problem for e-commerce search. We also relevant products found in top results. This search abandonment
show that diversification improves the median customer lifetime is known to get reduced by showing the diverse result set to the
value (CLV), which is a critical long-term business metric for an users in web search [15]. Search abandonment hurts e-commerce
e-commerce business. We design three algorithms for the task. The platforms even more because the business model depends on actual
first two algorithms are modifications of algorithms that are in the purchases instead of an ad-clicks. Hence, the diversification of rank-
past developed in the context of the diversification problem in web ing is a critical problem for e-commerce sites. One of the primary
search. The third algorithm is a novel approximate knapsack based business metrics for an e-commerce business is revenue [17] gen-
semi-bandit algorithm. We derive the regret and pay-off bounds of erated from the sales. The attempt for diversification can hurt the
all these algorithms and conduct experiments with synthetic data revenue unless we explicitly formulate the diversification problem
and simulation to validate and compare the algorithms. We compute that maximizes the revenue. Moreover, e-commerce businesses use
revenue, median CLV, and purchase based mean reciprocal rank customer lifetime value or CLV [20, 23] as a critical long term metric.
(PMRR) under various scenarios such as with changing user pref- CLV is the revenue generated by customers in their lifetime with
erences with time in our simulation to compare the performances the site. A successful e-commerce site intends to increase the pool
of these algorithms. We show that our proposed third algorithm is of high CLV customers. Intuitively, by selecting proper diversifica-
more practical and efficient compared to the first two algorithms tion of results, an e-commerce site can cater to the preferences of a
and can produce higher revenue, maintain a better median CLV larger pool of customers and can improve the median CLV of the
and PMRR. business. The aspects of ensuring maximization of revenue and not
hurting the relevance simultaneously while learning to diversify
ACM Reference Format:
for e-commerce search require a unique formulation. In this paper,
Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra. 2019. Learning
we address this problem of diversification of e-commerce search
to Diversify for E-commerce Search with Multi-Armed Bandit. In Proceedings
of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), 11 pages. results. Additionally, we show that our formulation also improves
the median CLV. We have made three contributions in this paper:
(1) We define the learning to diversify problem for e-commerce
1 INTRODUCTION search considering maximization of revenue and keeping the loss in
Search has become a central functionality of e-commerce sites. Typ- relevance within a bound. Additionally, we show that such design
ically, large e-commerce platforms have several competing products of learning to diversify problem also improves median CLV for an
relevant to a query. For example, Amazon returns more than 1000 e-commerce business.
products for the query “car seat”, about 300 products for the query (2) We present three multi-armed bandit based algorithms for
“smart watch”, and all the products up to almost the last page of this and derive the regret and pay-off bounds for them. Our third
algorithm is a novel approximate knapsack based semi-bandit opti-
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic
mization algorithm and we show that the algorithm performs well
purposes. for our problem.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): (3) We also present a simulation-based evaluation strategy and
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at
http://ceur-ws.org conduct experiments with synthetic data to show that under most
of the scenarios such as changing customer preferences and under
the assumption of position bias our semi-bandit algorithm can main-
tain a right balance of revenue, median CLV, and mean reciprocal
rank [16] based on purchases.
ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra

2 BACKGROUND MABs. Several papers that use MAB framework or online learning
Diversification of search result is one of the options for managing to diversify, appear in the domain of news content optimization
the uncertainties and ambiguities in understanding the user’s infor- problems [1] where diversity is considered to be the essential factor.
mation need from search queries [13]. A search ranking function, One of the interesting papers by Yue et al [43] uses linear sub-
optimized for relevance, is not designed to minimize the possibili- modular bandits based paradigm to learn the user preferences for
ties of redundancies on the search results [5]. The diversification diverse ranking. The use of sub-modular bandits guarantees the
problem has been discussed as one of the most important future existence of a greedy approximation algorithm. In some other pa-
research directions in learning to rank [12]. Researchers address pers, researchers use a linear or nonlinear model to simultaneously
this problem by using variants of two broad approaches: (1) The learning to rank and diversify [14, 18, 28]. The main problem with
first approach defines a measure of similarity based on the contents such algorithms is that the computation of variance that is used to
of the documents and designs a ranking function that can use this update the rewards in MAB framework becomes more complex. It
measure to generate a diversified search result while not hurting is also generally much harder to evaluate the effectiveness of an
the relevance as much as possible. (2) The second approach uses a online algorithm compared to traditional batch learning models.
multi-armed bandit based online learning algorithm to learn the Hence, using a ranking function for both the learning to rank and di-
diverse user preferences and optimize the ranking based on that. versify may not be practical for realization. Although, it is possible
The second approach does not require defining any similarity to use an online algorithm for diversification on top of traditional
measures, and instead, it learns from the data. Hence, it is easier to learning to rank (LTR) algorithms to reduce the complexity of the
realize in practice. In this paper, our algorithms are based on the engineering system. In e-commerce, the challenge is the need to
second approach because it is much harder to map any product account for revenue maximization, which requires formulating a
similarity measures to user preferences than to learn it from the data. different learning problem.
Moreover, user preferences can change over time, and a machine We show the improvement in median CLV with the diversifica-
learning based approach can better adapt to the changes. tion of results in e-commerce. We have not found this relationship
One of the most influential papers on the first approach is by in any other papers. However, extensive literature is available on
Carbonell et al. [11] where the authors introduce the concept of CLV [9, 29] for the interested readers. We omit to provide any sur-
maximal marginal relevance to maximize selecting documents that vey of CLV modeling literature since our work is not related to any
are different from the already chosen documents while reducing of those.
the loss in overall relevance.
In another paper, Zhai [44] provides an algorithm for optimizing 3 PROBLEM SETTING
search results by diversification using risk minimization principles
that use correlation among search result as a similarity measure. We address the problem of diversifying the e-commerce search
Agrawal et al. [2] propose a greedy algorithm that reorders the results based on the perspective of the e-commerce firm and also
top k search results to jointly maximize the probability of show- from the perspective of the users. The firm intends to maximize
ing diverse documents for a query and minimize the exposure of the revenue, increase the CLV. The customers desire to find rele-
low-quality documents to the users. The authors also provide a vant products that they may be interested in purchasing. Hence,
generalization of the classic ranking metrics such as discounted we define the learning to diversify problem for e-commerce along
cumulative gain (DCG), mean average precision (MAP), etc. to with maximization of revenue and maintaining a relevance thresh-
account for diversification in the ranking. Some researchers [13] old. We now create a few notations to explain the problem. Let’s
pose this problem slightly differently for reducing ambiguity in consider a query q and a corresponding set of n relevant items
search queries. Santos et al. [37] provide another similar algorithm Di = {d 1 , d 2 , · · · , dn } whose prices are given by {ρ 1 , ρ 2 , · · · , ρ n }.
that uses additional information by either reformulating queries We also assume that each product has a relevance score with re-
or using queries from related search to diversify the search results. spect to the query and these scores are given by {s 1 , s 2 , · · · , sn }.
There are also several papers written on diversification of results of The relevance score is correlated to user clicks but the degree of
recommender systems using a variant of content based similarity correlation can vary for various queries and the product corpora.
measures [42]. Let us denote a set of m users by U = {u 1 , · · · , um . We also consider
The most important paper on the second approach is probably a simple user behavior model where we assume that a user browses
the one written by Radlinski et al. [36] where authors come up with the items one after another from the top and then either purchase
an online algorithm based on classic multi-armed bandit (MAB) an item and leaves the site or leaves the site without any purchase
paradigm that can learn the user preferences. The authors also pro- after browsing at most the top k (k << n) items that are shown
vide a baseline greedy algorithm to compare with their proposed to him or her. We also assume that we study iterations (search
multi-armed bandit based algorithm. Two of our algorithms are sessions) {1, · · · ,T } and the top k items at an iteration t for a query
direct modifications of the algorithms presented in that paper. The q can be given by (b1t , · · · , bkt ). The corresponding products can
MAB algorithm in that paper requires one MAB per rank position, be given by (d (b1t ), · · · , d (bnt )). Note that, in our scenario, we do
and each MAB can have as many arms as the number of items. not have specific user information, and we only have the query.
This strategy increases the requirements of , and also there are In an e-commerce site, this is a general scenario when customers
overlapping arms for the MABs in various positions which are not start browsing for a product, and typically they log into the site
optimal since one needs to discard the already selected arms for the for purchase or sometimes can use a guest account for a purchase.
Our goal is to select the best top k items from the n items for the
Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France

query q such that the result set caters to diverse preferences of the complete nx iterations, the algorithm then presents the products in
users. If the user finds a relevant product in the top k, then he or the order of decreasing the expected revenue. Our implementation
she can decide to purchase or not purchase that product. Let us use of the algorithm is shown in 1.
two indicator variables, z jt = {0, 1} and x jt = {0, 1}. The first one This first greedy algorithm maximizes the revenue generated
denotes if the product d j is selected in top k results and is shown after nx iterations if the user preferences are unchanged. The main
to a customer and the second one is to denote whether or not the problem of this algorithm is that It assumes that the preferences of
product d j is purchased in iteration t for the given query. The rev- the users’ do not change with time. The algorithm may achieve ex-
enue generated from the product d j for the query in an iteration t cellent performance from the revenue metic’s perspective in specific
Pj=n
can be computed as r t = j=0 x jt ρ j . The total revenue can then be scenarios when the customer preferences do not change particu-
Pt =T larly after the nx iterations once it has a reasonable estimation of
given by RT = t =1 r t . We can also similarly define an user level
the purchase probabilities. However, since there is a need for show-
revenue expression Rui to denote the total revenue obtained by an
ing every product at every position, the regret of this algorithm can
user in T iterations. Then, we also have RT = i=m i=1 Ru i . We also
P
be very poor, and consequently, the revenue generated in initial
define a cumulative sum of relevance scores of top k products to
nx iteration can be arbitrarily bad. Hence, this algorithm is quite
keep a bound on relevance and denote it by S = {s 1 , · · · , sk }. Given
impractical for actual implementation.
the above set up learning to diversify problem in e-commerce can
Moreover, in e-commerce, it is particularly unwise to take any
be mapped to maximizing ∀iRui . However, this is not the same as
risk of making customers unhappy. Even the controlled experiments
maximizing RT . It is also hard to estimate Rui without estimating
require to be conducted very carefully often with a budget [21]. We
the purchase probabilities of a specific user. Learning to diversify al-
present this algorithm to provide a comparison with a simple greedy
gorithm, on the other hand, can optimize the search result more for
algorithm that always maximizes the revenue after a sufficiently
a larger pool of users and improve the median of the CLV. However,
long enough iteration as a baseline.
the optimization problem then requires to maximize the revenue
while aiming to learn the diverse preferences of the customers.
Hence one possible solution is simultaneously learning to diver- Algorithm 1 Revenue Ranked Explore and Commit algorithm
sify and maximize the total revenue while maintaining a reasonable (RREC)
value for relevance. This approach also requires us to use a thresh- input: Items (d 1 , d 2 , · · · , dn ), parameters: ϵ, δ, k.
old for the relevance of the top k products. x ← ⌈2k 2 /ϵ 2 log(2k/δ )⌉
Note that using such a relevance threshold is not really a new (b1 , b2 , · · · , bk ) ← k arbitrary items.
concept and has been used before in literature on diverse rank- for i = 1, · · · k do ▷ at each rank ∀ji j = 0, p j = 0, r j = 0
ing [11] and in general for limiting recall set in search [22]. It is for c = 1, · · · , x do ▷ Loop x times
not hard in practice to establish such thresholds for a query and is for doj = 1, · · · , n ▷ over every item d j
often used in industry [41] for various purposes. bc ← d j
display b1 , · · · , bk to the user
4 LEARNING ALGORITHMS ij = ij + 1
We now describe three algorithms for the problem. The first two if user purchases on bc then
algorithms are modifications of algorithms proposed in Radlinski pj = pj + 1
et al. [36] paper. The third algorithm is a new knapsack based semi- end if
bandit algorithm that we develop for this problem. We use the first end for
two algorithms as baselines for learning to diversify problem for end for
e-commerce search. The third problem is a novel algorithm that we for doj = 1, · · · , n
are introducing in this paper. pr j = p j /(i j + β ) ▷ β ≥ 1 is a constant to avoid division
by zero
4.1 Revenue Ranked Explore and Commit mr j = pr × ρ j × Z ▷ Z is a normalization constant for the
algorithm (RREC) prices for a query
end forj ∗ ← argmaxj mr j ▷ Commit to best document at
This algorithm is similar to the “ranked explore and commit algo- this rank
rithm” (REC) described in the paper by Radlinski et al. [36]. We bi ← d j∗
intend to maximize the revenue, which is real-valued instead of
click-through rate, which is a binary input. It requires a minor end for
change. The algorithm iteratively shows each of the n items at each
rank x times. It then records the purchases for these nx iterations
for every product at every rank position. We can then estimate
the probability of purchase of every product at every rank. We
4.2 Revenue Ranked Bandits Algorithm
then require to multiply the estimated probability of purchase by (RRBA)
the price of the product to determine the generated revenue. The In this algorithm, we modify Radlinski et al. [36]’s “ranked ban-
revenue is real-valued and can be an arbitrary number. Hence, we dit algorithm” (RBA) for learning to diversify and maximize the
need to normalize the price between 0 and 1 for all the products revenue coming from the users. We provide a schematic of the
to estimate a normalized value of the expected revenue. After we modified algorithm in 2. It uses k bandits MAB 1 , · · · , MABk for k
ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra

rank positions for a query. Each bandit is assumed to have n arms Algorithm 2 Revenue Ranked Bandits Algorithm (RRBA)
(a 1 , · · · , an ) corresponding to the n products in the recall set for Initialize: MAB 1 (n), MAB 2 (n), · · · , MABk (n) ▷ Initialize the
the query. If an user purchases a product d j from rank position r MABs
at an iteration t, we update the score for the arm a j of MABr as
follows: q for t = 1, · · · ,T do ▷ for T iterations
vr j = pr j /i r j × ρ j × Z + α 2 ln t/i r j for r = 1, · · · , k do ▷ for every position r
where pr j is the purchase count and the i r j is the impression count b̂r ← selectarm(MAB)r
of the product d j at rank position r at that iteration, α is a constant, if b̂r ∈ (b1t , · · · , brt −1 ) then ▷ replace repeats
and Z is a normalization factor for the price of the products. The br ← Arbitrary document from D
t

algorithm considers the same set of products as arms for all the else
MABs at all positions. Consequently, once a product is selected by brt ← b̂r
the k − 1 MAB, the same product cannot be selected by the k-th
MAB. Hence, except the MAB at the very first position, all other end if
MABs may not select their best arms. This phenomenon makes the end for
algorithm performing poorly for choosing the optimal top k prod- display (b1t , · · · , bkt ) to users; record purchases.
ucts. Moreover, the authors mention in the paper that the analysis for r = 1, · · · , k do ▷ Do all updates
of the regret for the non-binary case is non-trivial and the greedy id (brt ) = id (brt ) + 1 ▷ Assume that the products
algorithm on which RBA is based can obtain a pay-off bound that is if user purchases brt , and b̂r ← brt then
a factor of (k − ϵ ) below optimal for any ϵ. This algorithm can learn pd (brt ) = pd (brt ) + 1
the preferences even if those are changing and does not require to end if
estimate the purchase probability of every product as similar to the prd (brt ) = pd (brt ) /id (brt )
RREC. However, another problem with this algorithm is that the mrd (brt ) = prd (brt ) × ρ j × Z
number of MABs and the amount of bookkeeping required to run q
vard (brt ) = α 2 ln t/id (brt )
this in practice. Typically, any such MAB algorithms in practice
cannot replace the learning to rank algorithms, as mentioned in scd (brt ) = mrd (brt ) + vard (brt )
the paper by Radlinski et al. [36]. The most practical way of imple- Update MABr , arm = brt , reward = scd (brt )
menting any bandit algorithms or optimizations can be to use it as end for
the topmost layer of a multi-layer ranking architecture where the end for
set of products ranked by the LTR algorithm in a layer below can
be handed over to the MAB algorithm.
is the impression count of item d j at iteration t. At each iteration
{1, 2, · · · ,T }, each product for the top k position can then be chosen
4.3 Knapsack based bandit algorithm (KPBA)
from a knapsack based optimization framework as follows:
KPBA is a novel algorithm that we propose in this paper. To over-
come the problem of overlapping arms in k MABs for RRBA algo- k
rithm, we here consider a semi-bandit algorithm [6] that uses a
X
max vU
jT
CB
single bandit with n arms and selects k best arms at every iteration. 1,2, ··· ,T
j=1
It processes feedback for k arms at each iteration. Furthermore, k
to guarantee better relevance, we introduce a relevance threshold
X
subject to sj ≥ B (1)
while maximizing the revenue and learning the diverse preferences. j=1
This optimization problem turns out to be similar to the well known
Here B is a threshold for the cumulative sum of relevance scores.
exact k-item Knapsack problem or E-kKP [25]. Note that the revenue
We update the v U C B after each iteration. In order to solve this
expression is based on the UCB score similar to our formulation j
of RRBA. This algorithm does not require as much bookkeeping problem, we define the problem 1 as a binary integer programming
as the RRBA since it does not need to maintain k MABs. It also (BIP) problem [19]. We define sˆi = 1 − si and B̂ = C − B, where C is
does not have the limitation of not being able to select an optimal another constant. Problem 1 can then be rewritten as:
n
arm. Furthermore, It keeps a relevance bound, and hence, it can X
U CB
max x it × vit
have a better performance for relevance based on any information x 1t , ··· ,x nt
i=1
retrieval based ranking metrics such as mean reciprocal rank (MRR). n
It is also a bandit paradigm, thus it can learn changing preferences
X
subject to x i j × sˆi ≤ B̂
of customers. i=1
We now discuss the KPBA formulation below: Xn
Suppose, an item d j is purchased at an iteration t in this algo- xi j = k (2)
rithm, then the normalized revenue generated in that iteration can i=1
be written as r jt = ρ j × Z . The algorithm has to keep track of n Problem 2 is known as E-kKP. This problem is NP-hard in terms
UCB scores for the n products. The UCB score for a product d j after of the number of arms n. A brute force solution for this problem
iteration t can be written as v U C B = r + α 2 ln t/i . Note, i
can be found in O (nk ) time where k is the number of selected arms.
p
jt jt jt jt
Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France

1
However, there is a 12 -approximation algorithm named H 2 which 5.3 RRBA
has been proposed in a paper by Caprara et al. [10]. The authors use Radlinski et al. [36] have provided the results on regret and payoff
a well-known LP relaxation of Knapsack problem [32] and use the for RBA algorithm using EXP3 [8] and binary rewards.p The com-
fact that any basic feasible solution of a linear relaxation of BIP con- bined payoff is shown in the paper as (1 − e1 )OPT − O (k nT log n.
tains at most two fractional weights. In case, there are no fractional Authors commented that the case of real reward can be (k − ϵ )
weights, then the optimal basic feasible solution of LP relaxation worst below optimal, for any ϵ ≥ 0. We can use the same combined
problem is an optimal solution for the BIP problem. However, if payoff for RRBA using the real reward.
there are one or two weights out of all n weights have fractional
values for the basic feasible solution, then we can select one out 5.4 KPBA
of the two fractional, and it can still guarantee a 12 -approximate
Our problem is conceptually similar to dynamic assortment se-
solution for E-kKP.
lection problem [38]. However, assortment selection models often
The authors provide an analysis of the algorithm and have shown
assume that the purchase of the product depends on the selection of
that it runs in O (n) time. Readers can get the details of the algorithm
1 a specific set of products. Many papers in this area use the concept
H 2 , and it’s analysis in the paper by Caprara et al. [10]. of the utility of a set of products and use a choice model for model-
We have shown a schematic of our proposed algorithm in refkpba ing the purchase behavior of the users [33]. However, we assume
1
where we use the H 2 algorithm as a subroutine. that users explore each product individually and independently, and
users are not interested in going beyond the top k products if they
Algorithm 3 Knapsack based bandit algorithm (KPBA) do not find the desired product. The analysis of our algorithm can
be similar to the analysis of the algorithms developed for dynamic
Initialize: SemiMAB(n) ▷ Initialize the MABs
assortment selection problems. Specifically, we are going to use the
for t = 1, · · · ,T do ▷ for T iterations
1 proof for regret bound from one such paper by Agrawal et al. [3].
{b1 , · · · , bk } ← H 2 (D) ▷ The details of this algorithm is in The authors use a choice model for purchase and keep showing one
the paper [10] assortment of k products for a particular time until a no purchase
display (b1t , · · · , bkt ) to users; record purchases. event happens. In our case, the loss in revenue at each iteration t is
for l = 1, · · · , k do ▷ Update the impression counts of expressed as the following:
products, corresponding to b1t to bkt
j=n
il = il + 1 ▷ l = 1, · · · , k are now indices of k products 1 1X
Il = 0 ▷ Il is a boolean indicator variable. ((1 − )OPT − z jt v U CB
)
e 2 j=1 jt
if user purchases bl , and b̂r ← br then ▷ Update UCB
scores Il = 1 The second term is coming from the half approximation exact k
end if knapsack algorithm that we have used. Intuitively, since v U
jt
C B fol-
for l = 1, · · · , k do lows the properties of UCB algorithm, hence each product here is
vl = Il × pl /il × ρl × Z + α 2 ln t/il akin to be bounded by the regret bound given by the UCB algo-
p

end for rithm [7]. Note that, the number of times an item shown to the
end for customers on an average is bounded by tnk for t iterations. Let’s
end for also use r ˆjt to denote the expected normalized revenue of an item
d j in an iteration t. Now, we can write the following:
j=n
X j=n
X
5 THEORETICAL ANALYSIS z jt v U
jt
CB
≥ z jt r ˆjt
j=1 j=1
5.1 The offline optimization problem j=n j=n q
It is straightforward to see that the problem of finding set of k opti-
X X
(z jt v U
jt
CB
− z jt r ˆjt ) ≤ α (2 ln t/i jt )
mal products from n products with binary reward is equivalent to j=1 j=1
the maximum coverage problem [36]. The optimal greedy approxi- j=n
mation solution [27, 35] for this problem is a (1 − e1 )-approximation
X
(z jt v U CB
− z jt r ˆjt ) ≤ O ( (n ln t/(tk )))
p
jt
algorithm. j=1
j=n
5.2 RREC X
(z jt v U CB
− z jt r ˆjt ) ≤ O ( (nt ln t ))
p
jt
We modify Radlinski etal.’s [36] REC algorithm to work with real- j=1
valued rewards. This algorithm serves as a baseline in our paper.
Note that, we use the following expected average:
However, Radlinski et al.’s [36]’ paper mentions that the regret
for REC can be extended to the case when the rewards are not tX=T r
1 √
binary. They show that REC can achieve a payoff (1 − e1 − ϵ )OPT − t
≤ T
O (k 3 × n/ϵ ln (k/δ )) with at least probability (1 − δ ). In our case, t =1
the expression will be similar since we use the same algorithm with We can also use a different bound from UCB algorithm using lg n)
real reward. instead of lg t ).
ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra

Our sketch of derivation is similar to the proofs in Agrawal et al.’s that different users have different price preferences and that mainly
paper [3]. The bound can probably be made tighter using techniques dictates their purchase behavior. In reality, user preference is a
used by Auer et al. [7] for proving the bounds of UCBpalgorithm. complex function of multiple factors associated with the products
This shows that the regret here can be bounded by O ( p(nT lgT )). and other variables such as time in a year, the financial status of
The payoff then can be also bounded by ((1− e1 )OPT −O ( (nT lg n)) the person, etc. However, this assumption helps us keep our data
for KPBA, which is similar to Radlinski et al. [36]’s algorithm for generation and simulation simple but allows us to evaluate and
real reward. compare our algorithms in a manner that reveals the characteristics
of the proposed algorithms. In our simulation study, we model the
5.5 Comments from Analysis biases of a real system. Typically, it is known that evaluation of
bandit based algorithms in an online setting can be very hard [31]
From the above analysis, it is clear that RREC has several drawbacks
and offline counterfactual techniques [24] using historical logs have
as also identified by Radlinski et al. [36] in their paper. Moreover,
been recently a topic of research for such evaluations. We have
RREC is impractical since it’s regret in first nx iterations can be
taken the main ideas of using such evaluation techniques in this
the worst [36]. We then expect this to also perform poorly in terms
paper.
of revenue as well as median CLV. The analysis does not tell us
We aim to answer the following questions from the simulation
anything about its performance in terms of relevance. The RRBA
study:
on the other hand, clearly can be concluded to perform better than
(1) How does KPBA compare with RRBA in terms of ARQ, MCV,
RREC for revenue based on the analytical expression of regret.
and PMRR? In particular, we are interested in how the much better
KPBA can be expected to be similar in terms of revenue and median
value of these metrics can be obtained if we use KPBA instead of
CLV. However, KPBA can be more appealing compared to RRBA
using RRBA under scenarios such as without and with position bias
when we have real rewards. Moreover, since KPBA does not com-
and with changing customer preferences.
promise relevance beyond a bound for achieving the diversification,
(2) We also intend to see if RREC performs better in PMRR
hence, it can be expected to perform better for the users in terms
compared to RRBA and how does that compare with KPBA with a
of relevance metrics.
reasonable assumption on relevance threshold.
We discuss the experiments after illustrating our data generation
6 EVALUATION methodology.
In this study, we use the following metrics to evaluate the perfor-
mance of our learning algorithms from the perspective of revenue, 6.1 Synthetic data generation
CLV, and relevance:
Our synthetic e-commerce data consists of N queries and M rele-
• Average Revenue per Query: ARQ This is the average vant products per query. We assign prices to the products randomly
revenue for all the queries in the experimental study and can from a multimodal Gaussian distribution with m = {1, 8} peaks with
Pi =t
R
be obtained by the expression: i =1 n
i
where t is the total mean prices between $10 to $500. The purchase rates are generated
number of iterations in the simulation study and n is the from a similar distribution using mean peaks between 0.0 to 0.06.
number of queries. The maximum mean peak purchase rate is assigned to the cheapest
• Median Customer Life Time Value: MCV Suppose each mean price peak for 70% of the time and rest of the time that is
customer u j spends rut j for purchasing products in in t iter- assigned to any other mean price peaks. We additionally generate
ations. Then the customer value can be represented by the a relevance score that is linearly correlated to the purchase rates
median of all customer spends. with a person correlation coefficient between 0.10 and 0.30 with a
• Mean Reciprocal Rank of Purchases: PMRR Reciprocal p-value less than 0.10. In this paper, we use M = 200. Note that in
rank based on purchases is the reciprocal of the rank of the a real scenario, M can be a very large number, but typically a recall
product that has been purchased for a given query in a search set for a search query can drop to a much smaller size because of
session. The mean reciprocal rank of purchases is the mean the performance reasons and we anticipate to apply our algorithms
of the reciprocal ranks for all search sessions in a period of on top of a multi-layer ranking architecture. Moreover, this makes
time. running our experiments simpler and faster without losing gener-
ality. The figure 2 provides histograms of synthetically generated
It is clear from the analytical derivations in our section 3 that
prices and also shows the relationship between synthetically gener-
KPBA and RRBA both perform better compared to RREC in terms
ated relevance score and the product purchase rates for that query.
of ARQ. This also indicates that these two algorithms can exceed
The multimodal distribution of price and a weak linear correlation
RREC in MCV. Moreover, because of the bounds in relevance for
between relevance score and the purchase rate represent a common
KPBA, it is expected that PMRR can be better than RRBA. However,
scenario for an e-commerce platform.
it is not clear how the PMRR for KPBA can compare with RREC for
which PMRR can be good particularly after nx iterations. It is also
unclear the difference between KPBA and RRBA in terms of ARQ 6.2 User’s price preference model
since the pay-off expressions are similar. We assign every user u ∈ U to a preferred price cluster tu using a
We did not have any historical search log data from a real e- Chinese Restaurant Process (CRP) [4] with a parameter θ . If U = 20,
commerce site. We thus evaluate our algorithms by generating θ = 3.0, then the average number of price clusters in CRP is 6.5. We
some synthetic data and conducting a simulation study. We assume then assign each product to one of these clusters so that cheaper
Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France

Figure 1: Price histograms and their corresponding relevance score and purchase rate. Note the plot shows correlation between
the relevance score and purchase rate fitting a line.

products are assigned to a smaller price cluster number. We denote We observed that KPBA and RRBA generate significantly more
the price cluster for a product di as tdi . ARQ and MCV compared to RREC. However, RREC does have
better PMRR compared to RRBA. On the other hand, KPBA makes
6.3 User behavior model even more ARQ and MCV compared to RRBA, and its PMRR is
In the simulation, we consider that a user u browses through the close to RREC. We conduct experiments with a different number
top k products shown one after another. The following equation of users and iterations and observe a similar trend in the result.
provides the expression for the probability of a user u purchasing a The second row for all three algorithms show the results with 10
product di that is located at ranking position j: queries and 50000 iterations. We see that the KPBA proves to be
substantially better than both the algorithms for 50000 iterations
1
c × pdi × log (j+1)
 if tu = tdi except the PMRR is generally similar to RREC. The third row shows
pr (u, di , j) =  2
 (1 − c) × p
di else the results with 10 queries, 50000 iterations and 100 users who come
through a CRP process with θ = 30.0. We observe that KPBA again

1
We use c = 0.7. The factor log (j+1) is the position bias. produces substantially better results for ARQ, MCV compared to
2
the other two algorithms, and it also does very similar to RREC in
7 RESULTS PMRR metric.
We conducted three experimental studies to evaluate the perfor-
mance of these algorithms. For each iteration in the study, we use a 7.2 Comparison with Position Bias
query and a user uniform randomly. For each of the studies, we use In this section, we repeat all the previous experiments with position
three experiments with each algorithm with the following set-up: bias in the user behavior by introducing a logarithmic decay log1 k of
(1) N = 1, |U | = 20, θ = 3.0, and iterations=1000, (2) N = 10, |U | = the probability of purchase based on the rank of the products, where
20, θ = 10.0, and iterations=50000, (3) N = 10, |U | = 100, θ = 10, k is the ranking position of the product in an iteration. We again
and iterations=50000 observe a similar pattern that KPBA shows better performance
For each experiment, we reported the four metrics from an average compared to RRBA in all metrics. This result is expected since we
of 100 runs of the simulator. We now describe the experiments in have already found better PMRR when we run the experiments
the sections below: without position bias. The table 2 summarizes the experiments.
The second and third row of each algorithm uses θ = 10 for the
7.1 Comparison without Position Bias CRP process. Note that, the results for KPBA is significantly better
In this experiment, we do not use the position bias for computing compared to RRBA when position bias is present. Thus, KPBA can
the probability of purchase. The results are summarized in table 1. be a more practical algorithm for realization.
ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra

Algo Q U Iter ARQ ($) MCV ($) PMRR 7.4 Comparison of convergences for the three
RREC 1 20 1000 140 7.0 0.41 algorithms
10 20 50000 3500 1486 0.54 To show the growth of the four metrics under three above men-
10 100 50000 5039 439 0.52 tioned scenarios, we show the growth of two main metrics ARQ
RRBA 1 20 1000 18567 870 0.26 and MCV from our simulation study collecting these metrics for
10 20 50000 138556 64483 0.29 every 100-th iteration. The figure 2 uses 5 queries and 20 users
10 100 50000 144970 14254 0.29 with θ = 3.0 and shows the comparison of the two metrics without
KPBA 1 20 1000 25022 1085 0.41 position bias for all three algorithms. The figure 3 shows those two
10 20 50000 156941 79621 0.52 metrics for all three algorithms with position bias. The figure 4
10 100 50000 174030 16105 0.52 shows three metrics with changing customer preference for every
Table 1: Comparison of KPBA, RRBA and RREC metrics 500 iterations along with the position bias. It is very clear from
without position Bias. the plots that the KPBA performs consistently better than RRBA
and RREC in ARQ, and MCV metrics. Note, all the prices are in log
scale on the convergence plots. The figure 4 shows that KPBA has
a similarly good performance in PMRR metric compared to RREC.

Algo Q U Iter ARQ ($) MCV ($) PMRR
RREC 1 20 1000 158 8 0.37 7.5 Comments on experimental evaluation
10 20 50000 3500 1486 0.54
We find that as expected from the analytical study, KPBA performs
10 100 50000 5039 439 0.52
well in ARQ metric compared to RRBA. RREC performs worst in
RRBA 1 20 1000 16922 843 0.19
revenue metrics. We understand from our simulation studies that
10 20 50000 69221 34521 0.29
MCV metric is also significantly better for KPBA compared to RRBA
10 100 50000 97534 9917 0.31
and RREC. Moreover, the PMRR values in KPBA are similar or better
KPBA 1 20 1000 21497 1085 0.40
compared to RREC and are far better compared to RRBA based on
10 20 50000 109086 53716 0.53
the result of the simulation and as discussed in our theoretical
10 100 50000 107091 10655 0.54
analysis. We also observe that KPBA continues to perform better
Table 2: Comparison of KPBA, RRBA and RREC with posi- with more iterations, the number of users, with position bias, and
tion Bias. with changing user preferences.

8 CONCLUSION
In this paper, we introduce the learning to diversify problem for
Algo Q U Iter ARQ ($) MCV ($) PMRR
e-commerce search. We show that in order to serve best for both
RREC 1 20 1000 117 6 0.35
the company and the customers, in e-commerce, it is required to
10 20 50000 3100 1134 0.50
construct a unique formulation of the learning to diversify problem
10 100 50000 17111 1634 0.57
where we intend to learn to diversify, and maximize the revenue
RRBA 1 20 1000 20288 1019 0.17
simultaneously, and as well as ensure a value of relevance for the
10 20 50000 43654 25345 0.25
top k results. Our theoretical results show that the KPBA algorithm
10 100 50000 95281 9540 0.27
is expected to have better ARQ, and PMRR compared to RRBA. Our
KPBA 1 20 1000 23201 1125 0.49
simulation studies show that KPBA also has better MCV compared
10 20 50000 95754 42876 0.53
to both RRBA and RREC, and it also gives a good performance in
10 100 50000 103324 10315 0.57
terms of PMRR compared to RREC. On the other hand, RRBA is
Table 3: Comparison of KPBA, RRBA and RREC with Chang- quite bad from the customer’s perspective for this problem since the
ing Customer Preferences. PMRR is low in all scenarios based on our simulations. In essence,
we can show that the KPBA can be an efficient and practical algo-
rithm for diversifying e-commerce search results. We also show that
e-commerce companies can improve the CLV by using a diverse
ranking strategy. This connection between diversity in ranking and
CLV can be worth exploring more in the future. KPBA can also
7.3 Comparison with Changing Customer potentially be further optimized by formulating a variable budget
Preferences knapsack problem where we simultaneously also learn the optimal
In this section, we experiment with changing customer preferences relevance threshold. It is also possible to find a better probabilis-
after every 500 iteration. We summarize the experiments in table 3. tic approximation algorithm for such optimization problems [34].
We again notice that the KPBA algorithm still performs better than We anticipate that this paper can motivate further research in the
RRBA in all three metrics, and it compares favorably in PMRR area of diverse ranking for e-commerce search and recommender
compared to RREC. systems.
Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France

Figure 2: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote
KPBA metrics. The revenue metric uses log scale.

Figure 3: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote
KPBA metrics. The revenue metric uses log scale.

Figure 4: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote
KPBA metrics. The revenue metric uses log scale.

REFERENCES for content optimization. In Advances in Neural Information Processing Systems.
[1] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek 17–24.
Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. 2009. Online models
ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra

[2] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. [29] Kenneth C Laudon, Carol Guercio Traver, and Alfonso Vidal Romero Elizondo.
Diversifying search results. In Proceedings of the second ACM international con- 2007. E-commerce. Vol. 29. Pearson/Addison Wesley.
ference on web search and data mining. ACM, 5–14. [30] Jessica Lee. 2013. No. 1 Position in Google Gets 33% of Search Traffic
[3] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. 2016. A [Study]. https://searchenginewatch.com/sew/study/2276184/no-1-position-in-
Near-Optimal Exploration-Exploitation Approach for Assortment Selection. In google-gets-33-of-search-traffic-study. (2013). [Online; accessed 27-January-
Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 2018].
599–600. [31] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline
[4] David J Aldous. 1985. Exchangeability and related topics. In École d’Été de evaluation of contextual-bandit-based news article recommendation algorithms.
Probabilités de Saint-Flour XIIIâĂŤ1983. Springer, 1–198. In Proceedings of the fourth ACM international conference on Web search and data
[5] Albert Angel and Nick Koudas. 2011. Efficient diversity-aware search. In Proceed- mining. ACM, 297–306.
ings of the 2011 ACM SIGMOD International Conference on Management of data. [32] Silvano Martello and Paolo Toth. 1990. Knapsack problems: Algorithms and
ACM, 781–792. computer interpretations. Hoboken, NJ: Wiley-Interscience (1990).
[6] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. 2013. Regret in online [33] Daniel McFadden. 1980. Econometric models for probabilistic choice among
combinatorial optimization. Mathematics of Operations Research 39, 1 (2013), products. Journal of Business (1980), S13–S29.
31–45. [34] S Muthukrishnan, Martin Pál, and Zoya Svitkina. 2007. Stochastic models for
[7] Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. budget optimization in search-based advertising. In International Workshop on
Journal of Machine Learning Research 3, Nov (2002), 397–422. Web and Internet Economics. Springer, 131–142.
[8] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. 2002. The [35] George L Nemhauser and Laurence A Wolsey. 1978. Best algorithms for approxi-
nonstochastic multiarmed bandit problem. SIAM journal on computing 32, 1 mating the maximum of a submodular set function. Mathematics of operations
(2002), 48–77. research 3, 3 (1978), 177–188.
[9] Hans H Bauer, Tomas Falk, and Maik Hammerschmidt. 2006. eTransQual: A trans- [36] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning Diverse
action process-based approach for capturing service quality in online shopping. Rankings with Multi-Armed Bandits. In Proceedings of the 25th International
Journal of Business Research 59, 7 (2006), 866–875. Conference on Machine Learning (ICML ’08).
[10] Alberto Caprara, Hans Kellerer, and Ulrich Pferschy. 1998. Approximation Algo- [37] Rodrygo L.T. Santos, Craig Macdonald, and Iadh Ounis. 2010. Selectively Diversi-
rithms for Knapsack Problems with Cardinality Constraints. European Journal of fying Web Search Results. In Proceedings of the 19th ACM International Conference
Operational Research 123 (1998), 2000. on Information and Knowledge Management (CIKM ’10). ACM, New York, NY,
[11] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based USA, 1179–1188.
reranking for reordering documents and producing summaries. In Proceedings of [38] Denis Sauré and Assaf Zeevi. 2013. Optimal dynamic assortment planning with
the 21st annual international ACM SIGIR conference on Research and development demand learning. Manufacturing & Service Operations Management 15, 3 (2013),
in information retrieval. ACM, 335–336. 387–404.
[12] Olivier Chapelle, Yi Chang, and T-Y Liu. 2011. Future directions in learning to [39] Daria Sorokina and Erick Cantu-Paz. 2016. Amazon Search: The Joy of Ranking
rank. In Proceedings of the Learning to Rank Challenge. 91–100. Products. In Proceedings of the 39th International ACM SIGIR Conference on Re-
[13] Harr Chen and David R Karger. 2006. Less is more: probabilistic models for re- search and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY,
trieving fewer relevant documents. In Proceedings of the 29th annual international USA, 459–460.
ACM SIGIR conference on Research and development in information retrieval. ACM, [40] Unknown. 2013. The Moore?s law of e-commerce or the crucial importance of
429–436. being first. https://www.mabaya.com/the-moores-law-of-e-commerce-or-the-
[14] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. 2011. Contextual Bandits crucial-importance-of-being-first/. (2013). [Online; accessed 27-January-2018].
with Linear Payoff Functions.. In AISTATS, Vol. 15. 208–214. [41] Unknown. 2017. Relevance Scores: Understanding and Customizing. https:
[15] Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, //docs.marklogic.com/guide/search-dev/relevance. (2017). [Online; accessed
Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity 27-January-2018].
in information retrieval evaluation. In Proceedings of the 31st annual international [42] Cong Yu, Laks Lakshmanan, and Sihem Amer-Yahia. 2009. It takes variety to
ACM SIGIR conference on Research and development in information retrieval. ACM, make a world: diversification in recommender systems. In Proceedings of the 12th
659–666. international conference on extending database technology: Advances in database
[16] Nick Craswell. 2009. Mean reciprocal rank. In Encyclopedia of Database Systems. technology. ACM, 368–378.
Springer, 1703–1703. [43] Yisong Yue and Carlos Guestrin. 2011. Linear submodular bandits and their
[17] Olayinka David-West. 2016. E-Commerce Management in Emerging Markets. application to diversified retrieval. In Advances in Neural Information Processing
Encyclopedia of E-Commerce Development, Implementation, and Management 1 Systems. 2483–2491.
(2016), 200. [44] Cheng Xiang Zhai, William W Cohen, and John Lafferty. 2003. Beyond inde-
[18] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. 2010. Para- pendent relevance: methods and evaluation metrics for subtopic retrieval. In
metric bandits: The generalized linear case. In Advances in Neural Information Proceedings of the 26th annual international ACM SIGIR conference on Research
Processing Systems. 586–594. and development in informaion retrieval. ACM, 10–17.
[19] Robert S Garfinkel and George L Nemhauser. 1972. Integer programming. Vol. 4.
Wiley New York.
[20] Jennifer Gimson. 2017. Why Lifetime Value is the Most Important Metric in eCom-
merce. https://crealytics.com/blog/lifetime-value-important-metric-ecommerce/.
(2017). [Online; accessed 27-January-2018].
[21] Anjan Goswami, Wei Han, Zhenrui Wang, and Angela Jiang. 2015. Controlled
experiments for decision-making in e-Commerce search. In Big Data (Big Data),
2015 IEEE International Conference on. IEEE, 1094–1102.
[22] David Hawking, Paul Thistlewaite, and Nick Craswell. 1997. Anu/acsys trec-6
experiments. In TREC. Citeseer, 275–290.
[23] Dipak Jain and Siddhartha S Singh. 2002. Customer lifetime value research in
marketing: A review and future directions. Journal of interactive marketing 16, 2
(2002), 34–46.
[24] Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual evaluation and
learning for search, recommendation and ad placement. In Proceedings of the 39th
International ACM SIGIR conference on Research and Development in Information
Retrieval. ACM, 1199–1201.
[25] H. Kellerer, U. Pferschy., and D. Pisinger. 2003. Knapsack Problems. Springer.
[26] Hans Kellerer, Ulrich Pferschy, and David Pisinger. 2004. Introduction to NP-
Completeness of knapsack problems. Springer.
[27] Samir Khuller, Anna Moss, and Joseph Seffi Naor. 1999. The budgeted maximum
coverage problem. Inform. Process. Lett. 70, 1 (1999), 39–45.
[28] Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. 2015.
Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem.. In
COLT. 1141–1154.