Learning to Diversify for E-commerce Search with Multi-Armed
                             Bandit
                 Anjan Goswami                                            Chengxiang Zhai                              Prasant Mohapatra
           agoswami@ucdavis.edu                                            czhai@illinois.edu                      pmohapatra@ucdavis.edu
         University of California, Davis                                  University of Illinois,                 University of California, Davis
                                                                          Urbana-Champaign

ABSTRACT                                                                                results are relevant to the query. Many products shown in first few
Search is central to e-commerce platforms. Diversification of search                    pages have great reviews and it is clear that these products have
results is essential to cater to the diverse preferences of the cus-                    been regularly purchased many times from Amazon. This scenario
tomers. One of the primary metrics of e-commerce businesses is                          illustrates that there are often too many choices for users on a large
revenue. On the other hand, the prices of the products shown                            e-commerce web site. However, it has been well researched that
influence customer preferences. Hence, diversifying e-commerce                          that user clicks drop dramatically after the first page [30, 39] in
search results requires learning the diverse price preferences of the                   web and e-commerce search. Consequently, a user often does not
customers and simultaneously maximizing the revenue without                             browse through all the relevant products returned by the search
hurting the relevance of the results. In this paper, we introduce                       query and can abandon the search if there are only a few or no
the learning to diversify problem for e-commerce search. We also                        relevant products found in top results. This search abandonment
show that diversification improves the median customer lifetime                         is known to get reduced by showing the diverse result set to the
value (CLV), which is a critical long-term business metric for an                       users in web search [15]. Search abandonment hurts e-commerce
e-commerce business. We design three algorithms for the task. The                       platforms even more because the business model depends on actual
first two algorithms are modifications of algorithms that are in the                    purchases instead of an ad-clicks. Hence, the diversification of rank-
past developed in the context of the diversification problem in web                     ing is a critical problem for e-commerce sites. One of the primary
search. The third algorithm is a novel approximate knapsack based                       business metrics for an e-commerce business is revenue [17] gen-
semi-bandit algorithm. We derive the regret and pay-off bounds of                       erated from the sales. The attempt for diversification can hurt the
all these algorithms and conduct experiments with synthetic data                        revenue unless we explicitly formulate the diversification problem
and simulation to validate and compare the algorithms. We compute                       that maximizes the revenue. Moreover, e-commerce businesses use
revenue, median CLV, and purchase based mean reciprocal rank                            customer lifetime value or CLV [20, 23] as a critical long term metric.
(PMRR) under various scenarios such as with changing user pref-                         CLV is the revenue generated by customers in their lifetime with
erences with time in our simulation to compare the performances                         the site. A successful e-commerce site intends to increase the pool
of these algorithms. We show that our proposed third algorithm is                       of high CLV customers. Intuitively, by selecting proper diversifica-
more practical and efficient compared to the first two algorithms                       tion of results, an e-commerce site can cater to the preferences of a
and can produce higher revenue, maintain a better median CLV                            larger pool of customers and can improve the median CLV of the
and PMRR.                                                                               business. The aspects of ensuring maximization of revenue and not
                                                                                        hurting the relevance simultaneously while learning to diversify
ACM Reference Format:
                                                                                        for e-commerce search require a unique formulation. In this paper,
Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra. 2019. Learning
                                                                                        we address this problem of diversification of e-commerce search
to Diversify for E-commerce Search with Multi-Armed Bandit. In Proceedings
of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), 11 pages.                    results. Additionally, we show that our formulation also improves
                                                                                        the median CLV. We have made three contributions in this paper:
                                                                                            (1) We define the learning to diversify problem for e-commerce
1 INTRODUCTION                                                                          search considering maximization of revenue and keeping the loss in
Search has become a central functionality of e-commerce sites. Typ-                     relevance within a bound. Additionally, we show that such design
ically, large e-commerce platforms have several competing products                      of learning to diversify problem also improves median CLV for an
relevant to a query. For example, Amazon returns more than 1000                         e-commerce business.
products for the query “car seat”, about 300 products for the query                         (2) We present three multi-armed bandit based algorithms for
“smart watch”, and all the products up to almost the last page of                       this and derive the regret and pay-off bounds for them. Our third
                                                                                        algorithm is a novel approximate knapsack based semi-bandit opti-
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic
                                                                                        mization algorithm and we show that the algorithm performs well
purposes.                                                                               for our problem.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                              (3) We also present a simulation-based evaluation strategy and
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at
http://ceur-ws.org                                                                      conduct experiments with synthetic data to show that under most
                                                                                        of the scenarios such as changing customer preferences and under
                                                                                        the assumption of position bias our semi-bandit algorithm can main-
                                                                                        tain a right balance of revenue, median CLV, and mean reciprocal
                                                                                        rank [16] based on purchases.
ECOM’19, July 2019, Paris, France                                                                 Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra


2    BACKGROUND                                                           MABs. Several papers that use MAB framework or online learning
Diversification of search result is one of the options for managing       to diversify, appear in the domain of news content optimization
the uncertainties and ambiguities in understanding the user’s infor-      problems [1] where diversity is considered to be the essential factor.
mation need from search queries [13]. A search ranking function,          One of the interesting papers by Yue et al [43] uses linear sub-
optimized for relevance, is not designed to minimize the possibili-       modular bandits based paradigm to learn the user preferences for
ties of redundancies on the search results [5]. The diversification       diverse ranking. The use of sub-modular bandits guarantees the
problem has been discussed as one of the most important future            existence of a greedy approximation algorithm. In some other pa-
research directions in learning to rank [12]. Researchers address         pers, researchers use a linear or nonlinear model to simultaneously
this problem by using variants of two broad approaches: (1) The           learning to rank and diversify [14, 18, 28]. The main problem with
first approach defines a measure of similarity based on the contents      such algorithms is that the computation of variance that is used to
of the documents and designs a ranking function that can use this         update the rewards in MAB framework becomes more complex. It
measure to generate a diversified search result while not hurting         is also generally much harder to evaluate the effectiveness of an
the relevance as much as possible. (2) The second approach uses a         online algorithm compared to traditional batch learning models.
multi-armed bandit based online learning algorithm to learn the           Hence, using a ranking function for both the learning to rank and di-
diverse user preferences and optimize the ranking based on that.          versify may not be practical for realization. Although, it is possible
    The second approach does not require defining any similarity          to use an online algorithm for diversification on top of traditional
measures, and instead, it learns from the data. Hence, it is easier to    learning to rank (LTR) algorithms to reduce the complexity of the
realize in practice. In this paper, our algorithms are based on the       engineering system. In e-commerce, the challenge is the need to
second approach because it is much harder to map any product              account for revenue maximization, which requires formulating a
similarity measures to user preferences than to learn it from the data.   different learning problem.
Moreover, user preferences can change over time, and a machine               We show the improvement in median CLV with the diversifica-
learning based approach can better adapt to the changes.                  tion of results in e-commerce. We have not found this relationship
    One of the most influential papers on the first approach is by        in any other papers. However, extensive literature is available on
Carbonell et al. [11] where the authors introduce the concept of          CLV [9, 29] for the interested readers. We omit to provide any sur-
maximal marginal relevance to maximize selecting documents that           vey of CLV modeling literature since our work is not related to any
are different from the already chosen documents while reducing            of those.
the loss in overall relevance.
    In another paper, Zhai [44] provides an algorithm for optimizing      3    PROBLEM SETTING
search results by diversification using risk minimization principles
that use correlation among search result as a similarity measure.         We address the problem of diversifying the e-commerce search
Agrawal et al. [2] propose a greedy algorithm that reorders the           results based on the perspective of the e-commerce firm and also
top k search results to jointly maximize the probability of show-         from the perspective of the users. The firm intends to maximize
ing diverse documents for a query and minimize the exposure of            the revenue, increase the CLV. The customers desire to find rele-
low-quality documents to the users. The authors also provide a            vant products that they may be interested in purchasing. Hence,
generalization of the classic ranking metrics such as discounted          we define the learning to diversify problem for e-commerce along
cumulative gain (DCG), mean average precision (MAP), etc. to              with maximization of revenue and maintaining a relevance thresh-
account for diversification in the ranking. Some researchers [13]         old. We now create a few notations to explain the problem. Let’s
pose this problem slightly differently for reducing ambiguity in          consider a query q and a corresponding set of n relevant items
search queries. Santos et al. [37] provide another similar algorithm      Di = {d 1 , d 2 , · · · , dn } whose prices are given by {ρ 1 , ρ 2 , · · · , ρ n }.
that uses additional information by either reformulating queries          We also assume that each product has a relevance score with re-
or using queries from related search to diversify the search results.     spect to the query and these scores are given by {s 1 , s 2 , · · · , sn }.
There are also several papers written on diversification of results of    The relevance score is correlated to user clicks but the degree of
recommender systems using a variant of content based similarity           correlation can vary for various queries and the product corpora.
measures [42].                                                            Let us denote a set of m users by U = {u 1 , · · · , um . We also consider
    The most important paper on the second approach is probably           a simple user behavior model where we assume that a user browses
the one written by Radlinski et al. [36] where authors come up with       the items one after another from the top and then either purchase
an online algorithm based on classic multi-armed bandit (MAB)             an item and leaves the site or leaves the site without any purchase
paradigm that can learn the user preferences. The authors also pro-       after browsing at most the top k (k << n) items that are shown
vide a baseline greedy algorithm to compare with their proposed           to him or her. We also assume that we study iterations (search
multi-armed bandit based algorithm. Two of our algorithms are             sessions) {1, · · · ,T } and the top k items at an iteration t for a query
direct modifications of the algorithms presented in that paper. The       q can be given by (b1t , · · · , bkt ). The corresponding products can
MAB algorithm in that paper requires one MAB per rank position,           be given by (d (b1t ), · · · , d (bnt )). Note that, in our scenario, we do
and each MAB can have as many arms as the number of items.                not have specific user information, and we only have the query.
This strategy increases the requirements of , and also there are          In an e-commerce site, this is a general scenario when customers
overlapping arms for the MABs in various positions which are not          start browsing for a product, and typically they log into the site
optimal since one needs to discard the already selected arms for the      for purchase or sometimes can use a guest account for a purchase.
                                                                          Our goal is to select the best top k items from the n items for the
Learning to Diversify for E-commerce Search with Multi-Armed Bandit                                                         ECOM’19, July 2019, Paris, France


query q such that the result set caters to diverse preferences of the       complete nx iterations, the algorithm then presents the products in
users. If the user finds a relevant product in the top k, then he or        the order of decreasing the expected revenue. Our implementation
she can decide to purchase or not purchase that product. Let us use         of the algorithm is shown in 1.
two indicator variables, z jt = {0, 1} and x jt = {0, 1}. The first one        This first greedy algorithm maximizes the revenue generated
denotes if the product d j is selected in top k results and is shown        after nx iterations if the user preferences are unchanged. The main
to a customer and the second one is to denote whether or not the            problem of this algorithm is that It assumes that the preferences of
product d j is purchased in iteration t for the given query. The rev-       the users’ do not change with time. The algorithm may achieve ex-
enue generated from the product d j for the query in an iteration t         cellent performance from the revenue metic’s perspective in specific
                          Pj=n
can be computed as r t = j=0 x jt ρ j . The total revenue can then be       scenarios when the customer preferences do not change particu-
                 Pt =T                                                      larly after the nx iterations once it has a reasonable estimation of
given by RT = t =1 r t . We can also similarly define an user level
                                                                            the purchase probabilities. However, since there is a need for show-
revenue expression Rui to denote the total revenue obtained by an
                                                                            ing every product at every position, the regret of this algorithm can
user in T iterations. Then, we also have RT = i=m    i=1 Ru i . We also
                                                   P
                                                                            be very poor, and consequently, the revenue generated in initial
define a cumulative sum of relevance scores of top k products to
                                                                            nx iteration can be arbitrarily bad. Hence, this algorithm is quite
keep a bound on relevance and denote it by S = {s 1 , · · · , sk }. Given
                                                                            impractical for actual implementation.
the above set up learning to diversify problem in e-commerce can
                                                                               Moreover, in e-commerce, it is particularly unwise to take any
be mapped to maximizing ∀iRui . However, this is not the same as
                                                                            risk of making customers unhappy. Even the controlled experiments
maximizing RT . It is also hard to estimate Rui without estimating
                                                                            require to be conducted very carefully often with a budget [21]. We
the purchase probabilities of a specific user. Learning to diversify al-
                                                                            present this algorithm to provide a comparison with a simple greedy
gorithm, on the other hand, can optimize the search result more for
                                                                            algorithm that always maximizes the revenue after a sufficiently
a larger pool of users and improve the median of the CLV. However,
                                                                            long enough iteration as a baseline.
the optimization problem then requires to maximize the revenue
while aiming to learn the diverse preferences of the customers.
   Hence one possible solution is simultaneously learning to diver-         Algorithm 1 Revenue Ranked Explore and Commit algorithm
sify and maximize the total revenue while maintaining a reasonable          (RREC)
value for relevance. This approach also requires us to use a thresh-          input: Items (d 1 , d 2 , · · · , dn ), parameters: ϵ, δ, k.
old for the relevance of the top k products.                                  x ← ⌈2k 2 /ϵ 2 log(2k/δ )⌉
   Note that using such a relevance threshold is not really a new             (b1 , b2 , · · · , bk ) ← k arbitrary items.
concept and has been used before in literature on diverse rank-               for i = 1, · · · k do                 ▷ at each rank ∀ji j = 0, p j = 0, r j = 0
ing [11] and in general for limiting recall set in search [22]. It is              for c = 1, · · · , x do                                  ▷ Loop x times
not hard in practice to establish such thresholds for a query and is                     for doj = 1, · · · , n                      ▷ over every item d j
often used in industry [41] for various purposes.                                              bc ← d j
                                                                                               display b1 , · · · , bk to the user
4     LEARNING ALGORITHMS                                                                      ij = ij + 1
We now describe three algorithms for the problem. The first two                                if user purchases on bc then
algorithms are modifications of algorithms proposed in Radlinski                                    pj = pj + 1
et al. [36] paper. The third algorithm is a new knapsack based semi-                           end if
bandit algorithm that we develop for this problem. We use the first                      end for
two algorithms as baselines for learning to diversify problem for                  end for
e-commerce search. The third problem is a novel algorithm that we                  for doj = 1, · · · , n
are introducing in this paper.                                                           pr j = p j /(i j + β ) ▷ β ≥ 1 is a constant to avoid division
                                                                              by zero
4.1     Revenue Ranked Explore and Commit                                                mr j = pr × ρ j × Z ▷ Z is a normalization constant for the
        algorithm (RREC)                                                      prices for a query
                                                                                   end forj ∗ ← argmaxj mr j ▷ Commit to best document at
This algorithm is similar to the “ranked explore and commit algo-             this rank
rithm” (REC) described in the paper by Radlinski et al. [36]. We              bi ← d j∗
intend to maximize the revenue, which is real-valued instead of
click-through rate, which is a binary input. It requires a minor              end for
change. The algorithm iteratively shows each of the n items at each
rank x times. It then records the purchases for these nx iterations
for every product at every rank position. We can then estimate
the probability of purchase of every product at every rank. We
                                                                            4.2    Revenue Ranked Bandits Algorithm
then require to multiply the estimated probability of purchase by                  (RRBA)
the price of the product to determine the generated revenue. The            In this algorithm, we modify Radlinski et al. [36]’s “ranked ban-
revenue is real-valued and can be an arbitrary number. Hence, we            dit algorithm” (RBA) for learning to diversify and maximize the
need to normalize the price between 0 and 1 for all the products            revenue coming from the users. We provide a schematic of the
to estimate a normalized value of the expected revenue. After we            modified algorithm in 2. It uses k bandits MAB 1 , · · · , MABk for k
ECOM’19, July 2019, Paris, France                                                                  Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra


rank positions for a query. Each bandit is assumed to have n arms           Algorithm 2 Revenue Ranked Bandits Algorithm (RRBA)
(a 1 , · · · , an ) corresponding to the n products in the recall set for     Initialize: MAB 1 (n), MAB 2 (n), · · · , MABk (n)                   ▷ Initialize the
the query. If an user purchases a product d j from rank position r            MABs
at an iteration t, we update the score for the arm a j of MABr as
follows:                                             q                        for t = 1, · · · ,T do                                         ▷ for T iterations
                     vr j = pr j /i r j × ρ j × Z + α 2 ln t/i r j                for r = 1, · · · , k do                               ▷ for every position r
where pr j is the purchase count and the i r j is the impression count        b̂r ← selectarm(MAB)r
of the product d j at rank position r at that iteration, α is a constant,             if b̂r ∈ (b1t , · · · , brt −1 ) then                     ▷ replace repeats
and Z is a normalization factor for the price of the products. The            br ← Arbitrary document from D
                                                                                t

algorithm considers the same set of products as arms for all the                      else
MABs at all positions. Consequently, once a product is selected by            brt ← b̂r
the k − 1 MAB, the same product cannot be selected by the k-th
MAB. Hence, except the MAB at the very first position, all other                      end if
MABs may not select their best arms. This phenomenon makes the                    end for
algorithm performing poorly for choosing the optimal top k prod-                  display (b1t , · · · , bkt ) to users; record purchases.
ucts. Moreover, the authors mention in the paper that the analysis                for r = 1, · · · , k do                             ▷ Do all updates
of the regret for the non-binary case is non-trivial and the greedy                   id (brt ) = id (brt ) + 1             ▷ Assume that the products
algorithm on which RBA is based can obtain a pay-off bound that is                  if user purchases brt , and b̂r ← brt then
a factor of (k − ϵ ) below optimal for any ϵ. This algorithm can learn                   pd (brt ) = pd (brt ) + 1
the preferences even if those are changing and does not require to                  end if
estimate the purchase probability of every product as similar to the                prd (brt ) = pd (brt ) /id (brt )
RREC. However, another problem with this algorithm is that the                      mrd (brt ) = prd (brt ) × ρ j × Z
number of MABs and the amount of bookkeeping required to run                                         q
                                                                                    vard (brt ) = α 2 ln t/id (brt )
this in practice. Typically, any such MAB algorithms in practice
cannot replace the learning to rank algorithms, as mentioned in                     scd (brt ) = mrd (brt ) + vard (brt )
the paper by Radlinski et al. [36]. The most practical way of imple-                Update MABr , arm = brt , reward = scd (brt )
menting any bandit algorithms or optimizations can be to use it as               end for
the topmost layer of a multi-layer ranking architecture where the             end for
set of products ranked by the LTR algorithm in a layer below can
be handed over to the MAB algorithm.
                                                                            is the impression count of item d j at iteration t. At each iteration
                                                                            {1, 2, · · · ,T }, each product for the top k position can then be chosen
4.3     Knapsack based bandit algorithm (KPBA)
                                                                            from a knapsack based optimization framework as follows:
KPBA is a novel algorithm that we propose in this paper. To over-
come the problem of overlapping arms in k MABs for RRBA algo-                                                            k
rithm, we here consider a semi-bandit algorithm [6] that uses a
                                                                                                                         X
                                                                                                            max                    vU
                                                                                                                                    jT
                                                                                                                                      CB
single bandit with n arms and selects k best arms at every iteration.                                    1,2, ··· ,T
                                                                                                                             j=1
    It processes feedback for k arms at each iteration. Furthermore,                                                     k
to guarantee better relevance, we introduce a relevance threshold
                                                                                                                         X
                                                                                                     subject to                    sj ≥ B                       (1)
while maximizing the revenue and learning the diverse preferences.                                                           j=1
This optimization problem turns out to be similar to the well known
                                                                               Here B is a threshold for the cumulative sum of relevance scores.
exact k-item Knapsack problem or E-kKP [25]. Note that the revenue
                                                                            We update the v U   C B after each iteration. In order to solve this
expression is based on the UCB score similar to our formulation                               j
of RRBA. This algorithm does not require as much bookkeeping                problem, we define the problem 1 as a binary integer programming
as the RRBA since it does not need to maintain k MABs. It also              (BIP) problem [19]. We define sˆi = 1 − si and B̂ = C − B, where C is
does not have the limitation of not being able to select an optimal         another constant. Problem 1 can then be rewritten as:
                                                                                                               n
arm. Furthermore, It keeps a relevance bound, and hence, it can                                                X
                                                                                                                          U CB
                                                                                                  max             x it × vit
have a better performance for relevance based on any information                                 x 1t , ··· ,x nt
                                                                                                                       i=1
retrieval based ranking metrics such as mean reciprocal rank (MRR).                                                    n
It is also a bandit paradigm, thus it can learn changing preferences
                                                                                                                       X
                                                                                                 subject to                  x i j × sˆi ≤ B̂
of customers.                                                                                                          i=1
    We now discuss the KPBA formulation below:                                                                         Xn
    Suppose, an item d j is purchased at an iteration t in this algo-                                                        xi j = k                           (2)
rithm, then the normalized revenue generated in that iteration can                                                     i=1
be written as r jt = ρ j × Z . The algorithm has to keep track of n            Problem 2 is known as E-kKP. This problem is NP-hard in terms
UCB scores for the n products. The UCB score for a product d j after        of the number of arms n. A brute force solution for this problem
iteration t can be written as v U   C B = r + α 2 ln t/i . Note, i
                                                                            can be found in O (nk ) time where k is the number of selected arms.
                                                 p
                                 jt        jt            jt        jt
Learning to Diversify for E-commerce Search with Multi-Armed Bandit                                                                      ECOM’19, July 2019, Paris, France

                                                                      1
However, there is a 12 -approximation algorithm named H 2 which                  5.3    RRBA
has been proposed in a paper by Caprara et al. [10]. The authors use             Radlinski et al. [36] have provided the results on regret and payoff
a well-known LP relaxation of Knapsack problem [32] and use the                  for RBA algorithm using EXP3 [8] and binary rewards.p      The com-
fact that any basic feasible solution of a linear relaxation of BIP con-         bined payoff is shown in the paper as (1 − e1 )OPT − O (k nT log n.
tains at most two fractional weights. In case, there are no fractional           Authors commented that the case of real reward can be (k − ϵ )
weights, then the optimal basic feasible solution of LP relaxation               worst below optimal, for any ϵ ≥ 0. We can use the same combined
problem is an optimal solution for the BIP problem. However, if                  payoff for RRBA using the real reward.
there are one or two weights out of all n weights have fractional
values for the basic feasible solution, then we can select one out               5.4    KPBA
of the two fractional, and it can still guarantee a 12 -approximate
                                                                                 Our problem is conceptually similar to dynamic assortment se-
solution for E-kKP.
                                                                                 lection problem [38]. However, assortment selection models often
   The authors provide an analysis of the algorithm and have shown
                                                                                 assume that the purchase of the product depends on the selection of
that it runs in O (n) time. Readers can get the details of the algorithm
  1                                                                              a specific set of products. Many papers in this area use the concept
H 2 , and it’s analysis in the paper by Caprara et al. [10].                     of the utility of a set of products and use a choice model for model-
   We have shown a schematic of our proposed algorithm in refkpba                ing the purchase behavior of the users [33]. However, we assume
                        1
where we use the H 2 algorithm as a subroutine.                                  that users explore each product individually and independently, and
                                                                                 users are not interested in going beyond the top k products if they
Algorithm 3 Knapsack based bandit algorithm (KPBA)                               do not find the desired product. The analysis of our algorithm can
                                                                                 be similar to the analysis of the algorithms developed for dynamic
  Initialize: SemiMAB(n)                               ▷ Initialize the MABs
                                                                                 assortment selection problems. Specifically, we are going to use the
  for t = 1, · · · ,T do                                    ▷ for T iterations
                                1                                                proof for regret bound from one such paper by Agrawal et al. [3].
      {b1 , · · · , bk } ← H 2 (D) ▷ The details of this algorithm is in         The authors use a choice model for purchase and keep showing one
  the paper [10]                                                                 assortment of k products for a particular time until a no purchase
      display (b1t , · · · , bkt ) to users; record purchases.                   event happens. In our case, the loss in revenue at each iteration t is
      for l = 1, · · · , k do           ▷ Update the impression counts of        expressed as the following:
  products, corresponding to b1t to bkt
                                                                                                                     j=n
           il = il + 1 ▷ l = 1, · · · , k are now indices of k products                                    1       1X
           Il = 0                       ▷ Il is a boolean indicator variable.                         ((1 − )OPT −       z jt v U  CB
                                                                                                                                      )
                                                                                                           e       2 j=1        jt
           if user purchases bl , and b̂r ← br then ▷ Update UCB
  scores Il = 1                                                                  The second term is coming from the half approximation exact k
           end if                                                                knapsack algorithm that we have used. Intuitively, since v U
                                                                                                                                            jt
                                                                                                                                               C B fol-
           for l = 1, · · · , k do                                               lows the properties of UCB algorithm, hence each product here is
                 vl = Il × pl /il × ρl × Z + α 2 ln t/il                         akin to be bounded by the regret bound given by the UCB algo-
                                                   p

           end for                                                               rithm [7]. Note that, the number of times an item shown to the
      end for                                                                    customers on an average is bounded by tnk for t iterations. Let’s
  end for                                                                        also use r ˆjt to denote the expected normalized revenue of an item
                                                                                 d j in an iteration t. Now, we can write the following:
                                                                                             j=n
                                                                                             X                      j=n
                                                                                                                    X
5 THEORETICAL ANALYSIS                                                                             z jt v U
                                                                                                          jt
                                                                                                             CB
                                                                                                                ≥         z jt r ˆjt
                                                                                             j=1                    j=1
5.1 The offline optimization problem                                                         j=n                                       j=n q
It is straightforward to see that the problem of finding set of k opti-
                                                                                             X                                         X
                                                                                                   (z jt v U
                                                                                                           jt
                                                                                                              CB
                                                                                                                 − z jt r ˆjt ) ≤         α (2 ln t/i jt )
mal products from n products with binary reward is equivalent to                             j=1                                       j=1
the maximum coverage problem [36]. The optimal greedy approxi-                               j=n
mation solution [27, 35] for this problem is a (1 − e1 )-approximation
                                                                                             X
                                                                                                   (z jt v U  CB
                                                                                                                 − z jt r ˆjt ) ≤ O ( (n ln t/(tk )))
                                                                                                                                     p
                                                                                                           jt
algorithm.                                                                                   j=1
                                                                                             j=n
5.2     RREC                                                                                 X
                                                                                                   (z jt v U  CB
                                                                                                                 − z jt r ˆjt ) ≤ O ( (nt ln t ))
                                                                                                                                     p
                                                                                                           jt
We modify Radlinski etal.’s [36] REC algorithm to work with real-                            j=1
valued rewards. This algorithm serves as a baseline in our paper.
                                                                                 Note that, we use the following expected average:
However, Radlinski et al.’s [36]’ paper mentions that the regret
for REC can be extended to the case when the rewards are not                                               tX=T r
                                                                                                                  1 √
binary. They show that REC can achieve a payoff (1 − e1 − ϵ )OPT −                                                t
                                                                                                                    ≤ T
O (k 3 × n/ϵ ln (k/δ )) with at least probability (1 − δ ). In our case,                                    t =1
the expression will be similar since we use the same algorithm with              We can also use a different bound from UCB algorithm using lg n)
real reward.                                                                     instead of lg t ).
ECOM’19, July 2019, Paris, France                                                             Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra


   Our sketch of derivation is similar to the proofs in Agrawal et al.’s   that different users have different price preferences and that mainly
paper [3]. The bound can probably be made tighter using techniques         dictates their purchase behavior. In reality, user preference is a
used by Auer et al. [7] for proving the bounds of UCBpalgorithm.           complex function of multiple factors associated with the products
This shows that the regret here can be bounded by O ( p(nT lgT )).         and other variables such as time in a year, the financial status of
The payoff then can be also bounded by ((1− e1 )OPT −O ( (nT lg n))        the person, etc. However, this assumption helps us keep our data
for KPBA, which is similar to Radlinski et al. [36]’s algorithm for        generation and simulation simple but allows us to evaluate and
real reward.                                                               compare our algorithms in a manner that reveals the characteristics
                                                                           of the proposed algorithms. In our simulation study, we model the
5.5     Comments from Analysis                                             biases of a real system. Typically, it is known that evaluation of
                                                                           bandit based algorithms in an online setting can be very hard [31]
From the above analysis, it is clear that RREC has several drawbacks
                                                                           and offline counterfactual techniques [24] using historical logs have
as also identified by Radlinski et al. [36] in their paper. Moreover,
                                                                           been recently a topic of research for such evaluations. We have
RREC is impractical since it’s regret in first nx iterations can be
                                                                           taken the main ideas of using such evaluation techniques in this
the worst [36]. We then expect this to also perform poorly in terms
                                                                           paper.
of revenue as well as median CLV. The analysis does not tell us
                                                                              We aim to answer the following questions from the simulation
anything about its performance in terms of relevance. The RRBA
                                                                           study:
on the other hand, clearly can be concluded to perform better than
                                                                              (1) How does KPBA compare with RRBA in terms of ARQ, MCV,
RREC for revenue based on the analytical expression of regret.
                                                                           and PMRR? In particular, we are interested in how the much better
KPBA can be expected to be similar in terms of revenue and median
                                                                           value of these metrics can be obtained if we use KPBA instead of
CLV. However, KPBA can be more appealing compared to RRBA
                                                                           using RRBA under scenarios such as without and with position bias
when we have real rewards. Moreover, since KPBA does not com-
                                                                           and with changing customer preferences.
promise relevance beyond a bound for achieving the diversification,
                                                                              (2) We also intend to see if RREC performs better in PMRR
hence, it can be expected to perform better for the users in terms
                                                                           compared to RRBA and how does that compare with KPBA with a
of relevance metrics.
                                                                           reasonable assumption on relevance threshold.
                                                                              We discuss the experiments after illustrating our data generation
6     EVALUATION                                                           methodology.
In this study, we use the following metrics to evaluate the perfor-
mance of our learning algorithms from the perspective of revenue,          6.1    Synthetic data generation
CLV, and relevance:
                                                                           Our synthetic e-commerce data consists of N queries and M rele-
      • Average Revenue per Query: ARQ This is the average                 vant products per query. We assign prices to the products randomly
        revenue for all the queries in the experimental study and can      from a multimodal Gaussian distribution with m = {1, 8} peaks with
                                           Pi =t
                                                   R
        be obtained by the expression: i =1  n
                                                i
                                                  where t is the total     mean prices between $10 to $500. The purchase rates are generated
        number of iterations in the simulation study and n is the          from a similar distribution using mean peaks between 0.0 to 0.06.
        number of queries.                                                 The maximum mean peak purchase rate is assigned to the cheapest
      • Median Customer Life Time Value: MCV Suppose each                  mean price peak for 70% of the time and rest of the time that is
        customer u j spends rut j for purchasing products in in t iter-    assigned to any other mean price peaks. We additionally generate
        ations. Then the customer value can be represented by the          a relevance score that is linearly correlated to the purchase rates
        median of all customer spends.                                     with a person correlation coefficient between 0.10 and 0.30 with a
      • Mean Reciprocal Rank of Purchases: PMRR Reciprocal                 p-value less than 0.10. In this paper, we use M = 200. Note that in
        rank based on purchases is the reciprocal of the rank of the       a real scenario, M can be a very large number, but typically a recall
        product that has been purchased for a given query in a search      set for a search query can drop to a much smaller size because of
        session. The mean reciprocal rank of purchases is the mean         the performance reasons and we anticipate to apply our algorithms
        of the reciprocal ranks for all search sessions in a period of     on top of a multi-layer ranking architecture. Moreover, this makes
        time.                                                              running our experiments simpler and faster without losing gener-
                                                                           ality. The figure 2 provides histograms of synthetically generated
    It is clear from the analytical derivations in our section 3 that
                                                                           prices and also shows the relationship between synthetically gener-
KPBA and RRBA both perform better compared to RREC in terms
                                                                           ated relevance score and the product purchase rates for that query.
of ARQ. This also indicates that these two algorithms can exceed
                                                                           The multimodal distribution of price and a weak linear correlation
RREC in MCV. Moreover, because of the bounds in relevance for
                                                                           between relevance score and the purchase rate represent a common
KPBA, it is expected that PMRR can be better than RRBA. However,
                                                                           scenario for an e-commerce platform.
it is not clear how the PMRR for KPBA can compare with RREC for
which PMRR can be good particularly after nx iterations. It is also
unclear the difference between KPBA and RRBA in terms of ARQ               6.2    User’s price preference model
since the pay-off expressions are similar.                                 We assign every user u ∈ U to a preferred price cluster tu using a
    We did not have any historical search log data from a real e-          Chinese Restaurant Process (CRP) [4] with a parameter θ . If U = 20,
commerce site. We thus evaluate our algorithms by generating               θ = 3.0, then the average number of price clusters in CRP is 6.5. We
some synthetic data and conducting a simulation study. We assume           then assign each product to one of these clusters so that cheaper
Learning to Diversify for E-commerce Search with Multi-Armed Bandit                                               ECOM’19, July 2019, Paris, France


Figure 1: Price histograms and their corresponding relevance score and purchase rate. Note the plot shows correlation between
the relevance score and purchase rate fitting a line.


products are assigned to a smaller price cluster number. We denote          We observed that KPBA and RRBA generate significantly more
the price cluster for a product di as tdi .                              ARQ and MCV compared to RREC. However, RREC does have
                                                                         better PMRR compared to RRBA. On the other hand, KPBA makes
6.3     User behavior model                                              even more ARQ and MCV compared to RRBA, and its PMRR is
In the simulation, we consider that a user u browses through the         close to RREC. We conduct experiments with a different number
top k products shown one after another. The following equation           of users and iterations and observe a similar trend in the result.
provides the expression for the probability of a user u purchasing a     The second row for all three algorithms show the results with 10
product di that is located at ranking position j:                        queries and 50000 iterations. We see that the KPBA proves to be
                                                                         substantially better than both the algorithms for 50000 iterations
                                                1
                             c × pdi × log (j+1)
                                                     if tu = tdi        except the PMRR is generally similar to RREC. The third row shows
            pr (u, di , j) =                 2
                              (1 − c) × p
                                           di         else               the results with 10 queries, 50000 iterations and 100 users who come
                                                                         through a CRP process with θ = 30.0. We observe that KPBA again
                             
                               1
We use c = 0.7. The factor log (j+1) is the position bias.               produces substantially better results for ARQ, MCV compared to
                                   2
                                                                         the other two algorithms, and it also does very similar to RREC in
7     RESULTS                                                            PMRR metric.
We conducted three experimental studies to evaluate the perfor-
mance of these algorithms. For each iteration in the study, we use a     7.2    Comparison with Position Bias
query and a user uniform randomly. For each of the studies, we use       In this section, we repeat all the previous experiments with position
three experiments with each algorithm with the following set-up:         bias in the user behavior by introducing a logarithmic decay log1 k of
(1) N = 1, |U | = 20, θ = 3.0, and iterations=1000, (2) N = 10, |U | =   the probability of purchase based on the rank of the products, where
20, θ = 10.0, and iterations=50000, (3) N = 10, |U | = 100, θ = 10,      k is the ranking position of the product in an iteration. We again
and iterations=50000                                                     observe a similar pattern that KPBA shows better performance
For each experiment, we reported the four metrics from an average        compared to RRBA in all metrics. This result is expected since we
of 100 runs of the simulator. We now describe the experiments in         have already found better PMRR when we run the experiments
the sections below:                                                      without position bias. The table 2 summarizes the experiments.
                                                                         The second and third row of each algorithm uses θ = 10 for the
7.1     Comparison without Position Bias                                 CRP process. Note that, the results for KPBA is significantly better
In this experiment, we do not use the position bias for computing        compared to RRBA when position bias is present. Thus, KPBA can
the probability of purchase. The results are summarized in table 1.      be a more practical algorithm for realization.
ECOM’19, July 2019, Paris, France                                                        Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra


      Algo  Q    U     Iter ARQ ($) MCV ($) PMRR                      7.4    Comparison of convergences for the three
      RREC  1    20    1000   140      7.0   0.41                            algorithms
           10 20 50000       3500     1486   0.54                     To show the growth of the four metrics under three above men-
           10 100 50000      5039      439   0.52                     tioned scenarios, we show the growth of two main metrics ARQ
    RRBA 1       20    1000  18567     870   0.26                     and MCV from our simulation study collecting these metrics for
           10 20 50000 138556        64483   0.29                     every 100-th iteration. The figure 2 uses 5 queries and 20 users
           10 100 50000 144970       14254   0.29                     with θ = 3.0 and shows the comparison of the two metrics without
    KPBA 1       20    1000  25022    1085   0.41                     position bias for all three algorithms. The figure 3 shows those two
           10 20 50000 156941        79621   0.52                     metrics for all three algorithms with position bias. The figure 4
           10 100 50000 174030       16105   0.52                     shows three metrics with changing customer preference for every
Table 1: Comparison of KPBA, RRBA and RREC metrics                    500 iterations along with the position bias. It is very clear from
without position Bias.                                                the plots that the KPBA performs consistently better than RRBA
                                                                      and RREC in ARQ, and MCV metrics. Note, all the prices are in log
                                                                      scale on the convergence plots. The figure 4 shows that KPBA has
                                                                      a similarly good performance in PMRR metric compared to RREC.

      Algo  Q   U    Iter  ARQ ($) MCV ($) PMRR
      RREC  1   20   1000     158      8    0.37                      7.5    Comments on experimental evaluation
           10 20 50000       3500    1486   0.54
                                                                      We find that as expected from the analytical study, KPBA performs
           10 100 50000      5039     439   0.52
                                                                      well in ARQ metric compared to RRBA. RREC performs worst in
    RRBA 1      20   1000   16922     843   0.19
                                                                      revenue metrics. We understand from our simulation studies that
           10 20 50000      69221   34521   0.29
                                                                      MCV metric is also significantly better for KPBA compared to RRBA
           10 100 50000     97534    9917   0.31
                                                                      and RREC. Moreover, the PMRR values in KPBA are similar or better
    KPBA 1      20   1000   21497    1085   0.40
                                                                      compared to RREC and are far better compared to RRBA based on
           10 20 50000 109086       53716   0.53
                                                                      the result of the simulation and as discussed in our theoretical
           10 100 50000 107091      10655   0.54
                                                                      analysis. We also observe that KPBA continues to perform better
Table 2: Comparison of KPBA, RRBA and RREC with posi-                 with more iterations, the number of users, with position bias, and
tion Bias.                                                            with changing user preferences.


                                                                      8     CONCLUSION
                                                                      In this paper, we introduce the learning to diversify problem for
      Algo  Q    U     Iter ARQ ($) MCV ($) PMRR
                                                                      e-commerce search. We show that in order to serve best for both
      RREC   1  20    1000     117      6     0.35
                                                                      the company and the customers, in e-commerce, it is required to
            10 20 50000       3100     1134   0.50
                                                                      construct a unique formulation of the learning to diversify problem
            10 100 50000     17111     1634   0.57
                                                                      where we intend to learn to diversify, and maximize the revenue
    RRBA 1      20    1000   20288     1019   0.17
                                                                      simultaneously, and as well as ensure a value of relevance for the
            10 20 50000      43654    25345   0.25
                                                                      top k results. Our theoretical results show that the KPBA algorithm
            10 100 50000     95281     9540   0.27
                                                                      is expected to have better ARQ, and PMRR compared to RRBA. Our
    KPBA 1      20    1000   23201     1125   0.49
                                                                      simulation studies show that KPBA also has better MCV compared
            10 20 50000      95754    42876   0.53
                                                                      to both RRBA and RREC, and it also gives a good performance in
            10 100 50000 103324       10315   0.57
                                                                      terms of PMRR compared to RREC. On the other hand, RRBA is
Table 3: Comparison of KPBA, RRBA and RREC with Chang-                quite bad from the customer’s perspective for this problem since the
ing Customer Preferences.                                             PMRR is low in all scenarios based on our simulations. In essence,
                                                                      we can show that the KPBA can be an efficient and practical algo-
                                                                      rithm for diversifying e-commerce search results. We also show that
                                                                      e-commerce companies can improve the CLV by using a diverse
                                                                      ranking strategy. This connection between diversity in ranking and
                                                                      CLV can be worth exploring more in the future. KPBA can also
7.3     Comparison with Changing Customer                             potentially be further optimized by formulating a variable budget
        Preferences                                                   knapsack problem where we simultaneously also learn the optimal
In this section, we experiment with changing customer preferences     relevance threshold. It is also possible to find a better probabilis-
after every 500 iteration. We summarize the experiments in table 3.   tic approximation algorithm for such optimization problems [34].
We again notice that the KPBA algorithm still performs better than    We anticipate that this paper can motivate further research in the
RRBA in all three metrics, and it compares favorably in PMRR          area of diverse ranking for e-commerce search and recommender
compared to RREC.                                                     systems.
Learning to Diversify for E-commerce Search with Multi-Armed Bandit                                                            ECOM’19, July 2019, Paris, France


Figure 2: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote
KPBA metrics. The revenue metric uses log scale.


Figure 3: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote
KPBA metrics. The revenue metric uses log scale.


Figure 4: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote
KPBA metrics. The revenue metric uses log scale.


REFERENCES                                                                         for content optimization. In Advances in Neural Information Processing Systems.
 [1] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek      17–24.
     Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. 2009. Online models
ECOM’19, July 2019, Paris, France                                                                                    Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra


 [2] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009.           [29] Kenneth C Laudon, Carol Guercio Traver, and Alfonso Vidal Romero Elizondo.
     Diversifying search results. In Proceedings of the second ACM international con-            2007. E-commerce. Vol. 29. Pearson/Addison Wesley.
     ference on web search and data mining. ACM, 5–14.                                      [30] Jessica Lee. 2013. No. 1 Position in Google Gets 33% of Search Traffic
 [3] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. 2016. A                  [Study]. https://searchenginewatch.com/sew/study/2276184/no-1-position-in-
     Near-Optimal Exploration-Exploitation Approach for Assortment Selection. In                 google-gets-33-of-search-traffic-study. (2013). [Online; accessed 27-January-
     Proceedings of the 2016 ACM Conference on Economics and Computation. ACM,                   2018].
     599–600.                                                                               [31] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline
 [4] David J Aldous. 1985. Exchangeability and related topics. In École d’Été de                 evaluation of contextual-bandit-based news article recommendation algorithms.
     Probabilités de Saint-Flour XIIIâĂŤ1983. Springer, 1–198.                                   In Proceedings of the fourth ACM international conference on Web search and data
 [5] Albert Angel and Nick Koudas. 2011. Efficient diversity-aware search. In Proceed-           mining. ACM, 297–306.
     ings of the 2011 ACM SIGMOD International Conference on Management of data.            [32] Silvano Martello and Paolo Toth. 1990. Knapsack problems: Algorithms and
     ACM, 781–792.                                                                               computer interpretations. Hoboken, NJ: Wiley-Interscience (1990).
 [6] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. 2013. Regret in online         [33] Daniel McFadden. 1980. Econometric models for probabilistic choice among
     combinatorial optimization. Mathematics of Operations Research 39, 1 (2013),                products. Journal of Business (1980), S13–S29.
     31–45.                                                                                 [34] S Muthukrishnan, Martin Pál, and Zoya Svitkina. 2007. Stochastic models for
 [7] Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs.          budget optimization in search-based advertising. In International Workshop on
     Journal of Machine Learning Research 3, Nov (2002), 397–422.                                Web and Internet Economics. Springer, 131–142.
 [8] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. 2002. The         [35] George L Nemhauser and Laurence A Wolsey. 1978. Best algorithms for approxi-
     nonstochastic multiarmed bandit problem. SIAM journal on computing 32, 1                    mating the maximum of a submodular set function. Mathematics of operations
     (2002), 48–77.                                                                              research 3, 3 (1978), 177–188.
 [9] Hans H Bauer, Tomas Falk, and Maik Hammerschmidt. 2006. eTransQual: A trans-           [36] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning Diverse
     action process-based approach for capturing service quality in online shopping.             Rankings with Multi-Armed Bandits. In Proceedings of the 25th International
     Journal of Business Research 59, 7 (2006), 866–875.                                         Conference on Machine Learning (ICML ’08).
[10] Alberto Caprara, Hans Kellerer, and Ulrich Pferschy. 1998. Approximation Algo-         [37] Rodrygo L.T. Santos, Craig Macdonald, and Iadh Ounis. 2010. Selectively Diversi-
     rithms for Knapsack Problems with Cardinality Constraints. European Journal of              fying Web Search Results. In Proceedings of the 19th ACM International Conference
     Operational Research 123 (1998), 2000.                                                      on Information and Knowledge Management (CIKM ’10). ACM, New York, NY,
[11] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based                   USA, 1179–1188.
     reranking for reordering documents and producing summaries. In Proceedings of          [38] Denis Sauré and Assaf Zeevi. 2013. Optimal dynamic assortment planning with
     the 21st annual international ACM SIGIR conference on Research and development              demand learning. Manufacturing & Service Operations Management 15, 3 (2013),
     in information retrieval. ACM, 335–336.                                                     387–404.
[12] Olivier Chapelle, Yi Chang, and T-Y Liu. 2011. Future directions in learning to        [39] Daria Sorokina and Erick Cantu-Paz. 2016. Amazon Search: The Joy of Ranking
     rank. In Proceedings of the Learning to Rank Challenge. 91–100.                             Products. In Proceedings of the 39th International ACM SIGIR Conference on Re-
[13] Harr Chen and David R Karger. 2006. Less is more: probabilistic models for re-              search and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY,
     trieving fewer relevant documents. In Proceedings of the 29th annual international          USA, 459–460.
     ACM SIGIR conference on Research and development in information retrieval. ACM,        [40] Unknown. 2013. The Moore?s law of e-commerce or the crucial importance of
     429–436.                                                                                    being first. https://www.mabaya.com/the-moores-law-of-e-commerce-or-the-
[14] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. 2011. Contextual Bandits             crucial-importance-of-being-first/. (2013). [Online; accessed 27-January-2018].
     with Linear Payoff Functions.. In AISTATS, Vol. 15. 208–214.                           [41] Unknown. 2017. Relevance Scores: Understanding and Customizing. https:
[15] Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova,                      //docs.marklogic.com/guide/search-dev/relevance. (2017). [Online; accessed
     Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity                27-January-2018].
     in information retrieval evaluation. In Proceedings of the 31st annual international   [42] Cong Yu, Laks Lakshmanan, and Sihem Amer-Yahia. 2009. It takes variety to
     ACM SIGIR conference on Research and development in information retrieval. ACM,             make a world: diversification in recommender systems. In Proceedings of the 12th
     659–666.                                                                                    international conference on extending database technology: Advances in database
[16] Nick Craswell. 2009. Mean reciprocal rank. In Encyclopedia of Database Systems.             technology. ACM, 368–378.
     Springer, 1703–1703.                                                                   [43] Yisong Yue and Carlos Guestrin. 2011. Linear submodular bandits and their
[17] Olayinka David-West. 2016. E-Commerce Management in Emerging Markets.                       application to diversified retrieval. In Advances in Neural Information Processing
     Encyclopedia of E-Commerce Development, Implementation, and Management 1                    Systems. 2483–2491.
     (2016), 200.                                                                           [44] Cheng Xiang Zhai, William W Cohen, and John Lafferty. 2003. Beyond inde-
[18] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. 2010. Para-          pendent relevance: methods and evaluation metrics for subtopic retrieval. In
     metric bandits: The generalized linear case. In Advances in Neural Information              Proceedings of the 26th annual international ACM SIGIR conference on Research
     Processing Systems. 586–594.                                                                and development in informaion retrieval. ACM, 10–17.
[19] Robert S Garfinkel and George L Nemhauser. 1972. Integer programming. Vol. 4.
     Wiley New York.
[20] Jennifer Gimson. 2017. Why Lifetime Value is the Most Important Metric in eCom-
     merce. https://crealytics.com/blog/lifetime-value-important-metric-ecommerce/.
     (2017). [Online; accessed 27-January-2018].
[21] Anjan Goswami, Wei Han, Zhenrui Wang, and Angela Jiang. 2015. Controlled
     experiments for decision-making in e-Commerce search. In Big Data (Big Data),
     2015 IEEE International Conference on. IEEE, 1094–1102.
[22] David Hawking, Paul Thistlewaite, and Nick Craswell. 1997. Anu/acsys trec-6
     experiments. In TREC. Citeseer, 275–290.
[23] Dipak Jain and Siddhartha S Singh. 2002. Customer lifetime value research in
     marketing: A review and future directions. Journal of interactive marketing 16, 2
     (2002), 34–46.
[24] Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual evaluation and
     learning for search, recommendation and ad placement. In Proceedings of the 39th
     International ACM SIGIR conference on Research and Development in Information
     Retrieval. ACM, 1199–1201.
[25] H. Kellerer, U. Pferschy., and D. Pisinger. 2003. Knapsack Problems. Springer.
[26] Hans Kellerer, Ulrich Pferschy, and David Pisinger. 2004. Introduction to NP-
     Completeness of knapsack problems. Springer.
[27] Samir Khuller, Anna Moss, and Joseph Seffi Naor. 1999. The budgeted maximum
     coverage problem. Inform. Process. Lett. 70, 1 (1999), 39–45.
[28] Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. 2015.
     Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem.. In
     COLT. 1141–1154.