Learning to Diversify for E-commerce Search with Multi-Armed Bandit Anjan Goswami Chengxiang Zhai Prasant Mohapatra agoswami@ucdavis.edu czhai@illinois.edu pmohapatra@ucdavis.edu University of California, Davis University of Illinois, University of California, Davis Urbana-Champaign ABSTRACT results are relevant to the query. Many products shown in first few Search is central to e-commerce platforms. Diversification of search pages have great reviews and it is clear that these products have results is essential to cater to the diverse preferences of the cus- been regularly purchased many times from Amazon. This scenario tomers. One of the primary metrics of e-commerce businesses is illustrates that there are often too many choices for users on a large revenue. On the other hand, the prices of the products shown e-commerce web site. However, it has been well researched that influence customer preferences. Hence, diversifying e-commerce that user clicks drop dramatically after the first page [30, 39] in search results requires learning the diverse price preferences of the web and e-commerce search. Consequently, a user often does not customers and simultaneously maximizing the revenue without browse through all the relevant products returned by the search hurting the relevance of the results. In this paper, we introduce query and can abandon the search if there are only a few or no the learning to diversify problem for e-commerce search. We also relevant products found in top results. This search abandonment show that diversification improves the median customer lifetime is known to get reduced by showing the diverse result set to the value (CLV), which is a critical long-term business metric for an users in web search [15]. Search abandonment hurts e-commerce e-commerce business. We design three algorithms for the task. The platforms even more because the business model depends on actual first two algorithms are modifications of algorithms that are in the purchases instead of an ad-clicks. Hence, the diversification of rank- past developed in the context of the diversification problem in web ing is a critical problem for e-commerce sites. One of the primary search. The third algorithm is a novel approximate knapsack based business metrics for an e-commerce business is revenue [17] gen- semi-bandit algorithm. We derive the regret and pay-off bounds of erated from the sales. The attempt for diversification can hurt the all these algorithms and conduct experiments with synthetic data revenue unless we explicitly formulate the diversification problem and simulation to validate and compare the algorithms. We compute that maximizes the revenue. Moreover, e-commerce businesses use revenue, median CLV, and purchase based mean reciprocal rank customer lifetime value or CLV [20, 23] as a critical long term metric. (PMRR) under various scenarios such as with changing user pref- CLV is the revenue generated by customers in their lifetime with erences with time in our simulation to compare the performances the site. A successful e-commerce site intends to increase the pool of these algorithms. We show that our proposed third algorithm is of high CLV customers. Intuitively, by selecting proper diversifica- more practical and efficient compared to the first two algorithms tion of results, an e-commerce site can cater to the preferences of a and can produce higher revenue, maintain a better median CLV larger pool of customers and can improve the median CLV of the and PMRR. business. The aspects of ensuring maximization of revenue and not hurting the relevance simultaneously while learning to diversify ACM Reference Format: for e-commerce search require a unique formulation. In this paper, Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra. 2019. Learning we address this problem of diversification of e-commerce search to Diversify for E-commerce Search with Multi-Armed Bandit. In Proceedings of the SIGIR 2019 Workshop on eCommerce (SIGIR 2019 eCom), 11 pages. results. Additionally, we show that our formulation also improves the median CLV. We have made three contributions in this paper: (1) We define the learning to diversify problem for e-commerce 1 INTRODUCTION search considering maximization of revenue and keeping the loss in Search has become a central functionality of e-commerce sites. Typ- relevance within a bound. Additionally, we show that such design ically, large e-commerce platforms have several competing products of learning to diversify problem also improves median CLV for an relevant to a query. For example, Amazon returns more than 1000 e-commerce business. products for the query “car seat”, about 300 products for the query (2) We present three multi-armed bandit based algorithms for “smart watch”, and all the products up to almost the last page of this and derive the regret and pay-off bounds for them. Our third algorithm is a novel approximate knapsack based semi-bandit opti- Copyright © 2019 by the paper’s authors. Copying permitted for private and academic mization algorithm and we show that the algorithm performs well purposes. for our problem. In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): (3) We also present a simulation-based evaluation strategy and Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at http://ceur-ws.org conduct experiments with synthetic data to show that under most of the scenarios such as changing customer preferences and under the assumption of position bias our semi-bandit algorithm can main- tain a right balance of revenue, median CLV, and mean reciprocal rank [16] based on purchases. ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra 2 BACKGROUND MABs. Several papers that use MAB framework or online learning Diversification of search result is one of the options for managing to diversify, appear in the domain of news content optimization the uncertainties and ambiguities in understanding the user’s infor- problems [1] where diversity is considered to be the essential factor. mation need from search queries [13]. A search ranking function, One of the interesting papers by Yue et al [43] uses linear sub- optimized for relevance, is not designed to minimize the possibili- modular bandits based paradigm to learn the user preferences for ties of redundancies on the search results [5]. The diversification diverse ranking. The use of sub-modular bandits guarantees the problem has been discussed as one of the most important future existence of a greedy approximation algorithm. In some other pa- research directions in learning to rank [12]. Researchers address pers, researchers use a linear or nonlinear model to simultaneously this problem by using variants of two broad approaches: (1) The learning to rank and diversify [14, 18, 28]. The main problem with first approach defines a measure of similarity based on the contents such algorithms is that the computation of variance that is used to of the documents and designs a ranking function that can use this update the rewards in MAB framework becomes more complex. It measure to generate a diversified search result while not hurting is also generally much harder to evaluate the effectiveness of an the relevance as much as possible. (2) The second approach uses a online algorithm compared to traditional batch learning models. multi-armed bandit based online learning algorithm to learn the Hence, using a ranking function for both the learning to rank and di- diverse user preferences and optimize the ranking based on that. versify may not be practical for realization. Although, it is possible The second approach does not require defining any similarity to use an online algorithm for diversification on top of traditional measures, and instead, it learns from the data. Hence, it is easier to learning to rank (LTR) algorithms to reduce the complexity of the realize in practice. In this paper, our algorithms are based on the engineering system. In e-commerce, the challenge is the need to second approach because it is much harder to map any product account for revenue maximization, which requires formulating a similarity measures to user preferences than to learn it from the data. different learning problem. Moreover, user preferences can change over time, and a machine We show the improvement in median CLV with the diversifica- learning based approach can better adapt to the changes. tion of results in e-commerce. We have not found this relationship One of the most influential papers on the first approach is by in any other papers. However, extensive literature is available on Carbonell et al. [11] where the authors introduce the concept of CLV [9, 29] for the interested readers. We omit to provide any sur- maximal marginal relevance to maximize selecting documents that vey of CLV modeling literature since our work is not related to any are different from the already chosen documents while reducing of those. the loss in overall relevance. In another paper, Zhai [44] provides an algorithm for optimizing 3 PROBLEM SETTING search results by diversification using risk minimization principles that use correlation among search result as a similarity measure. We address the problem of diversifying the e-commerce search Agrawal et al. [2] propose a greedy algorithm that reorders the results based on the perspective of the e-commerce firm and also top k search results to jointly maximize the probability of show- from the perspective of the users. The firm intends to maximize ing diverse documents for a query and minimize the exposure of the revenue, increase the CLV. The customers desire to find rele- low-quality documents to the users. The authors also provide a vant products that they may be interested in purchasing. Hence, generalization of the classic ranking metrics such as discounted we define the learning to diversify problem for e-commerce along cumulative gain (DCG), mean average precision (MAP), etc. to with maximization of revenue and maintaining a relevance thresh- account for diversification in the ranking. Some researchers [13] old. We now create a few notations to explain the problem. Let’s pose this problem slightly differently for reducing ambiguity in consider a query q and a corresponding set of n relevant items search queries. Santos et al. [37] provide another similar algorithm Di = {d 1 , d 2 , · · · , dn } whose prices are given by {ρ 1 , ρ 2 , · · · , ρ n }. that uses additional information by either reformulating queries We also assume that each product has a relevance score with re- or using queries from related search to diversify the search results. spect to the query and these scores are given by {s 1 , s 2 , · · · , sn }. There are also several papers written on diversification of results of The relevance score is correlated to user clicks but the degree of recommender systems using a variant of content based similarity correlation can vary for various queries and the product corpora. measures [42]. Let us denote a set of m users by U = {u 1 , · · · , um . We also consider The most important paper on the second approach is probably a simple user behavior model where we assume that a user browses the one written by Radlinski et al. [36] where authors come up with the items one after another from the top and then either purchase an online algorithm based on classic multi-armed bandit (MAB) an item and leaves the site or leaves the site without any purchase paradigm that can learn the user preferences. The authors also pro- after browsing at most the top k (k << n) items that are shown vide a baseline greedy algorithm to compare with their proposed to him or her. We also assume that we study iterations (search multi-armed bandit based algorithm. Two of our algorithms are sessions) {1, · · · ,T } and the top k items at an iteration t for a query direct modifications of the algorithms presented in that paper. The q can be given by (b1t , · · · , bkt ). The corresponding products can MAB algorithm in that paper requires one MAB per rank position, be given by (d (b1t ), · · · , d (bnt )). Note that, in our scenario, we do and each MAB can have as many arms as the number of items. not have specific user information, and we only have the query. This strategy increases the requirements of , and also there are In an e-commerce site, this is a general scenario when customers overlapping arms for the MABs in various positions which are not start browsing for a product, and typically they log into the site optimal since one needs to discard the already selected arms for the for purchase or sometimes can use a guest account for a purchase. Our goal is to select the best top k items from the n items for the Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France query q such that the result set caters to diverse preferences of the complete nx iterations, the algorithm then presents the products in users. If the user finds a relevant product in the top k, then he or the order of decreasing the expected revenue. Our implementation she can decide to purchase or not purchase that product. Let us use of the algorithm is shown in 1. two indicator variables, z jt = {0, 1} and x jt = {0, 1}. The first one This first greedy algorithm maximizes the revenue generated denotes if the product d j is selected in top k results and is shown after nx iterations if the user preferences are unchanged. The main to a customer and the second one is to denote whether or not the problem of this algorithm is that It assumes that the preferences of product d j is purchased in iteration t for the given query. The rev- the users’ do not change with time. The algorithm may achieve ex- enue generated from the product d j for the query in an iteration t cellent performance from the revenue metic’s perspective in specific Pj=n can be computed as r t = j=0 x jt ρ j . The total revenue can then be scenarios when the customer preferences do not change particu- Pt =T larly after the nx iterations once it has a reasonable estimation of given by RT = t =1 r t . We can also similarly define an user level the purchase probabilities. However, since there is a need for show- revenue expression Rui to denote the total revenue obtained by an ing every product at every position, the regret of this algorithm can user in T iterations. Then, we also have RT = i=m i=1 Ru i . We also P be very poor, and consequently, the revenue generated in initial define a cumulative sum of relevance scores of top k products to nx iteration can be arbitrarily bad. Hence, this algorithm is quite keep a bound on relevance and denote it by S = {s 1 , · · · , sk }. Given impractical for actual implementation. the above set up learning to diversify problem in e-commerce can Moreover, in e-commerce, it is particularly unwise to take any be mapped to maximizing ∀iRui . However, this is not the same as risk of making customers unhappy. Even the controlled experiments maximizing RT . It is also hard to estimate Rui without estimating require to be conducted very carefully often with a budget [21]. We the purchase probabilities of a specific user. Learning to diversify al- present this algorithm to provide a comparison with a simple greedy gorithm, on the other hand, can optimize the search result more for algorithm that always maximizes the revenue after a sufficiently a larger pool of users and improve the median of the CLV. However, long enough iteration as a baseline. the optimization problem then requires to maximize the revenue while aiming to learn the diverse preferences of the customers. Hence one possible solution is simultaneously learning to diver- Algorithm 1 Revenue Ranked Explore and Commit algorithm sify and maximize the total revenue while maintaining a reasonable (RREC) value for relevance. This approach also requires us to use a thresh- input: Items (d 1 , d 2 , · · · , dn ), parameters: ϵ, δ, k. old for the relevance of the top k products. x ← ⌈2k 2 /ϵ 2 log(2k/δ )⌉ Note that using such a relevance threshold is not really a new (b1 , b2 , · · · , bk ) ← k arbitrary items. concept and has been used before in literature on diverse rank- for i = 1, · · · k do ▷ at each rank ∀ji j = 0, p j = 0, r j = 0 ing [11] and in general for limiting recall set in search [22]. It is for c = 1, · · · , x do ▷ Loop x times not hard in practice to establish such thresholds for a query and is for doj = 1, · · · , n ▷ over every item d j often used in industry [41] for various purposes. bc ← d j display b1 , · · · , bk to the user 4 LEARNING ALGORITHMS ij = ij + 1 We now describe three algorithms for the problem. The first two if user purchases on bc then algorithms are modifications of algorithms proposed in Radlinski pj = pj + 1 et al. [36] paper. The third algorithm is a new knapsack based semi- end if bandit algorithm that we develop for this problem. We use the first end for two algorithms as baselines for learning to diversify problem for end for e-commerce search. The third problem is a novel algorithm that we for doj = 1, · · · , n are introducing in this paper. pr j = p j /(i j + β ) ▷ β ≥ 1 is a constant to avoid division by zero 4.1 Revenue Ranked Explore and Commit mr j = pr × ρ j × Z ▷ Z is a normalization constant for the algorithm (RREC) prices for a query end forj ∗ ← argmaxj mr j ▷ Commit to best document at This algorithm is similar to the “ranked explore and commit algo- this rank rithm” (REC) described in the paper by Radlinski et al. [36]. We bi ← d j∗ intend to maximize the revenue, which is real-valued instead of click-through rate, which is a binary input. It requires a minor end for change. The algorithm iteratively shows each of the n items at each rank x times. It then records the purchases for these nx iterations for every product at every rank position. We can then estimate the probability of purchase of every product at every rank. We 4.2 Revenue Ranked Bandits Algorithm then require to multiply the estimated probability of purchase by (RRBA) the price of the product to determine the generated revenue. The In this algorithm, we modify Radlinski et al. [36]’s “ranked ban- revenue is real-valued and can be an arbitrary number. Hence, we dit algorithm” (RBA) for learning to diversify and maximize the need to normalize the price between 0 and 1 for all the products revenue coming from the users. We provide a schematic of the to estimate a normalized value of the expected revenue. After we modified algorithm in 2. It uses k bandits MAB 1 , · · · , MABk for k ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra rank positions for a query. Each bandit is assumed to have n arms Algorithm 2 Revenue Ranked Bandits Algorithm (RRBA) (a 1 , · · · , an ) corresponding to the n products in the recall set for Initialize: MAB 1 (n), MAB 2 (n), · · · , MABk (n) ▷ Initialize the the query. If an user purchases a product d j from rank position r MABs at an iteration t, we update the score for the arm a j of MABr as follows: q for t = 1, · · · ,T do ▷ for T iterations vr j = pr j /i r j × ρ j × Z + α 2 ln t/i r j for r = 1, · · · , k do ▷ for every position r where pr j is the purchase count and the i r j is the impression count b̂r ← selectarm(MAB)r of the product d j at rank position r at that iteration, α is a constant, if b̂r ∈ (b1t , · · · , brt −1 ) then ▷ replace repeats and Z is a normalization factor for the price of the products. The br ← Arbitrary document from D t algorithm considers the same set of products as arms for all the else MABs at all positions. Consequently, once a product is selected by brt ← b̂r the k − 1 MAB, the same product cannot be selected by the k-th MAB. Hence, except the MAB at the very first position, all other end if MABs may not select their best arms. This phenomenon makes the end for algorithm performing poorly for choosing the optimal top k prod- display (b1t , · · · , bkt ) to users; record purchases. ucts. Moreover, the authors mention in the paper that the analysis for r = 1, · · · , k do ▷ Do all updates of the regret for the non-binary case is non-trivial and the greedy id (brt ) = id (brt ) + 1 ▷ Assume that the products algorithm on which RBA is based can obtain a pay-off bound that is if user purchases brt , and b̂r ← brt then a factor of (k − ϵ ) below optimal for any ϵ. This algorithm can learn pd (brt ) = pd (brt ) + 1 the preferences even if those are changing and does not require to end if estimate the purchase probability of every product as similar to the prd (brt ) = pd (brt ) /id (brt ) RREC. However, another problem with this algorithm is that the mrd (brt ) = prd (brt ) × ρ j × Z number of MABs and the amount of bookkeeping required to run q vard (brt ) = α 2 ln t/id (brt ) this in practice. Typically, any such MAB algorithms in practice cannot replace the learning to rank algorithms, as mentioned in scd (brt ) = mrd (brt ) + vard (brt ) the paper by Radlinski et al. [36]. The most practical way of imple- Update MABr , arm = brt , reward = scd (brt ) menting any bandit algorithms or optimizations can be to use it as end for the topmost layer of a multi-layer ranking architecture where the end for set of products ranked by the LTR algorithm in a layer below can be handed over to the MAB algorithm. is the impression count of item d j at iteration t. At each iteration {1, 2, · · · ,T }, each product for the top k position can then be chosen 4.3 Knapsack based bandit algorithm (KPBA) from a knapsack based optimization framework as follows: KPBA is a novel algorithm that we propose in this paper. To over- come the problem of overlapping arms in k MABs for RRBA algo- k rithm, we here consider a semi-bandit algorithm [6] that uses a X max vU jT CB single bandit with n arms and selects k best arms at every iteration. 1,2, ··· ,T j=1 It processes feedback for k arms at each iteration. Furthermore, k to guarantee better relevance, we introduce a relevance threshold X subject to sj ≥ B (1) while maximizing the revenue and learning the diverse preferences. j=1 This optimization problem turns out to be similar to the well known Here B is a threshold for the cumulative sum of relevance scores. exact k-item Knapsack problem or E-kKP [25]. Note that the revenue We update the v U C B after each iteration. In order to solve this expression is based on the UCB score similar to our formulation j of RRBA. This algorithm does not require as much bookkeeping problem, we define the problem 1 as a binary integer programming as the RRBA since it does not need to maintain k MABs. It also (BIP) problem [19]. We define sˆi = 1 − si and B̂ = C − B, where C is does not have the limitation of not being able to select an optimal another constant. Problem 1 can then be rewritten as: n arm. Furthermore, It keeps a relevance bound, and hence, it can X U CB max x it × vit have a better performance for relevance based on any information x 1t , ··· ,x nt i=1 retrieval based ranking metrics such as mean reciprocal rank (MRR). n It is also a bandit paradigm, thus it can learn changing preferences X subject to x i j × sˆi ≤ B̂ of customers. i=1 We now discuss the KPBA formulation below: Xn Suppose, an item d j is purchased at an iteration t in this algo- xi j = k (2) rithm, then the normalized revenue generated in that iteration can i=1 be written as r jt = ρ j × Z . The algorithm has to keep track of n Problem 2 is known as E-kKP. This problem is NP-hard in terms UCB scores for the n products. The UCB score for a product d j after of the number of arms n. A brute force solution for this problem iteration t can be written as v U C B = r + α 2 ln t/i . Note, i can be found in O (nk ) time where k is the number of selected arms. p jt jt jt jt Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France 1 However, there is a 12 -approximation algorithm named H 2 which 5.3 RRBA has been proposed in a paper by Caprara et al. [10]. The authors use Radlinski et al. [36] have provided the results on regret and payoff a well-known LP relaxation of Knapsack problem [32] and use the for RBA algorithm using EXP3 [8] and binary rewards.p The com- fact that any basic feasible solution of a linear relaxation of BIP con- bined payoff is shown in the paper as (1 − e1 )OPT − O (k nT log n. tains at most two fractional weights. In case, there are no fractional Authors commented that the case of real reward can be (k − ϵ ) weights, then the optimal basic feasible solution of LP relaxation worst below optimal, for any ϵ ≥ 0. We can use the same combined problem is an optimal solution for the BIP problem. However, if payoff for RRBA using the real reward. there are one or two weights out of all n weights have fractional values for the basic feasible solution, then we can select one out 5.4 KPBA of the two fractional, and it can still guarantee a 12 -approximate Our problem is conceptually similar to dynamic assortment se- solution for E-kKP. lection problem [38]. However, assortment selection models often The authors provide an analysis of the algorithm and have shown assume that the purchase of the product depends on the selection of that it runs in O (n) time. Readers can get the details of the algorithm 1 a specific set of products. Many papers in this area use the concept H 2 , and it’s analysis in the paper by Caprara et al. [10]. of the utility of a set of products and use a choice model for model- We have shown a schematic of our proposed algorithm in refkpba ing the purchase behavior of the users [33]. However, we assume 1 where we use the H 2 algorithm as a subroutine. that users explore each product individually and independently, and users are not interested in going beyond the top k products if they Algorithm 3 Knapsack based bandit algorithm (KPBA) do not find the desired product. The analysis of our algorithm can be similar to the analysis of the algorithms developed for dynamic Initialize: SemiMAB(n) ▷ Initialize the MABs assortment selection problems. Specifically, we are going to use the for t = 1, · · · ,T do ▷ for T iterations 1 proof for regret bound from one such paper by Agrawal et al. [3]. {b1 , · · · , bk } ← H 2 (D) ▷ The details of this algorithm is in The authors use a choice model for purchase and keep showing one the paper [10] assortment of k products for a particular time until a no purchase display (b1t , · · · , bkt ) to users; record purchases. event happens. In our case, the loss in revenue at each iteration t is for l = 1, · · · , k do ▷ Update the impression counts of expressed as the following: products, corresponding to b1t to bkt j=n il = il + 1 ▷ l = 1, · · · , k are now indices of k products 1 1X Il = 0 ▷ Il is a boolean indicator variable. ((1 − )OPT − z jt v U CB ) e 2 j=1 jt if user purchases bl , and b̂r ← br then ▷ Update UCB scores Il = 1 The second term is coming from the half approximation exact k end if knapsack algorithm that we have used. Intuitively, since v U jt C B fol- for l = 1, · · · , k do lows the properties of UCB algorithm, hence each product here is vl = Il × pl /il × ρl × Z + α 2 ln t/il akin to be bounded by the regret bound given by the UCB algo- p end for rithm [7]. Note that, the number of times an item shown to the end for customers on an average is bounded by tnk for t iterations. Let’s end for also use r ˆjt to denote the expected normalized revenue of an item d j in an iteration t. Now, we can write the following: j=n X j=n X 5 THEORETICAL ANALYSIS z jt v U jt CB ≥ z jt r ˆjt j=1 j=1 5.1 The offline optimization problem j=n j=n q It is straightforward to see that the problem of finding set of k opti- X X (z jt v U jt CB − z jt r ˆjt ) ≤ α (2 ln t/i jt ) mal products from n products with binary reward is equivalent to j=1 j=1 the maximum coverage problem [36]. The optimal greedy approxi- j=n mation solution [27, 35] for this problem is a (1 − e1 )-approximation X (z jt v U CB − z jt r ˆjt ) ≤ O ( (n ln t/(tk ))) p jt algorithm. j=1 j=n 5.2 RREC X (z jt v U CB − z jt r ˆjt ) ≤ O ( (nt ln t )) p jt We modify Radlinski etal.’s [36] REC algorithm to work with real- j=1 valued rewards. This algorithm serves as a baseline in our paper. Note that, we use the following expected average: However, Radlinski et al.’s [36]’ paper mentions that the regret for REC can be extended to the case when the rewards are not tX=T r 1 √ binary. They show that REC can achieve a payoff (1 − e1 − ϵ )OPT − t ≤ T O (k 3 × n/ϵ ln (k/δ )) with at least probability (1 − δ ). In our case, t =1 the expression will be similar since we use the same algorithm with We can also use a different bound from UCB algorithm using lg n) real reward. instead of lg t ). ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra Our sketch of derivation is similar to the proofs in Agrawal et al.’s that different users have different price preferences and that mainly paper [3]. The bound can probably be made tighter using techniques dictates their purchase behavior. In reality, user preference is a used by Auer et al. [7] for proving the bounds of UCBpalgorithm. complex function of multiple factors associated with the products This shows that the regret here can be bounded by O ( p(nT lgT )). and other variables such as time in a year, the financial status of The payoff then can be also bounded by ((1− e1 )OPT −O ( (nT lg n)) the person, etc. However, this assumption helps us keep our data for KPBA, which is similar to Radlinski et al. [36]’s algorithm for generation and simulation simple but allows us to evaluate and real reward. compare our algorithms in a manner that reveals the characteristics of the proposed algorithms. In our simulation study, we model the 5.5 Comments from Analysis biases of a real system. Typically, it is known that evaluation of bandit based algorithms in an online setting can be very hard [31] From the above analysis, it is clear that RREC has several drawbacks and offline counterfactual techniques [24] using historical logs have as also identified by Radlinski et al. [36] in their paper. Moreover, been recently a topic of research for such evaluations. We have RREC is impractical since it’s regret in first nx iterations can be taken the main ideas of using such evaluation techniques in this the worst [36]. We then expect this to also perform poorly in terms paper. of revenue as well as median CLV. The analysis does not tell us We aim to answer the following questions from the simulation anything about its performance in terms of relevance. The RRBA study: on the other hand, clearly can be concluded to perform better than (1) How does KPBA compare with RRBA in terms of ARQ, MCV, RREC for revenue based on the analytical expression of regret. and PMRR? In particular, we are interested in how the much better KPBA can be expected to be similar in terms of revenue and median value of these metrics can be obtained if we use KPBA instead of CLV. However, KPBA can be more appealing compared to RRBA using RRBA under scenarios such as without and with position bias when we have real rewards. Moreover, since KPBA does not com- and with changing customer preferences. promise relevance beyond a bound for achieving the diversification, (2) We also intend to see if RREC performs better in PMRR hence, it can be expected to perform better for the users in terms compared to RRBA and how does that compare with KPBA with a of relevance metrics. reasonable assumption on relevance threshold. We discuss the experiments after illustrating our data generation 6 EVALUATION methodology. In this study, we use the following metrics to evaluate the perfor- mance of our learning algorithms from the perspective of revenue, 6.1 Synthetic data generation CLV, and relevance: Our synthetic e-commerce data consists of N queries and M rele- • Average Revenue per Query: ARQ This is the average vant products per query. We assign prices to the products randomly revenue for all the queries in the experimental study and can from a multimodal Gaussian distribution with m = {1, 8} peaks with Pi =t R be obtained by the expression: i =1 n i where t is the total mean prices between $10 to $500. The purchase rates are generated number of iterations in the simulation study and n is the from a similar distribution using mean peaks between 0.0 to 0.06. number of queries. The maximum mean peak purchase rate is assigned to the cheapest • Median Customer Life Time Value: MCV Suppose each mean price peak for 70% of the time and rest of the time that is customer u j spends rut j for purchasing products in in t iter- assigned to any other mean price peaks. We additionally generate ations. Then the customer value can be represented by the a relevance score that is linearly correlated to the purchase rates median of all customer spends. with a person correlation coefficient between 0.10 and 0.30 with a • Mean Reciprocal Rank of Purchases: PMRR Reciprocal p-value less than 0.10. In this paper, we use M = 200. Note that in rank based on purchases is the reciprocal of the rank of the a real scenario, M can be a very large number, but typically a recall product that has been purchased for a given query in a search set for a search query can drop to a much smaller size because of session. The mean reciprocal rank of purchases is the mean the performance reasons and we anticipate to apply our algorithms of the reciprocal ranks for all search sessions in a period of on top of a multi-layer ranking architecture. Moreover, this makes time. running our experiments simpler and faster without losing gener- ality. The figure 2 provides histograms of synthetically generated It is clear from the analytical derivations in our section 3 that prices and also shows the relationship between synthetically gener- KPBA and RRBA both perform better compared to RREC in terms ated relevance score and the product purchase rates for that query. of ARQ. This also indicates that these two algorithms can exceed The multimodal distribution of price and a weak linear correlation RREC in MCV. Moreover, because of the bounds in relevance for between relevance score and the purchase rate represent a common KPBA, it is expected that PMRR can be better than RRBA. However, scenario for an e-commerce platform. it is not clear how the PMRR for KPBA can compare with RREC for which PMRR can be good particularly after nx iterations. It is also unclear the difference between KPBA and RRBA in terms of ARQ 6.2 User’s price preference model since the pay-off expressions are similar. We assign every user u ∈ U to a preferred price cluster tu using a We did not have any historical search log data from a real e- Chinese Restaurant Process (CRP) [4] with a parameter θ . If U = 20, commerce site. We thus evaluate our algorithms by generating θ = 3.0, then the average number of price clusters in CRP is 6.5. We some synthetic data and conducting a simulation study. We assume then assign each product to one of these clusters so that cheaper Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France Figure 1: Price histograms and their corresponding relevance score and purchase rate. Note the plot shows correlation between the relevance score and purchase rate fitting a line. products are assigned to a smaller price cluster number. We denote We observed that KPBA and RRBA generate significantly more the price cluster for a product di as tdi . ARQ and MCV compared to RREC. However, RREC does have better PMRR compared to RRBA. On the other hand, KPBA makes 6.3 User behavior model even more ARQ and MCV compared to RRBA, and its PMRR is In the simulation, we consider that a user u browses through the close to RREC. We conduct experiments with a different number top k products shown one after another. The following equation of users and iterations and observe a similar trend in the result. provides the expression for the probability of a user u purchasing a The second row for all three algorithms show the results with 10 product di that is located at ranking position j: queries and 50000 iterations. We see that the KPBA proves to be substantially better than both the algorithms for 50000 iterations 1 c × pdi × log (j+1)  if tu = tdi except the PMRR is generally similar to RREC. The third row shows pr (u, di , j) =  2  (1 − c) × p di else the results with 10 queries, 50000 iterations and 100 users who come through a CRP process with θ = 30.0. We observe that KPBA again  1 We use c = 0.7. The factor log (j+1) is the position bias. produces substantially better results for ARQ, MCV compared to 2 the other two algorithms, and it also does very similar to RREC in 7 RESULTS PMRR metric. We conducted three experimental studies to evaluate the perfor- mance of these algorithms. For each iteration in the study, we use a 7.2 Comparison with Position Bias query and a user uniform randomly. For each of the studies, we use In this section, we repeat all the previous experiments with position three experiments with each algorithm with the following set-up: bias in the user behavior by introducing a logarithmic decay log1 k of (1) N = 1, |U | = 20, θ = 3.0, and iterations=1000, (2) N = 10, |U | = the probability of purchase based on the rank of the products, where 20, θ = 10.0, and iterations=50000, (3) N = 10, |U | = 100, θ = 10, k is the ranking position of the product in an iteration. We again and iterations=50000 observe a similar pattern that KPBA shows better performance For each experiment, we reported the four metrics from an average compared to RRBA in all metrics. This result is expected since we of 100 runs of the simulator. We now describe the experiments in have already found better PMRR when we run the experiments the sections below: without position bias. The table 2 summarizes the experiments. The second and third row of each algorithm uses θ = 10 for the 7.1 Comparison without Position Bias CRP process. Note that, the results for KPBA is significantly better In this experiment, we do not use the position bias for computing compared to RRBA when position bias is present. Thus, KPBA can the probability of purchase. The results are summarized in table 1. be a more practical algorithm for realization. ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra Algo Q U Iter ARQ ($) MCV ($) PMRR 7.4 Comparison of convergences for the three RREC 1 20 1000 140 7.0 0.41 algorithms 10 20 50000 3500 1486 0.54 To show the growth of the four metrics under three above men- 10 100 50000 5039 439 0.52 tioned scenarios, we show the growth of two main metrics ARQ RRBA 1 20 1000 18567 870 0.26 and MCV from our simulation study collecting these metrics for 10 20 50000 138556 64483 0.29 every 100-th iteration. The figure 2 uses 5 queries and 20 users 10 100 50000 144970 14254 0.29 with θ = 3.0 and shows the comparison of the two metrics without KPBA 1 20 1000 25022 1085 0.41 position bias for all three algorithms. The figure 3 shows those two 10 20 50000 156941 79621 0.52 metrics for all three algorithms with position bias. The figure 4 10 100 50000 174030 16105 0.52 shows three metrics with changing customer preference for every Table 1: Comparison of KPBA, RRBA and RREC metrics 500 iterations along with the position bias. It is very clear from without position Bias. the plots that the KPBA performs consistently better than RRBA and RREC in ARQ, and MCV metrics. Note, all the prices are in log scale on the convergence plots. The figure 4 shows that KPBA has a similarly good performance in PMRR metric compared to RREC. Algo Q U Iter ARQ ($) MCV ($) PMRR RREC 1 20 1000 158 8 0.37 7.5 Comments on experimental evaluation 10 20 50000 3500 1486 0.54 We find that as expected from the analytical study, KPBA performs 10 100 50000 5039 439 0.52 well in ARQ metric compared to RRBA. RREC performs worst in RRBA 1 20 1000 16922 843 0.19 revenue metrics. We understand from our simulation studies that 10 20 50000 69221 34521 0.29 MCV metric is also significantly better for KPBA compared to RRBA 10 100 50000 97534 9917 0.31 and RREC. Moreover, the PMRR values in KPBA are similar or better KPBA 1 20 1000 21497 1085 0.40 compared to RREC and are far better compared to RRBA based on 10 20 50000 109086 53716 0.53 the result of the simulation and as discussed in our theoretical 10 100 50000 107091 10655 0.54 analysis. We also observe that KPBA continues to perform better Table 2: Comparison of KPBA, RRBA and RREC with posi- with more iterations, the number of users, with position bias, and tion Bias. with changing user preferences. 8 CONCLUSION In this paper, we introduce the learning to diversify problem for Algo Q U Iter ARQ ($) MCV ($) PMRR e-commerce search. We show that in order to serve best for both RREC 1 20 1000 117 6 0.35 the company and the customers, in e-commerce, it is required to 10 20 50000 3100 1134 0.50 construct a unique formulation of the learning to diversify problem 10 100 50000 17111 1634 0.57 where we intend to learn to diversify, and maximize the revenue RRBA 1 20 1000 20288 1019 0.17 simultaneously, and as well as ensure a value of relevance for the 10 20 50000 43654 25345 0.25 top k results. Our theoretical results show that the KPBA algorithm 10 100 50000 95281 9540 0.27 is expected to have better ARQ, and PMRR compared to RRBA. Our KPBA 1 20 1000 23201 1125 0.49 simulation studies show that KPBA also has better MCV compared 10 20 50000 95754 42876 0.53 to both RRBA and RREC, and it also gives a good performance in 10 100 50000 103324 10315 0.57 terms of PMRR compared to RREC. On the other hand, RRBA is Table 3: Comparison of KPBA, RRBA and RREC with Chang- quite bad from the customer’s perspective for this problem since the ing Customer Preferences. PMRR is low in all scenarios based on our simulations. In essence, we can show that the KPBA can be an efficient and practical algo- rithm for diversifying e-commerce search results. We also show that e-commerce companies can improve the CLV by using a diverse ranking strategy. This connection between diversity in ranking and CLV can be worth exploring more in the future. KPBA can also 7.3 Comparison with Changing Customer potentially be further optimized by formulating a variable budget Preferences knapsack problem where we simultaneously also learn the optimal In this section, we experiment with changing customer preferences relevance threshold. It is also possible to find a better probabilis- after every 500 iteration. We summarize the experiments in table 3. tic approximation algorithm for such optimization problems [34]. We again notice that the KPBA algorithm still performs better than We anticipate that this paper can motivate further research in the RRBA in all three metrics, and it compares favorably in PMRR area of diverse ranking for e-commerce search and recommender compared to RREC. systems. Learning to Diversify for E-commerce Search with Multi-Armed Bandit ECOM’19, July 2019, Paris, France Figure 2: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote KPBA metrics. The revenue metric uses log scale. Figure 3: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote KPBA metrics. The revenue metric uses log scale. Figure 4: Note that the red curves represent RREC metrics, blue curves represent RRBA metrics and the green curves denote KPBA metrics. The revenue metric uses log scale. REFERENCES for content optimization. In Advances in Neural Information Processing Systems. [1] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek 17–24. Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. 2009. Online models ECOM’19, July 2019, Paris, France Anjan Goswami, Chengxiang Zhai, and Prasant Mohapatra [2] Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. [29] Kenneth C Laudon, Carol Guercio Traver, and Alfonso Vidal Romero Elizondo. Diversifying search results. In Proceedings of the second ACM international con- 2007. E-commerce. Vol. 29. Pearson/Addison Wesley. ference on web search and data mining. ACM, 5–14. [30] Jessica Lee. 2013. No. 1 Position in Google Gets 33% of Search Traffic [3] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. 2016. A [Study]. https://searchenginewatch.com/sew/study/2276184/no-1-position-in- Near-Optimal Exploration-Exploitation Approach for Assortment Selection. In google-gets-33-of-search-traffic-study. (2013). [Online; accessed 27-January- Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 2018]. 599–600. [31] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline [4] David J Aldous. 1985. Exchangeability and related topics. In École d’Été de evaluation of contextual-bandit-based news article recommendation algorithms. Probabilités de Saint-Flour XIIIâĂŤ1983. Springer, 1–198. In Proceedings of the fourth ACM international conference on Web search and data [5] Albert Angel and Nick Koudas. 2011. Efficient diversity-aware search. In Proceed- mining. ACM, 297–306. ings of the 2011 ACM SIGMOD International Conference on Management of data. [32] Silvano Martello and Paolo Toth. 1990. Knapsack problems: Algorithms and ACM, 781–792. computer interpretations. Hoboken, NJ: Wiley-Interscience (1990). [6] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. 2013. Regret in online [33] Daniel McFadden. 1980. Econometric models for probabilistic choice among combinatorial optimization. Mathematics of Operations Research 39, 1 (2013), products. Journal of Business (1980), S13–S29. 31–45. [34] S Muthukrishnan, Martin Pál, and Zoya Svitkina. 2007. Stochastic models for [7] Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. budget optimization in search-based advertising. In International Workshop on Journal of Machine Learning Research 3, Nov (2002), 397–422. Web and Internet Economics. Springer, 131–142. [8] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. 2002. The [35] George L Nemhauser and Laurence A Wolsey. 1978. Best algorithms for approxi- nonstochastic multiarmed bandit problem. SIAM journal on computing 32, 1 mating the maximum of a submodular set function. Mathematics of operations (2002), 48–77. research 3, 3 (1978), 177–188. [9] Hans H Bauer, Tomas Falk, and Maik Hammerschmidt. 2006. eTransQual: A trans- [36] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning Diverse action process-based approach for capturing service quality in online shopping. Rankings with Multi-Armed Bandits. In Proceedings of the 25th International Journal of Business Research 59, 7 (2006), 866–875. Conference on Machine Learning (ICML ’08). [10] Alberto Caprara, Hans Kellerer, and Ulrich Pferschy. 1998. Approximation Algo- [37] Rodrygo L.T. Santos, Craig Macdonald, and Iadh Ounis. 2010. Selectively Diversi- rithms for Knapsack Problems with Cardinality Constraints. European Journal of fying Web Search Results. In Proceedings of the 19th ACM International Conference Operational Research 123 (1998), 2000. on Information and Knowledge Management (CIKM ’10). ACM, New York, NY, [11] Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based USA, 1179–1188. reranking for reordering documents and producing summaries. In Proceedings of [38] Denis Sauré and Assaf Zeevi. 2013. Optimal dynamic assortment planning with the 21st annual international ACM SIGIR conference on Research and development demand learning. Manufacturing & Service Operations Management 15, 3 (2013), in information retrieval. ACM, 335–336. 387–404. [12] Olivier Chapelle, Yi Chang, and T-Y Liu. 2011. Future directions in learning to [39] Daria Sorokina and Erick Cantu-Paz. 2016. Amazon Search: The Joy of Ranking rank. In Proceedings of the Learning to Rank Challenge. 91–100. Products. In Proceedings of the 39th International ACM SIGIR Conference on Re- [13] Harr Chen and David R Karger. 2006. Less is more: probabilistic models for re- search and Development in Information Retrieval (SIGIR ’16). ACM, New York, NY, trieving fewer relevant documents. In Proceedings of the 29th annual international USA, 459–460. ACM SIGIR conference on Research and development in information retrieval. ACM, [40] Unknown. 2013. The Moore?s law of e-commerce or the crucial importance of 429–436. being first. https://www.mabaya.com/the-moores-law-of-e-commerce-or-the- [14] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. 2011. Contextual Bandits crucial-importance-of-being-first/. (2013). [Online; accessed 27-January-2018]. with Linear Payoff Functions.. In AISTATS, Vol. 15. 208–214. [41] Unknown. 2017. Relevance Scores: Understanding and Customizing. https: [15] Charles LA Clarke, Maheedhar Kolla, Gordon V Cormack, Olga Vechtomova, //docs.marklogic.com/guide/search-dev/relevance. (2017). [Online; accessed Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity 27-January-2018]. in information retrieval evaluation. In Proceedings of the 31st annual international [42] Cong Yu, Laks Lakshmanan, and Sihem Amer-Yahia. 2009. It takes variety to ACM SIGIR conference on Research and development in information retrieval. ACM, make a world: diversification in recommender systems. In Proceedings of the 12th 659–666. international conference on extending database technology: Advances in database [16] Nick Craswell. 2009. Mean reciprocal rank. In Encyclopedia of Database Systems. technology. ACM, 368–378. Springer, 1703–1703. [43] Yisong Yue and Carlos Guestrin. 2011. Linear submodular bandits and their [17] Olayinka David-West. 2016. E-Commerce Management in Emerging Markets. application to diversified retrieval. In Advances in Neural Information Processing Encyclopedia of E-Commerce Development, Implementation, and Management 1 Systems. 2483–2491. (2016), 200. [44] Cheng Xiang Zhai, William W Cohen, and John Lafferty. 2003. Beyond inde- [18] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. 2010. Para- pendent relevance: methods and evaluation metrics for subtopic retrieval. In metric bandits: The generalized linear case. In Advances in Neural Information Proceedings of the 26th annual international ACM SIGIR conference on Research Processing Systems. 586–594. and development in informaion retrieval. ACM, 10–17. [19] Robert S Garfinkel and George L Nemhauser. 1972. Integer programming. Vol. 4. Wiley New York. [20] Jennifer Gimson. 2017. Why Lifetime Value is the Most Important Metric in eCom- merce. https://crealytics.com/blog/lifetime-value-important-metric-ecommerce/. (2017). [Online; accessed 27-January-2018]. [21] Anjan Goswami, Wei Han, Zhenrui Wang, and Angela Jiang. 2015. Controlled experiments for decision-making in e-Commerce search. In Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 1094–1102. [22] David Hawking, Paul Thistlewaite, and Nick Craswell. 1997. Anu/acsys trec-6 experiments. In TREC. Citeseer, 275–290. [23] Dipak Jain and Siddhartha S Singh. 2002. Customer lifetime value research in marketing: A review and future directions. Journal of interactive marketing 16, 2 (2002), 34–46. [24] Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual evaluation and learning for search, recommendation and ad placement. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 1199–1201. [25] H. Kellerer, U. Pferschy., and D. Pisinger. 2003. Knapsack Problems. Springer. [26] Hans Kellerer, Ulrich Pferschy, and David Pisinger. 2004. Introduction to NP- Completeness of knapsack problems. Springer. [27] Samir Khuller, Anna Moss, and Joseph Seffi Naor. 1999. The budgeted maximum coverage problem. Inform. Process. Lett. 70, 1 (1999), 39–45. [28] Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. 2015. Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem.. In COLT. 1141–1154.