Leverage Implicit Feedback for Context-aware Product Search Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1 1 Center for Intelligent Information Retrieval, University of Massachusetts Amherst {kbi,croft}@cs.umass.edu 2 Search Labs, Amazon {choonhui,ydatta,vijaim}@amazon.com ABSTRACT user purchases that depend on both product relevance and customer Product search serves as an important entry point for online shop- preferences. Previous research on product search [7, 8, 19, 38, 42] ping. In contrast to web search, the retrieved results in product focused on product relevance. Several attempts [27, 44] were also search not only need to be relevant but also should satisfy cus- made to improve customer satisfaction by diversifying search re- tomers’ preferences in order to elicit purchases. Previous work has sults. Ai et al. [3] introduced a personalized ranking model which shown the efficacy of purchase history in personalized product takes the users’ preferences learned from their historical reviews search [3]. However, customers with little or no purchase history together with the queries as the basis for ranking. However, their do not benefit from personalized product search. Furthermore, pref- work has several limitations. First, the personalized model cannot erences extracted from a customer’s purchase history are usually cope with the situations such as users that have not logged in during long-term and may not always align with her short-term interests. searching and thus can not be identified; users that logged in but Hence, in this paper, we leverage clicks within a query session, as do not have enough purchase history, and a single account being implicit feedback, to represent users’ hidden intents, which further shared by several family members. In these cases, user purchase act as the basis for re-ranking subsequent result pages for the query. records are either not available or containing substantial noise. Sec- It has been studied extensively to model user preference with im- ond, given a specific purchase need expressed as a search query, plicit feedback in recommendation tasks. However, there has been long-term behaviors may not be as informative to indicate the user’s little research on modeling users’ short-term interest in product preferences as short-term behaviors such as interactions with the search. We study whether short-term context could help promote retrieved results. These limitations of existing work on product users’ ideal item in the following result pages for a query. Further- search motivate us to model customers’ preferences based on their more, we propose an end-to-end context-aware embedding model interactions with search results, which do not require additional which can capture long-term and short-term context dependencies. customers’ information or their purchase history. Our experimental results on the datasets collected from the search Customers’ interactions with search results such as clicks can log of a commercial product search engine show that short-term be considered as implicit feedback based on their preferences. In context leads to much better performance compared with long-term information retrieval (IR), there are extensive studies on how to and no context. Our results also show that our proposed model is use users’ feedback on the relevance of top retrieved documents to more effective than word-based context-aware models. abstract a topic model and retrieve more relevant results [21, 33, 46]. These feedback techniques were shown to be very effective and can KEYWORDS also be applied to use implicit feedback such as clicks. In contrast to document retrieval where a users’ information need can usually Implicit Feedback, Product Search, Context-aware Search be satisfied by a single click on a relevant result, we observe that, 1 INTRODUCTION in product search, users tend to paginate to browse more products and make comparisons before they make final purchase decisions. Online shopping has become an important part of people’s daily In about 5% to 15% of search traffic, users browse and click results life in recent years. In 2017, e-commerce represented 8.2% of global in the previous pages and purchase items in the later result pages. retail sales (2,197 billion dollars); 46.4% of internet users shop online This provides us with the chance to collect user clicks more easily, and nearly one-fourth of them do so at least once a week [34]. based on which results shown in the next page can be tailored to Product search engines have become an important starting point meet the users’ preferences. We reformulate product search as a for online shopping. A number of consumer surveys have shown dynamic ranking problem, where instead of one-shot ranking based that more online shoppers started searches on e-commerce search on the query, the unseen products are re-ranked dynamically when engines (e.g., Amazon) rather than a generic web search engine users paginate to the next search result page (SERP) based on their (e.g., Google) [10]. implicit feedback collected from previous SERPs. In contrast to document retrieval, where relevance is a universal Traditional relevance feedback (RF) methods, which extract word- evaluation criterion, a product search system is evaluated based on based topic models from feedback documents as an expansion to the Copyright © 2019 by the paper’s authors. Copying permitted for private and academic original queries, have potential word mismatch problems despite purposes. their effectiveness [31, 46]. To tackle this problem, we propose an In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at end-to-end context-aware embedding model that can incorporate http://ceur-ws.org both long-term and short-term context to predict purchased items. In this way, semantic match and the co-occurence relationship SIGIR 2019 eCom, July 2019, Paris, France Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1 between clicked and purchased items are both captured in the em- vector space model which matches queries and products in the se- beddings. We show the effectiveness of incorporating short-term mantic space. The latent vectors of products and words are learned context against baselines using both no short-term context and in an unsupervised way, where vectors of n-grams in the descrip- word-based context. tion and reviews of the product are used to predict the product. In this paper, we leverage implicit feedback as short-term context Later, Ai et al. [3] built a hierarchical embedding model in which, to provide users with more tailored search results. We first reformu- learned representations of users, queries, and products are used to late product search as a dynamic ranking problem, i.e., when users predict product purchases and associated reviews. request next SERPs, the remaining unseen results will be re-ranked. Other aspects of product search such as popularity, visual prefer- We then introduce several context dependency assumptions for the ence and diversity have also been studied. Li et al. [22] investigated task, and propose an end-to-end context-aware neural embedding product retrieval from an economic perspective. Long et al. [25] model that can represent each assumption by changing the coef- predicted sales volume of items based on their transaction history ficients to combine long-term and short-term context. We further and incorporate this complementary signal with relevance for prod- investigated the effect of several factors in the task: short-term con- uct ranking. The effectiveness of images for product search was text, long-term context, and neural embeddings. Our experimental also investigated [6, 11]. To satisfy different users’ intents behind results on the datasets collected from search logs of a commercial the same query, efforts on improving result diversity in product product search engine showed that incorporating short-term con- retrieval have also been made [27, 44]. text leads to better performance compared with long-term context In terms of labels for training, there are studies on using clicks and no context, and embedding-based models perform better than as an implicit feedback signal. Wu et al. [42] jointly modeled clicks word-based methods in the task under various settings. and purchases in a learning-to-rank framework in order to opti- Our contributions can be summarized as follows: (1) we refor- mize the gross merchandise volume. To model clicks, they consider mulate conventional one-shot ranking to dynamic ranking (i.e., click-through rate of an item for a given query in a set of search multi-page search) based on user clicks in product search, which sessions as the signal for training. Karmaker Santu et al. [19] com- has not been studied before; (2) we introduce different context pared the different effects of exploiting click-rate, add-to-cart ratios, dependency assumptions and propose a simple yet effective end-to- order rates as labels. They experimented on multiple representative end embedding model to capture different types of dependency; (3) learning to rank models in product search with various settings. we investigate different aspects in the dynamic ranking task on real Our work also uses clicks as implicit feedback signals, but instead search log data and confirmed the effectiveness of incorporating of aggregating all the clicks under the same query to get click- short-term context and neural embeddings. Our study on multi- through rate, we consider the clicks associated with each query as page product search indicates that this is a promising direction and an indicator of the user’s short-term preference behind that query. worth more attention. Most previous work treat product search as a one-shot ranking problem, where given a query, static results are shown to users regardless of their interaction with the result lists. In a different ap- 2 RELATED WORK proach, Hu et al. [14] formulate the user behaviors during searching Next, we review three lines of research related to our work: prod- products as a Markov decision process (MDP) and use reinforcement uct search, session-aware recommendation, and user feedback for learning to optimize the accumulative gain (expected price) of user information retrieval. purchases. They define the states in the MDP to be a non-terminal state, from where users continue to browse, and two terminal states, i.e. purchases happen (conversion events) or users abandon the re- 2.1 Product Search sults (abandon events). Their method is essentially online learning Product search has different characteristics compared with gen- and refines the ranking model with large-scale users’ behavior data. eral web search; product information is usually more structured Although we work on a similar scenario where the results shown and the evaluation is usually based on purchases rather clicks. In in next page can be revised, they gradually refine an overall ranker 2006, Jansen and Molina [15] noted that the links retrieved by an that affects all the queries while our model revises results for each e-commerce search engine are significantly better than those ob- individual query based on the estimation of the user preference tained from general search engines. Since the basic properties of under the query. Another difference is that they only consider pur- products such as brands, categories and price are well-structured, chases as a deferred signal for training and do not use any clicks considerable work has been done on searching products based on in the process. In contrast, we treat clicks as an indicator of user facets [24, 39]. However, user queries are usually in natural lan- preferences and refine ranking conditioned on the preferences. guage and hard to structure. To support keyword search, Duan et al. [7, 8] extended the Query Likelihood method [28] by consid- ering the query generated from a mixture of the language model of background corpus and the language model of the products condi- 2.2 Session-aware Recommendation tioned on their specifications. The ranking function constructed in In session-aware recommendation, a user’s interactions with the this approach utilizes exact word matching information whereas previously seen items in the session are used for recommending vocabulary mismatch between free-form user queries and prod- the next item. Considerable research on session-aware recommen- uct descriptions or reviews from other users can still be an issue. dation has been done in the application domains such as news, Van Gysel et al. [38] noticed this problem and introduced a latent music, movies and products. Many these works are based on matrix Leverage Implicit Feedback for Context-aware Product Search SIGIR 2019 eCom, July 2019, Paris, France factorization [13, 16, 32]. More recently, session-aware recommen- dation approaches based on neural networks have shown superior u q u q u q performance. Hidasi et al. [12] model the clickstream in a session with Gated Recurrent Unit (GRU) and predict the next item to C1:t C1:t recommend in the session. Twardowski [37] also used Recurrent Neural Networks (RNN) but used attributes for item encoding and recommended only on unseen items. Quadrana et al. [30] proposed Bt+1 Bt+1 Bt+1 a hierarchical RNN model, which consists of a session-level GRU to model users’ activities within sessions and a user-level GRU to Long-term Short-term Long-short-term model the evolution of the user across sessions. The updated user Context Dependency Context Dependency Context Dependency representation will affect the session-level GRU to make person- Figure 1: Different assumptions to model different factors as alized recommendations. Wu and Yan [41] proposed a two-step context for purchase prediction. ranking method to recommend item lists based on user clicks and views in the session. They treat item ranking as a classification are asked to assess the relevance of a batch of documents based on problem and learn the session representation in the first step. With which the retrieval model is refined to find more relevant results. the session representation as context, items are reranked with a Rocchio [33] is generally credited as the first relevance feedback list-wise loss proposed in ListNet in the second step. Li et al. [23] method, which is based on the vector space model [35]. After the adopted the attention mechanism in the RNN encoding process to language model approach for IR has been proposed [28], the rele- identify the user’s main purpose in the current session. Quadrana vance model version 3 (RM3) [21] became one of the state-of-art et al. [29] reviewed extensive previous work on sequence-aware pseudo RF methods that is also effective for relevance feedback. recommendation and categorized the existing methods in terms of Zamani and Croft [46] incorporate the semantic match between different tasks, goals, and types of context adaption. unsupervised trained word embeddings into the language model The goal of a recommendation system is typically to help users framework and introduced an embedding-based relevance model explore items that they may be interested in when they do not have (ERM). Although these RF methods can also be applied in our task, clear purchase needs. On the contrary, a search engine aims to help we propose an end-to-end neural model for relevance feedback in users find only items that are most relevant to their intent specified the context of product search. in search queries. Relevance plays totally different roles in the two tasks. In addition, the evaluation metrics in recommendation are 3 CONTEXT-AWARE PRODUCT SEARCH usually based on clicks [12, 23, 30, 37, 41], whereas product search We reformulate product search as a dynamic re-ranking task where is evaluated with purchases under a query. short-term context represented by the clicks in the previous SERPs is considered for re-ranking subsequent result pages. Users’ global interests can also be incorporated for re-ranking as long-term con- 2.3 User Feedback for Information Retrieval text. We first introduce our problem formulation and different There are studies on two types of user feedback in information assumptions of context dependency models. Then we propose a retrieval, implicit feedback which usually considers click-through context-aware embedding model for the task and show how to data as the indicator of document relevance and explicit feedback optimize the model. where users are asked to give the relevance judgments of a batch of documents. Joachims et al. [17] found that click-through data as 3.1 Problem Formulation implicit feedback is informative but biased and the relative prefer- A query session1 is initiated when a user u issues a query q to the ences derived from clicks are accurate on average. To separate click search engine. The search results returned by the search engine are bias from relevance signals, Craswell et al. [5] designed a Cascade typically grouped into pages with similar number of items. Let R t Model by assuming that users examine search results from top be the set of items on the t-th search result page ranked by an initial to bottom; Dupret and Piwowarski [9] proposed a User Browsing ranker and denote by R 1:t the union of R 1 , · · · , R t . For practical Model where results can be skipped according to their examination purposes, we let the re-ranking candidate set D t +1 for page t + 1 probability estimated from their positions and last clicks; Chapelle be R 1:t +k ⧹V1:t where k ≥ 1 and V1:t is the set of re-ranked items and Zhang [4] constructed a Dynamic Bayesian Network model viewed by the user in the first t pages. Given user u, query q, and the which incorporate a variable to indicate whether a user is satisfied set of clicked items in the first t pages C 1:t as context, the objective by a click and leaves the result page. Yue and Joachims [45] de- is to rank all, if any, purchased items Bt +1 in D t +1 at the top of the fined a dueling bandit problem where reliable relevance signals are next result page. collected from users’ clicks on interleaved results to optimize the ranking function. Learning an unbiased model directly from biased 3.2 Context Dependency Models click-through data has also been studied by incorporating inverse There are three types of context dependencies that one can use to propensity weighting and estimating the propensity [2, 18, 40]. In model the likelihood of a user purchasing a product in her query this work, we model the user preference behind a search query with her clicks and refine the following results shown to this user. 1 We refer to the series of user behaviors associated with a query as a query session, Explicit feedback is also referred to as true relevance feedback i.e, a user issues a query, clicks results, paginates, purchases items and finally ends (RF) in information retrieval and has been extensively studied. Users searching with the query. SIGIR 2019 eCom, July 2019, Paris, France Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1 session, namely, long-term context, short-term context, and long- Candidate Items } } } w w w i 2 Dt+1 short-term context. Figure 1 shows the graphical models for these + + + context dependencies, where u denotes the latent variable of a w i w i w i + + + user’s long-term interest that stays the same across all the search w w w sessions, and clicks in the first t result pages, i.e., C 1:t , represents the user’s short-term preference. Purchased items on and after page L ( ) t + 1, i.e., Bt +1 , depends on query q and different types of context St Overall Context under different dependency assumptions. (1 c) q + u c C1:t Clicked Items Long-term Context Dependency. In this assumption, only u u + } }} users’ long-term preferences, usually represented by their historical w + w + w c + c + c queries and the corresponding purchased items, are used to predict } } } the purchases in their current query sessions. An unshown item i is w + w + w w + w + w w + w + w ranked according to its probability of being purchased given u and q, namely p(i ∈ Bt +1 |u, q). The advantage of such models is that Figure 2: The structure of our context-aware embedding personalization of search results (as proposed in Ai et al. [3]) can be model (CEM). w represents words in queries or product ti- conducted from the very beginning of a query session when there is tles; C1:t denotes the click item set in the first t SERPs, which no feedback information available. However, this model needs user consist of item c; St is the overall context of the first t SEPRs, identity and purchase history, which are not always available. In a combination of query q, user u and clicks C1:t ; i is an item addition, the long-term context may not be informative to predict a in the candidate set Dt +1 for re-ranking from page t + 1. user’s final purchases since her current search intent may be totally documents, our model should capture user preferences from their different from any of her previous searches and purchases. clicked items which are implicit positive signals. Components of Short-term Context Dependency. The shortcomings of long- CEM will be introduced next. term context can be addressed by focusing on just the short-term Item Embeddings. We use product titles to represent products context, i.e., the user’s actions such as clicks performed within the since merchants tend to put the most informative, representative current query session. This dependency model assumes that given text such as the brand, name, size, color, material and even target the observed clicks in the first t pages, the items purchased in the customers in product titles. In this way, items do not have unique subsequent result pages are conditionally independent of the user, embeddings according to their identifiers and items with the same shown in Figure 1. An unseen item i in the query session is re- titles are considered the same. Although this may not be accurate ranked based on its purchase probability conditioning on C1:t and all the time, word representations can be generalized to new items, q, i.e., p(i ∈ Bt +1 |C1:t , q). In this way, users’ short-term preferences and we do not need to cope with the cold-start problem. We use the are captured and their identity and purchase records are not needed. average of title word embeddings of a product as its own embedding, Users with little or no purchase history and who have not logged i.e., in can benefit directly under such a ranking scheme. P E (w ) Long-short-term Context Dependency. The third dependency E (i) = w ∈i (1) |i | assumption is that purchases in the subsequent result pages de- where i is the item, and |i | is the title length of item i. We also pend on both short-term context, e.g., previous clicks in the current evaluated other more complex product title encoding approaches query session, and long-term context, such as historical queries and such as non-linear projection of average word embeddings and purchases of the user indicated by u. An unseen item i after page t recurrent neural network on title word sequence, but they did not is scored according to p(i ∈ Bt +1 |C1:t , q, u). This setting considers show superior performance over the simpler one that we use here. more information but it also has the drawback of requiring users User Embeddings. A lookup table for user embeddings is cre- identity and purchase history. ated and used for training, where each user has a unique represen- We will introduce how to model the three dependency assump- tation. This vector is shared across search sessions and updated by tions in a same framework in Section 3.3. In this paper, we focus the gradient learned from previous user transactions. In this way, on the case of non-personalized short-term context and include the the long-term interest of the user is captured and we use the user other two types of context for comparison. embeddings as long-term context in our models. Query Embeddings. Similar to item embeddings, we use the simple average embedding of query words as the representation, 3.3 Context-aware Embedding Model which also shows the best performance compared to the non-linear We designed a context-aware framework where models under dif- projection and recurrent neural network methods we have tried. ferent dependency assumptions can be trained by varying the cor- The embedding of the query is responding coefficients, shown in Figure 2. To incorporate semantic P meanings and avoid the word mismatch between queries and items, w ∈q E (w ) E (q) = (2) we embed queries, items and users into latent semantic space. Our |q| context-aware embedding model is referred to as CEM. We assume where |q| is the length of query q. users’ preferences are reflected by their implicit feedback, i.e. their Short-term Context Embeddings. We use the set of clicked clicks associated with the query. Similar to relevance feedback ap- items to represent user preference behind the query, which we proaches [21, 33] that extract a topic model from assessed relevant refer to as E (C1:t ). For sessions associated with a different query Leverage Implicit Feedback for Context-aware Product Search SIGIR 2019 eCom, July 2019, Paris, France q or page number t, the clicked items contained in C1:t may differ. where E (St ) is computed according to Equation 4. This model can We assume the sequence of clicked items does not matter when also be interpreted as a generative model for an item in the candidate modeling short-term user preference, i.e., the same set of clicked set Dt +1 given the context St . In this case, the probability of an items should imply the same user preference regardless of the order item in the candidate set Dt +1 being generated from the context of them being clicked. There are two reasons for this assumption. St is computed with a softmax function that take the dot product One is that the user’s purchase need is fixed for a query she issued score between the embedding of an item and the context as inputs, and is not affected by the order of clicks. The other is that the order i.e, of user clicks is usually based on the rank of retrieved products p(i |C1:t , u, q) = score (i |q, u, C1:t ) (6) from top to bottom as the user examines each result, which is not affected by user preference in the non-personalized search results. We need to train the model and learn appropriate embeddings So we represent the set as the centroid of each clicked item in the of context and items so that the probability of purchased items latent semantic space, where the order of clicks does not make a in Dt +1 , namely Bt +1 , should be larger than the other candidate difference. A simple yet effective way is to consider equal weights items, i.e. Dt +1 ⧹Bt +1 . Also, the conditional probability in Equation of all the items in C1:t so that the centroid is simply averaged item 6 can be used to compute the likelihood of the observed instance embeddings: of C1:t , u, q, Bt +1 . P i ∈ C1:t E (i) E (C1:t ) = (3) 3.4 Model Optimization |C1:t | where |C1:t | is the number of clicked items in set C1:t . The embeddings of queries, users, items are learned by maximizing We also tried an attention mechanism to weight each clicked the likelihood of observing Bt +1 given the condition of C1:t , u, q, item according to the query and represent the user preference with a i.e., after user u issued query q, she clicked the items in the first t weighted combination of clicked items. However, this method is not SERPs (C1:t ), then models are learned by maximizing the likelihood better than combining clicks with equal weights in our experiments. for her to finally purchased items in Bt +1 which are shown in and So we only show simple methods. after page t + 1. There are many possible values of t even for a same Overall Context Embeddings. We use a convex combination user u if she purchases multiple products on different result pages of user, query, and click embeddings as the representation of overall under query q. These are considered as different data entries. Then context E (St ). i.e. the log likelihood of observing purchases in Bt +1 conditioning on C1:t , u, q in our model can be computed as E (St ) = (1 − λu − λc )E (q) + λu E (u) + λc E (C1:t ) (4) L(Bt +1 |C1:t , u, q) = log p(Bt +1 |C1:t , u, q) 0 ≤ λu ≤ 1, 0 ≤ λc ≤ 1, λu + λc ≤ 1 Y ∝ log p(i |C1:t , u, q) This overall context is then treated as the basis for predicting pur- i ∈Bt +1 (7) chased items in Bt +1 . When λc = 0, C1:t is ignored in the prediction X and St corresponds to the long-term context shown in Figure 1. ∝ log p(i |C1:t , u, q) When λu = 0, user u does not have impact on the final purchase i ∈Bt +1 given C1:t . This aligns with the short-term context assumption in The second step can be inferred if we consider whether an item Figure 1. When λu > 0, λc > 0, λu + λc ≤ 1, both long-term and will be purchased is independent of another item given the context. short-term context are considered and this matches the type of According to Equation 5, 6 and 7, we can optimize the condi- long-short-term context in Figure 1. So by varying the values of λu tional log-likelihood directly. A common problem for the softmax and λc , we can use Equation 4 to model different types of context calculation is that the denominator usually involves a large num- dependency and do comparisons. ber of values and is impractical to compute. However, this is not Attention Allocation Model for Items. With the overall con- a problem in our model since we limit the candidate set Dt +1 to text collected from the first t pages, we further construct an atten- only some top-ranked items retrieved by the initial ranker so that tive model to re-rank the products in the candidate set Dt +1 . This the computation cost is small. re-ranking process can be considered as an attention allocation Similar to previous studies [3, 38], we apply L2 regularization on problem. Given the context that indicates the user’s preference and the embeddings of words and users to avoid overfitting. The final a set of candidate items that have not been shown to the users optimization goal can be written as yet, the item which attracts more user attention will have higher probability to be purchased. The attention weights then act as the E (w ) 2 + E (u) 2 ) X X X L′ = L(Bt +1 |C1:t , u, q) + γ ( basis for re-ranking. Predicting the probability of each candidate u,q,t w u item being purchased can be considered as attention allocation for X X exp(E (St ) · E (i)) the items. This idea is also similar to the listwise context model = log P (8) i ′ ∈Dt +1 exp(E (St ) · E (i )) ′ proposed by Ai et al. [1]. They extracted the topic model from top- u,q,t i ∈Bt +1 ranked documents with recurrent neural networks and used it as E (w ) 2 + E (u) 2 X X  +γ a local context to re-rank the top documents with their attention w u weights. The attention weights can be computed as: where γ is the hyper-parameter to control the strength of L2 regu- exp(E (St ) · E (i)) larization. The function accumulates entries of all the possible user score (i |q, u, C1:t ) = P (5) i ′ ∈ Dt +1 exp(E (St ) · E (i )) ′ u, query q, and the valid page number t for pagination which has SIGIR 2019 eCom, July 2019, Paris, France Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1 Table 1: Statistics of our collected datasets pages. Nonetheless, we can still evaluate the performance of one- Toys Garden Cell Phones shot re-ranking from page t + 1 given the context collected from the & Games & Outdoor & Accessories first t pages. In our experiments, we compare different methods for Product title length 13.14±6.46 16.39±7.38 22.02±7.34 re-ranking from page 2 and page 3 since earlier re-ranking can in- Vocabulary size 381,620 1,054,980 194,022 fluence results at higher positions which have bigger larger impact Query Session Splits on the ranking performance. As in relevance feedback experiments Train 91.21% 87.36% 86.57% [26, 33], our evaluation is also based on residual ranking, where the Validation 2.61% 3.66% 4.20% first t result pages are discarded and re-ranking of the unseen items Test 6.18% 8.98% 9.23% are evaluated. We use the residual ranking evaluation paradigm clicks in and before page t and purchases after that page. All possi- because the results before re-ranking are retrieved by the same ble words and users are taken into account in the regularization. initial ranker and identical for all the re-ranking methods. When we do not incorporate long-term context, the corresponding Similar to other ranking tasks, we use mean average precision parts of u are omitted. (MAP) at cutoff 100, mean reciprocal rank (MRR) and normalized The loss function actually captures the loss of a list and this discounted cumulative gain (N DCG) as ranking metrics. MAP mea- list-wise loss is similar to AttentionRank proposed by Ai et al. [1]. sure the overall performance of a ranker in terms of both precision Because of the softmax function, optimizing the probabilities of rel- and recall, which indicates the ability to retrieve more purchased evant instances in Bt +1 simultaneously minimizes the probabilities items in next 100 results and ranking them to higher positions. MRR of the rest non-relevant instances. This loss shows superiority over is the average inverse rank for the first purchase in the retrieved other list-wise loss such as ListMLE [43] and SoftRank [36], which items. It indicates the expected number of products users need to is another reason we adopt this loss. browse before finding the ones they are satisfied with. N DCG is a common metric for multiple-label document ranking. Although in our context-aware product search, items only have binary labels 4 EXPERIMENTAL SETUP indicating whether they were purchased given the context, N DCG In this section, we introduce our experimental settings of context- still shows how good a rank list is with emphasis on results at top aware product search. We first describe how we construct the positions compared with the ideal rank list. We use N DCG@10 in datasets for experiments. Then we describe the baseline methods our experiments. and evaluation methodology for comparing different methods. We also introduce the training settings for our model. 4.3 Baselines 4.1 Datasets We compare our short-term context-aware embedding model (SCEM) We randomly sampled three category-specific datasets, namely, with four groups of baseline, retrieval model without using context, “Toys & Games”, “Garden & Outdoor”, and “Cell Phones & Ac- long-term, short-term and long-short-term context-aware models. cessories”, from the logs of a commercial product search engine Production Model (PROD). PROD is essentially a gradient spanning ten months between years 2017 and 2018. We keep only boosted decision tree based model. Comparing with this model the query sessions with at least one clicked item on any page before indicates the potential gain of our model if deployed online. Note the pages with purchased items. These sessions are difficult for the that PROD performs worse on our datasets than on the entire search production model since it could not rank the “right” items on the traffic since we extracted query sessions where the purchased items top so that users purchased items in the second or later result pages. are in the second or later result pages. Our datasets include up to a few million query sessions containing Random (RAND). By randomly shuffling the results in the several hundred thousand unique queries. When there are multiple candidate set which consists of the top unseen retrieved items by purchases in a query session across different result pages, purchases the production model, we get the performance of a random re- until page t are only considered as clicks and used together with ranking strategy. This performance should be the lower bound of other clicks to predict purchases on and after page t + 1. Statistics any reasonable model. of our datasets are shown in Table 1. Popularity (POP). In this method, the products in the candidate set are ranked according to how many times they were purchased in the training set. Popularity is an important factor for product 4.2 Evaluation Methodology search [25] besides relevance. We divided each dataset into training, validation, and test sets by Query Likelihood Model (QL). The query likelihood model the date of the query sessions. The sessions occurred in the first 34 (QL) [28] is a language model approach for information retrieval. It weeks are used for training, the following 2 weeks for validation shows the performance of re-ranking without implicit feedback and and the last 4 weeks for testing. Models were trained with data is only based on the bag-of-words representation. The smoothing in the training set; hyper-parameters were tuned according to the parameter µ in QL was tuned from {10, 30, 50, 100, 300, 500}. model performance on the validation set, and evaluation results on Query Embedding based Model (QEM). This model scores an the test set were reported for comparison. item by the generative probability of the item given the embedding Since the datasets are static, it is impossible to evaluate the mod- of a query. When λu = 0, λc = 0, CEM is exactly QEM. els in a truly interactive setting where each subsequent page is Long-term Context-aware Relevance Model (LCRM3). Rel- re-ranked based on the observed clicks on the current and previous evance Model Version 3 (RM3) [21] is an effective method for both Leverage Implicit Feedback for Context-aware Product Search SIGIR 2019 eCom, July 2019, Paris, France pseudo and true relevance feedback. It extracts a bag-of-words lan- We implemented our models with Tensorflow. The models were guage model from a set of feedback documents, expands the original trained for 20 epochs with the batch size set to 256. Adam [20] was query with the most important words from the language model, used as the optimizer and the global norm of parameter gradients and retrieve results again with the expanded query. To capture the was clipped at 5 to avoid unstable gradient updates. After each long-term interest of a user, we use RM3 to extract significant words epoch, the model was evaluated on the validation set and the model from titles of the user’s historical purchased products and refine the with the best performance on the validation set was selected to retrieval results for the user in the test set with the expanded query. be evaluated on the test set. The initial learning rate was selected The weight of the initial query was tuned from {0, 0.2, · · · , 1.0} from {0.01, 0.005, 0.001, 0.0005, 0.0001}. L2 regularization strength and the expansion term count was tuned from {10, 20, · · · , 50}. The γ was tuned from 0.0 to 0.005. λq , λu in Equation 4 were tuned from effect of query weight is shown in Section 5.2. {0, 0.2, · · · , 0.8, 1.0} (λq + λu ≤ 1) to represent various dependency Long-term Context-aware Embedding Model (LCEM). When assumptions mentioned in Section 3.2, and the embedding size were λc = 0, 0 < λu ≤ 1, CEM becomes LCEM by considering long-term scanned from {50, 100, · · · , 300}. The effect of λq , λu and embedding context indicated by universal user representations. size are shown in Section 5. Short-term Context-aware Relevance Model (SCRM3). We also use RM3 to extract the user preference behind a query from 5 RESULTS AND DISCUSSION the clicked items in the previous SERPs as short-term context and In this section, we show the performance of the four types of models refine the next SERP. This method uses the same information as our mentioned in Section 4.3. First, we compare the overall retrieval short-term context-aware embedding model, but it represents user performance of various types of models in Section 5.1. Then we preference with a bag-of-words model and only -consider word further study the effect of queries, long-term context and embedding exact match between a candidate item and the user preference size on each model in the following subsections. model. The query weight and expansion term count were tuned in the same range as LC-RM3 and the influence of initial query weight 5.1 Overall Retrieval Performance can be found in Section 5.2. 2 Table 2 shows the performance of different methods on re-ranking Long-short-term Context-aware Embedding Model (LSCEM). items when users paginate to the second and third SERP for Toys & When λu > 0, λc > 0, 0 < λu + λc ≤ 1, both long-term context Games, Garden & Outdoor and Cell Phones & Accessories. Among represented by u and short-term context indicated by Ct are taken all the methods, SCEM and SCRM3 perform better than all the into account in CEM. other baselines without using short-term context, including their PROD, RAND, POP, QL, and QEM are retrieval models that rank corresponding retrieval baseline, QEM, and QL respectively, and items based on queries and do not rely on context or user informa- PROD which considers many additional features, showing the ef- tion. These models can be used as the initial ranker for any queries. fectiveness of incorporating short-term context. The second type of rankers consider users’ long-term interests to- In contrast to the effectiveness of short-term context, long-term gether with queries, such as LCEM and LCRM3. These methods context does not help much when combined with queries alone or utilize users’ historical purchases but can only be applied to users together with short-term context. LCRM3 outperforms QL on all who appear in the training set. The third type is feedback models the datasets by a small margin when users’ historical purchases which take users’ clicks in the query session as short-term context are used to represent their preferences. LCEM and LSCEM always and this category includes SCRM3 and our SCEM. In this approach, perform worse than QEM and SCEM by incorporating long-term user identities are not needed. However, they can only be applied context with λu > 0. Note that since only a small portion of users in to search sessions where users click on results and only items from the test set appear in the training set, the re-ranking performance the second result page or later can be refined with the clicks. The of most query sessions in the test set will not be affected. We will fourth category considers both long and short-term context, e.g., elaborate on the effect of long-term context in Section 5.3. LSCEM. The second, third and fourth groups of baseline correspond We found that neural embedding methods are more effective to the dependency assumptions shown in the first, second and third than word-based baselines. When implicit feedback is not incorpo- sub-figure in Figure 1 respectively. rated, QEM performs significantly better than QL, sometimes even better than PROD. When clicks are used as context, with neural 4.4 Model Training embeddings, SCEM is much more effective than SCRM3. This shows Query sessions with multiple purchases on different pages are split that semantic match is more beneficial than exact word match for into sub-sessions, one for each page with a purchase. When there top retrieved items in product search. In addition, these embeddings are more than three sub-sessions for a given session, we randomly also carry the popularity information since items purchased more select three in each training epoch. We do so to avoid skewing the in the training data will get more gradients during training. Due dataset with sessions with many purchases. Likewise, we randomly to our model structure, there are also properties that the embed- select five clicked items for constructing short-term context if there dings of items purchased under similar queries or context will be are more than five clicked items in a query session. more alike compared with non-purchased items, and embeddings of clicked and purchased items are also similar. 2 We also implemented the embedding-based relevance model (ERM) [46], which is The relative improvement of SCEM and SCRM3 compared to an extension of RM3 by taking semantic similarities between word embeddings into the production model on Toys & Games is less than the other account, as a context-aware baseline. But it does not perform better than RM3 across different settings. So we did not include it. 3 Due the confidentiality policy, the absolute value of each metric can not be revealed. SIGIR 2019 eCom, July 2019, Paris, France Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1 Table 2: Comparison of baselines and our short-term context embedding model (SCEM) on re-ranking when users paginate to the 2nd and 3rd page. The number is the relative improvement of each method compared with the production model (PROD)3 . ‘− ’ indicates significant worse of each baseline compared with SCEM in student t-test with p ≤ 0.001. Note that difference larger than 3% is approximately significant. The best performance in each column is marked in bold. Performance of Re-ranking from the 2nd Page Toys & Games Garden & Outdoor Cell Phones & Accessories Model MAP MRR N DCG@10 MAP MRR N DCG@10 MAP MRR N DCG@10 PROD 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− RAND -25.70% − -26.83% − -29.23% − -23.40% − -24.16%− -25.73%− -20.15% − -20.93% − -22.73%− POP -15.82%− -15.90%− -17.87%− -9.38%− -9.51%− -9.55%− -8.54%− -8.25%− -11.12%− QL -25.78% − -27.80% − -29.73% − -19.62% − -20.78%− -21.63%− -16.14% − -16.77% − -18.00%− QEM -2.57%− -3.10%− -3.85%− +0.65%− -0.34%− +1.06%− +9.96%− +9.73%− +10.58%− LCRM3 -24.82%− -25.92%− -28.60%− -19.33%− -20.45%− -21.28%− -15.44%− -16.07%− -17.38%− LCEM -2.57%− -3.10%− -3.85%− +0.65%− -0.34%− +1.06%− +9.96%− +9.73%− +10.58%− SCRM3 +12.93% − +9.63% − +9.53% − +25.15% − +23.01% − +23.15%− +18.65%− +16.77%− +17.11%− SCEM +26.59% +24.56% +26.20% +37.43% +35.16% +37.22% +48.99% +47.00% +50.18% LSCEM +26.59% +24.56% +26.20% +37.43% +35.16% +37.22% +48.99% +47.00% +50.18% Performance of Re-ranking from the 3rd Page Toys & Games Garden & Outdoor Cell Phones & Accessories Model MAP MRR N DCG@10 MAP MRR N DCG@10 MAP MRR N DCG@10 PROD 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− 0.00%− RAND -15.45%− -17.97%− -18.96%− -12.29%− -13.71%− -13.97%− -8.75%− -10.05%− -9.55%− POP -4.37%− -5.31%− -5.18%− 2.09%− 1.43%− 3.49%− -0.78%− -1.21%− -1.43%− QL -14.87%− -18.31%− -19.20%− -9.15%− -10.97%− -10.37%− -4.05%− -5.21%− -3.62%− QEM +12.83%− +11.07%− +14.13%− +15.82%− +14.42%− +19.32%− +28.85% − +27.60% − +33.92%− LCRM3 -13.99% − -15.82% − -17.20% − -9.02% − -10.73%− -10.04%− -3.26% − -4.48% − -2.85%− LCEM +12.83%− +11.07%− +14.13%− +15.82%− +14.42%− +19.32%− +28.85%− +27.60%− +33.92%− SCRM3 +34.26%− +29.27%− +32.86%− +49.54%− +46.60%− +51.20%− +44.52%− +41.16%− +46.98%− SCEM +51.46% +47.57% +54.77% +63.79% +60.43% +67.79% +85.51% +81.72% +93.85% LSCEM +51.46% +47.57% +54.77% +63.79% +60.43% +67.79% +85.51% +81.72% +93.85% two datasets. There are two possible reasons. First, the production model. In addition, most candidate products are consistent with the model performs better on Toys & Games, compared with Garden & query intent but the final purchase depends on users’ preference. Outdoor, and Cell Phones & Accessories, which can be seen from Popularity, as an important factor that consumers will consider, can the larger advantages compared with random re-ranking. Second, improve the performance upon QL. However, it is still worse than the average clicks in the first two and three SERPs in Toys & Games the production model most of the time. are less than the other two datasets 4 , thus SCEM and SCRM3 can perform better with more implicit feedback information. 5.2 Effect of Short-term Context The relative performance of all the other methods against PROD We investigate the influence of short-term context by varying the is better when re-ranking from page 2 compared with re-ranking value of λc with λu set to 0. The performance of SCRM3 and SCEM from page 3 in terms of all three metrics. Several reasons are shown varies as the interpolation coefficient of short-term context changes as follows. When purchases happen in the third page or later, it since only these two methods utilize the clicks. Since re-ranking usually means users cannot find the “right” products in the first from the second or third pages on Toys & Games, Garden and Mo- two pages, which further indicates the production model is worse bile all show similar trends, we only report performance of each for these query sessions. In addition, the ranking quality of PROD method in the setting of re-ranking from second pages on Toys on the third page is worse than on the second page. Another reason & Games, which is shown in Figure 3a. Figure 3a shows that as that SCRM3 and SCEM improve more upon PROD when re-ranking the weight of clicks is set larger, the performance of SCRM3 and from page 3 is that more context becomes available with clicks SCEM goes up consistently. When λc is set to 0, SCRM3 and SCEM collected in the second page and makes the user preference model degenerate to QL and QEM respectively which do not incorporate more robust. short-term context. From another perspective, SCRM3 and SCEM QL performs similarly to RAND on Toys & Games and a little degrade in performance as we increase the weight on queries. For better than RAND on Garden & Outdoor, and Cell Phones & Ac- exact word match based methods, more click signals lead to more cessories, which indicates that relevance captured by exact word improvements for SCRM3, which is also consistent with the fact matching is not the key concern in the rank lists of the production that QL performs similarly to RAND by only considering queries. 4 The specific number of average clicks in the datasets cannot be revealed due to the For embedding-based methods which capture semantic match and confidentiality policy. popularity, QEM with queries alone performs similarly to PROD Leverage Implicit Feedback for Context-aware Product Search SIGIR 2019 eCom, July 2019, Paris, France PROD POP QEM SCRM3 PROD POP LCEM SCRM3 30 PROD POP LCEM SCRM3 PROD POP QEM SCRM3 30 RAND QL LCRM3 SCEM 30 RAND QL LCRM3 LSCEM RAND QL LCRM3 LSCEM 30 RAND QL LCRM3 SCEM 20 20 20 20 M AP Lif t (%) M AP Lif t (%) M AP Lif t (%) M AP Lif t (%) 10 10 10 10 0 0 0 0 −10 −10 −10 −10 −20 −20 −20 −20 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250 300 λc λu λu Embedding Size (a) λc (b) λu (c) λu on seen users (d) Embedding size Figure 3: The effect of λc , λu , embedding size on the performance of each model in the collection of Toys & Games when re-ranking from the second SERP for the scenarios where users paginate to page 2. but much better when more context information is incorporated This finding is different from the observation in HEM proposed in SCEM. This indicates that users’ clicks already cover the query by Ai et al. [3], which incorporates user embeddings as users’ long- intent, and also contain additional users’ preference information. term preferences and achieves superior performance compared to not using user embeddings. We hypothesize that this inconsis- 5.3 Effect of Long-term Context tent finding is due to the differences in datasets. HEM was experi- mented on a dataset that is heavily biased to users with multiple Next we study the effect of long-term context indicated by users’ purchases and under a rather simplistic assumption of query gen- global representations E (u) both with and without incorporating eration, where the terms from the category hierarchy of a product short-term context. QEM and LCRM3 only use queries and user are concatenated as the query string. Their datasets contain only historical transactions for ranking; LSCEM uses long and short-term hundreds of unique queries and tens of thousands items that are all context (λu + λc is fixed as 1 since we found that query embeddings purchased by multiple users. In contrast, we experimented on the do not contribute to the re-ranking performance when short-term real queries and corresponding user behavior data extracted from context is incorporated). Toys & Games is used again to show the search log. The number of unique queries and items in our experi- sensitivity of each model in terms of λu under the setting of re- ments are hundreds times larger than in their dataset. There is also ranking from the second page. Since there are users in the test set little overlap of users in the training and test set in our datasets, which never appear in the training set, λu does not take effect due to while in their experiments, all the users in the test set are shown in the null representations for these unknown users. In Toys & Games, the training set. only about 13% of all the query sessions in the test set are from users who also appear in the training set. The performance change 5.4 Effect of Embedding Size on the entire test set will be smaller due to the low proportion the models can effect in the test set, so we also include the performance Figure 3d shows the sensitivity of each model in terms of embedding of each model on the subset of data entries associated with users size on Toys & Games, which presents similar trends to the other seen in the training set. Figure 3b and 3c show how each method two datasets. Generally, SCEM and QEM are not sensitive to the performs on the whole test set and the subset respectively with embedding size as long as it is in a reasonable range. To keep the different λu . model effective and simple, we use 100 as the embedding size and Figure 3b and 3c show that for LSCEM, as λu becomes larger, report experimental results under this setting in Table 2 and the performance goes down. This indicates that when short-term con- other figures. texts are used, users’ embeddings act like noise and drag down the re-ranking performance. λu has different impacts on the models 6 CONCLUSION AND FUTURE WORK not using clicks. For LCRM3, when we zoom in to only focus on We reformulate product search as a dynamic ranking problem where users that appear in the training set, the performance changes and leverage users’ implicit feedback on the presented products as short- the superiority over QL are more noticeable. The best value of MAP term context and refine the ranking of remaining items when the is achieved when λu = 0.8, which means long-term context benefit users request the next result pages. We then propose an end-to- word-based models with additional information, which can be help- end context-aware neural embedding model to represent various ful for solving the word mismatch problem. In contrast, for LCEM, context dependency assumptions for predicting purchased items. with non-zero λu , it performs worse than only considering queries. Our experimental results indicate that incorporating short-term Embedding models already capture semantic similarities between context is more effective than using long-term context or not using words. In addition, as we mentioned in Section 5.1, they also carry context at all. It is also shown that our neural context-aware model information about popularity since the products purchased more performs better than the state-of-art word-based feedback models. often under the query will get more credits during training. An- For future work, there are several research directions. First, it other possible reason is that the number of customers with sessions would be better to evaluate our short-term context re-ranking model of similar intent is low so that the user embedding is misguiding online, in an interactive setting as each result page can be re-ranked the query sessions. Thus, users’ long-term interests do not bring dynamically. Second, other information sources such as images additional information to further improve LCEM on the collections. and price can be included to extract user preferences from their SIGIR 2019 eCom, July 2019, Paris, France Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1 feedback. Third, we are interested in the use of negative feedback 40th International ACM SIGIR Conference. ACM, 475–484. such as “skips” that can be identified reliably based on subsequent [20] Diederik P Kingma and Jimmy Lei Ba. 2014. Adam: Amethod for stochastic optimization. In Proc. 3rd Int. Conf. Learn. Representations. user actions. [21] Victor Lavrenko and W Bruce Croft. 2017. Relevance-based language models. In ACM SIGIR Forum, Vol. 51. ACM, 260–267. [22] Beibei Li, Anindya Ghose, and Panagiotis G Ipeirotis. 2011. Towards a theory ACKNOWLEDGMENTS model for product search. In Proceedings of the 20th international conference on This work was supported in part by the Center for Intelligent In- World wide web. ACM, 327–336. [23] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. formation Retrieval and in part by NSF IIS-1715095. Any opinions, Neural attentive session-based recommendation. In Proceedings of the 2017 ACM findings and conclusions or recommendations expressed in this on Conference on Information and Knowledge Management. ACM, 1419–1428. material are those of the authors and do not necessarily reflect [24] Soon Chong Johnson Lim, Ying Liu, and Wing Bun Lee. 2010. Multi-facet product information search and retrieval using semantically annotated product family those of the sponsor. ontology. Information Processing & Management 46, 4 (2010), 479–493. [25] Bo Long, Jiang Bian, Anlei Dong, and Yi Chang. 2012. Enhancing product search by best-selling prediction in e-commerce. In Proceedings of the 21st ACM CIKM REFERENCES Conference. ACM, 2479–2482. [1] Qingyao Ai, Keping Bi, Jiafeng Guo, and W Bruce Croft. 2018. Learning a Deep [26] Yuanhua Lv and ChengXiang Zhai. 2009. Adaptive relevance feedback in infor- Listwise Context Model for Ranking Refinement. arXiv preprint arXiv:1804.05936 mation retrieval. In Proceedings of the 18th ACM conference on Information and (2018), 135–144. knowledge management. ACM, 255–264. [2] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Un- [27] Nish Parikh and Neel Sundaresan. 2011. Beyond relevance in marketplace search. biased Learning to Rank with Unbiased Propensity Estimation. arXiv preprint In Proceedings of the 20th ACM CIKM Conference. ACM, 2109–2112. arXiv:1804.05938 (2018). [28] Jay M Ponte and W Bruce Croft. 1998. A language modeling approach to in- [3] Qingyao Ai, Yongfeng Zhang, Keping Bi, Xu Chen, and W Bruce Croft. 2017. formation retrieval. In Proceedings of the 21st annual international ACM SIGIR Learning a hierarchical embedding model for personalized product search. In conference. ACM, 275–281. Proceedings of the 40th International ACM SIGIR Conference. ACM, 645–654. [29] Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. Sequence- [4] Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model Aware Recommender Systems. ACM Comput. Surv. (2018). for web search ranking. In Proceedings of the 18th international conference on [30] Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi. World wide web. ACM, 1–10. 2017. Personalizing session-based recommendations with hierarchical recurrent [5] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experi- neural networks. In Proceedings of the Eleventh ACM Conference on Recommender mental comparison of click position-bias models. In Proceedings of the 2008 WSDM Systems. ACM, 130–137. Conference. ACM, 87–94. [31] Navid Rekabsaz, Mihai Lupu, Allan Hanbury, and Guido Zuccon. 2016. General- [6] Wei Di, Anurag Bhardwaj, Vignesh Jagadeesh, Robinson Piramuthu, and Elizabeth izing translation models in the probabilistic relevance framework. In Proceedings Churchill. 2014. When relevance is not enough: Promoting visual attractiveness of the 25th ACM CIKM conference. ACM, 711–720. for fashion e-commerce. arXiv preprint arXiv:1406.3561 (2014). [32] Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. [7] Huizhong Duan, ChengXiang Zhai, Jinxing Cheng, and Abhishek Gattani. 2013. 2011. Fast context-aware recommendations with factorization machines. In A probabilistic mixture model for mining and analyzing product search log. In Proceedings of the 34th international ACM SIGIR Conference. ACM, 635–644. Proceedings of the 22nd ACM international conference on Information & Knowledge [33] Joseph John Rocchio. 1971. Relevance feedback in information retrieval. The Management. ACM, 2179–2188. Smart retrieval system-experiments in automatic document processing (1971). [8] Huizhong Duan, ChengXiang Zhai, Jinxing Cheng, and Abhishek Gattani. 2013. [34] Khalid Saleh. 2018. Global Online Retail Spending - Statistics and Trends. https:// Supporting keyword search in product database: a probabilistic approach. Pro- www.invespcro.com/blog/global-online-retail-spending-statistics-and-trends/ ceedings of the VLDB Endowment 6, 14 (2013), 1786–1797. [35] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model [9] Georges E Dupret and Benjamin Piwowarski. 2008. A user browsing model to for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. predict search engine click data from past observations.. In Proceedings of the [36] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. 2008. Softrank: 31st annual international ACM SIGIR conference. ACM, 331–338. optimizing non-smooth rank metrics. In Proceedings of the 2008 International [10] Krista Garcia. 2018. More Product Searches Start on Amazon. Conference on Web Search and Data Mining. ACM, 77–86. https://retail.emarketer.com/article/more-product-searches-start-on-amazon/ [37] Bartłomiej Twardowski. 2016. Modelling contextual information in session-aware 5b92c0e0ebd40005bc4dc7ae recommender systems with neural networks. In Proceedings of the 10th ACM [11] Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankan- Conference on Recommender Systems. ACM, 273–276. halli. 2018. Multi-modal preference modeling for product search. In 2018 ACM [38] Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2016. Learning Multimedia Conference on Multimedia Conference. ACM, 1865–1873. latent vector spaces for product search. In Proceedings of the 25th ACM CIKM [12] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Conference. ACM, 165–174. 2015. Session-based recommendations with recurrent neural networks. arXiv [39] Damir Vandic, Flavius Frasincar, and Uzay Kaymak. 2013. Facet selection algo- preprint arXiv:1511.06939 (2015). rithms for web product search. In Proceedings of the 22nd ACM CIKM Conference. [13] Balázs Hidasi and Domonkos Tikk. 2016. General factorization framework for ACM, 2327–2332. context-aware recommendations. Data Mining and Knowledge Discovery 30, 2 [40] Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc (2016), 342–371. Najork. 2018. Position bias estimation for unbiased learning to rank in personal [14] Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforce- search. In Proceedings of the Eleventh ACM WSDM Conference. ACM, 610–618. ment Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, [41] Chen Wu and Ming Yan. 2017. Session-aware information embedding for e- and Application. arXiv preprint arXiv:1803.00710 (2018). commerce product recommendation. In Proceedings of the 2017 ACM on Conference [15] Bernard J Jansen and Paulo R Molina. 2006. The effectiveness of Web search on Information and Knowledge Management. ACM, 2379–2382. engines for retrieving relevant ecommerce links. Information Processing & Man- [42] Liang Wu, Diane Hu, Liangjie Hong, and Huan Liu. 2018. Turning Clicks into agement 42, 4 (2006), 1075–1098. Purchases: Revenue Optimization for Product Search in E-Commerce. (2018). [16] Gawesh Jawaheer, Peter Weller, and Patty Kostkova. 2014. Modeling user pref- [43] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise erences in recommender systems: A classification framework for explicit and approach to learning to rank: theory and algorithm. In Proceedings of the 25th implicit user feedback. ACM Transactions on Interactive Intelligent Systems (TiiS) international conference on Machine learning. ACM, 1192–1199. 4, 2 (2014), 8. [44] Jun Yu, Sunil Mohan, Duangmanee Pew Putthividhya, and Weng-Keen Wong. [17] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2014. Latent dirichlet allocation based diversified retrieval for e-commerce search. 2017. Accurately interpreting clickthrough data as implicit feedback. In ACM In Proceedings of the 7th ACM WSDM Conference. ACM, 463–472. SIGIR Forum, Vol. 51. Acm, 4–11. [45] Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information [18] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual learning-to-rank with biased feedback. In Proceedings of the Tenth ACM Interna- International Conference on Machine Learning. ACM, 1201–1208. tional Conference on Web Search and Data Mining. ACM, 781–789. [46] Hamed Zamani and W Bruce Croft. 2016. Embedding-based query language [19] Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017. models. In Proceedings of the 2016 ACM ICTIR conference. ACM, 147–156. On application of learning to rank for e-commerce search. In Proceedings of the