Leverage Implicit Feedback for Context-aware Product Search
                      Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1
                             1 Center for Intelligent Information Retrieval, University of Massachusetts Amherst

                                                                  {kbi,croft}@cs.umass.edu
                                                                    2 Search Labs, Amazon

                                                            {choonhui,ydatta,vijaim}@amazon.com

ABSTRACT                                                                              user purchases that depend on both product relevance and customer
Product search serves as an important entry point for online shop-                    preferences. Previous research on product search [7, 8, 19, 38, 42]
ping. In contrast to web search, the retrieved results in product                     focused on product relevance. Several attempts [27, 44] were also
search not only need to be relevant but also should satisfy cus-                      made to improve customer satisfaction by diversifying search re-
tomers’ preferences in order to elicit purchases. Previous work has                   sults. Ai et al. [3] introduced a personalized ranking model which
shown the efficacy of purchase history in personalized product                        takes the users’ preferences learned from their historical reviews
search [3]. However, customers with little or no purchase history                     together with the queries as the basis for ranking. However, their
do not benefit from personalized product search. Furthermore, pref-                   work has several limitations. First, the personalized model cannot
erences extracted from a customer’s purchase history are usually                      cope with the situations such as users that have not logged in during
long-term and may not always align with her short-term interests.                     searching and thus can not be identified; users that logged in but
Hence, in this paper, we leverage clicks within a query session, as                   do not have enough purchase history, and a single account being
implicit feedback, to represent users’ hidden intents, which further                  shared by several family members. In these cases, user purchase
act as the basis for re-ranking subsequent result pages for the query.                records are either not available or containing substantial noise. Sec-
It has been studied extensively to model user preference with im-                     ond, given a specific purchase need expressed as a search query,
plicit feedback in recommendation tasks. However, there has been                      long-term behaviors may not be as informative to indicate the user’s
little research on modeling users’ short-term interest in product                     preferences as short-term behaviors such as interactions with the
search. We study whether short-term context could help promote                        retrieved results. These limitations of existing work on product
users’ ideal item in the following result pages for a query. Further-                 search motivate us to model customers’ preferences based on their
more, we propose an end-to-end context-aware embedding model                          interactions with search results, which do not require additional
which can capture long-term and short-term context dependencies.                      customers’ information or their purchase history.
Our experimental results on the datasets collected from the search                       Customers’ interactions with search results such as clicks can
log of a commercial product search engine show that short-term                        be considered as implicit feedback based on their preferences. In
context leads to much better performance compared with long-term                      information retrieval (IR), there are extensive studies on how to
and no context. Our results also show that our proposed model is                      use users’ feedback on the relevance of top retrieved documents to
more effective than word-based context-aware models.                                  abstract a topic model and retrieve more relevant results [21, 33, 46].
                                                                                      These feedback techniques were shown to be very effective and can
KEYWORDS                                                                              also be applied to use implicit feedback such as clicks. In contrast
                                                                                      to document retrieval where a users’ information need can usually
Implicit Feedback, Product Search, Context-aware Search
                                                                                      be satisfied by a single click on a relevant result, we observe that,
1 INTRODUCTION                                                                        in product search, users tend to paginate to browse more products
                                                                                      and make comparisons before they make final purchase decisions.
Online shopping has become an important part of people’s daily
                                                                                      In about 5% to 15% of search traffic, users browse and click results
life in recent years. In 2017, e-commerce represented 8.2% of global
                                                                                      in the previous pages and purchase items in the later result pages.
retail sales (2,197 billion dollars); 46.4% of internet users shop online
                                                                                      This provides us with the chance to collect user clicks more easily,
and nearly one-fourth of them do so at least once a week [34].
                                                                                      based on which results shown in the next page can be tailored to
Product search engines have become an important starting point
                                                                                      meet the users’ preferences. We reformulate product search as a
for online shopping. A number of consumer surveys have shown
                                                                                      dynamic ranking problem, where instead of one-shot ranking based
that more online shoppers started searches on e-commerce search
                                                                                      on the query, the unseen products are re-ranked dynamically when
engines (e.g., Amazon) rather than a generic web search engine
                                                                                      users paginate to the next search result page (SERP) based on their
(e.g., Google) [10].
                                                                                      implicit feedback collected from previous SERPs.
    In contrast to document retrieval, where relevance is a universal
                                                                                         Traditional relevance feedback (RF) methods, which extract word-
 evaluation criterion, a product search system is evaluated based on
                                                                                      based topic models from feedback documents as an expansion to the
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   original queries, have potential word mismatch problems despite
purposes.
                                                                                      their effectiveness [31, 46]. To tackle this problem, we propose an
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at   end-to-end context-aware embedding model that can incorporate
http://ceur-ws.org                                                                    both long-term and short-term context to predict purchased items.
                                                                                      In this way, semantic match and the co-occurence relationship
SIGIR 2019 eCom, July 2019, Paris, France                      Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1


between clicked and purchased items are both captured in the em-           vector space model which matches queries and products in the se-
beddings. We show the effectiveness of incorporating short-term            mantic space. The latent vectors of products and words are learned
context against baselines using both no short-term context and             in an unsupervised way, where vectors of n-grams in the descrip-
word-based context.                                                        tion and reviews of the product are used to predict the product.
   In this paper, we leverage implicit feedback as short-term context      Later, Ai et al. [3] built a hierarchical embedding model in which,
to provide users with more tailored search results. We first reformu-      learned representations of users, queries, and products are used to
late product search as a dynamic ranking problem, i.e., when users         predict product purchases and associated reviews.
request next SERPs, the remaining unseen results will be re-ranked.            Other aspects of product search such as popularity, visual prefer-
We then introduce several context dependency assumptions for the           ence and diversity have also been studied. Li et al. [22] investigated
task, and propose an end-to-end context-aware neural embedding             product retrieval from an economic perspective. Long et al. [25]
model that can represent each assumption by changing the coef-             predicted sales volume of items based on their transaction history
ficients to combine long-term and short-term context. We further           and incorporate this complementary signal with relevance for prod-
investigated the effect of several factors in the task: short-term con-    uct ranking. The effectiveness of images for product search was
text, long-term context, and neural embeddings. Our experimental           also investigated [6, 11]. To satisfy different users’ intents behind
results on the datasets collected from search logs of a commercial         the same query, efforts on improving result diversity in product
product search engine showed that incorporating short-term con-            retrieval have also been made [27, 44].
text leads to better performance compared with long-term context               In terms of labels for training, there are studies on using clicks
and no context, and embedding-based models perform better than             as an implicit feedback signal. Wu et al. [42] jointly modeled clicks
word-based methods in the task under various settings.                     and purchases in a learning-to-rank framework in order to opti-
   Our contributions can be summarized as follows: (1) we refor-           mize the gross merchandise volume. To model clicks, they consider
mulate conventional one-shot ranking to dynamic ranking (i.e.,             click-through rate of an item for a given query in a set of search
multi-page search) based on user clicks in product search, which           sessions as the signal for training. Karmaker Santu et al. [19] com-
has not been studied before; (2) we introduce different context            pared the different effects of exploiting click-rate, add-to-cart ratios,
dependency assumptions and propose a simple yet effective end-to-          order rates as labels. They experimented on multiple representative
end embedding model to capture different types of dependency; (3)          learning to rank models in product search with various settings.
we investigate different aspects in the dynamic ranking task on real       Our work also uses clicks as implicit feedback signals, but instead
search log data and confirmed the effectiveness of incorporating           of aggregating all the clicks under the same query to get click-
short-term context and neural embeddings. Our study on multi-              through rate, we consider the clicks associated with each query as
page product search indicates that this is a promising direction and       an indicator of the user’s short-term preference behind that query.
worth more attention.                                                          Most previous work treat product search as a one-shot ranking
                                                                           problem, where given a query, static results are shown to users
                                                                           regardless of their interaction with the result lists. In a different ap-
2     RELATED WORK                                                         proach, Hu et al. [14] formulate the user behaviors during searching
Next, we review three lines of research related to our work: prod-         products as a Markov decision process (MDP) and use reinforcement
uct search, session-aware recommendation, and user feedback for            learning to optimize the accumulative gain (expected price) of user
information retrieval.                                                     purchases. They define the states in the MDP to be a non-terminal
                                                                           state, from where users continue to browse, and two terminal states,
                                                                           i.e. purchases happen (conversion events) or users abandon the re-
2.1    Product Search                                                      sults (abandon events). Their method is essentially online learning
Product search has different characteristics compared with gen-            and refines the ranking model with large-scale users’ behavior data.
eral web search; product information is usually more structured            Although we work on a similar scenario where the results shown
and the evaluation is usually based on purchases rather clicks. In         in next page can be revised, they gradually refine an overall ranker
2006, Jansen and Molina [15] noted that the links retrieved by an          that affects all the queries while our model revises results for each
e-commerce search engine are significantly better than those ob-           individual query based on the estimation of the user preference
tained from general search engines. Since the basic properties of          under the query. Another difference is that they only consider pur-
products such as brands, categories and price are well-structured,         chases as a deferred signal for training and do not use any clicks
considerable work has been done on searching products based on             in the process. In contrast, we treat clicks as an indicator of user
facets [24, 39]. However, user queries are usually in natural lan-         preferences and refine ranking conditioned on the preferences.
guage and hard to structure. To support keyword search, Duan
et al. [7, 8] extended the Query Likelihood method [28] by consid-
ering the query generated from a mixture of the language model of
background corpus and the language model of the products condi-             2.2   Session-aware Recommendation
tioned on their specifications. The ranking function constructed in         In session-aware recommendation, a user’s interactions with the
this approach utilizes exact word matching information whereas              previously seen items in the session are used for recommending
vocabulary mismatch between free-form user queries and prod-                the next item. Considerable research on session-aware recommen-
uct descriptions or reviews from other users can still be an issue.         dation has been done in the application domains such as news,
Van Gysel et al. [38] noticed this problem and introduced a latent          music, movies and products. Many these works are based on matrix
Leverage Implicit Feedback for Context-aware Product Search                                             SIGIR 2019 eCom, July 2019, Paris, France


factorization [13, 16, 32]. More recently, session-aware recommen-
dation approaches based on neural networks have shown superior                  u          q             u                 q          u                    q
performance. Hidasi et al. [12] model the clickstream in a session
with Gated Recurrent Unit (GRU) and predict the next item to                                                    C1:t                           C1:t
recommend in the session. Twardowski [37] also used Recurrent
Neural Networks (RNN) but used attributes for item encoding and
recommended only on unseen items. Quadrana et al. [30] proposed                     Bt+1                        Bt+1                           Bt+1
a hierarchical RNN model, which consists of a session-level GRU
to model users’ activities within sessions and a user-level GRU to           Long-term                     Short-term                   Long-short-term
model the evolution of the user across sessions. The updated user        Context Dependency            Context Dependency             Context Dependency

representation will affect the session-level GRU to make person-         Figure 1: Different assumptions to model different factors as
alized recommendations. Wu and Yan [41] proposed a two-step              context for purchase prediction.
ranking method to recommend item lists based on user clicks and
views in the session. They treat item ranking as a classification        are asked to assess the relevance of a batch of documents based on
problem and learn the session representation in the first step. With     which the retrieval model is refined to find more relevant results.
the session representation as context, items are reranked with a         Rocchio [33] is generally credited as the first relevance feedback
list-wise loss proposed in ListNet in the second step. Li et al. [23]    method, which is based on the vector space model [35]. After the
adopted the attention mechanism in the RNN encoding process to           language model approach for IR has been proposed [28], the rele-
identify the user’s main purpose in the current session. Quadrana        vance model version 3 (RM3) [21] became one of the state-of-art
et al. [29] reviewed extensive previous work on sequence-aware           pseudo RF methods that is also effective for relevance feedback.
recommendation and categorized the existing methods in terms of          Zamani and Croft [46] incorporate the semantic match between
different tasks, goals, and types of context adaption.                   unsupervised trained word embeddings into the language model
    The goal of a recommendation system is typically to help users       framework and introduced an embedding-based relevance model
explore items that they may be interested in when they do not have       (ERM). Although these RF methods can also be applied in our task,
clear purchase needs. On the contrary, a search engine aims to help      we propose an end-to-end neural model for relevance feedback in
users find only items that are most relevant to their intent specified   the context of product search.
in search queries. Relevance plays totally different roles in the two
tasks. In addition, the evaluation metrics in recommendation are         3     CONTEXT-AWARE PRODUCT SEARCH
usually based on clicks [12, 23, 30, 37, 41], whereas product search     We reformulate product search as a dynamic re-ranking task where
is evaluated with purchases under a query.                               short-term context represented by the clicks in the previous SERPs
                                                                         is considered for re-ranking subsequent result pages. Users’ global
                                                                         interests can also be incorporated for re-ranking as long-term con-
2.3    User Feedback for Information Retrieval                           text. We first introduce our problem formulation and different
There are studies on two types of user feedback in information           assumptions of context dependency models. Then we propose a
retrieval, implicit feedback which usually considers click-through       context-aware embedding model for the task and show how to
data as the indicator of document relevance and explicit feedback        optimize the model.
where users are asked to give the relevance judgments of a batch
of documents. Joachims et al. [17] found that click-through data as      3.1     Problem Formulation
implicit feedback is informative but biased and the relative prefer-     A query session1 is initiated when a user u issues a query q to the
ences derived from clicks are accurate on average. To separate click     search engine. The search results returned by the search engine are
bias from relevance signals, Craswell et al. [5] designed a Cascade      typically grouped into pages with similar number of items. Let R t
Model by assuming that users examine search results from top             be the set of items on the t-th search result page ranked by an initial
to bottom; Dupret and Piwowarski [9] proposed a User Browsing            ranker and denote by R 1:t the union of R 1 , · · · , R t . For practical
Model where results can be skipped according to their examination        purposes, we let the re-ranking candidate set D t +1 for page t + 1
probability estimated from their positions and last clicks; Chapelle     be R 1:t +k ⧹V1:t where k ≥ 1 and V1:t is the set of re-ranked items
and Zhang [4] constructed a Dynamic Bayesian Network model               viewed by the user in the first t pages. Given user u, query q, and the
which incorporate a variable to indicate whether a user is satisfied     set of clicked items in the first t pages C 1:t as context, the objective
by a click and leaves the result page. Yue and Joachims [45] de-         is to rank all, if any, purchased items Bt +1 in D t +1 at the top of the
fined a dueling bandit problem where reliable relevance signals are      next result page.
collected from users’ clicks on interleaved results to optimize the
ranking function. Learning an unbiased model directly from biased        3.2     Context Dependency Models
click-through data has also been studied by incorporating inverse
                                                                         There are three types of context dependencies that one can use to
propensity weighting and estimating the propensity [2, 18, 40]. In
                                                                         model the likelihood of a user purchasing a product in her query
this work, we model the user preference behind a search query
with her clicks and refine the following results shown to this user.     1 We refer to the series of user behaviors associated with a query as a query session,
   Explicit feedback is also referred to as true relevance feedback      i.e, a user issues a query, clicks results, paginates, purchases items and finally ends
(RF) in information retrieval and has been extensively studied. Users    searching with the query.
SIGIR 2019 eCom, July 2019, Paris, France                         Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1


session, namely, long-term context, short-term context, and long-                                                                                     Candidate Items


                                                                                        }                           }                             }
                                                                                    w                           w                             w               i 2 Dt+1
short-term context. Figure 1 shows the graphical models for these                   +                           +                             +
context dependencies, where u denotes the latent variable of a                      w       i                   w       i                     w       i
                                                                                    +                           +                             +
user’s long-term interest that stays the same across all the search                 w                           w                             w
sessions, and clicks in the first t result pages, i.e., C 1:t , represents
the user’s short-term preference. Purchased items on and after page                 L       (                                                         )

t + 1, i.e., Bt +1 , depends on query q and different types of context                                     St                                             Overall Context
under different dependency assumptions.
                                                                               (1               c) q   +            u                c C1:t                Clicked Items
    Long-term Context Dependency. In this assumption, only                              u                       u           +


                                                                                        }
                                                                                 }}
users’ long-term preferences, usually represented by their historical
                                                                                            w + w + w               c           +      c          +            c
queries and the corresponding purchased items, are used to predict


                                                                                                           }
                                                                                                           }
                                                                                                           }
the purchases in their current query sessions. An unshown item i is                                        w + w + w                w + w + w         w + w + w
ranked according to its probability of being purchased given u and
q, namely p(i ∈ Bt +1 |u, q). The advantage of such models is that             Figure 2: The structure of our context-aware embedding
personalization of search results (as proposed in Ai et al. [3]) can be        model (CEM). w represents words in queries or product ti-
conducted from the very beginning of a query session when there is             tles; C1:t denotes the click item set in the first t SERPs, which
no feedback information available. However, this model needs user              consist of item c; St is the overall context of the first t SEPRs,
identity and purchase history, which are not always available. In              a combination of query q, user u and clicks C1:t ; i is an item
addition, the long-term context may not be informative to predict a            in the candidate set Dt +1 for re-ranking from page t + 1.
user’s final purchases since her current search intent may be totally         documents, our model should capture user preferences from their
different from any of her previous searches and purchases.                    clicked items which are implicit positive signals. Components of
    Short-term Context Dependency. The shortcomings of long-                  CEM will be introduced next.
term context can be addressed by focusing on just the short-term                  Item Embeddings. We use product titles to represent products
context, i.e., the user’s actions such as clicks performed within the         since merchants tend to put the most informative, representative
current query session. This dependency model assumes that given               text such as the brand, name, size, color, material and even target
the observed clicks in the first t pages, the items purchased in the          customers in product titles. In this way, items do not have unique
subsequent result pages are conditionally independent of the user,            embeddings according to their identifiers and items with the same
shown in Figure 1. An unseen item i in the query session is re-               titles are considered the same. Although this may not be accurate
ranked based on its purchase probability conditioning on C1:t and             all the time, word representations can be generalized to new items,
q, i.e., p(i ∈ Bt +1 |C1:t , q). In this way, users’ short-term preferences   and we do not need to cope with the cold-start problem. We use the
are captured and their identity and purchase records are not needed.          average of title word embeddings of a product as its own embedding,
Users with little or no purchase history and who have not logged              i.e.,
in can benefit directly under such a ranking scheme.                                                            P
                                                                                                                        E (w )
    Long-short-term Context Dependency. The third dependency                                             E (i) = w ∈i                            (1)
                                                                                                                     |i |
assumption is that purchases in the subsequent result pages de-               where i is the item, and |i | is the title length of item i. We also
pend on both short-term context, e.g., previous clicks in the current         evaluated other more complex product title encoding approaches
query session, and long-term context, such as historical queries and          such as non-linear projection of average word embeddings and
purchases of the user indicated by u. An unseen item i after page t           recurrent neural network on title word sequence, but they did not
is scored according to p(i ∈ Bt +1 |C1:t , q, u). This setting considers      show superior performance over the simpler one that we use here.
more information but it also has the drawback of requiring users                  User Embeddings. A lookup table for user embeddings is cre-
identity and purchase history.                                                ated and used for training, where each user has a unique represen-
    We will introduce how to model the three dependency assump-               tation. This vector is shared across search sessions and updated by
tions in a same framework in Section 3.3. In this paper, we focus             the gradient learned from previous user transactions. In this way,
on the case of non-personalized short-term context and include the            the long-term interest of the user is captured and we use the user
other two types of context for comparison.                                    embeddings as long-term context in our models.
                                                                                  Query Embeddings. Similar to item embeddings, we use the
                                                                              simple average embedding of query words as the representation,
3.3    Context-aware Embedding Model
                                                                              which also shows the best performance compared to the non-linear
We designed a context-aware framework where models under dif-                 projection and recurrent neural network methods we have tried.
ferent dependency assumptions can be trained by varying the cor-              The embedding of the query is
responding coefficients, shown in Figure 2. To incorporate semantic                                             P
meanings and avoid the word mismatch between queries and items,                                                   w ∈q E (w )
                                                                                                        E (q) =                                  (2)
we embed queries, items and users into latent semantic space. Our                                                    |q|
context-aware embedding model is referred to as CEM. We assume                where |q| is the length of query q.
users’ preferences are reflected by their implicit feedback, i.e. their           Short-term Context Embeddings. We use the set of clicked
clicks associated with the query. Similar to relevance feedback ap-           items to represent user preference behind the query, which we
proaches [21, 33] that extract a topic model from assessed relevant           refer to as E (C1:t ). For sessions associated with a different query
Leverage Implicit Feedback for Context-aware Product Search                                                SIGIR 2019 eCom, July 2019, Paris, France


q or page number t, the clicked items contained in C1:t may differ.          where E (St ) is computed according to Equation 4. This model can
We assume the sequence of clicked items does not matter when                 also be interpreted as a generative model for an item in the candidate
modeling short-term user preference, i.e., the same set of clicked           set Dt +1 given the context St . In this case, the probability of an
items should imply the same user preference regardless of the order          item in the candidate set Dt +1 being generated from the context
of them being clicked. There are two reasons for this assumption.            St is computed with a softmax function that take the dot product
One is that the user’s purchase need is fixed for a query she issued         score between the embedding of an item and the context as inputs,
and is not affected by the order of clicks. The other is that the order      i.e,
of user clicks is usually based on the rank of retrieved products                                p(i |C1:t , u, q) = score (i |q, u, C1:t )               (6)
from top to bottom as the user examines each result, which is not
affected by user preference in the non-personalized search results.          We need to train the model and learn appropriate embeddings
So we represent the set as the centroid of each clicked item in the          of context and items so that the probability of purchased items
latent semantic space, where the order of clicks does not make a             in Dt +1 , namely Bt +1 , should be larger than the other candidate
difference. A simple yet effective way is to consider equal weights          items, i.e. Dt +1 ⧹Bt +1 . Also, the conditional probability in Equation
of all the items in C1:t so that the centroid is simply averaged item        6 can be used to compute the likelihood of the observed instance
embeddings:                                                                  of C1:t , u, q, Bt +1 .
                                    P
                                     i ∈ C1:t E (i)
                        E (C1:t ) =                                 (3)      3.4    Model Optimization
                                        |C1:t |
where |C1:t | is the number of clicked items in set C1:t .                   The embeddings of queries, users, items are learned by maximizing
   We also tried an attention mechanism to weight each clicked               the likelihood of observing Bt +1 given the condition of C1:t , u, q,
item according to the query and represent the user preference with a         i.e., after user u issued query q, she clicked the items in the first t
weighted combination of clicked items. However, this method is not           SERPs (C1:t ), then models are learned by maximizing the likelihood
better than combining clicks with equal weights in our experiments.          for her to finally purchased items in Bt +1 which are shown in and
So we only show simple methods.                                              after page t + 1. There are many possible values of t even for a same
   Overall Context Embeddings. We use a convex combination                   user u if she purchases multiple products on different result pages
of user, query, and click embeddings as the representation of overall        under query q. These are considered as different data entries. Then
context E (St ). i.e.                                                        the log likelihood of observing purchases in Bt +1 conditioning on
                                                                             C1:t , u, q in our model can be computed as
        E (St ) = (1 − λu − λc )E (q) + λu E (u) + λc E (C1:t )
                                                                       (4)                 L(Bt +1 |C1:t , u, q) = log p(Bt +1 |C1:t , u, q)
               0 ≤ λu ≤ 1, 0 ≤ λc ≤ 1, λu + λc ≤ 1                                                                      Y
                                                                                                                 ∝ log       p(i |C1:t , u, q)
This overall context is then treated as the basis for predicting pur-
                                                                                                                           i ∈Bt +1                       (7)
chased items in Bt +1 . When λc = 0, C1:t is ignored in the prediction                                                  X
and St corresponds to the long-term context shown in Figure 1.                                                     ∝              log p(i |C1:t , u, q)
When λu = 0, user u does not have impact on the final purchase                                                         i ∈Bt +1
given C1:t . This aligns with the short-term context assumption in           The second step can be inferred if we consider whether an item
Figure 1. When λu > 0, λc > 0, λu + λc ≤ 1, both long-term and               will be purchased is independent of another item given the context.
short-term context are considered and this matches the type of                  According to Equation 5, 6 and 7, we can optimize the condi-
long-short-term context in Figure 1. So by varying the values of λu          tional log-likelihood directly. A common problem for the softmax
and λc , we can use Equation 4 to model different types of context           calculation is that the denominator usually involves a large num-
dependency and do comparisons.                                               ber of values and is impractical to compute. However, this is not
   Attention Allocation Model for Items. With the overall con-               a problem in our model since we limit the candidate set Dt +1 to
text collected from the first t pages, we further construct an atten-        only some top-ranked items retrieved by the initial ranker so that
tive model to re-rank the products in the candidate set Dt +1 . This         the computation cost is small.
re-ranking process can be considered as an attention allocation                 Similar to previous studies [3, 38], we apply L2 regularization on
problem. Given the context that indicates the user’s preference and          the embeddings of words and users to avoid overfitting. The final
a set of candidate items that have not been shown to the users               optimization goal can be written as
yet, the item which attracts more user attention will have higher
probability to be purchased. The attention weights then act as the                                                        E (w ) 2 +   E (u) 2 )
                                                                                       X                                X            X
                                                                                 L′ =       L(Bt +1 |C1:t , u, q) + γ (
basis for re-ranking. Predicting the probability of each candidate                      u,q,t                                 w                  u
item being purchased can be considered as attention allocation for                      X       X                 exp(E (St ) · E (i))
the items. This idea is also similar to the listwise context model                  =                    log P                                            (8)
                                                                                                             i ′ ∈Dt +1 exp(E (St ) · E (i ))
                                                                                                                                          ′
proposed by Ai et al. [1]. They extracted the topic model from top-                     u,q,t i ∈Bt +1
ranked documents with recurrent neural networks and used it as                                        E (w ) 2 +       E (u) 2
                                                                                                X                 X              
                                                                                          +γ
a local context to re-rank the top documents with their attention                                 w                u
weights. The attention weights can be computed as:
                                                                             where γ is the hyper-parameter to control the strength of L2 regu-
                                         exp(E (St ) · E (i))                larization. The function accumulates entries of all the possible user
         score (i |q, u, C1:t ) = P                                    (5)
                                   i ′ ∈ Dt +1 exp(E (St ) · E (i ))
                                                                 ′           u, query q, and the valid page number t for pagination which has
SIGIR 2019 eCom, July 2019, Paris, France                         Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1

           Table 1: Statistics of our collected datasets                      pages. Nonetheless, we can still evaluate the performance of one-
                                  Toys       Garden      Cell Phones          shot re-ranking from page t + 1 given the context collected from the
                              & Games     & Outdoor    & Accessories          first t pages. In our experiments, we compare different methods for
      Product title length   13.14±6.46   16.39±7.38      22.02±7.34          re-ranking from page 2 and page 3 since earlier re-ranking can in-
      Vocabulary size           381,620    1,054,980          194,022         fluence results at higher positions which have bigger larger impact
      Query Session Splits                                                    on the ranking performance. As in relevance feedback experiments
      Train                     91.21%       87.36%           86.57%          [26, 33], our evaluation is also based on residual ranking, where the
      Validation                 2.61%        3.66%            4.20%
                                                                              first t result pages are discarded and re-ranking of the unseen items
      Test                       6.18%        8.98%            9.23%
                                                                              are evaluated. We use the residual ranking evaluation paradigm
clicks in and before page t and purchases after that page. All possi-         because the results before re-ranking are retrieved by the same
ble words and users are taken into account in the regularization.             initial ranker and identical for all the re-ranking methods.
When we do not incorporate long-term context, the corresponding                   Similar to other ranking tasks, we use mean average precision
parts of u are omitted.                                                       (MAP) at cutoff 100, mean reciprocal rank (MRR) and normalized
    The loss function actually captures the loss of a list and this           discounted cumulative gain (N DCG) as ranking metrics. MAP mea-
list-wise loss is similar to AttentionRank proposed by Ai et al. [1].         sure the overall performance of a ranker in terms of both precision
Because of the softmax function, optimizing the probabilities of rel-         and recall, which indicates the ability to retrieve more purchased
evant instances in Bt +1 simultaneously minimizes the probabilities           items in next 100 results and ranking them to higher positions. MRR
of the rest non-relevant instances. This loss shows superiority over          is the average inverse rank for the first purchase in the retrieved
other list-wise loss such as ListMLE [43] and SoftRank [36], which            items. It indicates the expected number of products users need to
is another reason we adopt this loss.                                         browse before finding the ones they are satisfied with. N DCG is a
                                                                              common metric for multiple-label document ranking. Although in
                                                                              our context-aware product search, items only have binary labels
4     EXPERIMENTAL SETUP                                                      indicating whether they were purchased given the context, N DCG
In this section, we introduce our experimental settings of context-           still shows how good a rank list is with emphasis on results at top
aware product search. We first describe how we construct the                  positions compared with the ideal rank list. We use N DCG@10 in
datasets for experiments. Then we describe the baseline methods               our experiments.
and evaluation methodology for comparing different methods. We
also introduce the training settings for our model.
                                                                               4.3   Baselines
4.1     Datasets                                                              We compare our short-term context-aware embedding model (SCEM)
We randomly sampled three category-specific datasets, namely,                 with four groups of baseline, retrieval model without using context,
“Toys & Games”, “Garden & Outdoor”, and “Cell Phones & Ac-                    long-term, short-term and long-short-term context-aware models.
cessories”, from the logs of a commercial product search engine                  Production Model (PROD). PROD is essentially a gradient
spanning ten months between years 2017 and 2018. We keep only                 boosted decision tree based model. Comparing with this model
the query sessions with at least one clicked item on any page before          indicates the potential gain of our model if deployed online. Note
the pages with purchased items. These sessions are difficult for the          that PROD performs worse on our datasets than on the entire search
production model since it could not rank the “right” items on the             traffic since we extracted query sessions where the purchased items
top so that users purchased items in the second or later result pages.        are in the second or later result pages.
Our datasets include up to a few million query sessions containing               Random (RAND). By randomly shuffling the results in the
several hundred thousand unique queries. When there are multiple              candidate set which consists of the top unseen retrieved items by
purchases in a query session across different result pages, purchases         the production model, we get the performance of a random re-
until page t are only considered as clicks and used together with             ranking strategy. This performance should be the lower bound of
other clicks to predict purchases on and after page t + 1. Statistics         any reasonable model.
of our datasets are shown in Table 1.                                            Popularity (POP). In this method, the products in the candidate
                                                                              set are ranked according to how many times they were purchased
                                                                              in the training set. Popularity is an important factor for product
4.2     Evaluation Methodology                                                search [25] besides relevance.
We divided each dataset into training, validation, and test sets by              Query Likelihood Model (QL). The query likelihood model
the date of the query sessions. The sessions occurred in the first 34         (QL) [28] is a language model approach for information retrieval. It
weeks are used for training, the following 2 weeks for validation             shows the performance of re-ranking without implicit feedback and
and the last 4 weeks for testing. Models were trained with data               is only based on the bag-of-words representation. The smoothing
in the training set; hyper-parameters were tuned according to the             parameter µ in QL was tuned from {10, 30, 50, 100, 300, 500}.
model performance on the validation set, and evaluation results on               Query Embedding based Model (QEM). This model scores an
the test set were reported for comparison.                                    item by the generative probability of the item given the embedding
   Since the datasets are static, it is impossible to evaluate the mod-       of a query. When λu = 0, λc = 0, CEM is exactly QEM.
els in a truly interactive setting where each subsequent page is                 Long-term Context-aware Relevance Model (LCRM3). Rel-
re-ranked based on the observed clicks on the current and previous            evance Model Version 3 (RM3) [21] is an effective method for both
Leverage Implicit Feedback for Context-aware Product Search                                                           SIGIR 2019 eCom, July 2019, Paris, France


pseudo and true relevance feedback. It extracts a bag-of-words lan-                       We implemented our models with Tensorflow. The models were
guage model from a set of feedback documents, expands the original                     trained for 20 epochs with the batch size set to 256. Adam [20] was
query with the most important words from the language model,                           used as the optimizer and the global norm of parameter gradients
and retrieve results again with the expanded query. To capture the                     was clipped at 5 to avoid unstable gradient updates. After each
long-term interest of a user, we use RM3 to extract significant words                  epoch, the model was evaluated on the validation set and the model
from titles of the user’s historical purchased products and refine the                 with the best performance on the validation set was selected to
retrieval results for the user in the test set with the expanded query.                be evaluated on the test set. The initial learning rate was selected
The weight of the initial query was tuned from {0, 0.2, · · · , 1.0}                   from {0.01, 0.005, 0.001, 0.0005, 0.0001}. L2 regularization strength
and the expansion term count was tuned from {10, 20, · · · , 50}. The                  γ was tuned from 0.0 to 0.005. λq , λu in Equation 4 were tuned from
effect of query weight is shown in Section 5.2.                                        {0, 0.2, · · · , 0.8, 1.0} (λq + λu ≤ 1) to represent various dependency
   Long-term Context-aware Embedding Model (LCEM). When                                assumptions mentioned in Section 3.2, and the embedding size were
λc = 0, 0 < λu ≤ 1, CEM becomes LCEM by considering long-term                          scanned from {50, 100, · · · , 300}. The effect of λq , λu and embedding
context indicated by universal user representations.                                   size are shown in Section 5.
   Short-term Context-aware Relevance Model (SCRM3). We
also use RM3 to extract the user preference behind a query from                        5     RESULTS AND DISCUSSION
the clicked items in the previous SERPs as short-term context and                      In this section, we show the performance of the four types of models
refine the next SERP. This method uses the same information as our                     mentioned in Section 4.3. First, we compare the overall retrieval
short-term context-aware embedding model, but it represents user                       performance of various types of models in Section 5.1. Then we
preference with a bag-of-words model and only -consider word                           further study the effect of queries, long-term context and embedding
exact match between a candidate item and the user preference                           size on each model in the following subsections.
model. The query weight and expansion term count were tuned in
the same range as LC-RM3 and the influence of initial query weight                     5.1      Overall Retrieval Performance
can be found in Section 5.2. 2
                                                                                       Table 2 shows the performance of different methods on re-ranking
   Long-short-term Context-aware Embedding Model (LSCEM).
                                                                                       items when users paginate to the second and third SERP for Toys &
When λu > 0, λc > 0, 0 < λu + λc ≤ 1, both long-term context
                                                                                       Games, Garden & Outdoor and Cell Phones & Accessories. Among
represented by u and short-term context indicated by Ct are taken
                                                                                       all the methods, SCEM and SCRM3 perform better than all the
into account in CEM.
                                                                                       other baselines without using short-term context, including their
   PROD, RAND, POP, QL, and QEM are retrieval models that rank
                                                                                       corresponding retrieval baseline, QEM, and QL respectively, and
items based on queries and do not rely on context or user informa-
                                                                                       PROD which considers many additional features, showing the ef-
tion. These models can be used as the initial ranker for any queries.
                                                                                       fectiveness of incorporating short-term context.
The second type of rankers consider users’ long-term interests to-
                                                                                          In contrast to the effectiveness of short-term context, long-term
gether with queries, such as LCEM and LCRM3. These methods
                                                                                       context does not help much when combined with queries alone or
utilize users’ historical purchases but can only be applied to users
                                                                                       together with short-term context. LCRM3 outperforms QL on all
who appear in the training set. The third type is feedback models
                                                                                       the datasets by a small margin when users’ historical purchases
which take users’ clicks in the query session as short-term context
                                                                                       are used to represent their preferences. LCEM and LSCEM always
and this category includes SCRM3 and our SCEM. In this approach,
                                                                                       perform worse than QEM and SCEM by incorporating long-term
user identities are not needed. However, they can only be applied
                                                                                       context with λu > 0. Note that since only a small portion of users in
to search sessions where users click on results and only items from
                                                                                       the test set appear in the training set, the re-ranking performance
the second result page or later can be refined with the clicks. The
                                                                                       of most query sessions in the test set will not be affected. We will
fourth category considers both long and short-term context, e.g.,
                                                                                       elaborate on the effect of long-term context in Section 5.3.
LSCEM. The second, third and fourth groups of baseline correspond
                                                                                          We found that neural embedding methods are more effective
to the dependency assumptions shown in the first, second and third
                                                                                       than word-based baselines. When implicit feedback is not incorpo-
sub-figure in Figure 1 respectively.
                                                                                       rated, QEM performs significantly better than QL, sometimes even
                                                                                       better than PROD. When clicks are used as context, with neural
4.4     Model Training                                                                 embeddings, SCEM is much more effective than SCRM3. This shows
Query sessions with multiple purchases on different pages are split                    that semantic match is more beneficial than exact word match for
into sub-sessions, one for each page with a purchase. When there                       top retrieved items in product search. In addition, these embeddings
are more than three sub-sessions for a given session, we randomly                      also carry the popularity information since items purchased more
select three in each training epoch. We do so to avoid skewing the                     in the training data will get more gradients during training. Due
dataset with sessions with many purchases. Likewise, we randomly                       to our model structure, there are also properties that the embed-
select five clicked items for constructing short-term context if there                 dings of items purchased under similar queries or context will be
are more than five clicked items in a query session.                                   more alike compared with non-purchased items, and embeddings
                                                                                       of clicked and purchased items are also similar.
2 We also implemented the embedding-based relevance model (ERM) [46], which is            The relative improvement of SCEM and SCRM3 compared to
an extension of RM3 by taking semantic similarities between word embeddings into       the production model on Toys & Games is less than the other
account, as a context-aware baseline. But it does not perform better than RM3 across
different settings. So we did not include it.                                          3 Due the confidentiality policy, the absolute value of each metric can not be revealed.
SIGIR 2019 eCom, July 2019, Paris, France                                   Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1

Table 2: Comparison of baselines and our short-term context embedding model (SCEM) on re-ranking when users paginate to
the 2nd and 3rd page. The number is the relative improvement of each method compared with the production model (PROD)3 .
‘− ’ indicates significant worse of each baseline compared with SCEM in student t-test with p ≤ 0.001. Note that difference larger
than 3% is approximately significant. The best performance in each column is marked in bold.
                                                      Performance of Re-ranking from the 2nd Page
                                         Toys & Games                     Garden & Outdoor                           Cell Phones & Accessories
               Model                MAP        MRR N DCG@10           MAP         MRR N DCG@10                       MAP         MRR N DCG@10
               PROD               0.00%−     0.00%−      0.00%−     0.00%−      0.00%−        0.00%−               0.00%−      0.00%−       0.00%−
               RAND            -25.70% −   -26.83% −  -29.23% −   -23.40% −   -24.16%−     -25.73%−             -20.15% −    -20.93% −   -22.73%−
               POP             -15.82%− -15.90%−      -17.87%−     -9.38%−     -9.51%−       -9.55%−              -8.54%−     -8.25%−    -11.12%−
               QL              -25.78% −   -27.80% −  -29.73% −   -19.62% −   -20.78%−     -21.63%−             -16.14% −    -16.77% −   -18.00%−
               QEM               -2.57%−    -3.10%−     -3.85%−    +0.65%−     -0.34%−      +1.06%−              +9.96%−      +9.73%−    +10.58%−
               LCRM3           -24.82%− -25.92%−      -28.60%−    -19.33%− -20.45%−        -21.28%−             -15.44%− -16.07%−        -17.38%−
               LCEM              -2.57%−    -3.10%−     -3.85%−    +0.65%−     -0.34%−      +1.06%−              +9.96%−      +9.73%−    +10.58%−
               SCRM3           +12.93% −    +9.63% −   +9.53% −  +25.15%  −  +23.01% −     +23.15%−             +18.65%− +16.77%−        +17.11%−
               SCEM             +26.59% +24.56%        +26.20%    +37.43% +35.16%           +37.22%              +48.99% +47.00%          +50.18%
               LSCEM            +26.59% +24.56%        +26.20%    +37.43% +35.16%           +37.22%              +48.99% +47.00%          +50.18%
                                                      Performance of Re-ranking from the 3rd Page
                                         Toys & Games                     Garden & Outdoor                           Cell Phones & Accessories
               Model                MAP        MRR N DCG@10           MAP         MRR N DCG@10                       MAP         MRR N DCG@10
               PROD               0.00%−     0.00%−      0.00%−     0.00%−      0.00%−        0.00%−               0.00%−      0.00%−       0.00%−
               RAND            -15.45%− -17.97%−      -18.96%−    -12.29%− -13.71%−        -13.97%−               -8.75%− -10.05%−         -9.55%−
               POP               -4.37%−    -5.31%−     -5.18%−     2.09%−      1.43%−        3.49%−              -0.78%−     -1.21%−      -1.43%−
               QL              -14.87%− -18.31%−      -19.20%−     -9.15%− -10.97%−        -10.37%−               -4.05%−     -5.21%−      -3.62%−
               QEM             +12.83%− +11.07%−      +14.13%− +15.82%− +14.42%−           +19.32%−             +28.85%  −  +27.60%  −   +33.92%−
               LCRM3           -13.99% −   -15.82% −  -17.20% −    -9.02% −   -10.73%−     -10.04%−               -3.26% −    -4.48% −     -2.85%−
               LCEM            +12.83%− +11.07%−      +14.13%− +15.82%− +14.42%−           +19.32%−             +28.85%− +27.60%−        +33.92%−
               SCRM3           +34.26%− +29.27%−      +32.86%− +49.54%− +46.60%−           +51.20%−             +44.52%− +41.16%−        +46.98%−
               SCEM             +51.46% +47.57%        +54.77%    +63.79% +60.43%           +67.79%              +85.51% +81.72%          +93.85%
               LSCEM            +51.46% +47.57%        +54.77%    +63.79% +60.43%           +67.79%              +85.51% +81.72%          +93.85%


two datasets. There are two possible reasons. First, the production                      model. In addition, most candidate products are consistent with the
model performs better on Toys & Games, compared with Garden &                            query intent but the final purchase depends on users’ preference.
Outdoor, and Cell Phones & Accessories, which can be seen from                           Popularity, as an important factor that consumers will consider, can
the larger advantages compared with random re-ranking. Second,                           improve the performance upon QL. However, it is still worse than
the average clicks in the first two and three SERPs in Toys & Games                      the production model most of the time.
are less than the other two datasets 4 , thus SCEM and SCRM3 can
perform better with more implicit feedback information.                                  5.2   Effect of Short-term Context
   The relative performance of all the other methods against PROD
                                                                                        We investigate the influence of short-term context by varying the
is better when re-ranking from page 2 compared with re-ranking
                                                                                        value of λc with λu set to 0. The performance of SCRM3 and SCEM
from page 3 in terms of all three metrics. Several reasons are shown
                                                                                        varies as the interpolation coefficient of short-term context changes
as follows. When purchases happen in the third page or later, it
                                                                                        since only these two methods utilize the clicks. Since re-ranking
usually means users cannot find the “right” products in the first
                                                                                        from the second or third pages on Toys & Games, Garden and Mo-
two pages, which further indicates the production model is worse
                                                                                        bile all show similar trends, we only report performance of each
for these query sessions. In addition, the ranking quality of PROD
                                                                                        method in the setting of re-ranking from second pages on Toys
on the third page is worse than on the second page. Another reason
                                                                                        & Games, which is shown in Figure 3a. Figure 3a shows that as
that SCRM3 and SCEM improve more upon PROD when re-ranking
                                                                                        the weight of clicks is set larger, the performance of SCRM3 and
from page 3 is that more context becomes available with clicks
                                                                                        SCEM goes up consistently. When λc is set to 0, SCRM3 and SCEM
collected in the second page and makes the user preference model
                                                                                        degenerate to QL and QEM respectively which do not incorporate
more robust.
                                                                                        short-term context. From another perspective, SCRM3 and SCEM
   QL performs similarly to RAND on Toys & Games and a little
                                                                                        degrade in performance as we increase the weight on queries. For
better than RAND on Garden & Outdoor, and Cell Phones & Ac-
                                                                                        exact word match based methods, more click signals lead to more
cessories, which indicates that relevance captured by exact word
                                                                                        improvements for SCRM3, which is also consistent with the fact
matching is not the key concern in the rank lists of the production
                                                                                        that QL performs similarly to RAND by only considering queries.
4 The specific number of average clicks in the datasets cannot be revealed due to the   For embedding-based methods which capture semantic match and
confidentiality policy.                                                                 popularity, QEM with queries alone performs similarly to PROD
Leverage Implicit Feedback for Context-aware Product Search                                                                                                                                      SIGIR 2019 eCom, July 2019, Paris, France

                            PROD    POP        QEM       SCRM3                                  PROD    POP        LCEM      SCRM3                           30       PROD   POP        LCEM      SCRM3                                  PROD     POP     QEM       SCRM3
                   30       RAND    QL         LCRM3     SCEM                            30     RAND    QL         LCRM3     LSCEM                                    RAND   QL         LCRM3     LSCEM                           30     RAND     QL      LCRM3     SCEM

                                                                                                                                                             20
                   20                                                                    20                                                                                                                                       20
 M AP Lif t (%)


                                                                       M AP Lif t (%)


                                                                                                                                           M AP Lif t (%)


                                                                                                                                                                                                                M AP Lif t (%)
                                                                                                                                                             10
                   10                                                                    10                                                                                                                                       10
                                                                                                                                                              0
                    0                                                                     0                                                                                                                                        0

                                                                                        −10                                                                 −10
                  −10                                                                                                                                                                                                            −10

                  −20                                                                   −20                                                                 −20                                                                  −20

                    0.0     0.2    0.4         0.6     0.8       1.0                      0.0   0.2    0.4         0.6     0.8       1.0                      0.0     0.2    0.4        0.6     0.8       1.0                      50    100     150     200      250       300
                                          λc                                                                  λu                                                                   λu                                                           Embedding Size


                                   (a) λc                                                              (b) λu                                                       (c) λu on seen users                                                (d) Embedding size
Figure 3: The effect of λc , λu , embedding size on the performance of each model in the collection of Toys & Games when
re-ranking from the second SERP for the scenarios where users paginate to page 2.


but much better when more context information is incorporated                                                                                                   This finding is different from the observation in HEM proposed
in SCEM. This indicates that users’ clicks already cover the query                                                                                          by Ai et al. [3], which incorporates user embeddings as users’ long-
intent, and also contain additional users’ preference information.                                                                                          term preferences and achieves superior performance compared
                                                                                                                                                            to not using user embeddings. We hypothesize that this inconsis-
5.3                       Effect of Long-term Context                                                                                                       tent finding is due to the differences in datasets. HEM was experi-
                                                                                                                                                            mented on a dataset that is heavily biased to users with multiple
Next we study the effect of long-term context indicated by users’
                                                                                                                                                            purchases and under a rather simplistic assumption of query gen-
global representations E (u) both with and without incorporating
                                                                                                                                                            eration, where the terms from the category hierarchy of a product
short-term context. QEM and LCRM3 only use queries and user
                                                                                                                                                            are concatenated as the query string. Their datasets contain only
historical transactions for ranking; LSCEM uses long and short-term
                                                                                                                                                            hundreds of unique queries and tens of thousands items that are all
context (λu + λc is fixed as 1 since we found that query embeddings
                                                                                                                                                            purchased by multiple users. In contrast, we experimented on the
do not contribute to the re-ranking performance when short-term
                                                                                                                                                            real queries and corresponding user behavior data extracted from
context is incorporated). Toys & Games is used again to show the
                                                                                                                                                            search log. The number of unique queries and items in our experi-
sensitivity of each model in terms of λu under the setting of re-
                                                                                                                                                            ments are hundreds times larger than in their dataset. There is also
ranking from the second page. Since there are users in the test set
                                                                                                                                                            little overlap of users in the training and test set in our datasets,
which never appear in the training set, λu does not take effect due to
                                                                                                                                                            while in their experiments, all the users in the test set are shown in
the null representations for these unknown users. In Toys & Games,
                                                                                                                                                            the training set.
only about 13% of all the query sessions in the test set are from
users who also appear in the training set. The performance change
                                                                                                                                                            5.4       Effect of Embedding Size
on the entire test set will be smaller due to the low proportion the
models can effect in the test set, so we also include the performance                                                                                       Figure 3d shows the sensitivity of each model in terms of embedding
of each model on the subset of data entries associated with users                                                                                           size on Toys & Games, which presents similar trends to the other
seen in the training set. Figure 3b and 3c show how each method                                                                                             two datasets. Generally, SCEM and QEM are not sensitive to the
performs on the whole test set and the subset respectively with                                                                                             embedding size as long as it is in a reasonable range. To keep the
different λu .                                                                                                                                              model effective and simple, we use 100 as the embedding size and
   Figure 3b and 3c show that for LSCEM, as λu becomes larger,                                                                                              report experimental results under this setting in Table 2 and the
performance goes down. This indicates that when short-term con-                                                                                             other figures.
texts are used, users’ embeddings act like noise and drag down the
re-ranking performance. λu has different impacts on the models                                                                                              6       CONCLUSION AND FUTURE WORK
not using clicks. For LCRM3, when we zoom in to only focus on                                                                                               We reformulate product search as a dynamic ranking problem where
users that appear in the training set, the performance changes and                                                                                          leverage users’ implicit feedback on the presented products as short-
the superiority over QL are more noticeable. The best value of MAP                                                                                          term context and refine the ranking of remaining items when the
is achieved when λu = 0.8, which means long-term context benefit                                                                                            users request the next result pages. We then propose an end-to-
word-based models with additional information, which can be help-                                                                                           end context-aware neural embedding model to represent various
ful for solving the word mismatch problem. In contrast, for LCEM,                                                                                           context dependency assumptions for predicting purchased items.
with non-zero λu , it performs worse than only considering queries.                                                                                         Our experimental results indicate that incorporating short-term
Embedding models already capture semantic similarities between                                                                                              context is more effective than using long-term context or not using
words. In addition, as we mentioned in Section 5.1, they also carry                                                                                         context at all. It is also shown that our neural context-aware model
information about popularity since the products purchased more                                                                                              performs better than the state-of-art word-based feedback models.
often under the query will get more credits during training. An-                                                                                               For future work, there are several research directions. First, it
other possible reason is that the number of customers with sessions                                                                                         would be better to evaluate our short-term context re-ranking model
of similar intent is low so that the user embedding is misguiding                                                                                           online, in an interactive setting as each result page can be re-ranked
the query sessions. Thus, users’ long-term interests do not bring                                                                                           dynamically. Second, other information sources such as images
additional information to further improve LCEM on the collections.                                                                                          and price can be included to extract user preferences from their
SIGIR 2019 eCom, July 2019, Paris, France                                     Keping Bi1 , Choon Hui Teo2 , Yesh Dattatreya2 , Vijai Mohan2 , W. Bruce Croft1


feedback. Third, we are interested in the use of negative feedback                              40th International ACM SIGIR Conference. ACM, 475–484.
such as “skips” that can be identified reliably based on subsequent                        [20] Diederik P Kingma and Jimmy Lei Ba. 2014. Adam: Amethod for stochastic
                                                                                                optimization. In Proc. 3rd Int. Conf. Learn. Representations.
user actions.                                                                              [21] Victor Lavrenko and W Bruce Croft. 2017. Relevance-based language models. In
                                                                                                ACM SIGIR Forum, Vol. 51. ACM, 260–267.
                                                                                           [22] Beibei Li, Anindya Ghose, and Panagiotis G Ipeirotis. 2011. Towards a theory
ACKNOWLEDGMENTS                                                                                 model for product search. In Proceedings of the 20th international conference on
This work was supported in part by the Center for Intelligent In-                               World wide web. ACM, 327–336.
                                                                                           [23] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017.
formation Retrieval and in part by NSF IIS-1715095. Any opinions,                               Neural attentive session-based recommendation. In Proceedings of the 2017 ACM
findings and conclusions or recommendations expressed in this                                   on Conference on Information and Knowledge Management. ACM, 1419–1428.
material are those of the authors and do not necessarily reflect                           [24] Soon Chong Johnson Lim, Ying Liu, and Wing Bun Lee. 2010. Multi-facet product
                                                                                                information search and retrieval using semantically annotated product family
those of the sponsor.                                                                           ontology. Information Processing & Management 46, 4 (2010), 479–493.
                                                                                           [25] Bo Long, Jiang Bian, Anlei Dong, and Yi Chang. 2012. Enhancing product search
                                                                                                by best-selling prediction in e-commerce. In Proceedings of the 21st ACM CIKM
REFERENCES                                                                                      Conference. ACM, 2479–2482.
 [1] Qingyao Ai, Keping Bi, Jiafeng Guo, and W Bruce Croft. 2018. Learning a Deep          [26] Yuanhua Lv and ChengXiang Zhai. 2009. Adaptive relevance feedback in infor-
     Listwise Context Model for Ranking Refinement. arXiv preprint arXiv:1804.05936             mation retrieval. In Proceedings of the 18th ACM conference on Information and
     (2018), 135–144.                                                                           knowledge management. ACM, 255–264.
 [2] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Un-           [27] Nish Parikh and Neel Sundaresan. 2011. Beyond relevance in marketplace search.
     biased Learning to Rank with Unbiased Propensity Estimation. arXiv preprint                In Proceedings of the 20th ACM CIKM Conference. ACM, 2109–2112.
     arXiv:1804.05938 (2018).                                                              [28] Jay M Ponte and W Bruce Croft. 1998. A language modeling approach to in-
 [3] Qingyao Ai, Yongfeng Zhang, Keping Bi, Xu Chen, and W Bruce Croft. 2017.                   formation retrieval. In Proceedings of the 21st annual international ACM SIGIR
     Learning a hierarchical embedding model for personalized product search. In                conference. ACM, 275–281.
     Proceedings of the 40th International ACM SIGIR Conference. ACM, 645–654.             [29] Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. Sequence-
 [4] Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model                Aware Recommender Systems. ACM Comput. Surv. (2018).
     for web search ranking. In Proceedings of the 18th international conference on        [30] Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi.
     World wide web. ACM, 1–10.                                                                 2017. Personalizing session-based recommendations with hierarchical recurrent
 [5] Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An experi-              neural networks. In Proceedings of the Eleventh ACM Conference on Recommender
     mental comparison of click position-bias models. In Proceedings of the 2008 WSDM           Systems. ACM, 130–137.
     Conference. ACM, 87–94.                                                               [31] Navid Rekabsaz, Mihai Lupu, Allan Hanbury, and Guido Zuccon. 2016. General-
 [6] Wei Di, Anurag Bhardwaj, Vignesh Jagadeesh, Robinson Piramuthu, and Elizabeth              izing translation models in the probabilistic relevance framework. In Proceedings
     Churchill. 2014. When relevance is not enough: Promoting visual attractiveness             of the 25th ACM CIKM conference. ACM, 711–720.
     for fashion e-commerce. arXiv preprint arXiv:1406.3561 (2014).                        [32] Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme.
 [7] Huizhong Duan, ChengXiang Zhai, Jinxing Cheng, and Abhishek Gattani. 2013.                 2011. Fast context-aware recommendations with factorization machines. In
     A probabilistic mixture model for mining and analyzing product search log. In              Proceedings of the 34th international ACM SIGIR Conference. ACM, 635–644.
     Proceedings of the 22nd ACM international conference on Information & Knowledge       [33] Joseph John Rocchio. 1971. Relevance feedback in information retrieval. The
     Management. ACM, 2179–2188.                                                                Smart retrieval system-experiments in automatic document processing (1971).
 [8] Huizhong Duan, ChengXiang Zhai, Jinxing Cheng, and Abhishek Gattani. 2013.            [34] Khalid Saleh. 2018. Global Online Retail Spending - Statistics and Trends. https://
     Supporting keyword search in product database: a probabilistic approach. Pro-              www.invespcro.com/blog/global-online-retail-spending-statistics-and-trends/
     ceedings of the VLDB Endowment 6, 14 (2013), 1786–1797.                               [35] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model
 [9] Georges E Dupret and Benjamin Piwowarski. 2008. A user browsing model to                   for automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
     predict search engine click data from past observations.. In Proceedings of the       [36] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. 2008. Softrank:
     31st annual international ACM SIGIR conference. ACM, 331–338.                              optimizing non-smooth rank metrics. In Proceedings of the 2008 International
[10] Krista Garcia. 2018.           More Product Searches Start on Amazon.                      Conference on Web Search and Data Mining. ACM, 77–86.
     https://retail.emarketer.com/article/more-product-searches-start-on-amazon/           [37] Bartłomiej Twardowski. 2016. Modelling contextual information in session-aware
     5b92c0e0ebd40005bc4dc7ae                                                                   recommender systems with neural networks. In Proceedings of the 10th ACM
[11] Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Xin-Shun Xu, and Mohan Kankan-                   Conference on Recommender Systems. ACM, 273–276.
     halli. 2018. Multi-modal preference modeling for product search. In 2018 ACM          [38] Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2016. Learning
     Multimedia Conference on Multimedia Conference. ACM, 1865–1873.                            latent vector spaces for product search. In Proceedings of the 25th ACM CIKM
[12] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.                 Conference. ACM, 165–174.
     2015. Session-based recommendations with recurrent neural networks. arXiv             [39] Damir Vandic, Flavius Frasincar, and Uzay Kaymak. 2013. Facet selection algo-
     preprint arXiv:1511.06939 (2015).                                                          rithms for web product search. In Proceedings of the 22nd ACM CIKM Conference.
[13] Balázs Hidasi and Domonkos Tikk. 2016. General factorization framework for                 ACM, 2327–2332.
     context-aware recommendations. Data Mining and Knowledge Discovery 30, 2              [40] Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc
     (2016), 342–371.                                                                           Najork. 2018. Position bias estimation for unbiased learning to rank in personal
[14] Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforce-                search. In Proceedings of the Eleventh ACM WSDM Conference. ACM, 610–618.
     ment Learning to Rank in E-Commerce Search Engine: Formalization, Analysis,           [41] Chen Wu and Ming Yan. 2017. Session-aware information embedding for e-
     and Application. arXiv preprint arXiv:1803.00710 (2018).                                   commerce product recommendation. In Proceedings of the 2017 ACM on Conference
[15] Bernard J Jansen and Paulo R Molina. 2006. The effectiveness of Web search                 on Information and Knowledge Management. ACM, 2379–2382.
     engines for retrieving relevant ecommerce links. Information Processing & Man-        [42] Liang Wu, Diane Hu, Liangjie Hong, and Huan Liu. 2018. Turning Clicks into
     agement 42, 4 (2006), 1075–1098.                                                           Purchases: Revenue Optimization for Product Search in E-Commerce. (2018).
[16] Gawesh Jawaheer, Peter Weller, and Patty Kostkova. 2014. Modeling user pref-          [43] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise
     erences in recommender systems: A classification framework for explicit and                approach to learning to rank: theory and algorithm. In Proceedings of the 25th
     implicit user feedback. ACM Transactions on Interactive Intelligent Systems (TiiS)         international conference on Machine learning. ACM, 1192–1199.
     4, 2 (2014), 8.                                                                       [44] Jun Yu, Sunil Mohan, Duangmanee Pew Putthividhya, and Weng-Keen Wong.
[17] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay.                 2014. Latent dirichlet allocation based diversified retrieval for e-commerce search.
     2017. Accurately interpreting clickthrough data as implicit feedback. In ACM               In Proceedings of the 7th ACM WSDM Conference. ACM, 463–472.
     SIGIR Forum, Vol. 51. Acm, 4–11.                                                      [45] Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information
[18] Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased                  retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual
     learning-to-rank with biased feedback. In Proceedings of the Tenth ACM Interna-            International Conference on Machine Learning. ACM, 1201–1208.
     tional Conference on Web Search and Data Mining. ACM, 781–789.                        [46] Hamed Zamani and W Bruce Croft. 2016. Embedding-based query language
[19] Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. 2017.                 models. In Proceedings of the 2016 ACM ICTIR conference. ACM, 147–156.
     On application of learning to rank for e-commerce search. In Proceedings of the