=Paper=
{{Paper
|id=Vol-2410/paper7.pdf
|storemode=property
|title=Personalized Query Auto-Completion Through a Lightweight Representation of the User Context
|pdfUrl=https://ceur-ws.org/Vol-2410/paper7.pdf
|volume=Vol-2410
|authors=Manojkumar Rangasamy Kannadasan,Grigor Aslanyan
|dblpUrl=https://dblp.org/rec/conf/sigir/KannadasanA19
}}
==Personalized Query Auto-Completion Through a Lightweight Representation of the User Context==
<pdf width="1500px">https://ceur-ws.org/Vol-2410/paper7.pdf</pdf>
<pre>
    Personalized Query Auto-Completion Through a Lightweight
                Representation of the User Context
             Manojkumar Rangasamy Kannadasan                                                              Grigor Aslanyan
                                  eBay Inc.                                                                    eBay Inc.
                                San Jose, CA                                                                 San Jose, CA
                           mkannadasan@ebay.com                                                          gaslanyan@ebay.com

ABSTRACT                                                                              1   INTRODUCTION
Query Auto-Completion (QAC) is a widely used feature in many do-                      Query Auto-Completion (QAC) is a common feature of most mod-
mains, including web and eCommerce search. This feature suggests                      ern search engines. It refers to the task of suggesting full queries
full queries based on a prefix of a few characters typed by the user.                 after the user has typed a prefix of a few characters [6]. QAC can
QAC has been extensively studied in the literature in the recent                      significantly reduce the number of characters typed [26], which is
years, and it has been consistently shown that adding personal-                       especially helpful to users on mobile devices. QAC can also help
ization features can significantly improve the performance of the                     reduce the number of spelling errors in queries. In cases when the
QAC model. In this work we propose a novel method for personal-                       user is not really sure how to formulate the query, QAC can be of
ized QAC that uses lightweight embeddings learnt through fastText                     great help. It has been shown that QAC can greatly improve user sat-
[2, 14]. We construct an embedding for the user context queries,                      isfaction [24]. Moreover, this can reduce the overall search duration,
which are the last few queries issued by the user. We also use the                    resulting in a lower load on the search engine [1]. Currently QAC
same model to get the embedding for the candidate queries to be                       has a wide range of applications, including search (such as web,
ranked. We introduce ranking features that compute the distance                       eCommerce, email), databases, operating systems, development
between the candidate queries and the context queries in the embed-                   environments [6].
ding space. These features are then combined with other commonly                         Query Auto-Completion has been extensively studied in the lit-
used QAC ranking features to learn a ranking model using the state                    erature in the recent years. A detailed survey of the work prior to
of the art LambdaMART ranker [3]. We apply our method to a large                      2016 can be found in [6], which broadly classifies QAC approaches
eCommerce search engine (eBay) and show that the ranker with                          into two main categories - heuristic models and learning based
our proposed feature significantly outperforms the baselines on all                   models. Heuristic models use a few different sources for each possi-
of the offline metrics measured, which includes Mean Reciprocal                       ble query completion and compute a final score. These approaches
Rank (MRR), Success Rate (SR), Mean Average Precision (MAP),                          do not use a large variety of features. In contrast, learning based
and Normalized Discounted Cumulative Gain (NDCG). Our base-                           approaches treat the problem as a ranking problem and rely on the
lines include the Most Popular Completion (MPC) model which is a                      extensive research in the literature in the learning-to-rank (LTR)
commonly used baseline in the QAC literature, as well as a ranking                    field [15]. Learning based approaches rely on a large number of
model without our proposed features. The ranking model with the                       features and generally outperform heuristic models [6]. The fea-
proposed features results in a 20 − 30% improvement over the MPC                      tures for both of these approaches can be broadly characterized
model on all metrics. We obtain up to a 5% improvement over the                       into three groups - time-sensitive, context-based, and demogra-
baseline ranking model for all the sessions, which goes up to about                   phy based. Time-sensitive features model the query popularity and
10% when we restrict to sessions that contain the user context.                       changes over time, such as weekly patterns. Demographic based
Moreover, our proposed features also significantly outperform text                    features, such as gender and age, are typically limited and may
based personalization features studied in the literature before, and                  be hard to access. In contrast, context based features rely on the
adding text based features on top of our proposed embedding based                     user’s previous search activity (short term, as well as long term) to
features results only in minor improvements.                                          suggest new query completions. This data is essentially free, mak-
                                                                                      ing context-based features an attractive approach for personalizing
ACM Reference Format:                                                                 QAC. Context-based features for LTR models will be the focus of
Manojkumar Rangasamy Kannadasan and Grigor Aslanyan. 2019. Personal-                  this work.
ized Query Auto-Completion Through a Lightweight Representation of the                   In this paper we propose a novel method to learn the query
User Context. In Proceedings of the SIGIR 2019 Workshop on eCommerce                  embeddings [2, 14] using a simple and scalable technique and use it
(SIGIR 2019 eCom), 8 pages.
                                                                                      to measure similarity between user context queries and candidate
                                                                                      queries to personalize QAC. We learn the embeddings in a semi-
                                                                                      supervised fashion using fastText by taking all the queries in a
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   session as a single document. We design features that measure the
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                        similarity between the context and candidate queries, which are
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at   then incorporated into a learning-to-rank model. We use the state
http://ceur-ws.org                                                                    of the art LambdaMART model [3] for ranking candidate queries
                                                                                      for QAC. Even though embedding based features have been studied
SIGIR 2019 eCom, July 2019, Paris, France                                       Manojkumar Rangasamy Kannadasan and Grigor Aslanyan


for QAC in the literature before, as discussed in Section 2, our work        Word embeddings, such as word2vec [17], glove [20], and fastText
makes the following novel contributions:                                  [2, 14], have become increasingly popular in the recent years for
                                                                          a large variety of tasks, including computing similarity between
    • A lightweight and scalable way to represent the user’s con-
                                                                          words. Embeddings have also been studied in the context of QAC.
      text in the embedding space.
                                                                          Specifically, Mitra [18] studies a Convolutional Latent Semantic
    • Simple and robust ranking features based on such embed-
                                                                          Model for distributed representations of queries. Query similarity
      dings for QAC, which can be used in any heuristic or LTR
                                                                          based on word2vec embeddings is studied in [21] where the features
      model.
                                                                          are combined with the MPC model. In Section 3 , we explain our
    • Training and evaluation of a pairwise LambdaMART ranker
                                                                          approach of learning embeddings for the user context in a simple
      for QAC using the proposed features. We show that our pro-
                                                                          and scalable fashion and the usage of these embeddings and text
      posed features result in significant improvements in offline
                                                                          based features to personalize QAC.
      metrics compared with state-of-the-art baselines.
    • We also compare and combine text based features with em-
      bedding based features and show that embedding based fea-
                                                                          3     PERSONALIZED QUERY
      tures result in larger improvements in offline metrics.                   AUTO-COMPLETION WITH
                                                                                REFORMULATION
   The rest of the paper is organized as follows. Section 2 discusses
                                                                          A search session is defined as a sequence of queries ⟨q 1 , q 2 , . . . , qT ⟩
some of the related work in the literature. In Section 3 we describe
                                                                          issued by a user within a particular time frame. A query consists of
our methodology. In Section 4 we describe our datasets and ex-
                                                                          a set of tokens. If the user types a prefix pT and ends up issuing the
periments. We summarize our work and discuss possible future
                                                                          query qT , then the user’s context is the previous queries issued till
research in Section 5.
                                                                          time step T , ⟨q 1 , q 2 , . . . , qT −1 ⟩. For example, if the queries issued in
                                                                          a session is ⟨nike, adidas, shoes⟩, the prefix used to issue the query
2   RELATED WORK                                                          shoes is sh, then ⟨q 1 , q 2 , . . . , qT −1 ⟩ = ⟨nike, adidas⟩, pT = sh, qT =
The user’s previously entered text is used for personalized QAC by        shoes. Given a prefix pT , the user context ⟨q 1 , q 2 , . . . , qT −1 ⟩ and can-
Bar-Yossef and Kraus [1]. The method, called NearestCompletion,           didate queries QT matching the prefix, our goal is to provide ranking
computes the similarity of query completion candidates to the             for the queries q ∈ QT such that we have the best ranking for qT .
context queries (user’s previously entered queries), using term-          The ranking score can be considered as P(QT |⟨q 1 , q 2 , . . . , qT −1 ⟩).
weighted vectors for queries and contexts and applying cosine             This can be solved using the learning to rank framework.
similarity. This method results in significant improvements in MRR.          The influence of user context features towards the prediction
In addition, the authors proposed the MPC approach, which is              accuracy has already been studied in [13, 18]. In this paper we
based on the overall popularity of the queries matching the given         propose a simple and scalable way to understand the user’s context
prefix. MPC is a straightforward heuristic approach with good             using query embeddings and use multiple distance related features
performance and is typically used as a baseline for more complex          to compare the user’s context to the candidate queries QT .
approaches. We use MPC as one of the baselines in this work as
well.                                                                     3.1     Learning Query Representation for
   The user’s long term search history is used in [5] to selectively              Reformulation
personalize QAC, where a trade-off between query popularity and
                                                                          Continuous text representations and embeddings for a text can
search context is used to encode the ranking signal. Jiang et. al. [13]
                                                                          be learnt through both supervised [18] and semi-supervised ap-
study user reformulation behavior using textual features. Shokouhi
                                                                          proaches [2, 14, 17, 20]. In this paper, we learn the query represen-
[22] studies QAC personalization using a combination of context
                                                                          tations via semi-supervised techniques. We use the publicly avail-
based textual features and demographic features, and shows that the
                                                                          able fastText library [2, 14] for efficient representation learning to
user’s long term search history and location are the most effective
                                                                          learn the query embeddings. The fastText model learns subword
for QAC personalization. Su et. al. [25] propose the framework
                                                                          representations while taking into account morphology. The model
EXOS for personalizing QAC, which also relies on textual features
                                                                          considers subword units, and represents a word by the sum of its
(token level). Jiang et. al. [8] use history-level, session-level, and
                                                                          character n-grams. The word iphone with character n-grams (n = 3)
query-level textual features for personalized QAC. Fei et. al. [4]
                                                                          is represented as:
study features on the observed and predicted search popularity
both for longer and shorter time periods for learning personalized                         “⟨ip”, “iph”, “pho”, “hon”, “one”, “ne⟩”
QAC. Diversification of personalized query suggestion is studied in          Some of the previous work learns distinct vector representations
[7].                                                                      for the words thereby ignoring internal structure of the words [17].
   Recurrent Neural Networks (RNN) [11] have also been studied            If we have a dictionary of n-grams of size G, then the set of n-grams
for QAC. Three RNN models - session-based, personalized, and              in a word w is denoted as Grw ∈ {1, 2, . . . , G}. We use the skipGram
attention based, have been proposed in [12]. Fiorini and Lu [9]           model where the goal is to independently predict the presence or
use user history based features as well as time features as input         absence of the context words. The problem is framed as a binary
to an RNN model. [19] uses RNNs to specifically improve QAC               classification task. For the word at position t we consider all context
performance on previously unseen queries. An adaptable language           words as positive examples and sample negatives at random from
model is proposed in [10] for personalized QAC.                           the dictionary as described in [2, 14]. For a context word wc , we
Personalized QAC Through a Lightweight Representation of the User Context                                 SIGIR 2019 eCom, July 2019, Paris, France


use the negative log likelihood, l : x 7→ loд(1 + e −x ), for the binary         Table 1: Textual features based on the user context defined
logistic loss. The objective function is defined as:                             across 3 categories. A subset of features are highlighted in
                                                                                 the table. Rest of the features are derived from them.
        Õ T      Õ                              Õ                    
                            l(s(w   , w      +          l(−s(w   , n))   (1)
                                                                       
              
                                 t     c ))                   t
                                                                       
                                                                                   Category                            Examples
        t =1 c ∈Cont ex t                    n ∈N t,c                
                                                                                                                  ratio of new terms
where w t is the target word, wc is the context word, Nt,c is a set                                                ratio of used terms
of negative examples sampled from the vocabulary. The scoring                                              average terms in previous queries
                                                                                                           median terms in previous queries
function, s(w, c) is defined as
                                                                                      Token                     trend of number of terms
                                               zTд vc
                                         Õ
                          s(w, c) =                                      (2)       (16 features)          unique terms added from last query
                                     д ∈G w                                                              unique terms retained from last query
                                                                                                         unique terms removed from last query
where zд is the vector representation of each n-gram of a word
                                                                                                      unique terms added from all previous queries
w and vc is the vector representation of the context. Our goal is
                                                                                                        occurrence of terms in previous queries
to learn scalable and lightweight embeddings for queries based
                                                                                                             frequency in previous queries
on their reformulation behavior across different users. Here we                       Query
                                                                                                   character n-gram similarity with previous queries
represent all the queries issued in a session ⟨q 1 , q 2 , . . . , qT ⟩ as one     (7 features)
                                                                                                     token n-gram similarity with previous queries
document in the training data. By learning the subword representa-                                                 position in session
tions using the probability of words in the context of other words                   Session
                                                                                                                 unique terms in session
present in the queries issued in the same context, we are able to                  (3 features)
                                                                                                                common terms in session
provide a simple and scalable way to encode the query reformula-
tion behavior in the embedding space. We are also able to learn the              context. Since the median number of searches in a session is approx
syntactic and semantic similarity between the vocabulary.                        3, we considered up to 3 queries previously issued by the user
   We learn query representations by mining 3 days of eBay’s search              for generating the remaining embedding features. The Embedding
logs to get the query reformulations issued by the user. The query               Features are computed as a distance between 2 vectors using cosine
log is segmented into sessions with a 30 minute session boundary                 similarity [23].
as followed in [13]. Based on this definition of a session boundary,                 • user_context_cosine: Cosine distance between the user con-
we collect different queries issued by the users within that session.                  text vector vC and the current target query vqT .
We remove special characters in the query and convert them to                        • prev_query1_cosine: Cosine distance between the query
lowercase. We also filter out sessions with only one query in the                      vector vqT −1 and the current target query vqT .
session. For example, if the user issues q 1 , q 2 , . . . , qT in a session,        • prev_query2_cosine: Cosine distance between the query
then all of these queries ⟨q 1 , q 2 , . . . , qT ⟩ together, separated by             vector vqT −2 and the current target query vqT .
whitespace, are considered as one user context. For example, if a                    • prev_query3_cosine: Cosine distance between the query
session contains 2 queries in the user context, “iphone”, “iphone xs                   vector vqT −3 and the current target query vqT .
case”, then a single document for training will be represented as
                                                                                    In addition to the Embedding_Features, we also developed var-
“iphone iphone xs case”.
                                                                                 ious Textual_Features comparing the user context and the current
   For training unsupervised character n-Gram representations
                                                                                 target query to be ranked as defined in Table 1. We categorize them
we consider each user context as one document sample. We tune
                                                                                 into three categories, namely Token, Query, and Session. There is
the model hyperparameters by fixing the dimension of subword
                                                                                 a large overlap between the features defined in Table 1 and the
representations as 50, minimum occurrence of the words in the
                                                                                 features defined in [13, 22]. A query qT can contain multiple tokens.
vocabulary to be 20 and hierarchical softmax as the loss function.
                                                                                 Users may add or remove tokens between 2 consecutive queries in
The other hyperparameters of the model are tuned based on the
                                                                                 a session. Based on analyzing the user sessions, between queries
Embedding_Features model described in Section 4.3. The number of
                                                                                 qT and qT −1 , tokens can either be added and/or removed. These
unique words in the vocabulary used to train the model is 189,138.
                                                                                 token reformulation user behavior can be encoded via 16 features,
The user context ⟨q 1 , q 2 , . . . , qT ⟩ is then converted to multiple vec-
                                                                                 described in Table 1, representing the effectiveness of the tokens
tor representations. Similar vector representations are also created
                                                                                 in the context C and the target query qT . Similarly, Query level
for all candidate target queries in the dataset.
                                                                                 features represent how users reformulate the queries in a session
                                                                                 through repetition and textual similarity between qT and qT −1 . The
3.2     User Context Features
                                                                                 Session level features represent how users reformulate their queries
In this section we propose different user context features based on              without taking into account the relationship to the target query qT .
the queries issued in the session. Vector representations are created
for both the individual queries as well as the entire context taking             4 EXPERIMENTS
all queries in the session. vC represents the user context vector
and vqT represents the vector for one query at time step T . We                  4.1 Dataset and Experiment Setting
develop four features based on the query representations learned                 We conduct our ranking experiments on a large scale search query
in the previous section. We denote these features as Embedding                   dataset sampled from the logs of eBay Search engine. The query
Features. One embedding feature is based on all the queries in the               log is segmented into sessions with a 30 minute session boundary
SIGIR 2019 eCom, July 2019, Paris, France                                           Manojkumar Rangasamy Kannadasan and Grigor Aslanyan


          Prefix            MPC


                    Top N Candidates                           Baseline_Features


                                                             Embedding_Features              Ranking Model            Final Ranked List

         fastText           Embeddings                         Textual_Features


         Context
Figure 1: The end to end architecture of the Textual_Embedding model. The architecture for the other models is similar, except
that some of the features will be excluded.


as described in [13]. For ranking experiments, we do not filter out                  hyperparameters used for the LambdaMART model are ex-
sessions containing a single query. This is to make sure that we                     actly the same as in all the experiments for the personalized
have a single learning to rank model powering sessions with and                      ranker.
without user context. The dataset obtained based on the above logic
results in about 53% of the sessions with user’s context. This gives        4.3      Personalized Ranking Models
us good coverage of user context features to learn a global model.          We have developed three personalized ranking models with differ-
    The labeling strategy used in [13, 22] assume there is at least         ent combinations of user context features, as described in Section
one query in the context, remove target queries qT not matching             3.2. These ranking models are compared against the two baseline
the prefix pT , setting the first character of the prefix pT based on       rankers by experimentally evaluating the improvements on eBay
qT . In our method, we use a slightly different labeling strategy for       datasets. The results are presented in Section 4.5.
building the personalized QAC. We sample a set of impressions
                                                                                   • Textual: Ranker with Baseline_Features and Textual_Features
from search logs. For each issued query qT , we capture the prefix
                                                                                     representing the user context.
pT that led to the search page. This is marked as a positive label. For
                                                                                   • Embedding: Ranker with Baseline_Features, as well as Em-
the same prefix pT we identify rest of the candidate queries QT \qT
                                                                                     bedding_Features representing the user context.
that were shown to the user and did not lead to the impression.
                                                                                   • Textual_Embedding: Ranker with Baseline_Features, Tex-
They are marked as negative labels. We also retain sessions without
                                                                                     tual_Features, and Embedding_Features representing the user
user context.
                                                                                     context.
    The above training data now consists of labeled prefix-query
pairs. To learn the performance of the lightweight query represen-             For all of the ranking models we first get the top N candidate
tation of reformulations, we use LambdaMART [3] as the choice of            queries from the MPC model and re-rank them with the rank-
learning to-rank algorithms, a boosted tree version of LambdaRank.          ing model. We show the full end to end architecture for the Tex-
LambdaMART is considered as one of the state-of-the-art learning            tual_Embedding model in Figure 1. The architecture for the other
to rank algorithms and has won the Yahoo! Learning to Rank Chal-            models is similar except that they will only include a subset of the
lenge (2010) [13]. We use a pairwise ranking model and fine tune            features.
our parameters based on the Baseline_Ranker defined in Section
4.2. We fix these parameters to train and evaluate our models across        4.4      Evaluation Metrics
all of our experiments.                                                     The quality of our predictions can be measured using the following
                                                                            metrics:
4.2    Baseline System                                                             • Mean Reciprocal Rank (MRR) - the average of the reciprocal
To evaluate our new personalized QAC ranker we establish two                         ranks of the target queries in the QAC results. Given a test
baseline ranking algorithms.                                                         dataset S, the MRR for algorithm A is computed as
    • MPC: The Most Popular Completion model [1] predicts and                                          1   Õ              1
       provides users with candidate queries which are ranked by                          MRR(A) =                                            (3)
                                                                                                      |S |        hitRank(A, CT , qT )
                                                                                                         CT ,qT ∈S
       the popularity of the query. Popularity of a query is defined
       as the number of times the query has been issued by all the                   where CT represents the user context at time step T , qT rep-
       users in the past.                                                            resents the relevant target query, and the function hitRank
    • Baseline_Ranker: The baseline ranker is a Learning to Rank                     computes the rank of the relevant query based on the order
       model built using the same methodology for creating and                       created by algorithm A. Relevant query refers to the clicked
       labeling the dataset. The features used in building the model                 query in QAC.
       are prefix features, target query features and prefix-query                 • Success Rate at Top-K (SR@K) - the average percentage of
       features. We refer to these features as Baseline_Features. The                relevant queries ranked at or above the position K in the
Personalized QAC Through a Lightweight Representation of the User Context                       SIGIR 2019 eCom, July 2019, Paris, France

Table 2: Offline evaluation metrics. We show the ratio of the metrics for four of the ranking models to the MPC model on both
test datasets - all data and filtered data to include full coverage for user context.
  Dataset                  Measure      Baseline_Ranker      Textual_Features   Embedding_Features       Textual_Embedding_Features
                           MRR          1.26                 1.30               1.31                     1.31
                           nDCG         1.30                 1.32               1.33                     1.33
                           SR@1         1.24                 1.29               1.31                     1.31
  Whole                    SR@3         1.23                 1.26               1.27                     1.27
                           MAP          1.26                 1.30               1.31                     1.31
                           MAP@1        1.24                 1.29               1.31                     1.31
                           MAP@3        1.23                 1.27               1.29                     1.29
                           MRR          1.27                 1.34               1.37                     1.37
                           nDCG         1.30                 1.35               1.37                     1.37
                           SR@1         1.26                 1.37               1.42                     1.41
  User Context Only        SR@3         1.23                 1.30               1.33                     1.34
                           MAP          1.27                 1.34               1.37                     1.37
                           MAP@1        1.26                 1.37               1.42                     1.41
                           MAP@3        1.24                 1.33               1.37                     1.37
        ranked list from QAC. In this paper we will consider only
        SR@1, SR@2, SR@3.
      • Normalized Discounted Cumulative Gain (nDCG) - The Dis-
        counted Cumulative Gain (DCG) represents the usefulness
        or gain of the query based on its position in the ranked list
        from QAC. DCG penalizes the relevance of the query loga-
        rithmically based on the position of the query in the ranked
        list. DCG is defined as
                                 P
                                Õ    2r eli − 1
                       DCGq =                                     (4)
                                i=1
                                    loд2 (i + 1)
        where i denotes the rank and reli is the relevance of query
        at rank i. For our purposes reli takes values 0 or 1.
        nDCG is defined as normalized DCG. Namely, it is the ratio
        of DCG to I DCG (ideal DCG):
                                      DCGq                               Figure 2: A three-dimensional t-SNE plot using the vectors
                          nDCGq =                                (5)     learned from user query reformulation, showing how simi-
                                     IDCGq
                                                                         lar intent words are modeled in the embedding space.
        where I DCGq is the maximum possible value of DCGq for
        any ranker.
        The overall performance of the ranking algorithm A is mea-       to evaluate if the embeddings are representing the syntactic and
        sured by the average nDCG across all queries in the dataset:     semantic knowledge of the queries learnt from the query reformu-
                                ÍQ                                       lation behavior. We use t-SNE [16] to visualize the embeddings for
                                  q=1 nDCG q                             these sampled queries and show that words like samsung, galaxy, tv
                       nDCG =                 .                    (6)
                                      Q                                  are close to each other in the embedding space and far from queries
      • Mean Average Precision (MAP) - the mean of the average pre-      like adidas and iphone. This verifies that our query embeddings
        cision scores for each query across the entire set of queries:   have good subword information to represent the user context in
                            ÍQ                                           the embedding space. The t-SNE plot for a small sample of queries
                             q=1 AvдPrecision(q)                         is shown in Figure 2.
                   MAP =                          .                (7)       Offline metrics MRR, SR@k, nDCG, MAP, and MAP@k are shown
                                     Q
                                                                         in Table 2, where we have normalized the metrics with respect to
4.5     Results                                                          the MPC model. We show results for the whole test dataset, which
We perform our evaluation in two phases. Firstly, we evaluate the        includes queries with and without user context, as well as the
quality of our query representations. Secondly, we evaluate the user     dataset with user context only. To assess statistical significance
context embeddings against the user context based textual features       we have performed 1,000 bootstrap samples over the test queries
using a Learning to Rank framework [3].                                  and computed 95% confidence intervals using those samples. The
   To evaluate our query representations we sample a few words           metrics, together with the 95% confidence intervals, are plotted in
across different verticals like fashion, electronics, home and garden,   Figure 3, where the plots on the left are for the whole dataset and
SIGIR 2019 eCom, July 2019, Paris, France                                           Manojkumar Rangasamy Kannadasan and Grigor Aslanyan


            MRR Ratio Over MPC                                                     MRR Ratio Over MPC

     1.31
                                                                            1.35
     1.29

                                                                            1.30
     1.27

     1.25                                                                   1.25
                    e             g                l                                                                                  al
               elin            din           tua                  din
                                                                      g                    line                   dd
                                                                                                                    ing          tu                          ing
             as             ed           Tex                    ed                       se                   e              Tex                     e  dd
            B            mb                                   mb                    Ba                     mb                                     mb
                        E                                 E                                            E                                      E
                                                tu     al_                                                                          tu     al_
                                            Tex                                                                                 Tex

            MAP@3 Ratio Over MPC                                                   MAP@3 Ratio Over MPC
     1.30

     1.28                                                                   1.35

     1.26                                                                   1.30

     1.24
                                                                            1.25
     1.22
                    e                g         al                       g                        e                  ing               al                       g
                elin             din         tu                      din                 elin                     dd             tu                         din
            Ba
               s              ed         Tex                    ed                  Ba
                                                                                       s                      e              Tex                       ed
                        E   mb                            E   mb                                       E   mb                                 Em
                                                                                                                                                 b
                                                tu     al_                                                                          tu     al_
                                            Tex                                                                                 Tex

            SR@3 Ratio Over MPC                                                    SR@3 Ratio Over MPC

     1.27                                                                   1.32

     1.25                                                                   1.28

     1.23
                                                                            1.24


                    e                g         al                       g                  line                        g              al                     ing
                elin             din         tu                      din                 se                        din           tu                     dd
            Ba
               s           be
                              d          Tex                    ed                  Ba                    b  ed              Tex                     e
                        Em                                E   mb                                       Em                                     E   mb
                                                tu     al_                                                                          tu     al_
                                            Tex                                                                                 Tex

            NDCG Ratio Over MPC                                                     NDCG Ratio Over MPC
     1.34
                                                                            1.375
     1.33
     1.32                                                                   1.350

     1.31                                                                   1.325
     1.30
                                                                            1.300
     1.29
                    e                g         al                       g                       line                   ing            al                     ing
                elin             din         tu                      din                   se                     dd             tu                      dd
            Ba
               s              ed         Tex                    ed                   Ba                       e              Tex                        e
                        E   mb                            E   mb                                       E   mb                                    E   mb
                                                tu     al_                                                                           tu    al_
                                            Tex                                                                                  Tex

Figure 3: Metrics ratio to MPC for the whole dataset (left) and the user context only dataset (right). Error bars are computed
using 1,000 bootstrap samples of the test queries.
Personalized QAC Through a Lightweight Representation of the User Context                                        SIGIR 2019 eCom, July 2019, Paris, France


the plots on the right are for the context only dataset. We have only              queries and enable them to model the user context seamlessly. We
plotted one variant of each metric since the others are very similar.              have leveraged these lightweight embeddings to represent the user
   Our results show that all of the LTR models result in 20 − 30%                  context in our personalized ranker for Query Auto-Completion. Dif-
improvements over the MPC model. All three models with contex-                     ferent combinations of user context features are created, including
tualization features outperform the Baseline_Ranker on all the                     textual and embedding features on top of our baseline ranker. We
metrics statistically significantly. For example, for MAP@3 Embed-                 have applied these personalization features to a large scale commer-
ding outperforms Baseline_Ranker by 5% on the whole dataset                        cial search engine (eBay) and experimentally verified significant
and 10% for the context only dataset. The Embedding model also                     improvements on all the offline ranking metrics. We have evalu-
outperforms Textual with an improvement of 1.5% for MAP@3                          ated our personalized ranker on the entire dataset and a dataset
on the whole dataset and 3% for the context only dataset. The                      restricted to sessions containing the user context. We see the biggest
Textual_Embedding model performs very similarly to Embed-                          improvements on the user context filtered dataset. Furthermore,
ding which implies that the embedding based features proposed                      we show that the ranking model with embedding features outper-
in this work capture all of the information in the textual features                forms the model with the textual features, whereas the model with
(from the perspective of the ranking model), and provide additional                combined textual and embedding features results in only minor
significant improvements.                                                          improvements on top of the model with embedding features alone.
                                                                                   The minor improvements from the textual features is likely due to
4.6    Feature Analysis                                                            the session level features which are agnostic of the queries in the
In this section we analyze the user context embedding features                     context. As a future work, we would like to explore different repre-
through partial dependence plots shown in Figure 4. The partial                    sentation learning techniques like sent2vec, doc2vec, and sequence
dependence plot for the user_context_cosine feature clearly indicates              models, to understand the user context better and incorporate them
that the cosine similarity between the user context ⟨q 1 , q 2 , . . . , qT −1 ⟩   in the personalized ranker. We also plan to explore the trade offs
and the target query qT has a linear relationship with the target. The             between short term and long term user contexts in QAC. Lastly,
embedding features based on individual time step (prev_query1_cosine,              the user context vectors provide a simple and scalable way to un-
prev_query2_cosine, prev_query3_cosine) also show a clear mono-                    derstand the user sessions which can be utilized for personalizing
tonic relationship.                                                                different parts of search and recommender systems.

                                                                                   REFERENCES
                                                                                    [1] Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive Query Auto-completion.
                                                                                        In Proceedings of the 20th International Conference on World Wide Web (WWW ’11).
                                                                                        ACM, New York, NY, USA, 107–116. https://doi.org/10.1145/1963405.1963424
                                                                                    [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.
                                                                                        Enriching word vectors with subword information. Transactions of the Association
                                                                                        for Computational Linguistics 5 (2017), 135–146.
                                                                                    [3] Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An
                                                                                        overview. Learning 11, 23-581 (2010), 81.
                                                                                    [4] Fei Cai, Wanyu Chen, and Xinliang Ou. 2017. Learning search popularity for
                                                                                        personalized query completion in information retrieval. Journal of Intelligent &
                                                                                        Fuzzy Systems 33, 4 (2017), 2427–2435.
                                                                                    [5] Fei Cai and Maarten de Rijke. 2016. Selectively Personalizing Query Auto-
                                                                                        Completion. In Proceedings of the 39th International ACM SIGIR Conference on
                                                                                        Research and Development in Information Retrieval (SIGIR ’16). ACM, New York,
                                                                                        NY, USA, 993–996. https://doi.org/10.1145/2911451.2914686
                                                                                    [6] Fei Cai and Maarten de Rijke. 2016. A Survey of Query Auto Completion in
                                                                                        Information Retrieval. Foundations and Trends® in Information Retrieval 10, 4
                                                                                        (2016), 273–363. https://doi.org/10.1561/1500000055
                                                                                    [7] Wanyu Chen, Fei Cai, Honghui Chen, and Maarten de Rijke. 2017. Personalized
                                                                                        query suggestion diversification. In Proceedings of the 40th International ACM
                                                                                        SIGIR Conference on Research and Development in Information Retrieval. ACM,
                                                                                        817–820.
                                                                                    [8] Jiang Danyang, Fei Cai, and Honghui Chen. 2018. Personalizing Query Auto-
                                                                                        Completion for Multi-Session Tasks. 203–207. https://doi.org/10.1109/CCET.
                                                                                        2018.8542201
                                                                                    [9] Nicolas Fiorini and Zhiyong Lu. 2018. Personalized neural language models for
                                                                                        real-world query auto completion. arXiv preprint arXiv:1804.06439 (2018).
                                                                                   [10] Aaron Jaech and Mari Ostendorf. 2018. Personalized language model for query
Figure 4: Partial dependence plots for user context embed-                              auto-completion. arXiv preprint arXiv:1804.09661 (2018).
ding features learnt from query reformulation.                                     [11] L. C. Jain and L. R. Medsker. 1999. Recurrent Neural Networks: Design and Appli-
                                                                                        cations (1st ed.). CRC Press, Inc., Boca Raton, FL, USA.
                                                                                   [12] D. Jiang, W. Chen, F. Cai, and H. Chen. 2018. Neural Attentive Personaliza-
                                                                                        tion Model for Query Auto-Completion. In 2018 IEEE 3rd Advanced Informa-
                                                                                        tion Technology, Electronic and Automation Control Conference (IAEAC). 725–730.
5     SUMMARY AND FUTURE WORK                                                           https://doi.org/10.1109/IAEAC.2018.8577694
In this work we have presented a simple and scalable approach                      [13] Jyun-Yu Jiang, Yen-Yu Ke, Pao-Yu Chien, and Pu-Jen Cheng. 2014. Learning User
                                                                                        Reformulation Behavior for Query Auto-completion. In Proceedings of the 37th
to learn lightweight vector representations (embeddings) for the                        International ACM SIGIR Conference on Research &#38; Development in Information
query reformulations in a user session. These query representations                     Retrieval (SIGIR ’14). ACM, New York, NY, USA, 445–454. https://doi.org/10.
exhibit both syntactic and semantic similarities between different                      1145/2600428.2609614
SIGIR 2019 eCom, July 2019, Paris, France                                                   Manojkumar Rangasamy Kannadasan and Grigor Aslanyan


[14] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag
     of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).
[15] Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and
     Trends® in Information Retrieval 3, 3 (2009), 225–331. https://doi.org/10.1561/
     1500000016
[16] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
     Journal of machine learning research 9, Nov (2008), 2579–2605.
[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.
     Distributed Representations of Words and Phrases and Their Compositionality. In
     Proceedings of the 26th International Conference on Neural Information Processing
     Systems - Volume 2 (NIPS’13). Curran Associates Inc., USA, 3111–3119. http:
     //dl.acm.org/citation.cfm?id=2999792.2999959
[18] Bhaskar Mitra. 2015. Exploring session context using distributed representations
     of queries and reformulations. In Proceedings of the 38th international ACM SIGIR
     conference on research and development in information retrieval. ACM, 3–12.
[19] Dae Hoon Park and Rikio Chiba. 2017. A neural language model for query auto-
     completion. In Proceedings of the 40th International ACM SIGIR Conference on
     Research and Development in Information Retrieval. ACM, 1189–1192.
[20] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
     Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
     guage Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/
     D14-1162
[21] Taihua Shao, Honghui Chen, and Wanyu Chen. 2018. Query Auto-Completion
     Based on Word2vec Semantic Similarity. In Journal of Physics: Conference Series,
     Vol. 1004. IOP Publishing, 012018.
[22] Milad Shokouhi. 2013. Learning to Personalize Query Auto-completion. In
     Proceedings of the 36th International ACM SIGIR Conference on Research and
     Development in Information Retrieval (SIGIR ’13). ACM, New York, NY, USA,
     103–112. https://doi.org/10.1145/2484028.2484076
[23] Amit Singhal et al. 2001. Modern information retrieval: A brief overview. IEEE
     Data Eng. Bull. 24, 4 (2001), 35–43.
[24] Yang Song, Dengyong Zhou, and Li-wei He. 2011. Post-ranking Query Suggestion
     by Diversifying Search Results. In Proceedings of the 34th International ACM SIGIR
     Conference on Research and Development in Information Retrieval (SIGIR ’11). ACM,
     New York, NY, USA, 815–824. https://doi.org/10.1145/2009916.2010025
[25] F. Su, M. Somaiya, S. Mishra, and R. Mukherjee. 2015. EXOS: Expansion on session
     for enhancing effectiveness of query auto-completion. In 2015 IEEE International
     Conference on Big Data (Big Data). 1154–1163. https://doi.org/10.1109/BigData.
     2015.7363869
[26] Aston Zhang, Amit Goyal, Weize Kong, Hongbo Deng, Anlei Dong, Yi Chang,
     Carl A. Gunter, and Jiawei Han. 2015. adaQAC: Adaptive Query Auto-Completion
     via Implicit Negative Feedback. In Proceedings of the 38th International ACM SIGIR
     Conference on Research and Development in Information Retrieval (SIGIR ’15). ACM,
     New York, NY, USA, 143–152. https://doi.org/10.1145/2766462.2767697

</pre>