=Paper=
{{Paper
|id=Vol-2319/paper24
|storemode=property
|title=Realtime Query Completion via Deep Language Models
|pdfUrl=https://ceur-ws.org/Vol-2319/paper24.pdf
|volume=Vol-2319
|authors=Po-Wei Wang,Huan Zhang,Vijai Mohan,Inderjit S. Dhillon,J. Zico Kolter
|dblpUrl=https://dblp.org/rec/conf/sigir/WangZMDK18
}}
==Realtime Query Completion via Deep Language Models==
<pdf width="1500px">https://ceur-ws.org/Vol-2319/paper24.pdf</pdf>
<pre>
                  Realtime query completion via deep language models
                      Po-Wei Wang∗                                                        Huan Zhang∗                                    Vijai Mohan
                Machine Learning Dept                                   Electrical and Computer Engineering                                Amazon
             Carnegie Mellon University                                               UC Davis                                   Palo Alto, CA, United States
             Pittsburgh, PA, United States                                     Davis, CA, United States                             vijaim@amazon.com
                 poweiw@cs.cmu.edu                                             ecezhang@ucdavis.edu

                                                   Inderjit S. Dhillon∗                                          J. Zico Kolter
                                                 Amazon & UT Austin                                       School of Computer Science
                                              Palo Alto, CA, United States                                Carnegie Mellon University
                                                   isd@amazon.com                                         Pittsburgh, PA, United States
                                                                                                               zkolter@cs.cmu.edu
 ABSTRACT                                                                                             1    INTRODUCTION
Search engine users nowadays heavily depend on query completion                                       Search completion is the problem of taking the prefix of a search
and correction to shape their queries. Typically, the completion is                                   query from a user and generating several candidate completions.
done by database lookup which does not understand the context                                         This problem has enormous potential utility and monetary value
and cannot generalize to prefixes not in the database. In this paper,                                 to any search provider: the more accurately an engine can find the
we propose to use unsupervised deep language models to complete                                       desired completions for a user (or indeed, potentially steer the user
and correct the queries given an arbitrary prefix. We address two                                     towards high-value completions), the more quickly it can lead the
main challenges that renders this method practical for large-scale                                    user to their desired goal.
deployment: 1) we propose a modified beam search process which                                            This paper proposes a realtime search completion architecture
integrates with a completion distance based error correction model,                                   based upon deep character-level language models. The basic idea
combining the error correction process (as a potential function)                                      is that instead of looking up possible completions from a generic
together with the language model; and 2) we show how to efficiently                                   database, we perform search under a deep-network-based language
perform our modified beam search process on CPU to complete                                           model to find the most likely completions of the user’s current input.
the queries with error correction in real time, by exploiting the                                     This allows us to integrate the power of deep language models, that
greatly overlapped forward propagation process and conducting                                         have been shown to perform extremely well on complex language
amortized dynamic programming on the search tree, along with                                          modeling and prediction tasks, with the desired goal of finding a
both SIMD-level and thread level parallelism. We outperform the                                       good completion. Although this is a conceptually simple strategy
off-the-shelf Keras implementation by a factor of 50, thus allowing                                   (and one which has been considered before, as we highlight below
us to generate query suggestions in real time (generating top 16                                      in the literature survey), there are two key elements required to
completions within 16 ms). Experiments on two large scale datasets                                    make this of practical use for a search engine provider, which to-
from AOL and Amazon.com show that the method substantially                                            gether make up the primary technical contributions of the paper: 1)
increases hit rate over standard approaches, reduces the memory                                       The completion must be error correcting, able to handle small errors
footprint of database lookup based approach by over two orders of                                     in the user’s initial input and provide completions for the most
magnitude, and is capable of handling tail queries.                                                   likely “correct” input. We propose such an approach that combines
                                                                                                      a character-level language model with an edit-distance-based po-
 KEYWORDS                                                                                             tential function, combining the two using a tree-based beam search
 Query completion, query correction, deep learning, realtime                                          algorithm; 2) The completion must be realtime, able to produce high-
                                                                                                      quality potential completions in time that is not even perceivable
ACM Reference Format:
                                                                                                      to the user. We achieve this by developing an efficient tree-based
Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico
                                                                                                      version of beam search and an amortized dynamic programming
Kolter. 2018. Realtime query completion via deep language models. In Pro-
ceedings of SIGIR Workshop On eCommerce (SIGIR eCom’18). ACM, New                                     algorithm for error correction based on completion distance (our
York, NY, USA, Article 4, 9 pages. https://doi.org/10.475/123_4                                       proposed editing distance variant) along the search tree, exploiting
                                                                                                      thread-level and SIMD-level CPU-based computation for a single
 ∗ Work performed while at A9.com, an Amazon subsidiary
                                                                                                      query, and through numerous optimizations to the implementation
                                                                                                      that we discuss below.
 Permission to make digital or hard copies of part or all of this work for personal or
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
 classroom     use is G.
                       granted  withoutS.fee providedM.that copies  areLin,
                                                                         notA.made  or distributed
                                                                                                          We evaluate the method on the AOL search dataset, a dataset
In: J. Degenhardt,        Di Fabbrizio,   Kallumadi,     Kumar,  Y.-C.         Trotman, H. Zhao
(eds.): Proceedings
 for profit           of the SIGIR
             or commercial         2018 eCom
                               advantage  andworkshop,  12 July,
                                               that copies bear2018,   Ann Arbor,
                                                                 this notice  andMichigan,  USA,
                                                                                  the full citation   consisting of over 36 million total search queries, as well as on an
published  at http://ceur-ws.org
 on the first   page. Copyrights for third-party components of this work must be honored.             Amazon product search dataset containing over 100 million user
For all other uses, contact the owner/author(s).
SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA
                                                                                                      search queries. Our proposed method substantially outperforms
© 2018 Copyright held by the owner/author(s).                                                         highly optimized standard search completion algorithms in terms
ACM ISBN 123-4567-24-567/08/06.                                                                       of its hit rate (the benefit of the deep language model and the
https://doi.org/10.475/123_4
SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA            Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter


error correction), while being fast enough to execute in real time            Heuristic models. Whitelaw et al. [22] proposed generating can-
for search engines. Our approach is also very memory efficient             didate sets that contain common errors for given prefixes, then
and reduces the memory usage of database lookup based query                searching these based upon the current query. Similarly, Martins
completion system by at least two orders of magnitude; in addition,        and Silva [13] use a ternary search tree to accelerate the search
we can handle tail queries, which are the queries that are rarely          within candidate sets for spelling correction in general. The ap-
seen and for which database lookup based approach cannot give              proaches are nice in that they are easily parallelizable at runtime,
any query completions. The experiments on AOL search dataset               but are relatively “brute force”, and cannot handle previously un-
and code are publicly available online 1 .                                 seen permutations.
                                                                              Learning-based Model. On the learning side, Duan and Hsu [6]
2 RELATED WORK
                                                                           train an n-gram Markov model combined with A* search to deter-
2.1 Background on search completion                                        mine candidate misspelling; this is similar to our approach except
Here we review existing approaches to search query completion              with a much richer language model replacing the simple n-gram
and error correction. Broadly speaking, two types of query comple-         model, which creates several challenges in the search procedure
tions are most relevant to our work, database lookup methods and           itself. Likewise, Xie et al. [23] use a similar character-level model
learning-based approaches.                                                 with attention, but do so in the context of error correcting an entire
                                                                           paragraph of text, and don’t focus on the same realtime aspects
   Database Lookup. One of the most intuitive ways to do query             that we do.
completion is to do a database lookup. That is, given a prefix, we can
fetch all the known queries matching the prefix and return the most        3     BACKGROUND ON QUERY COMPLETION
frequent candidates. This is called the “most popular completion”
                                                                          When a user types any prefix string s in the search engine, the query
(MPC) [1], which corresponds to the maximum likelihood estimator
                                                                          completion function will start to recommend the best r completions,
for P (completion | prefix). The database lookup can be efficiently
                                                                          each denoted ŝ, according to certain metrics. For example, one might
implemented by a trie [9]. For instance, it takes only 15µs to give
                                                                          want to maximize the probability that a recommendation is clicked.
16 suggestions for a query in our own trie-based implementation.
                                                                          The conditional probability can be formulated as
However, due to the long-tail nature [20] of the search queries,
many prefixes might not exist in the database; for example, in the                           P (ŝ | s) := P (completion | prefix),                     (1)
AOL search data, 28% of the queries are unique. An excellent survey        and the goal of query completion in the setting is to find the top
of these current “classical” approaches is given in Cai et al. [3].        r most probable strings ŝ which potentially also maximize some
   Learning-based. In addition to database lookup approaches, in           additional metric, such as the click-through rate.
recent years there have been a number of approaches that use                  Denote s 1:m as the first m characters in string s. We first discuss
learning-based methods for query completion. Sordoni et al. [19]           the query completion in a simplified setting, in which all comple-
use a translation model at the word level to output single-word            tions must contain the prefix exactly, that is: ŝ 1:m = s 1:m , and
search query suggestions, and also model consecutive sessions of                  P (ŝ 1:n | s 1:m ) = P (ŝm+1:n | s 1:m ) = P (ŝm+1:n | ŝ 1:m ),   (2)
the same user. Liu et al. [12] proposed a word-based method for code
completion, but focused solely on greedy stochastic sampling for           where n is the total length of a completion. Note that the probability
the prediction. Mitra and Craswell [14] also used neural networks          is defined in the sequence domain, which contains exponentially
combined with a database-based model to handle tail queries, but           many candidate strings. To simplify the model, we can apply the
focused on CNN approaches that just output the single most likely          conditional probability formula recursively and have
word-level completion. Shokouhi [18] used logistic regression to                                                     n−1
                                                                                                                     Y
learn a personalized query ranking model, specific to individual                           P (ŝm+1:n | ŝ 1:m ) =          P (ŝt +1 | ŝ 1:t ).       (3)
users. All these approaches are relevant but fairly orthogonal to our                                                t =m
own, as we focus here on character-level modeling, beam search,               This way, we only need to model P (ŝt +1 | ŝ 1:t ), that is, the proba-
and realtime completion. Finally, Park and Chiba [15] very recently        bility of the next character under the current prefix. This is precisely
published an approach similar to ours, which uses a character-level        a character-level language model, and we can learn it in an unsuper-
language model for completion. But their approach focuses on the           vised manner using a variety of methods, though here we focus on
use of embeddings (such as word2vec) to produce “intelligent” com-         the popular approach of using recurrent neural networks (RNNs)
pletions that make use of additional context, and the approach does        for this character-level language model. Character-level models
not handle error correction; they also do not report the prediction        are the right fidelity for the search completion task, because they
time of their completions, which is a key driver for our work.             satisfy the customer’s expectation from the user interface, and addi-
                                                                           tionally, word-level models or sequence-to-sequence probabilities
2.2     Error correction for queries                                       would not be able to model probabilities under all partial strings.
Our work also relates to methods on error and spelling correction
approaches, which again are roughly divided into heuristic models          3.1    The Unsupervised Language Model
and learning-based approaches.                                            We first focus on the language model term P (ŝt +1 | ŝ 1:t ), the prob-
                                                                          ability of next character under the current prefix. RNNs in general,
1 https://github.com/xflash96/query_completion
                                                                          and variants like long short term memory networks (LSTMs) [8], are
Realtime query completion via deep language models                                                        SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA


extremely popular for high-fidelity character level modeling, and                             The most naive way to do so is simply via sampling: we sample
achieve state-of-the-art performance for a number of datasets [5].                         the next character (according to its probability of occurrence) given
Since they can be trained from unsupervised data (e.g., just datasets                      the current prefix, until we hit an end-of-sequence (EOS) symbol:
of many unannotated search queries), we can easily adapt the model                                                 For t = m;        ; t++ :
to whatever terms users are actually searching for in the dataset,
with the potential to adapt to new searches, products, etc, simply                                                       ŝt +1 ∼ P (ŝt +1 | ŝ 1:t );
by occasionally retraining the model on all data collected up to the                                                     If ŝt +1 == EOS : break;
current point.                                                                             This method produces output that looks intuitively reasonable.
   Although character-level language modeling is a fairly standard                         However, it is biased toward longer sequences (as we can possibly
approach, we briefly highlight the model we use for completeness.                          miss the EOS symbol even if it has a relatively large probability)
Consider a recurrent neural network with hidden state ht at time t.                        with short-term dependencies and clearly does not generate the
We want to encode the prefix ŝ 1:t and predict the next character                         most probable sequences, because sampling in a greedy fashion is
using ht . We follow fairly standard approaches here and use an                            clearly not the same as sampling from the sequence space.
LSTM model, in particular the specific implementation from the                                That is, we really need to do a better approximate search to
Keras library [4]2 , which is defined by the recurrences                                   get better results. One classic way to do this is to perform beam
         i t = σ (Wx i x t + Whi ht −1 + bi ) ,                                     (4)    search, that is, perform breadth-first search while keeping the top-r
                                                                                         candidates. We illustrate the algorithm as follows:
         ft = σ Wx f x t + Whf ht −1 + bf ,                                         (5)
                                                                                           cand := {s 1:m : 0}, result := { }
         ot = σ (Wxo x t + Who ht −1 + bo ) ,                                       (6)
                                                                                           For t = m; cand is not empty; t++:
         c t = i t ⊙ tanh (Wxc x t + Whc ht −1 + bc ) + ft ⊙ c t −1 ,               (7)
                                                                                                             s 1:t +1 : log P (s 1:t +1 | s 1:m )
                                                                                                           (                                                 )
         ht = ot ⊙ tanh (c t ) ,                                                    (8)         candnew :=                                                     ;
                                                                                                                     for all st +1 ∈ C, for all s 1:t ∈ cand
in which ht , b ∈ Rd , x t ∈ R |C | , ∀t, Wx i ,Wx f ,Wxo ,Wxc and                               cand := the most probable (r − |result|) candidates in candnew ;
Whi ,Whf ,Who ,Whc are the forward kernel and recurrent kernel                                   Move s 1:t +1 from cand to result if st +1 is EOS symbol;
with corresponding dimensions, σ is the sigmoid activation func-
tion, and ⊙ is the element-wise product. We use a one-hot encoding                         By performing beam search we can consistently obtain a more
of characters as input, a two-layer LSTM with 256 to 1024 hidden                           probable set of completions compared to stochastic search.
units (more discussion on these choices below), and for prediction                            However, there are two issues with the above method. First,
of character ŝt +1 , we feed the hidden layer ht to a softmax function                    it does not handle error correction (which is necessary for any
                                                                                           practical type of completion) since the completion always attempts
                                                          exp(w iT ht )                    to find sequences that fit the current prefix exactly.3 Second, as we
 P (ŝt +1 = i | ŝ 1:t ) = softmax (i; Wsoftmaxht ) = P |C |             ,
                                                                    T                      show below, a naive implementation of this model is extremely slow,
                                                        j=1 exp(w j h t )
                                                                        (9)                often taking on the order of one second to produce 16 completions
                                                                                           for a given prefix. Thus, in the next two sections, we present our
for all i in the character set C and train the language model to                           primary technical contributions, which address both these issues.
maximize the log likelihood (minimize the categorical cross-entropy
loss),                                                                                     4     COMPLETION WITH ERROR CORRECTION
                                 X ns       |s |
                                            X                                              Most of the time, query completion is more than completing over a
                 minimize −                         log P (st +1 | s 1:t ),       (10)     fixed prefix. The input prefix might contain mistakes and sometimes
                     W                  |s | t =1                                          we would also like to insert keywords in the prefix. Traditionally,
                                 s ∈S
where S denotes the set of queries, |s | is the length of query s and ns                   the database community handles the two features by first doing a
is the number of times query s appears in the dataset. Further, we                         pass of error correction by matching the input to a typo database
pad all queries with an end-of-sequence symbol to predict whether                          generated by permuting characters, then matching the database
the query is complete.                                                                     again on the permuted terms for insertion completion [17, chap.
                                                                                           14]. Our observation is that with a language-model-based approach,
3.2      Stochastic Search and Beam Search                                                 we can handle the spelling correction and insertion completion all
                                                                                           in one model.
Once we have the language model, we can evaluate the probability
P (ŝm+1:n | ŝ 1:m ) for any prefix ŝ 1:m , but would ideally like to find               4.1     Error correction via the noisy channel
the completion with the highest probability. Enumerating all the
                                                                                                   model
possible strings is not an option because we have exponentially
many candidates. Indeed, finding the best sequence probability,                            Following the convention in the previous section, define s 1:m and
which is called the “decoding problem”, is NP-hard [7], so we have                         ŝ 1:n to be the prefix and completion string of length m and n, respec-
to rely on approximations.                                                                 tively. Different from the previous section, we no longer constrain
                                                                                           3 A character-level LSTM alone only gives the probability of the input/completion
2 Note that, as we describe below, we won’t actually use the Keras library at prediction
                                                                                           sequence. It could not control the trade-off between the probability of user typos and
time, but we do use it for training                                                        the likelihood of a completion, thus an error model is necessary.
SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA                             Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter


the beginning of completions to be identical to the prefix so that                          characters are proper completions (only append character after
we can “correct” the user input. Thus, the problem of finding the                           terms).
most probable completion becomes                                                               To perform error correction under the noisy channel model, we
                                                                                            still need to integrate the noisy channel (distance function) with
                             arg max P (ŝ 1:n | s 1:m )                          (11)
                                      ŝ 1:n                                                our LSTM-based language model (the prior), which can only be
                                                                                            evaluated once in the forward direction because of the beam search
Now let us derive the maximum-a-posteriori (MAP) estimate of the
                                                                                            procedure. Recall that the dynamic programming algorithm [21] of
above problem. Using Bayes’ theorem, (11) can be rewritten as
                                                                                            edit (completion) distance costs O (m · t ) to compare two strings of
                                       P (s 1:m | ŝ 1:n )P (ŝ 1:n )                       length m and t. If we apply the algorithm to every candidate in the
                     arg max                                          .           (12)
                             ŝ 1:n            P (s 1:m )                                   beam search for the incremental length t which ranges from 1 to n,
                                                                                            it would add O (|C |rm · n 2 ) overhead to the beam search procedure,
Because s 1:m never changes, P (s 1:m ) can be considered as a constant.
                                                                                            where |C | is the size of character set, r is the number of candidates
Thus, solving (12) is equivalent to maximizing the following:
                                                                                            we keep, and n is the length of the final completion. This overhead
               arg max log P (s 1:m | ŝ 1:n ) + log P (ŝ 1:n ).                 (13)      is not affordable, and we need to modify the dynamic programming
                    ŝ 1:n
                                                                                            algorithm for completion distance to amortize it on the search tree.
This MAP estimate is called the noisy channel model [2, 10] in NLP,
in which the first part log P (s 1:m | ŝ 1:n ) models the noisy channel                    4.3     Amortized Dynamic Programming On the
of user inputs, and the second part log P (ŝ 1:n ) models the prior. For                           Search Tree
example, when using the noisy channel model for error correction
                                                                                           We can exploit the fact that every new candidate in the beam search
in a paragraph, we can assume that users have a constant proba-
                                                                                           procedure originates incrementally from a previous candidate. That
bility to make a typo for each letter. Under such assumption, the
                                                                                           is, only one character is changed. Thus, if we can maintain the last
noisy channel log P (s 1:m | ŝ 1:n ) is proportional to the edit distance
                                                                                           column in the completion distance algorithm, that is distcompl , ∀j,
(Levenshtein distance). For the prior part, we can plug in whatever
                                                                                           for every candidate, we can save the repeated effort in building the
P (ŝ 1:n ) we have for the paragraph, like the n-gram transitional
                                                                                           edit distance table. The resulting algorithm is summarized below:
probability or the language model. However, error correction for
queries is essentially different from that for paragraphs; in query                         cand := {empty string “ ”: 0}, result := { }
completion the user inputs are always incomplete. Thus, we must                             For t = 0; cand is not empty; t++:
perform the completion and error correction at the same time. One                                            (
                                                                                                               ŝ 1:t +1 : log P (s 1:m | ŝ 1:t +1 ) + log P (ŝ 1:t +1 )
                                                                                                                                                                           )
consequence of such a constraint is that we can no longer use the                                candnew :=                                                                  ;
                                                                                                                        for all ŝt +1 ∈ C, for all ŝ 1:t ∈ cand
edit distance function directly.
                                                                                                  cand := the most probable (r − |result|) candidates in candnew ;
4.2    Edit Distance v.s. Completion Distance                                                     Move ŝ 1:t +1 from cand to result if st +1 is EOS symbol;
The edit distance function, which returns the minimum changes                                     Maintain the last col of distnew for P (s 1:m | ŝ 1:t ) ∀ŝ 1:t ∈ cand;
(add/substitute/delete) to transform one string into another, is a
                                                                                            By such bookkeeping, we are able to amortize the completion dis-
natural candidate to measure the number of corrections between
                                                                                            tance algorithm over the beam search procedure, making it n times
user inputs and completions. Assume that the probability by which
                                                                                            faster (from O (|C |rm · n2 ) to O (|C |rm · n)).
users make an error is constant, like 2%. As we mentioned before,
the noisy channel under such an assumption can be written as                                4.4     Extensions
         log P (s 1:m | ŝ 1:n ) = −α · edit distance(s 1:m , ŝ 1:n ),           (14)     While we are using the simple assumption (2% user error rate)
                                                                                           in this paper, the error correction algorithm can be generalized
where α = − log 2%. Note that to handle incomplete prefix and inser-
                                                                                           in various ways [10, chap. 5]. For example, we can plug in the
tion completion, we should not incur penalties for the completions.
                                                                                           frequency statistics in the transition function of edit distance [11]
That is, we should not count the edit distance for adding words after
                                                                                           or learn it directly from the corpus [22]. Finally, we note that this
the last character (of terms) from the user input. This can be done
                                                                                           idea of inserting a noisy channel model naturally generalizes to
by modifying the transition function in the edit distance algorithm.
                                                                                           contexts other than edit distance. For example, many product search
To be specific, we change the penalty to an indicator when dealing
                                                                                           engines wish to drive the user not simply to a high-probability
with the “add” operation in the edit distance algorithm; we define
                                                                                           completion, but to a completion that is likely to lead to an actual
the new transition function to be
                                                                                           sale. By modifying the prior probability to more heavily weight
                  
                  
                  
                   distnew (j-1) + I (s j−1 , last char)                 add;             high-value completions, we can effectively optimize metrics other
 distnew (j)= min  distcompl (j-1) + 1                                                    than simple completion probability using this approach.
                  
                                                                         substitute;
                        compl (j) + 1
                  
                   dist                                                  delete;
                  
                                                                                           5     REALTIME COMPLETION
                                                                                  (15)
                                                                                            Starting with the system as proposed previously, the key challenge
We called the new edit distance function a “completion distance”,                           that remains now is to perform such completions in real time. Re-
in a way that the completion “pokemon go plus” for the prefix                               sponse time is crucial for query completion because unless the user
“poke go” would not incur unwanted penalties, because the added                             can see completions as they type the query, the results will likely
Realtime query completion via deep language models                                                  SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA


have very little value. The bar we set for ourselves in this work is to                the matrix-vector product in the LSTM. By properly moving to
provide 16 candidate completions in about 20 milliseconds on cur-                      batch matrix-matrix operations with a minibatch that contains all r
rent hardware.4 The 20 milliseconds budget, combined with typical                      candidates maintained by beam search, we can substantially speed
network latency, is similar to the pace that a user types in the query;                this up; By grouping together the product between the W matrices
we need to complete faster than typing to make our realtime com-                       and ht for all r candidates maintained by the beam search procedure,
pletion usable in practice. Unfortunately, a naive implementation of                   we can use matrix-matrix products, which have significantly better
beam search with the model trained above (using off-the-shelf im-                      cache efficiency even on the CPU. We use the Intel MKL BLAS, and
plementations), requires more than one second to complete forward                      the total of these optimizations further reduces the running time to
propagation through the network and beam search.                                       75ms. By further parallelizing the updates via 8 OpenMP threads
   In this section, we now provide a detailed breakdown of how we                      brings completion time down to 25 ms.
have empirically improved this performance by a factor of over 50x                        Finally, one of the most subtle but surprising speedups we at-
in order to achieve sub-20-ms completion times.                                        tained was through a slightly tweaked LSTM implementation. With
                                                                                       the optimizations above, computing the sigmoid terms in the LSTM
5.1     LSTM over a Tree                                                               actually took a surprisingly large 30% of the total computation time.
First, we observe that all new candidates in the beam search process                   This is due to the fact that 1) our LSTM implementation uses a hard
are extensions from the old candidates because of the BFS property.                    sigmoid activation, which as a clipping operation requires branch
In this case, the forward propagations would greatly overlap. If we                    prediction; and 2) the fact that the activations we need to apply the
can maintain ht for every old candidate, extending one character                       sigmoid to are not consecutive in the hidden state vector means
for new candidates would require only one forward propagation                          we cannot perform fast vectorized operations. By simply grouping
step. That is, we amortize the LSTM forward propagation over the                       together the terms i t , ft , ot in the hidden state, and by using Intel
search tree. The algorithm is illustrated below.                                       SSE-based operations for the hard sigmoid, we further reduce the
                                                                                       completion time down to 13.3ms, or 16.3ms if we include the error
cand := {s 1:m : (hm , 0)}, result := { };
                                                                                       correction procedure.
For t = m; cand is not empty; t++:
                s 1:t +1 : (ht , log P (s 1:t | s 1:m ) + log P (st +1 | s 1:t )) 
               
                                                                                       6    EXPERIMENTAL RESULTS
                                                                                   
   candnew :=                                                                     
                        for every st +1 ∈ C, for every s 1:t ∈ cand 
                                                                                     We evaluate our method on the AOL search dataset [16], a public
   cand := the most probable r − |result| candidates in candnew                        dataset of real-world searches from 2006, as well as an internal
   Move s 1:t +1 from cand to result if st +1 is EOS symbol                            dataset of product search queries from Amazon.com. The AOL
   Bump ht to ht +1 by one step of LSTM on st +1 , ∀s 1:t +1 ∈cand                     dataset contains 36 million total queries, with 10 million of these
                                                                                       being unique, illustrating the long tail in these search domains. We
Note that the initialization takes O (md 2 ), and the four lines in the                set a maximum sequence length for the queries at 60 characters, as
loop cost O (r |C |d ), O (r |C |), O (r ), and O (rd 2 ), where m is the length       this contained 99.5% of all queries. The Amazon dataset contains
of s 1:m , d is the hidden dimension of LSTM, |C | is the length of                    a random sample of about 110 million product search queries that
character set C, and r is the number of completions required. Using                    users typed on the Amazon.com web site during 2017 (we excluded
this approach, the complexity for computing r completions for d-                       sexually explicit and culturally insensitive or inappropriate queries).
dimensional LSTM reduces from O (n2rd (d +|C |)) to O (nrd (d +|C |))
for sequence with maximum length n. A naive C implementation
shows that the running time for such search drops to 250 ms from                          Training and testing splits. For each example in the dataset, we
over 1 sec.                                                                            choose a random cutting point (always after two characters in
                                                                                       the string), and treat all characters beforehand as the prefix and
5.2     CPU implementation and LSTM tweaks                                             all characters afterwards as the completion. For examples in the
                                                                                       validation and test set, we use these prefixes and actual completions
Although GPUs appear to be most suitable for computation in deep
                                                                                       to evaluate the completions that our method predicts. In the training
learning, for this particular application we found that the CPU is
                                                                                       set, we discard the cutting points and just train on the entire queries.
actually better suited to the task. This is due to the need for branch-
                                                                                          For the AOL dataset, we use a test set size of 330K queries, and
ing and maintaining relatively complex data structures in the beam
                                                                                       use the rest for training. For the Amazon dataset we use a test set
search process, along with the integration of the edit distance com-
                                                                                       size of 1 million queries. We create training and testing splits to
putation. Thus, implementation on a GPU requires a process that
                                                                                       evaluate our method using two different strategies:
frequently shuffles very small amounts of data (each new character)
between the CPU and GPU and can be very inefficient. We thus
                                                                                           • Prefix splitting: sort the queries according to the MD5 hash
implemented the entire beam search, error correction and forward
                                                                                             of the prefix, and then split. This ensures that data in the test
propagation in C on the CPU.
                                                                                             set does not contain an exact prefix match in the training
   However, after moving to a pure CPU implementation, it is the
                                                                                             set.
case that initially about 90% of the time is spent on computing
                                                                                           • Time splitting: For both the AOL and Amazon datasets, we
4 Experiments are carried on an Intel Xeon E5-2670 machine. We use up to 8 threads,
                                                                                             sort the queries by timestamp and split. This mimics making
and test it on the error-corrected query completion model with 512 hidden units. For         predictions online as new data comes in.
each input, 16 suggestions are generated.
SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA               Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter

Table 1: Character-level language model cross-entropy loss (see (10)) for the LSTM on AOL search dataset and Amazon Product
Search dataset. We explored both LSTM and GRU as the recurrent units and varied their dimensions from 256 to 1,024.

                                                    Training Loss                                            Validation loss
      Dataset       Split                 LSTM                          GRU                          LSTM                          GRU
                                  256      512      1024       256       512      1024      256       512      1024       256       512      1024
                 Prefix split   .07929    .0691    .06282    .07745    .06920    .06385   .07192    .06405    .05866    .07236    .06528    .06073
       AOL
                 Time split     .07928    .0691    .06279    .07739    .06918    .06373   .07241    .06416    .05904    .07279    .06562    .06108
                 Prefix split   .06635    .0583    .05275    .06463    .05819    .05356   .06099    .05463    .04997    .06116    .05562    .05175
      Amazon
                 Time split     .06624    .0580    .05308    .06454    .05832    .05428   .06076    .05424    .05008    .06058    .05497    .05133


6.1     Training language model                                                 6.2    Runtime evaluation
We trained our character-level language model on the characters                 Compared with the traditional database lookup based query com-
of all the queries in the training set. We trained each model for 3-4           pletion system, one challenge of our deep learning based approach
epochs over the entire dataset, and applied early stopping if the               is its prediction time. Here we summarize the speedups achieved
validation loss did not improve for more than 50, 000 mini-batches.             by the different optimizations discussed in Section 5 in Table 2, and
We used a 2-layer LSTM with 256, 512, 1024 hidden dimensions with               report the time to give 16 suggestions for a prefix. A naive imple-
dropout of 0.5 between the two LSTM layers (no dropout within                   mentation in Keras would result in a prediction time of over one
a single layer), and used Adam optimizer to train with a mini-                  second per query, which is intolerable in the case of completing the
batch size of 256. We use cross-entropy loss weighted by query                  user’s query in real time, where a completion time close to typing
length for training. For each model, we select the learning rate                speed is desired. With all the optimization techniques applied, we
from {10−2 , 10−3 , 10−4 }, weight decay from {10−7 , 10−8 , 10−9 , 0} and      observe over 50X speedup comparing with a naive beam search
gradient norm clipping from {0.01, 0.001, 0.0001, 0.00001}. We also             implementation. These optimizations are crucial to make our deep
conduct experiments on replacing LSTM with Gated Recurrent                      learning based query completion practical as an online service.
Unit (GRU), as GRU has lower computation cost compared to LSTM.                     One interesting point to note is that stochastic search in this set-
Training and validation losses for each datasets, under the two                 ting actually takes three times longer with all the same optimization
different splittings and varying LSTM/GRU dimensions, are shown                 techniques applied than beam search, to generate the same number
in Table 1. We observe that with the same model size, GRU usually               of completion candidates. This is due to the fact that stochastic
shows slightly worse performance than LSTM, and increasing the                  search tends to generate completions that are much longer than
hidden dimension does help in improving model performance.                      those of beam search, interestingly making the “simpler” method
    Our training time on a NVIDIA V100 GPU (on AWS P3 instance)                 here actually substantially slower while giving worse completions
for the AOL dataset is 17, 18 and 24 hours for LSTM with 256, 512,              (which we will evaluate shortly).
and 1024 neurons, respectively. For the Amazon dataset, we remove
all duplicate queries and apply per instance weight as the number               Table 2: The speedups from different optimizations, with an
of occurrences of each query to reduce the number of training                   LSTM of dimension 256. Results were produced on a Xeon
examples to iterate over. The training time for the Amazon dataset              E5-2670 machine, running 8 threads.
is roughly three times longer than the AOL dataset using a batch size
of 256. However, we found that if we increase the batch size to 1024,
                                                                                            Optimization                     Resulting runtime
we can speed up the training by a factor of 2.2 because of increased
                                                                                   Naive beam search implementation                 >1sec
GPU utilization, without noticeable performance loss. As a result,
                                                                                        Tree-based beam search                     250ms
we can run one epoch of the Amazon dataset in approximately 6
                                                                                          Adding MKL BLAS                           75ms
hours, and training on the entire dataset can be done within one
                                                                                        OpenMP parallelization                      25ms
day.
                                                                                    Custom LSTM implementation                     13.3ms
    We evaluated relatively few other architectures for this model,
                                                                                       Adding prefix edit distance                16.3 ms
as the goal here is to use the character-level language model for
                                                                                           Stochastic search                       40 ms
completion rather than attain state-of-the-art results on language
modeling in general. It is worth noting that the validation loss is
lower than the training loss in Table 1, and the two losses becomes                As an online service, it is important to pay attention to the worst-
closer when the LSTM/GRU size is increased, indicating that our                 case performance, especially because the nature of user query pre-
models are still in the regime of under-fitting, and even larger                fixes have a long-tail nature. In Table 3, we measure the prediction
LSTM sizes may be used for improving performance. However, the                  time using 100,000 real user query prefixes from Amazon.com, and
requirement of completing the queries in real-time forbids us from              report the Top-Percentiles (TPS) performance. A TP99 of 12 ms indi-
using a larger model, as we will show shortly in the next section.              cates that 99% of user requests can be served under 12 milliseconds.
                                                                                In Figure 1, we plot the cumulative distribution function (CDF)
                                                                                of prediction time. We observe that the distribution of prediction
                                                                                time given real user query prefixes does have a very long tail. We
Realtime query completion via deep language models                                     SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA


desire that the response time of our query completion service is          the hit rate metric. Table 4 shows the performance of the trie-based
close to user typing speed, thus LSTM with 1024 hidden neurons            approach, beam search, and beam search with error correction
is unsuitable for our use case despite showing the best prediction        under these metrics. Our models generally outperform trie-based
performance.                                                              approaches in all settings, the one exception being probabilistic
                                                                          coverage on the time-based training/testing split. This is possi-
Table 3: Top-Percentile of completion time on the Amazon                  bly due to some amount of shift over time in the search query
dataset with varying LSTM dimension. TPx is the minimum                   terms. And although we cannot generate coverage numbers for the
time (in milliseconds) under which x% of requests have been               error-correction method, the significantly larger hit rate suggests
served. Results were produced on an Intel Xeon E5-2686 v4                 that it is indeed giving better completions than all the alternative
machine, running with 8 threads.                                          approaches.
                                                                             Further, we note that in addition to these numbers, there are a
                                     LSTM Dimension                       few notable disadvantages with trie-based lookup. The trie data
       Top-Percentile                                                     structure we compare to is very memory intensive (requires keeping
                                 256      512      1024
             TP50             5.388 ms 15.74 ms 67.19 ms                  prefixes for all relevant queries in memory), and takes a minimum
             TP90             8.067 ms 24.17 ms 96.08 ms                  of 11 GB of RAM for the entire AOL search data set, and over 50 GB
             TP99             11.14 ms 28.86 ms 114.74 ms                 of RAM for the Amazon dataset. Our deep learning based language
                                                                          model approach uses over two magnitudes less memory, even at
            TP99.9            12.02 ms 30.93 ms 125.63 ms
                                                                          its largest configuration (LSTM-1024), as shown in Table 6. It is
            TP99.99           12.30 ms 31.35 ms 131.84 ms
                                                                          worth mentioning that the Amazon dataset we used in experiments
                                                                          contains user queries that are sampled from one month’s data on
                                                                          amazon.com; if we want to utilize complete data from the month, it
6.3    Performance evaluation                                             can be memory intensive to use the trie-based approach, while our
Finally, we evaluate the actual performance of the completion ap-         deep learning based approach does not have this limitation and in
proaches, both comparing the performance of our beam search               general, a learning based approach can benefit more from having a
method to stochastic search (evaluated by log likelihood under the        bigger dataset.
model), and comparing our completion method to a heavily opti-               Additionally, if a prefix has not been seen before in the dataset,
mized in-memory trie-base completion model, the standard data             the trie-based approach will offer no completions. In our test set
structure for completion given string prefixes.                           of the Amazon dataset, which contains user queries from a time
                                                                          period subsequent to the training data, we observe that there are a
   Stochastic Search vs. Beam Search. In Table 5 we highlight the         substantial fraction of queries which do not appear in the training
performance of beam search versus stochastic search for query             data. A trie-based query approach cannot give these queries as
completion, evaluated in terms of log likelihood under the model.         suggestions, whereas our deep learning based approach can still
Over all models, splitting methods and LSTM sizes, beam search            make suggestions using its language model. Furthermore, the trie-
produces substantially better results in terms of log likelihood; in      based approach is not amenable to error correction in isolation, as
addition, it is 3x faster as mentioned above. Thus we believe that        candidate corrections need to be proposed prior to lookup in the
beam search is necessary in our real-time query completion task,          database; the process of repeatedly generating these candidates and
justifying our efforts on optimizing its runtime. Note that in this       performing the lookups will work for at most 2 edits, whereas our
case we are not including any error correction, as it is not trivial to   approach empirically easily handles completions that include 4-5
integrate this into the stochastic search setting, and we wanted a        edits; this is reflected in the significantly higher hit rate in Table 4.
direct comparison on sample likelihood.
   Our approach vs. database lookup. Finally, we compare our total        7    CONCLUSIONS
approach (beam search with error correction) to a trie-based (i.e.,       In this paper, we have presented a search query completion ap-
prefix lookup) completion model. We compare the approach using            proach based upon character-level deep language models. We pro-
a combination of two metrics: 1) probabilistic coverage, which            posed a method for integrating the approach with an error cor-
is simply the empirical conditional probability of the predicted          rection framework and showed that candidate completions with
completion given the prefix:                                              error correction can be efficiently generated using beam search. We
                                                                          further described several optimizations that enabled the system to
                    X                                                     deliver results in real time, including a CPU-based custom LSTM
                          P̂ (completion i | prefix),             (16)
                                                                          implementation. We demonstrated the effectiveness of our method
                      i
                                                                          on two large-scale datasets from AOL and Amazon, and showed
where P̂ is the empirical probability for the whole dataset (counts       that our proposed deep learning based query completion model
of completion i over all other queries with the same prefix in the        is able to jointly produce better completions than simple prefix
whole dataset); and 2) hit rate, which simply lists the number of         lookup, while simultaneously being able to generate the candidates
times a completion appears in the entire dataset. Because the error       in real time.
correction model adjusts the prefix, it is not possible to compute           Acknowledgment. The authors thank Juzer Arsiwala for his
probabilistic coverage exactly, but we can still get a sense of how       help on preparing Amazon training dataset.
likely the completions are based upon how often they occur using
SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA                      Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter

            1.0                                               1.0                                                1.0

            0.8                                               0.8                                                0.8

            0.6                                               0.6                                                0.6
      CDF


                                                        CDF


                                                                                                           CDF
            0.4                                               0.4                                                0.4

            0.2                                               0.2                                                0.2

            0.0                                               0.0                                                0.0
                  0              5            10                    0         10        20        30                   0            50           100
                      prediction time (milliseconds)                    prediction time (milliseconds)                     prediction time (milliseconds)
                            (a) LSTM-256                                    (b) LSTM-512                                       (c) LSTM-1024


Figure 1: The cumulative distribution function (CDF) of completion time for different LSTM sizes. Time is measured by pre-
dicting on 100,000 real user query prefixes from Amazon.com

Table 4: Performance of our language model based methods (LSTM with 256 and 512 dimensions) versus trie-based prefix
lookup on AOL dataset.


                      Train/test Split                 Completion Method                            Probabilistic Coverage               Hit Rate

                                                          Trie-based                                         0.2754                       1482
                                                   Beam Search (LSTM-256)                                    0.4023                       1679
                       Prefix Splitting            Beam Search (LSTM-512)                                    0.4476                       1730
                                           Beam Search w/ error correction (LSTM-256)                           -                         3864
                                           Beam Search w/ error correction (LSTM-512)                           -                         3860
                                                          Trie-based                                         0.4869                       1273
                                                   Beam Search (LSTM-256)                                    0.3089                       1065
                          Time Splitting           Beam Search (LSTM-512)                                    0.3578                       1080
                                           Beam Search w/ error correction (LSTM-256)                           -                         1534
                                           Beam Search w/ error correction (LSTM-512)                           -                         1581


Table 5: Completion negative log likelihood for stochastic                             Table 6: Number of model parameters and memory con-
search vs. beam search (lower is better)                                               sumption of each method. Measurements were done on a
                                                                                       64-bit Linux machine and we record resident set size (RSS)
                                      LSTM Dimension                                   for each program.
 Dataset          Split           256                512
                            Beam Stochastic Beam Stochastic                                                                  Number of           Memory
                                                                                             Dataset      Model
              Prefix        0.984     1.058   0.798      0.840                                                               parameters        Requirement
   AOL
              Time          1.569     1.726   1.229      1.321                                           Trie-based              0M               11 GB
              Prefix        2.708     3.159   2.449      2.780                                           LSTM-256               0.8 M             12 MB
 Amazon                                                                                        AOL
              Time          3.063     3.646   3.259      3.863                                           LSTM-512               3.3 M             30 MB
                                                                                                         LSTM-1024             12.8 M            103 MB
                                                                                                         Trie-based              0M               50 GB
                                                                                                         LSTM-256               1.1 M             14 MB
                                                                                             Amazon
                                                                                                         LSTM-512               3.8 M             35 MB
                                                                                                         LSTM-1024             13.9 M            112 MB
Realtime query completion via deep language models                                                        SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA


REFERENCES                                                                                [14] Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes.
 [1] Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion.            In Proceedings of the 24th ACM International on Conference on Information and
     In Proceedings of the 20th international conference on World wide web. ACM,               Knowledge Management. ACM, 1755–1758.
     107–116.                                                                             [15] Dae Hoon Park and Rikio Chiba. 2017. A Neural Language Model for Query
 [2] Eric Brill and Robert C Moore. 2000. An improved error model for noisy channel            Auto-Completion. In Proceedings of the 40th International ACM SIGIR Conference
     spelling correction. In Proceedings of the 38th Annual Meeting on Association for         on Research and Development in Information Retrieval. ACM, 1189–1192.
     Computational Linguistics. Association for Computational Linguistics, 286–293.       [16] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search.
 [3] Fei Cai, Maarten De Rijke, et al. 2016. A survey of query auto completion in              In Proceedings of the 1st international conference on Scalable information systems.
     information retrieval. Foundations and Trends® in Information Retrieval 10, 4             ACM, 1.
     (2016), 273–363.                                                                     [17] Toby Segaran and Jeff Hammerbacher. 2009. Beautiful data: the stories behind
 [4] François Chollet et al. 2015. Keras. https://github.com/fchollet/keras. (2015).           elegant data solutions. " O’Reilly Media, Inc.".
 [5] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2016. Hierarchical multiscale        [18] Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proceed-
     recurrent neural networks. arXiv preprint arXiv:1609.01704 (2016).                        ings of the 36th international ACM SIGIR conference on Research and development
 [6] Huizhong Duan and Bo-June Paul Hsu. 2011. Online spelling correction for query            in information retrieval. ACM, 103–112.
     completion. In Proceedings of the 20th international conference on World wide web.   [19] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob
     ACM, 117–126.                                                                             Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder
 [7] G David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278.            for generative context-aware query suggestion. In Proceedings of the 24th ACM
 [8] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural              International on Conference on Information and Knowledge Management. ACM,
     computation 9, 8 (1997), 1735–1780.                                                       553–562.
 [9] Bo-June Paul Hsu and Giuseppe Ottaviano. 2013. Space-efficient data structures       [20] Idan Szpektor, Aristides Gionis, and Yoelle Maarek. 2011. Improving recommen-
     for top-k completion. In Proceedings of the 22nd international conference on World        dation for long-tail queries via templates. In Proceedings of the 20th international
     Wide Web. ACM, 583–594.                                                                   conference on World wide web. ACM, 47–56.
[10] Dan Jurafsky and James H Martin. [n. d.]. Speech and language processing. Vol. 3.    [21] Robert A Wagner and Michael J Fischer. 1974. The string-to-string correction
[11] Mark D Kernighan, Kenneth W Church, and William A Gale. 1990. A spelling                  problem. Journal of the ACM (JACM) 21, 1 (1974), 168–173.
     correction program based on a noisy channel model. In Proceedings of the 13th        [22] Casey Whitelaw, Ben Hutchinson, Grace Y Chung, and Gerard Ellis. 2009. Using
     conference on Computational linguistics-Volume 2. Association for Computational           the web for language independent spellchecking and autocorrection. In Proceed-
     Linguistics, 205–210.                                                                     ings of the 2009 Conference on Empirical Methods in Natural Language Processing:
[12] Chang Liu, Xin Wang, Richard Shin, Joseph E Gonzalez, and Dawn Song. 2016.                Volume 2-Volume 2. Association for Computational Linguistics, 890–899.
     Neural Code Completion. (2016).                                                      [23] Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, and Andrew Y Ng.
[13] Bruno Martins and Mário J Silva. 2004. Spelling correction for search engine              2016. Neural language correction with character-based attention. arXiv preprint
     queries. In Advances in Natural Language Processing. Springer, 372–383.                   arXiv:1603.09727 (2016).

</pre>