Realtime query completion via deep language models Po-Wei Wang∗ Huan Zhang∗ Vijai Mohan Machine Learning Dept Electrical and Computer Engineering Amazon Carnegie Mellon University UC Davis Palo Alto, CA, United States Pittsburgh, PA, United States Davis, CA, United States vijaim@amazon.com poweiw@cs.cmu.edu ecezhang@ucdavis.edu Inderjit S. Dhillon∗ J. Zico Kolter Amazon & UT Austin School of Computer Science Palo Alto, CA, United States Carnegie Mellon University isd@amazon.com Pittsburgh, PA, United States zkolter@cs.cmu.edu ABSTRACT 1 INTRODUCTION Search engine users nowadays heavily depend on query completion Search completion is the problem of taking the prefix of a search and correction to shape their queries. Typically, the completion is query from a user and generating several candidate completions. done by database lookup which does not understand the context This problem has enormous potential utility and monetary value and cannot generalize to prefixes not in the database. In this paper, to any search provider: the more accurately an engine can find the we propose to use unsupervised deep language models to complete desired completions for a user (or indeed, potentially steer the user and correct the queries given an arbitrary prefix. We address two towards high-value completions), the more quickly it can lead the main challenges that renders this method practical for large-scale user to their desired goal. deployment: 1) we propose a modified beam search process which This paper proposes a realtime search completion architecture integrates with a completion distance based error correction model, based upon deep character-level language models. The basic idea combining the error correction process (as a potential function) is that instead of looking up possible completions from a generic together with the language model; and 2) we show how to efficiently database, we perform search under a deep-network-based language perform our modified beam search process on CPU to complete model to find the most likely completions of the user’s current input. the queries with error correction in real time, by exploiting the This allows us to integrate the power of deep language models, that greatly overlapped forward propagation process and conducting have been shown to perform extremely well on complex language amortized dynamic programming on the search tree, along with modeling and prediction tasks, with the desired goal of finding a both SIMD-level and thread level parallelism. We outperform the good completion. Although this is a conceptually simple strategy off-the-shelf Keras implementation by a factor of 50, thus allowing (and one which has been considered before, as we highlight below us to generate query suggestions in real time (generating top 16 in the literature survey), there are two key elements required to completions within 16 ms). Experiments on two large scale datasets make this of practical use for a search engine provider, which to- from AOL and Amazon.com show that the method substantially gether make up the primary technical contributions of the paper: 1) increases hit rate over standard approaches, reduces the memory The completion must be error correcting, able to handle small errors footprint of database lookup based approach by over two orders of in the user’s initial input and provide completions for the most magnitude, and is capable of handling tail queries. likely “correct” input. We propose such an approach that combines a character-level language model with an edit-distance-based po- KEYWORDS tential function, combining the two using a tree-based beam search Query completion, query correction, deep learning, realtime algorithm; 2) The completion must be realtime, able to produce high- quality potential completions in time that is not even perceivable ACM Reference Format: to the user. We achieve this by developing an efficient tree-based Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico version of beam search and an amortized dynamic programming Kolter. 2018. Realtime query completion via deep language models. In Pro- ceedings of SIGIR Workshop On eCommerce (SIGIR eCom’18). ACM, New algorithm for error correction based on completion distance (our York, NY, USA, Article 4, 9 pages. https://doi.org/10.475/123_4 proposed editing distance variant) along the search tree, exploiting thread-level and SIMD-level CPU-based computation for a single ∗ Work performed while at A9.com, an Amazon subsidiary query, and through numerous optimizations to the implementation that we discuss below. Permission to make digital or hard copies of part or all of this work for personal or Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes. classroom use is G. granted withoutS.fee providedM.that copies areLin, notA.made or distributed We evaluate the method on the AOL search dataset, a dataset In: J. Degenhardt, Di Fabbrizio, Kallumadi, Kumar, Y.-C. Trotman, H. Zhao (eds.): Proceedings for profit of the SIGIR or commercial 2018 eCom advantage andworkshop, 12 July, that copies bear2018, Ann Arbor, this notice andMichigan, USA, the full citation consisting of over 36 million total search queries, as well as on an published at http://ceur-ws.org on the first page. Copyrights for third-party components of this work must be honored. Amazon product search dataset containing over 100 million user For all other uses, contact the owner/author(s). SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA search queries. Our proposed method substantially outperforms © 2018 Copyright held by the owner/author(s). highly optimized standard search completion algorithms in terms ACM ISBN 123-4567-24-567/08/06. of its hit rate (the benefit of the deep language model and the https://doi.org/10.475/123_4 SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter error correction), while being fast enough to execute in real time Heuristic models. Whitelaw et al. [22] proposed generating can- for search engines. Our approach is also very memory efficient didate sets that contain common errors for given prefixes, then and reduces the memory usage of database lookup based query searching these based upon the current query. Similarly, Martins completion system by at least two orders of magnitude; in addition, and Silva [13] use a ternary search tree to accelerate the search we can handle tail queries, which are the queries that are rarely within candidate sets for spelling correction in general. The ap- seen and for which database lookup based approach cannot give proaches are nice in that they are easily parallelizable at runtime, any query completions. The experiments on AOL search dataset but are relatively “brute force”, and cannot handle previously un- and code are publicly available online 1 . seen permutations. Learning-based Model. On the learning side, Duan and Hsu [6] 2 RELATED WORK train an n-gram Markov model combined with A* search to deter- 2.1 Background on search completion mine candidate misspelling; this is similar to our approach except Here we review existing approaches to search query completion with a much richer language model replacing the simple n-gram and error correction. Broadly speaking, two types of query comple- model, which creates several challenges in the search procedure tions are most relevant to our work, database lookup methods and itself. Likewise, Xie et al. [23] use a similar character-level model learning-based approaches. with attention, but do so in the context of error correcting an entire paragraph of text, and don’t focus on the same realtime aspects Database Lookup. One of the most intuitive ways to do query that we do. completion is to do a database lookup. That is, given a prefix, we can fetch all the known queries matching the prefix and return the most 3 BACKGROUND ON QUERY COMPLETION frequent candidates. This is called the “most popular completion” When a user types any prefix string s in the search engine, the query (MPC) [1], which corresponds to the maximum likelihood estimator completion function will start to recommend the best r completions, for P (completion | prefix). The database lookup can be efficiently each denoted ŝ, according to certain metrics. For example, one might implemented by a trie [9]. For instance, it takes only 15µs to give want to maximize the probability that a recommendation is clicked. 16 suggestions for a query in our own trie-based implementation. The conditional probability can be formulated as However, due to the long-tail nature [20] of the search queries, many prefixes might not exist in the database; for example, in the P (ŝ | s) := P (completion | prefix), (1) AOL search data, 28% of the queries are unique. An excellent survey and the goal of query completion in the setting is to find the top of these current “classical” approaches is given in Cai et al. [3]. r most probable strings ŝ which potentially also maximize some Learning-based. In addition to database lookup approaches, in additional metric, such as the click-through rate. recent years there have been a number of approaches that use Denote s 1:m as the first m characters in string s. We first discuss learning-based methods for query completion. Sordoni et al. [19] the query completion in a simplified setting, in which all comple- use a translation model at the word level to output single-word tions must contain the prefix exactly, that is: ŝ 1:m = s 1:m , and search query suggestions, and also model consecutive sessions of P (ŝ 1:n | s 1:m ) = P (ŝm+1:n | s 1:m ) = P (ŝm+1:n | ŝ 1:m ), (2) the same user. Liu et al. [12] proposed a word-based method for code completion, but focused solely on greedy stochastic sampling for where n is the total length of a completion. Note that the probability the prediction. Mitra and Craswell [14] also used neural networks is defined in the sequence domain, which contains exponentially combined with a database-based model to handle tail queries, but many candidate strings. To simplify the model, we can apply the focused on CNN approaches that just output the single most likely conditional probability formula recursively and have word-level completion. Shokouhi [18] used logistic regression to n−1 Y learn a personalized query ranking model, specific to individual P (ŝm+1:n | ŝ 1:m ) = P (ŝt +1 | ŝ 1:t ). (3) users. All these approaches are relevant but fairly orthogonal to our t =m own, as we focus here on character-level modeling, beam search, This way, we only need to model P (ŝt +1 | ŝ 1:t ), that is, the proba- and realtime completion. Finally, Park and Chiba [15] very recently bility of the next character under the current prefix. This is precisely published an approach similar to ours, which uses a character-level a character-level language model, and we can learn it in an unsuper- language model for completion. But their approach focuses on the vised manner using a variety of methods, though here we focus on use of embeddings (such as word2vec) to produce “intelligent” com- the popular approach of using recurrent neural networks (RNNs) pletions that make use of additional context, and the approach does for this character-level language model. Character-level models not handle error correction; they also do not report the prediction are the right fidelity for the search completion task, because they time of their completions, which is a key driver for our work. satisfy the customer’s expectation from the user interface, and addi- tionally, word-level models or sequence-to-sequence probabilities 2.2 Error correction for queries would not be able to model probabilities under all partial strings. Our work also relates to methods on error and spelling correction approaches, which again are roughly divided into heuristic models 3.1 The Unsupervised Language Model and learning-based approaches. We first focus on the language model term P (ŝt +1 | ŝ 1:t ), the prob- ability of next character under the current prefix. RNNs in general, 1 https://github.com/xflash96/query_completion and variants like long short term memory networks (LSTMs) [8], are Realtime query completion via deep language models SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA extremely popular for high-fidelity character level modeling, and The most naive way to do so is simply via sampling: we sample achieve state-of-the-art performance for a number of datasets [5]. the next character (according to its probability of occurrence) given Since they can be trained from unsupervised data (e.g., just datasets the current prefix, until we hit an end-of-sequence (EOS) symbol: of many unannotated search queries), we can easily adapt the model For t = m; ; t++ : to whatever terms users are actually searching for in the dataset, with the potential to adapt to new searches, products, etc, simply ŝt +1 ∼ P (ŝt +1 | ŝ 1:t ); by occasionally retraining the model on all data collected up to the If ŝt +1 == EOS : break; current point. This method produces output that looks intuitively reasonable. Although character-level language modeling is a fairly standard However, it is biased toward longer sequences (as we can possibly approach, we briefly highlight the model we use for completeness. miss the EOS symbol even if it has a relatively large probability) Consider a recurrent neural network with hidden state ht at time t. with short-term dependencies and clearly does not generate the We want to encode the prefix ŝ 1:t and predict the next character most probable sequences, because sampling in a greedy fashion is using ht . We follow fairly standard approaches here and use an clearly not the same as sampling from the sequence space. LSTM model, in particular the specific implementation from the That is, we really need to do a better approximate search to Keras library [4]2 , which is defined by the recurrences get better results. One classic way to do this is to perform beam i t = σ (Wx i x t + Whi ht −1 + bi ) , (4) search, that is, perform breadth-first search while keeping the top-r   candidates. We illustrate the algorithm as follows: ft = σ Wx f x t + Whf ht −1 + bf , (5) cand := {s 1:m : 0}, result := { } ot = σ (Wxo x t + Who ht −1 + bo ) , (6) For t = m; cand is not empty; t++: c t = i t ⊙ tanh (Wxc x t + Whc ht −1 + bc ) + ft ⊙ c t −1 , (7) s 1:t +1 : log P (s 1:t +1 | s 1:m ) ( ) ht = ot ⊙ tanh (c t ) , (8) candnew := ; for all st +1 ∈ C, for all s 1:t ∈ cand in which ht , b ∈ Rd , x t ∈ R |C | , ∀t, Wx i ,Wx f ,Wxo ,Wxc and cand := the most probable (r − |result|) candidates in candnew ; Whi ,Whf ,Who ,Whc are the forward kernel and recurrent kernel Move s 1:t +1 from cand to result if st +1 is EOS symbol; with corresponding dimensions, σ is the sigmoid activation func- tion, and ⊙ is the element-wise product. We use a one-hot encoding By performing beam search we can consistently obtain a more of characters as input, a two-layer LSTM with 256 to 1024 hidden probable set of completions compared to stochastic search. units (more discussion on these choices below), and for prediction However, there are two issues with the above method. First, of character ŝt +1 , we feed the hidden layer ht to a softmax function it does not handle error correction (which is necessary for any practical type of completion) since the completion always attempts exp(w iT ht ) to find sequences that fit the current prefix exactly.3 Second, as we P (ŝt +1 = i | ŝ 1:t ) = softmax (i; Wsoftmaxht ) = P |C | , T show below, a naive implementation of this model is extremely slow, j=1 exp(w j h t ) (9) often taking on the order of one second to produce 16 completions for a given prefix. Thus, in the next two sections, we present our for all i in the character set C and train the language model to primary technical contributions, which address both these issues. maximize the log likelihood (minimize the categorical cross-entropy loss), 4 COMPLETION WITH ERROR CORRECTION X ns |s | X Most of the time, query completion is more than completing over a minimize − log P (st +1 | s 1:t ), (10) fixed prefix. The input prefix might contain mistakes and sometimes W |s | t =1 we would also like to insert keywords in the prefix. Traditionally, s ∈S where S denotes the set of queries, |s | is the length of query s and ns the database community handles the two features by first doing a is the number of times query s appears in the dataset. Further, we pass of error correction by matching the input to a typo database pad all queries with an end-of-sequence symbol to predict whether generated by permuting characters, then matching the database the query is complete. again on the permuted terms for insertion completion [17, chap. 14]. Our observation is that with a language-model-based approach, 3.2 Stochastic Search and Beam Search we can handle the spelling correction and insertion completion all in one model. Once we have the language model, we can evaluate the probability P (ŝm+1:n | ŝ 1:m ) for any prefix ŝ 1:m , but would ideally like to find 4.1 Error correction via the noisy channel the completion with the highest probability. Enumerating all the model possible strings is not an option because we have exponentially many candidates. Indeed, finding the best sequence probability, Following the convention in the previous section, define s 1:m and which is called the “decoding problem”, is NP-hard [7], so we have ŝ 1:n to be the prefix and completion string of length m and n, respec- to rely on approximations. tively. Different from the previous section, we no longer constrain 3 A character-level LSTM alone only gives the probability of the input/completion 2 Note that, as we describe below, we won’t actually use the Keras library at prediction sequence. It could not control the trade-off between the probability of user typos and time, but we do use it for training the likelihood of a completion, thus an error model is necessary. SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter the beginning of completions to be identical to the prefix so that characters are proper completions (only append character after we can “correct” the user input. Thus, the problem of finding the terms). most probable completion becomes To perform error correction under the noisy channel model, we still need to integrate the noisy channel (distance function) with arg max P (ŝ 1:n | s 1:m ) (11) ŝ 1:n our LSTM-based language model (the prior), which can only be evaluated once in the forward direction because of the beam search Now let us derive the maximum-a-posteriori (MAP) estimate of the procedure. Recall that the dynamic programming algorithm [21] of above problem. Using Bayes’ theorem, (11) can be rewritten as edit (completion) distance costs O (m · t ) to compare two strings of P (s 1:m | ŝ 1:n )P (ŝ 1:n ) length m and t. If we apply the algorithm to every candidate in the arg max . (12) ŝ 1:n P (s 1:m ) beam search for the incremental length t which ranges from 1 to n, it would add O (|C |rm · n 2 ) overhead to the beam search procedure, Because s 1:m never changes, P (s 1:m ) can be considered as a constant. where |C | is the size of character set, r is the number of candidates Thus, solving (12) is equivalent to maximizing the following: we keep, and n is the length of the final completion. This overhead arg max log P (s 1:m | ŝ 1:n ) + log P (ŝ 1:n ). (13) is not affordable, and we need to modify the dynamic programming ŝ 1:n algorithm for completion distance to amortize it on the search tree. This MAP estimate is called the noisy channel model [2, 10] in NLP, in which the first part log P (s 1:m | ŝ 1:n ) models the noisy channel 4.3 Amortized Dynamic Programming On the of user inputs, and the second part log P (ŝ 1:n ) models the prior. For Search Tree example, when using the noisy channel model for error correction We can exploit the fact that every new candidate in the beam search in a paragraph, we can assume that users have a constant proba- procedure originates incrementally from a previous candidate. That bility to make a typo for each letter. Under such assumption, the is, only one character is changed. Thus, if we can maintain the last noisy channel log P (s 1:m | ŝ 1:n ) is proportional to the edit distance column in the completion distance algorithm, that is distcompl , ∀j, (Levenshtein distance). For the prior part, we can plug in whatever for every candidate, we can save the repeated effort in building the P (ŝ 1:n ) we have for the paragraph, like the n-gram transitional edit distance table. The resulting algorithm is summarized below: probability or the language model. However, error correction for queries is essentially different from that for paragraphs; in query cand := {empty string “ ”: 0}, result := { } completion the user inputs are always incomplete. Thus, we must For t = 0; cand is not empty; t++: perform the completion and error correction at the same time. One ( ŝ 1:t +1 : log P (s 1:m | ŝ 1:t +1 ) + log P (ŝ 1:t +1 ) ) consequence of such a constraint is that we can no longer use the candnew := ; for all ŝt +1 ∈ C, for all ŝ 1:t ∈ cand edit distance function directly. cand := the most probable (r − |result|) candidates in candnew ; 4.2 Edit Distance v.s. Completion Distance Move ŝ 1:t +1 from cand to result if st +1 is EOS symbol; The edit distance function, which returns the minimum changes Maintain the last col of distnew for P (s 1:m | ŝ 1:t ) ∀ŝ 1:t ∈ cand; (add/substitute/delete) to transform one string into another, is a By such bookkeeping, we are able to amortize the completion dis- natural candidate to measure the number of corrections between tance algorithm over the beam search procedure, making it n times user inputs and completions. Assume that the probability by which faster (from O (|C |rm · n2 ) to O (|C |rm · n)). users make an error is constant, like 2%. As we mentioned before, the noisy channel under such an assumption can be written as 4.4 Extensions log P (s 1:m | ŝ 1:n ) = −α · edit distance(s 1:m , ŝ 1:n ), (14) While we are using the simple assumption (2% user error rate) in this paper, the error correction algorithm can be generalized where α = − log 2%. Note that to handle incomplete prefix and inser- in various ways [10, chap. 5]. For example, we can plug in the tion completion, we should not incur penalties for the completions. frequency statistics in the transition function of edit distance [11] That is, we should not count the edit distance for adding words after or learn it directly from the corpus [22]. Finally, we note that this the last character (of terms) from the user input. This can be done idea of inserting a noisy channel model naturally generalizes to by modifying the transition function in the edit distance algorithm. contexts other than edit distance. For example, many product search To be specific, we change the penalty to an indicator when dealing engines wish to drive the user not simply to a high-probability with the “add” operation in the edit distance algorithm; we define completion, but to a completion that is likely to lead to an actual the new transition function to be sale. By modifying the prior probability to more heavily weight     distnew (j-1) + I (s j−1 , last char) add; high-value completions, we can effectively optimize metrics other distnew (j)= min  distcompl (j-1) + 1 than simple completion probability using this approach.   substitute; compl (j) + 1   dist delete;   5 REALTIME COMPLETION (15) Starting with the system as proposed previously, the key challenge We called the new edit distance function a “completion distance”, that remains now is to perform such completions in real time. Re- in a way that the completion “pokemon go plus” for the prefix sponse time is crucial for query completion because unless the user “poke go” would not incur unwanted penalties, because the added can see completions as they type the query, the results will likely Realtime query completion via deep language models SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA have very little value. The bar we set for ourselves in this work is to the matrix-vector product in the LSTM. By properly moving to provide 16 candidate completions in about 20 milliseconds on cur- batch matrix-matrix operations with a minibatch that contains all r rent hardware.4 The 20 milliseconds budget, combined with typical candidates maintained by beam search, we can substantially speed network latency, is similar to the pace that a user types in the query; this up; By grouping together the product between the W matrices we need to complete faster than typing to make our realtime com- and ht for all r candidates maintained by the beam search procedure, pletion usable in practice. Unfortunately, a naive implementation of we can use matrix-matrix products, which have significantly better beam search with the model trained above (using off-the-shelf im- cache efficiency even on the CPU. We use the Intel MKL BLAS, and plementations), requires more than one second to complete forward the total of these optimizations further reduces the running time to propagation through the network and beam search. 75ms. By further parallelizing the updates via 8 OpenMP threads In this section, we now provide a detailed breakdown of how we brings completion time down to 25 ms. have empirically improved this performance by a factor of over 50x Finally, one of the most subtle but surprising speedups we at- in order to achieve sub-20-ms completion times. tained was through a slightly tweaked LSTM implementation. With the optimizations above, computing the sigmoid terms in the LSTM 5.1 LSTM over a Tree actually took a surprisingly large 30% of the total computation time. First, we observe that all new candidates in the beam search process This is due to the fact that 1) our LSTM implementation uses a hard are extensions from the old candidates because of the BFS property. sigmoid activation, which as a clipping operation requires branch In this case, the forward propagations would greatly overlap. If we prediction; and 2) the fact that the activations we need to apply the can maintain ht for every old candidate, extending one character sigmoid to are not consecutive in the hidden state vector means for new candidates would require only one forward propagation we cannot perform fast vectorized operations. By simply grouping step. That is, we amortize the LSTM forward propagation over the together the terms i t , ft , ot in the hidden state, and by using Intel search tree. The algorithm is illustrated below. SSE-based operations for the hard sigmoid, we further reduce the completion time down to 13.3ms, or 16.3ms if we include the error cand := {s 1:m : (hm , 0)}, result := { }; correction procedure. For t = m; cand is not empty; t++:  s 1:t +1 : (ht , log P (s 1:t | s 1:m ) + log P (st +1 | s 1:t ))   6 EXPERIMENTAL RESULTS  candnew :=    for every st +1 ∈ C, for every s 1:t ∈ cand    We evaluate our method on the AOL search dataset [16], a public cand := the most probable r − |result| candidates in candnew dataset of real-world searches from 2006, as well as an internal Move s 1:t +1 from cand to result if st +1 is EOS symbol dataset of product search queries from Amazon.com. The AOL Bump ht to ht +1 by one step of LSTM on st +1 , ∀s 1:t +1 ∈cand dataset contains 36 million total queries, with 10 million of these being unique, illustrating the long tail in these search domains. We Note that the initialization takes O (md 2 ), and the four lines in the set a maximum sequence length for the queries at 60 characters, as loop cost O (r |C |d ), O (r |C |), O (r ), and O (rd 2 ), where m is the length this contained 99.5% of all queries. The Amazon dataset contains of s 1:m , d is the hidden dimension of LSTM, |C | is the length of a random sample of about 110 million product search queries that character set C, and r is the number of completions required. Using users typed on the Amazon.com web site during 2017 (we excluded this approach, the complexity for computing r completions for d- sexually explicit and culturally insensitive or inappropriate queries). dimensional LSTM reduces from O (n2rd (d +|C |)) to O (nrd (d +|C |)) for sequence with maximum length n. A naive C implementation shows that the running time for such search drops to 250 ms from Training and testing splits. For each example in the dataset, we over 1 sec. choose a random cutting point (always after two characters in the string), and treat all characters beforehand as the prefix and 5.2 CPU implementation and LSTM tweaks all characters afterwards as the completion. For examples in the validation and test set, we use these prefixes and actual completions Although GPUs appear to be most suitable for computation in deep to evaluate the completions that our method predicts. In the training learning, for this particular application we found that the CPU is set, we discard the cutting points and just train on the entire queries. actually better suited to the task. This is due to the need for branch- For the AOL dataset, we use a test set size of 330K queries, and ing and maintaining relatively complex data structures in the beam use the rest for training. For the Amazon dataset we use a test set search process, along with the integration of the edit distance com- size of 1 million queries. We create training and testing splits to putation. Thus, implementation on a GPU requires a process that evaluate our method using two different strategies: frequently shuffles very small amounts of data (each new character) between the CPU and GPU and can be very inefficient. We thus • Prefix splitting: sort the queries according to the MD5 hash implemented the entire beam search, error correction and forward of the prefix, and then split. This ensures that data in the test propagation in C on the CPU. set does not contain an exact prefix match in the training However, after moving to a pure CPU implementation, it is the set. case that initially about 90% of the time is spent on computing • Time splitting: For both the AOL and Amazon datasets, we 4 Experiments are carried on an Intel Xeon E5-2670 machine. We use up to 8 threads, sort the queries by timestamp and split. This mimics making and test it on the error-corrected query completion model with 512 hidden units. For predictions online as new data comes in. each input, 16 suggestions are generated. SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter Table 1: Character-level language model cross-entropy loss (see (10)) for the LSTM on AOL search dataset and Amazon Product Search dataset. We explored both LSTM and GRU as the recurrent units and varied their dimensions from 256 to 1,024. Training Loss Validation loss Dataset Split LSTM GRU LSTM GRU 256 512 1024 256 512 1024 256 512 1024 256 512 1024 Prefix split .07929 .0691 .06282 .07745 .06920 .06385 .07192 .06405 .05866 .07236 .06528 .06073 AOL Time split .07928 .0691 .06279 .07739 .06918 .06373 .07241 .06416 .05904 .07279 .06562 .06108 Prefix split .06635 .0583 .05275 .06463 .05819 .05356 .06099 .05463 .04997 .06116 .05562 .05175 Amazon Time split .06624 .0580 .05308 .06454 .05832 .05428 .06076 .05424 .05008 .06058 .05497 .05133 6.1 Training language model 6.2 Runtime evaluation We trained our character-level language model on the characters Compared with the traditional database lookup based query com- of all the queries in the training set. We trained each model for 3-4 pletion system, one challenge of our deep learning based approach epochs over the entire dataset, and applied early stopping if the is its prediction time. Here we summarize the speedups achieved validation loss did not improve for more than 50, 000 mini-batches. by the different optimizations discussed in Section 5 in Table 2, and We used a 2-layer LSTM with 256, 512, 1024 hidden dimensions with report the time to give 16 suggestions for a prefix. A naive imple- dropout of 0.5 between the two LSTM layers (no dropout within mentation in Keras would result in a prediction time of over one a single layer), and used Adam optimizer to train with a mini- second per query, which is intolerable in the case of completing the batch size of 256. We use cross-entropy loss weighted by query user’s query in real time, where a completion time close to typing length for training. For each model, we select the learning rate speed is desired. With all the optimization techniques applied, we from {10−2 , 10−3 , 10−4 }, weight decay from {10−7 , 10−8 , 10−9 , 0} and observe over 50X speedup comparing with a naive beam search gradient norm clipping from {0.01, 0.001, 0.0001, 0.00001}. We also implementation. These optimizations are crucial to make our deep conduct experiments on replacing LSTM with Gated Recurrent learning based query completion practical as an online service. Unit (GRU), as GRU has lower computation cost compared to LSTM. One interesting point to note is that stochastic search in this set- Training and validation losses for each datasets, under the two ting actually takes three times longer with all the same optimization different splittings and varying LSTM/GRU dimensions, are shown techniques applied than beam search, to generate the same number in Table 1. We observe that with the same model size, GRU usually of completion candidates. This is due to the fact that stochastic shows slightly worse performance than LSTM, and increasing the search tends to generate completions that are much longer than hidden dimension does help in improving model performance. those of beam search, interestingly making the “simpler” method Our training time on a NVIDIA V100 GPU (on AWS P3 instance) here actually substantially slower while giving worse completions for the AOL dataset is 17, 18 and 24 hours for LSTM with 256, 512, (which we will evaluate shortly). and 1024 neurons, respectively. For the Amazon dataset, we remove all duplicate queries and apply per instance weight as the number Table 2: The speedups from different optimizations, with an of occurrences of each query to reduce the number of training LSTM of dimension 256. Results were produced on a Xeon examples to iterate over. The training time for the Amazon dataset E5-2670 machine, running 8 threads. is roughly three times longer than the AOL dataset using a batch size of 256. However, we found that if we increase the batch size to 1024, Optimization Resulting runtime we can speed up the training by a factor of 2.2 because of increased Naive beam search implementation >1sec GPU utilization, without noticeable performance loss. As a result, Tree-based beam search 250ms we can run one epoch of the Amazon dataset in approximately 6 Adding MKL BLAS 75ms hours, and training on the entire dataset can be done within one OpenMP parallelization 25ms day. Custom LSTM implementation 13.3ms We evaluated relatively few other architectures for this model, Adding prefix edit distance 16.3 ms as the goal here is to use the character-level language model for Stochastic search 40 ms completion rather than attain state-of-the-art results on language modeling in general. It is worth noting that the validation loss is lower than the training loss in Table 1, and the two losses becomes As an online service, it is important to pay attention to the worst- closer when the LSTM/GRU size is increased, indicating that our case performance, especially because the nature of user query pre- models are still in the regime of under-fitting, and even larger fixes have a long-tail nature. In Table 3, we measure the prediction LSTM sizes may be used for improving performance. However, the time using 100,000 real user query prefixes from Amazon.com, and requirement of completing the queries in real-time forbids us from report the Top-Percentiles (TPS) performance. A TP99 of 12 ms indi- using a larger model, as we will show shortly in the next section. cates that 99% of user requests can be served under 12 milliseconds. In Figure 1, we plot the cumulative distribution function (CDF) of prediction time. We observe that the distribution of prediction time given real user query prefixes does have a very long tail. We Realtime query completion via deep language models SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA desire that the response time of our query completion service is the hit rate metric. Table 4 shows the performance of the trie-based close to user typing speed, thus LSTM with 1024 hidden neurons approach, beam search, and beam search with error correction is unsuitable for our use case despite showing the best prediction under these metrics. Our models generally outperform trie-based performance. approaches in all settings, the one exception being probabilistic coverage on the time-based training/testing split. This is possi- Table 3: Top-Percentile of completion time on the Amazon bly due to some amount of shift over time in the search query dataset with varying LSTM dimension. TPx is the minimum terms. And although we cannot generate coverage numbers for the time (in milliseconds) under which x% of requests have been error-correction method, the significantly larger hit rate suggests served. Results were produced on an Intel Xeon E5-2686 v4 that it is indeed giving better completions than all the alternative machine, running with 8 threads. approaches. Further, we note that in addition to these numbers, there are a LSTM Dimension few notable disadvantages with trie-based lookup. The trie data Top-Percentile structure we compare to is very memory intensive (requires keeping 256 512 1024 TP50 5.388 ms 15.74 ms 67.19 ms prefixes for all relevant queries in memory), and takes a minimum TP90 8.067 ms 24.17 ms 96.08 ms of 11 GB of RAM for the entire AOL search data set, and over 50 GB TP99 11.14 ms 28.86 ms 114.74 ms of RAM for the Amazon dataset. Our deep learning based language model approach uses over two magnitudes less memory, even at TP99.9 12.02 ms 30.93 ms 125.63 ms its largest configuration (LSTM-1024), as shown in Table 6. It is TP99.99 12.30 ms 31.35 ms 131.84 ms worth mentioning that the Amazon dataset we used in experiments contains user queries that are sampled from one month’s data on amazon.com; if we want to utilize complete data from the month, it 6.3 Performance evaluation can be memory intensive to use the trie-based approach, while our Finally, we evaluate the actual performance of the completion ap- deep learning based approach does not have this limitation and in proaches, both comparing the performance of our beam search general, a learning based approach can benefit more from having a method to stochastic search (evaluated by log likelihood under the bigger dataset. model), and comparing our completion method to a heavily opti- Additionally, if a prefix has not been seen before in the dataset, mized in-memory trie-base completion model, the standard data the trie-based approach will offer no completions. In our test set structure for completion given string prefixes. of the Amazon dataset, which contains user queries from a time period subsequent to the training data, we observe that there are a Stochastic Search vs. Beam Search. In Table 5 we highlight the substantial fraction of queries which do not appear in the training performance of beam search versus stochastic search for query data. A trie-based query approach cannot give these queries as completion, evaluated in terms of log likelihood under the model. suggestions, whereas our deep learning based approach can still Over all models, splitting methods and LSTM sizes, beam search make suggestions using its language model. Furthermore, the trie- produces substantially better results in terms of log likelihood; in based approach is not amenable to error correction in isolation, as addition, it is 3x faster as mentioned above. Thus we believe that candidate corrections need to be proposed prior to lookup in the beam search is necessary in our real-time query completion task, database; the process of repeatedly generating these candidates and justifying our efforts on optimizing its runtime. Note that in this performing the lookups will work for at most 2 edits, whereas our case we are not including any error correction, as it is not trivial to approach empirically easily handles completions that include 4-5 integrate this into the stochastic search setting, and we wanted a edits; this is reflected in the significantly higher hit rate in Table 4. direct comparison on sample likelihood. Our approach vs. database lookup. Finally, we compare our total 7 CONCLUSIONS approach (beam search with error correction) to a trie-based (i.e., In this paper, we have presented a search query completion ap- prefix lookup) completion model. We compare the approach using proach based upon character-level deep language models. We pro- a combination of two metrics: 1) probabilistic coverage, which posed a method for integrating the approach with an error cor- is simply the empirical conditional probability of the predicted rection framework and showed that candidate completions with completion given the prefix: error correction can be efficiently generated using beam search. We further described several optimizations that enabled the system to X deliver results in real time, including a CPU-based custom LSTM P̂ (completion i | prefix), (16) implementation. We demonstrated the effectiveness of our method i on two large-scale datasets from AOL and Amazon, and showed where P̂ is the empirical probability for the whole dataset (counts that our proposed deep learning based query completion model of completion i over all other queries with the same prefix in the is able to jointly produce better completions than simple prefix whole dataset); and 2) hit rate, which simply lists the number of lookup, while simultaneously being able to generate the candidates times a completion appears in the entire dataset. Because the error in real time. correction model adjusts the prefix, it is not possible to compute Acknowledgment. The authors thank Juzer Arsiwala for his probabilistic coverage exactly, but we can still get a sense of how help on preparing Amazon training dataset. likely the completions are based upon how often they occur using SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA Po-Wei Wang, Huan Zhang, Vijai Mohan, Inderjit S. Dhillon, and J. Zico Kolter 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 CDF CDF CDF 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 5 10 0 10 20 30 0 50 100 prediction time (milliseconds) prediction time (milliseconds) prediction time (milliseconds) (a) LSTM-256 (b) LSTM-512 (c) LSTM-1024 Figure 1: The cumulative distribution function (CDF) of completion time for different LSTM sizes. Time is measured by pre- dicting on 100,000 real user query prefixes from Amazon.com Table 4: Performance of our language model based methods (LSTM with 256 and 512 dimensions) versus trie-based prefix lookup on AOL dataset. Train/test Split Completion Method Probabilistic Coverage Hit Rate Trie-based 0.2754 1482 Beam Search (LSTM-256) 0.4023 1679 Prefix Splitting Beam Search (LSTM-512) 0.4476 1730 Beam Search w/ error correction (LSTM-256) - 3864 Beam Search w/ error correction (LSTM-512) - 3860 Trie-based 0.4869 1273 Beam Search (LSTM-256) 0.3089 1065 Time Splitting Beam Search (LSTM-512) 0.3578 1080 Beam Search w/ error correction (LSTM-256) - 1534 Beam Search w/ error correction (LSTM-512) - 1581 Table 5: Completion negative log likelihood for stochastic Table 6: Number of model parameters and memory con- search vs. beam search (lower is better) sumption of each method. Measurements were done on a 64-bit Linux machine and we record resident set size (RSS) LSTM Dimension for each program. Dataset Split 256 512 Beam Stochastic Beam Stochastic Number of Memory Dataset Model Prefix 0.984 1.058 0.798 0.840 parameters Requirement AOL Time 1.569 1.726 1.229 1.321 Trie-based 0M 11 GB Prefix 2.708 3.159 2.449 2.780 LSTM-256 0.8 M 12 MB Amazon AOL Time 3.063 3.646 3.259 3.863 LSTM-512 3.3 M 30 MB LSTM-1024 12.8 M 103 MB Trie-based 0M 50 GB LSTM-256 1.1 M 14 MB Amazon LSTM-512 3.8 M 35 MB LSTM-1024 13.9 M 112 MB Realtime query completion via deep language models SIGIR eCom’18, July 2018, Ann Arbor, Michigan, USA REFERENCES [14] Bhaskar Mitra and Nick Craswell. 2015. Query auto-completion for rare prefixes. [1] Ziv Bar-Yossef and Naama Kraus. 2011. Context-sensitive query auto-completion. In Proceedings of the 24th ACM International on Conference on Information and In Proceedings of the 20th international conference on World wide web. ACM, Knowledge Management. ACM, 1755–1758. 107–116. [15] Dae Hoon Park and Rikio Chiba. 2017. A Neural Language Model for Query [2] Eric Brill and Robert C Moore. 2000. An improved error model for noisy channel Auto-Completion. In Proceedings of the 40th International ACM SIGIR Conference spelling correction. In Proceedings of the 38th Annual Meeting on Association for on Research and Development in Information Retrieval. ACM, 1189–1192. Computational Linguistics. Association for Computational Linguistics, 286–293. [16] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. [3] Fei Cai, Maarten De Rijke, et al. 2016. A survey of query auto completion in In Proceedings of the 1st international conference on Scalable information systems. information retrieval. Foundations and Trends® in Information Retrieval 10, 4 ACM, 1. (2016), 273–363. [17] Toby Segaran and Jeff Hammerbacher. 2009. Beautiful data: the stories behind [4] François Chollet et al. 2015. Keras. https://github.com/fchollet/keras. (2015). elegant data solutions. " O’Reilly Media, Inc.". [5] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2016. Hierarchical multiscale [18] Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proceed- recurrent neural networks. arXiv preprint arXiv:1609.01704 (2016). ings of the 36th international ACM SIGIR conference on Research and development [6] Huizhong Duan and Bo-June Paul Hsu. 2011. Online spelling correction for query in information retrieval. ACM, 103–112. completion. In Proceedings of the 20th international conference on World wide web. [19] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob ACM, 117–126. Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder [7] G David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61, 3 (1973), 268–278. for generative context-aware query suggestion. In Proceedings of the 24th ACM [8] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural International on Conference on Information and Knowledge Management. ACM, computation 9, 8 (1997), 1735–1780. 553–562. [9] Bo-June Paul Hsu and Giuseppe Ottaviano. 2013. Space-efficient data structures [20] Idan Szpektor, Aristides Gionis, and Yoelle Maarek. 2011. Improving recommen- for top-k completion. In Proceedings of the 22nd international conference on World dation for long-tail queries via templates. In Proceedings of the 20th international Wide Web. ACM, 583–594. conference on World wide web. ACM, 47–56. [10] Dan Jurafsky and James H Martin. [n. d.]. Speech and language processing. Vol. 3. [21] Robert A Wagner and Michael J Fischer. 1974. The string-to-string correction [11] Mark D Kernighan, Kenneth W Church, and William A Gale. 1990. A spelling problem. Journal of the ACM (JACM) 21, 1 (1974), 168–173. correction program based on a noisy channel model. In Proceedings of the 13th [22] Casey Whitelaw, Ben Hutchinson, Grace Y Chung, and Gerard Ellis. 2009. Using conference on Computational linguistics-Volume 2. Association for Computational the web for language independent spellchecking and autocorrection. In Proceed- Linguistics, 205–210. ings of the 2009 Conference on Empirical Methods in Natural Language Processing: [12] Chang Liu, Xin Wang, Richard Shin, Joseph E Gonzalez, and Dawn Song. 2016. Volume 2-Volume 2. Association for Computational Linguistics, 890–899. Neural Code Completion. (2016). [23] Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Jurafsky, and Andrew Y Ng. [13] Bruno Martins and Mário J Silva. 2004. Spelling correction for search engine 2016. Neural language correction with character-based attention. arXiv preprint queries. In Advances in Natural Language Processing. Springer, 372–383. arXiv:1603.09727 (2016).