Query Segmentation via RNNs Encoder-Decoder Framework Yiu-Chang Lin Giuseppe Di Fabbrizio Ankur Datta Rakuten Institute of Technology Rakuten Institute of Technology Rakuten Institute of Technology Boston, Massachusetts - USA 02110 Boston, Massachusetts - USA 02110 Boston, Massachusetts - USA 02110 yiuchang.lin@rakuten.com difabbrizio@gmail.com ankur.datta@rakuten.com ABSTRACT a variety of sequence labeling task is Conditional Random Fields Query segmentation is the task of segmenting a Web query into (CRFs) [14]. CRFs model the conditional probability distribution adjacent phrases, typically keywords that form noun phrases or con- over a label sequence given an input query where each token in the cepts that are relevant to the search task. In this paper, we describe a query is assigned to a label from the possible values of a pre-defined research study and some preliminary experiment results for query label sets [22]. However, such supervised methods require a huge segmentation via a Recurrent Neural Network encoder-decoder amount of human segmentation labels which are usually expensive framework on a public benchmark dataset (Webis-QSeC-10). The to obtain and, furthermore, careful feature engineering plays an resulting segmented queries can be used for several downstream important role in achieving high segmentation accuracy. tasks such as improving the performance of relevance ranking in On the other hand, in the unsupervised learning family, several search, better understanding of the query intent, and suggesting methods have been proposed to either automatically collect seg- queries for auto-completion. mented queries or train segmentation models from query log data. For example, in the e-commerce domain, query terms are aligned KEYWORDS to product attribute terms via user’s click data and the ambiguities are resolved using frequency and similarity statistics [12]. Statis- Query segmentation, Recurrent neural network, Encoder-Decoder tical methods based on point-wise mutual information (PMI) [13], ACM Reference format: n-gram frequency [18], or Multi-Word Expression probability [16] Yiu-Chang Lin, Giuseppe Di Fabbrizio, and Ankur Datta. 2017. Query Seg- are also popular. One unsupervised approach using generative lan- mentation via RNNs Encoder-Decoder Framework. In Proceedings of SIGIR guage models and Wikipedia as external resource has been reported 2017 eCom, Tokyo, Japan, August 2017, 5 pages. to have competitive performance [21]. Another unsupervised proba- bilistic model was proposed to exploit user click-throughs for query 1 INTRODUCTION segmentation and the model parameters were estimated by efficient expectation—maximization (EM) algorithm [15]. Query segmentation aims to detect semantically consistent phrases Recently, Deep Neural Networks (DNNs) models have shown its that identify entities and concepts in Web search queries e.g., "[air powerful capability to achieve excellent performance on various conditioner][remote control]", "[compact][microwave oven]", difficult Natural Language Processing learning tasks. Especially and "[iphone 7][cover]". Such phrases are the semantic structural in end-to-end sequence learning tasks, the Encoder-Decoder net- units of a search task and can be exploited by search engines as work [20] that makes minimal assumptions on the sequence struc- indivisible units in order to improve retrieval precision or reformu- ture is widely used in machine translation [1, 4, 5]. In this paper, late phrase-level query. It is often the case that short text as in Web we propose to treat query segmentation as a machine translation queries do not follow grammar rules hence traditional methods task and apply the Encoder-Decoder framework to generate query based on well-formed English are not applicable. segments. Preliminary results on the Webis-QSeC-10 1 dataset are Query segmentation is one of the most important tasks toward reported. query understanding, a key component of modern search engines for precisely inferring the users’ intent through queries since query segments can be further refined into named-entities and semantic 2 DATA relations linking head-phrases with modifiers. The Webis Query Segmentation Corpus (Webis-QSeC-10) [8] con- Both supervised and unsupervised learning techniques have sists of 53,437 web queries and each query has at least 10 seg- been used to solve the query segmentation task in the past. In the mentations provided by 10 different annotators crowdsourced via supervised learning category, Support Vector Machines ranker [10] Amazon's Mechanical Turk (AMT). A sample of 4,850 queries is pub- was used to learn a structured classifier that makes a segmentation lished as the training set and the remaining 48,587 queries serve as decision (yes or no) between each pair of continuous tokens [3]. the testing set, with a 1:9 train/test split ratio. The Webis-QSeC-10 Another well-known model that has been successfully applied to is sampled from the subset of the AOL query log [19] which consists of only queries with length from 3 to 10 words. Since 1-word queries cannot be segmented anymore and 2-word queries are typically Copyright © 2017 by the paper’s authors. Copying permitted for private and academic handled well by proximity features, queries with just 1 or 2 word purposes. In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.): are excluded. The sampling maintains the query length distribution Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published at and the query frequency distribution of the entire AOL query log. http://ceur-ws.org 1 https://www.uni-weimar.de/de/medien/professuren/medieninformatik/webis/ corpora/webis-qsec-10/ SIGIR 2017 eCom, August 2017, Tokyo, Japan Yiu-Chang Lin, Giuseppe Di Fabbrizio, and Ankur Datta An example query with its segmentations from the training set is Pn P shown below, where 1004073900 is the unique query id followed exp i=1 j λ j f j (yi−1 , yi , x, i) P (y|x, λ ) = by a list of vote and segmentation pairs indicating the 10 different Z (x) decisions the AMT workers made for that query. n X XX Z (x) = λ j f j (yi−1 , yi , x, i) • 1004073900 y∈Y i=1 j • (5, ’graffiti fonts|alphabet’), where λ is the model parameters, f j is the feature function and the (3, ’graffiti|fonts|alphabet’), numerator of P (y|x, λ ) is composed of potential functions. λ can (2, ’graffiti fonts alphabet’) be obtained by maximizing the logarithm of the likelihood of the training data with L 1 or L 2 regularization terms, Since each query is segmented by at least 10 annotators and not all of them always agree with each other, to select the reference X L(λ) = loдP (y|x, λ ) annotation, we apply the break fusion strategy described in [7]. i The underlying idea is that annotators should at least agree on In order to apply CRFs to the query segmentation task, we in- specific important segments even if there is no absolute majority troduce the standard Begin, Inside, Outside (BIO) tagging schema on the entire query segmentation. Break fusion simply follows the to maps a segmented query to a sequence of tags. Table 1 shows majority of annotators at each single break position of a query. A some example queries from Webis-QSeC-10 training set with their break is inserted in case of a tie vote. Considering the following corresponding BIO tags. example annotation, • 5 graffiti fonts|alphabet segmented query BIO tagging • 3 graffiti|fonts|alphabet graffiti fonts | alphabet graffiti (B) fonts (I) alphabet (B) stainless steel | chest freezers stainless (B) steel (I) chest (B) freezers (I) • 2 graffiti fonts alphabet rutgers | online | graduate classes rutgers (B) online (B) graduate (B) classes (I) at the first break position (between graffiti and fonts), 7 (5+2) anno- review | on | breezes review (B) on (B) breezes (B) tators agree with no break. Similarly, 8 (5+3) annotators agree with Table 1: Example queries from Webis-QSeC-10 training set inserting a break at the second break position (between fonts and and their corresponding BIO tags. alphabet). Therefore the final reference is • graffiti fonts|alphabet 3 METHODS 3.3 Recurrent Neural Networks In this section, we describe one baseline method [7] and two models, The fundamental idea of Recurrent Neural Networks is that the Conditional Random Fields (CRFs) and Recurrent Neural Networks network contains a feed-back connection as shown in the left part encoder-decoder framework, which are used in this paper for the of Figure 1, so that it can make use of sequential information. RNNs query segmentation experiment. perform the same task for every element in a sequence x, with the output o being dependent on the computations from the previous 3.1 Wikipedia Titles and Strict Noun Phrases state s. This characteristic enables the networks to do sequence pro- Baseline cessing and learn sequential structure information. Theoretically, RNNs are capable of capturing arbitrarily long distance dependen- This baseline method is simply treating only Wikipedia titles and cies, but in practice, they are limited to looking back only a few strict noun phrases as query segments. If the query contains more steps, known as the gradient vanishing/exploding problem [2]. than one overlapping Wikipedia title, the decision rule proposed The right part of Figure 1 shows a typical RNN and its forward in [8] is used, which basically assigns each title a score based on computation structure after being unfolded into a full network the frequencies in the Google n-gram corpus and multiplied by its within the sequence window t − 1, t, and t + 1. Assume that the length. For strict noun phrases, similarly, the multiplication of their input sequence x is a sentence consisting of n words, x 1 , x 2 , ..., x n . Web frequencies and length is assigned as the score. Finally, the x t is the input token at position t and it can be represented as a segmentation with the highest score is chosen. typical one-hot vector or a word embedding of dimension d. st , the corresponding hidden state or "memory", is calculated based on 3.2 Conditional Random Fields the previous hidden state and the input at the current step. In this Conditional Random Fields have been widely used in NLP struc- case, we would like to predict the next word given x 1 , x 2 , ..., x t −1 tured prediction tasks, especially sequence labeling such as part- so ot would be a vector of probabilities across the vocabulary. The of-speech (POS) tagging and named-entity recognition (NER). For- following equations explicitly explain the computation of RNNs. mally, let the input sequence x = x 1 , x 2 , ..., x n and label sequence y = y1 , y2 , ..., yn , we want to model the conditional distribution st = f (U x t + W st −1 ) P (y|x) so that the optimal label sequence can be predicted by solv- ing y∗ = argmaxP (y|x). The probabilistic model for sequence CRFs y ot = so f tmax (V st ) defines a family of conditional probability P (y|x, λ ) over all possible where the function f is a nonlinearity mapping such as tanh or label sequences y given x with the following form: ReLU. U , V and W are matrices (model parameters) and can be Query Segmentation via RNNs Encoder-Decoder Framework SIGIR 2017 eCom, August 2017, Tokyo, Japan graffiti fonts alphabet optimized through back propagation. Usually, s −1 , which is required Encoder to calculate the first hidden state, is initialized to a zero vector. RNN context o ot-1 ot ot+1 Decoder RNN graffiti fonts alphabet V V V V s st-1 st st+1 W W W W Figure 2: A RNN encoder decoder framework and its appli- W cation to query segmentation. unfold U U U U x xt-1 xt xt+1 level and break level accuracy [7] as the evaluation matrices. At query level, given a query q, its reference segmentation S and the Figure 1: A recurrent neural network and the unfolding in output segmentation S 0 from the model, the query accuracy is 1 if time of the computation involved in its forward computa- S 0 = S and 0 otherwise. At break level, a decision whether a break tion. needs to be inserted is made for every two consecutive words in the query. The break accuracy is defined as the ratio of correct decisions over all break positions in q with respect to S 0 . Theoretically, there 3.4 RNNs Encoder-Decoder Framework (k 2 −k ) exists 2k −1 valid segmentations for each q, and 2 potential Figure 2 shows a typical encoder decoder framework, a model con- segments that contain at least two keywords from q. sisting of two separate RNNs called the encoder and the decoder. The encoder reads an input sequence one item at a time, and out- 4.1 Model Parameters puts a vector at each step (ignored in Figure 2). The final output of In our experiment, we use CRFsuite 2 [17] for optimizing the CRF the encoder serves as the context vector and the decoder uses this model parameters and the following set of word uni-gram and context vector to generate a sequence of outputs. In the context of bi-gram features are utilized: machine translation, the encoder first processes a variable-length input word sequence from the source language and builds a fixed- • uni-gram: x −2 , x −1 , x, x 1 , x2 length vector representation (context vector). Conditioned on this • bi-gram: x −1x, xx 1 encoded representation, the decoder produces a variable-length For RNN encoder decoder, the following loss function, optimizer word sequence in the target language. In an ideal case, the context and parameters are used: vector can be considered as the meaning of the sequence in latent semantic space, and this idea can be extended beyond sequences. • Word representation: 1-hot vector For example, in image captioning tasks, the encoder decoder frame- • RNN hidden layer size: 1024 work takes the image as input and produces a text description as • RNN number of layers: 2 output. In the reverse direction, image generation tasks take a text • RNN activation function: tanh description as input and output a generated image. • Loss function: Negative log likelihood loss To fit the query segmentation task into encoder decoder frame- • Optimizer: Adam optimizer work, we treat the original query as an input sequence from one • Learning rate: 0.0001 language and the segmented query as an output sequence from the • Dropout rate: 0.05 other language. The vocabulary size is therefore the same for both • Epochs: 50,000 languages except that the target language has one additional break token, i.e., 4.2 RNNs Encoder-Decoder Loss Parameters optimization is obtained by Adam optimizer with nega- V ocabt ar дet = V ocabsour ce + {”|”} tive log likelihood as the loss function. Adam optimizer (Adaptive In practice, the queries and their segmentations combined are Moment Estimation) [11] is an algorithm for first-order gradient- treated as a parallel corpus for training. In testing phase, the encoder based optimization of stochastic objective functions through com- first calculates the context vector and then generates output tokens puting adaptive learning rates for each parameter. Adam keeps an one at a time from V ocabt ar дet . exponentially decaying average of both past gradients and squared gradients. The loss function value on the training set is recorded ev- 4 EXPERIMENTS ery 200 epochs and it shows that the training loss decreases steadily The Webis-QSeC-10 corpus [8] comprises 53,437 web queries and with the number of epochs and eventually converges at the end each of them has at least 10 segmentations. The reference segmen- (Figure 3). tation is obtained as described in Section 2. There are 4,850 queries in the training set and 48,587 queries in the testing set. To quantify the segmentation result of different algorithms, we adopt query 2 http://www.chokkan.org/software/crfsuite/ SIGIR 2017 eCom, August 2017, Tokyo, Japan Yiu-Chang Lin, Giuseppe Di Fabbrizio, and Ankur Datta compared to standard sentences in machine translation corpus. Therefore, RNNs’ remarkable capacity of capturing long-distance dependency is not that effective in this task. Although CRFs out- performs RNNs encoder-decoder, one disadvantage of CRFs is that it requires human-designed features as opposed to RNNs which require no feature engineering. 6 CONCLUSION AND FUTURE WORK Query segmentation is crucial for a search engine to better under- stand query intent and return higher quality search results. This paper provides a study on fitting query segmentation task into a RNN encoder-decoder framework and describes preliminary exper- imental results compared with other baselines. The RNNs does not perform as expected due to the lack of training data and the short nature of query length. However, three feasible future directions might be helpful for improving RNNs encoder decoder framework on query segmentation. The first direction is to automatically collect a large amount of Figure 3: Negative log-likelihood loss for the RNN encoder- segmented queries via user implicit feedback from query logs as decoder on the training set. The loss is recorded every 200 proposed in [12]. This will solve the challenge of little training epochs. data mentioned in Section 5. Another direction is to replace the RNN units in the encoder decoder framework with GRUs [6] or 4.3 Results LSTMs [9] and add an attention mechanism [1] at the encoder, giving the decoder a way to "pay attention" to different parts of Table 2 shows query-level and break-level accuracy of Wikipedia the input while decoding. Finally, substituting pre-trained word titles (WT), Wikipedia titles + strict noun phrases (WT+SNP), Con- embedding for the current one-hot word vector will both reduce ditional Random Fields and RNN encoder-decoder. WT+SNP has the input dimension and provide the network with richer word the best accuracy among the four methods at both levels. CRF per- representation. forms better than WT baseline in terms of query level and break level accuracy. The RNN encoder-decoder framework, however, in this case does not perform as expected as it does in other tasks such ACKNOWLEDGMENTS as machine translation and image captioning. The authors would like to thank Dr. Martin Potthast from Bauhaus- Universität Weimar for kindly providing the full Webis Query Seg- query break mentation Corpus used for our modeling experiments. The authors accuracy accuracy would also like to thank the anonymous reviewers for their feedback WT [7] 0.431 0.769 and helpful advice. WT+SNP [7] 0.585 0.837 CRF 0.465 0.814 REFERENCES RNN Encoder-Decoder 0.421 0.664 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- Table 2: Query level and break level accuracy on Webis- chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014). QSeC-10 test set. [2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157–166. [3] Shane Bergsma and Qin Iris Wang. 2007. Learning Noun Phrase Query Segmen- tation.. In EMNLP-CoNLL, Vol. 7. 819–826. [4] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Ben- 5 DISCUSSION gio. 2014. On the properties of neural machine translation: Encoder-decoder The first two methods (WT and WT+SNP) in Table 2 are unsuper- approaches. arXiv preprint arXiv:1409.1259 (2014). [5] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, vised but require external knowledge resource, e.g., Wikipedia titles, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase Google n-gram frequencies and Web n-gram frequencies. On the representations using RNN encoder-decoder for statistical machine translation. other hand, both CRFs and RNNs encoder-decoder are supervised arXiv preprint arXiv:1406.1078 (2014). [6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. machine learning methods relying on human annotation. Since Empirical evaluation of gated recurrent neural networks on sequence modeling. the training set only consists of 4,850 annotated queries, which is arXiv preprint arXiv:1412.3555 (2014). [7] Matthias Hagen, Martin Potthast, Anna Beyer, and Benno Stein. 2012. Towards 1/9 the size of testing set in Webis-QSeC-10, supervised methods optimum query segmentation: in doubt without. In Proceedings of the 21st ACM cannot benefit from a large amount of training data. In addition international conference on Information and knowledge management. ACM, 1015– to the small size of training set, short-query length is also another 1024. [8] Matthias Hagen, Martin Potthast, Benno Stein, and Christof Bräutigam. 2011. key factor that limits the power of RNNs encoder-decoder in query Query segmentation revisited. In Proceedings of the 20th international conference segmentation. Web queries are typically short and less structured on World wide web. ACM, 97–106. Query Segmentation via RNNs Encoder-Decoder Framework SIGIR 2017 eCom, August 2017, Tokyo, Japan [9] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780. [10] Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 133–142. [11] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980 (2014). [12] Julia Kiseleva, Qi Guo, Eugene Agichtein, Daniel Billsus, and Wei Chai. 2010. Unsupervised query segmentation using click data: preliminary results. In Pro- ceedings of the 19th international conference on World wide web. ACM, 1131–1132. [13] Giridhar Kumaran and Vitor R Carvalho. 2009. Reducing long queries using query quality predictors. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, 564–571. [14] John Lafferty, Andrew McCallum, Fernando Pereira, and others. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML, Vol. 1. 282–289. [15] Yanen Li, Bo-Jun Paul Hsu, ChengXiang Zhai, and Kuansan Wang. 2011. Un- supervised query segmentation using clickthrough for information retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 285–294. [16] Nikita Mishra, Rishiraj Saha Roy, Niloy Ganguly, Srivatsan Laxman, and Monojit Choudhury. 2011. Unsupervised query segmentation using only query logs. In Proceedings of the 20th international conference companion on World wide web. ACM, 91–92. [17] Naoaki Okazaki. 2007. CRFsuite: a fast implementation of Conditional Random Fields (CRFs). (2007). http://www.chokkan.org/software/crfsuite/ [18] Nish Parikh, Prasad Sriram, and Mohammad Al Hasan. 2013. On segmentation of ecommerce queries. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 1137–1146. [19] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A picture of search. [20] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112. [21] Bin Tan and Fuchun Peng. 2008. Unsupervised query segmentation using gen- erative language models and wikipedia. In Proceedings of the 17th international conference on World Wide Web. ACM, 347–356. [22] Xiaohui Yu and Huxia Shi. 2009. Query segmentation using conditional random fields. In Proceedings of the First International Workshop on Keyword Search on Structured Data. ACM, 21–26.