Query Segmentation via RNNs Encoder-Decoder Framework
                  Yiu-Chang Lin                                     Giuseppe Di Fabbrizio                                    Ankur Datta
       Rakuten Institute of Technology                         Rakuten Institute of Technology                   Rakuten Institute of Technology
      Boston, Massachusetts - USA 02110                       Boston, Massachusetts - USA 02110                 Boston, Massachusetts - USA 02110
          yiuchang.lin@rakuten.com                                 difabbrizio@gmail.com                            ankur.datta@rakuten.com

ABSTRACT                                                                               a variety of sequence labeling task is Conditional Random Fields
Query segmentation is the task of segmenting a Web query into                          (CRFs) [14]. CRFs model the conditional probability distribution
adjacent phrases, typically keywords that form noun phrases or con-                    over a label sequence given an input query where each token in the
cepts that are relevant to the search task. In this paper, we describe a               query is assigned to a label from the possible values of a pre-defined
research study and some preliminary experiment results for query                       label sets [22]. However, such supervised methods require a huge
segmentation via a Recurrent Neural Network encoder-decoder                            amount of human segmentation labels which are usually expensive
framework on a public benchmark dataset (Webis-QSeC-10). The                           to obtain and, furthermore, careful feature engineering plays an
resulting segmented queries can be used for several downstream                         important role in achieving high segmentation accuracy.
tasks such as improving the performance of relevance ranking in                           On the other hand, in the unsupervised learning family, several
search, better understanding of the query intent, and suggesting                       methods have been proposed to either automatically collect seg-
queries for auto-completion.                                                           mented queries or train segmentation models from query log data.
                                                                                       For example, in the e-commerce domain, query terms are aligned
KEYWORDS                                                                               to product attribute terms via user’s click data and the ambiguities
                                                                                       are resolved using frequency and similarity statistics [12]. Statis-
Query segmentation, Recurrent neural network, Encoder-Decoder
                                                                                       tical methods based on point-wise mutual information (PMI) [13],
ACM Reference format:                                                                  n-gram frequency [18], or Multi-Word Expression probability [16]
Yiu-Chang Lin, Giuseppe Di Fabbrizio, and Ankur Datta. 2017. Query Seg-                are also popular. One unsupervised approach using generative lan-
mentation via RNNs Encoder-Decoder Framework. In Proceedings of SIGIR                  guage models and Wikipedia as external resource has been reported
2017 eCom, Tokyo, Japan, August 2017, 5 pages.
                                                                                       to have competitive performance [21]. Another unsupervised proba-
                                                                                       bilistic model was proposed to exploit user click-throughs for query
1 INTRODUCTION                                                                         segmentation and the model parameters were estimated by efficient
                                                                                       expectation—maximization (EM) algorithm [15].
Query segmentation aims to detect semantically consistent phrases
                                                                                          Recently, Deep Neural Networks (DNNs) models have shown its
that identify entities and concepts in Web search queries e.g., "[air
                                                                                       powerful capability to achieve excellent performance on various
conditioner][remote control]", "[compact][microwave oven]",
                                                                                       difficult Natural Language Processing learning tasks. Especially
and "[iphone 7][cover]". Such phrases are the semantic structural
                                                                                       in end-to-end sequence learning tasks, the Encoder-Decoder net-
units of a search task and can be exploited by search engines as
                                                                                       work [20] that makes minimal assumptions on the sequence struc-
indivisible units in order to improve retrieval precision or reformu-
                                                                                       ture is widely used in machine translation [1, 4, 5]. In this paper,
late phrase-level query. It is often the case that short text as in Web
                                                                                       we propose to treat query segmentation as a machine translation
queries do not follow grammar rules hence traditional methods
                                                                                       task and apply the Encoder-Decoder framework to generate query
based on well-formed English are not applicable.
                                                                                       segments. Preliminary results on the Webis-QSeC-10 1 dataset are
   Query segmentation is one of the most important tasks toward
                                                                                       reported.
query understanding, a key component of modern search engines
for precisely inferring the users’ intent through queries since query
segments can be further refined into named-entities and semantic                       2    DATA
relations linking head-phrases with modifiers.                                         The Webis Query Segmentation Corpus (Webis-QSeC-10) [8] con-
   Both supervised and unsupervised learning techniques have                           sists of 53,437 web queries and each query has at least 10 seg-
been used to solve the query segmentation task in the past. In the                     mentations provided by 10 different annotators crowdsourced via
supervised learning category, Support Vector Machines ranker [10]                      Amazon's Mechanical Turk (AMT). A sample of 4,850 queries is pub-
was used to learn a structured classifier that makes a segmentation                    lished as the training set and the remaining 48,587 queries serve as
decision (yes or no) between each pair of continuous tokens [3].                       the testing set, with a 1:9 train/test split ratio. The Webis-QSeC-10
Another well-known model that has been successfully applied to                         is sampled from the subset of the AOL query log [19] which consists
                                                                                       of only queries with length from 3 to 10 words. Since 1-word queries
                                                                                       cannot be segmented anymore and 2-word queries are typically
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic    handled well by proximity features, queries with just 1 or 2 word
purposes.
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):         are excluded. The sampling maintains the query length distribution
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published at   and the query frequency distribution of the entire AOL query log.
http://ceur-ws.org
                                                                                       1 https://www.uni-weimar.de/de/medien/professuren/medieninformatik/webis/
                                                                                       corpora/webis-qsec-10/
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                              Yiu-Chang Lin, Giuseppe Di Fabbrizio, and Ankur Datta


An example query with its segmentations from the training set is                                                   Pn P
shown below, where 1004073900 is the unique query id followed                                             exp       i=1    j λ j f j (yi−1 , yi , x, i)
                                                                                          P (y|x, λ ) =
by a list of vote and segmentation pairs indicating the 10 different                                                         Z (x)
decisions the AMT workers made for that query.                                                             n X
                                                                                                          XX
                                                                                              Z (x) =                     λ j f j (yi−1 , yi , x, i)
      • 1004073900                                                                                       y∈Y i=1 j
      • (5, ’graffiti fonts|alphabet’),                                     where λ is the model parameters, f j is the feature function and the
        (3, ’graffiti|fonts|alphabet’),                                     numerator of P (y|x, λ ) is composed of potential functions. λ can
        (2, ’graffiti fonts alphabet’)                                      be obtained by maximizing the logarithm of the likelihood of the
                                                                            training data with L 1 or L 2 regularization terms,
   Since each query is segmented by at least 10 annotators and not
all of them always agree with each other, to select the reference                                           X
                                                                                                  L(λ) =       loдP (y|x, λ )
annotation, we apply the break fusion strategy described in [7].
                                                                                                                     i
The underlying idea is that annotators should at least agree on
                                                                               In order to apply CRFs to the query segmentation task, we in-
specific important segments even if there is no absolute majority
                                                                            troduce the standard Begin, Inside, Outside (BIO) tagging schema
on the entire query segmentation. Break fusion simply follows the
                                                                            to maps a segmented query to a sequence of tags. Table 1 shows
majority of annotators at each single break position of a query. A
                                                                            some example queries from Webis-QSeC-10 training set with their
break is inserted in case of a tie vote. Considering the following
                                                                            corresponding BIO tags.
example annotation,
      • 5 graffiti fonts|alphabet                                                    segmented query                                  BIO tagging
      • 3 graffiti|fonts|alphabet                                                  graffiti fonts | alphabet               graffiti (B) fonts (I) alphabet (B)
                                                                               stainless steel | chest freezers       stainless (B) steel (I) chest (B) freezers (I)
      • 2 graffiti fonts alphabet
                                                                             rutgers | online | graduate classes    rutgers (B) online (B) graduate (B) classes (I)
at the first break position (between graffiti and fonts), 7 (5+2) anno-             review | on | breezes                    review (B) on (B) breezes (B)
tators agree with no break. Similarly, 8 (5+3) annotators agree with        Table 1: Example queries from Webis-QSeC-10 training set
inserting a break at the second break position (between fonts and           and their corresponding BIO tags.
alphabet). Therefore the final reference is
      • graffiti fonts|alphabet

3     METHODS                                                               3.3     Recurrent Neural Networks
In this section, we describe one baseline method [7] and two models,        The fundamental idea of Recurrent Neural Networks is that the
Conditional Random Fields (CRFs) and Recurrent Neural Networks              network contains a feed-back connection as shown in the left part
encoder-decoder framework, which are used in this paper for the             of Figure 1, so that it can make use of sequential information. RNNs
query segmentation experiment.                                              perform the same task for every element in a sequence x, with the
                                                                            output o being dependent on the computations from the previous
3.1    Wikipedia Titles and Strict Noun Phrases                             state s. This characteristic enables the networks to do sequence pro-
       Baseline                                                             cessing and learn sequential structure information. Theoretically,
                                                                            RNNs are capable of capturing arbitrarily long distance dependen-
This baseline method is simply treating only Wikipedia titles and
                                                                            cies, but in practice, they are limited to looking back only a few
strict noun phrases as query segments. If the query contains more
                                                                            steps, known as the gradient vanishing/exploding problem [2].
than one overlapping Wikipedia title, the decision rule proposed
                                                                                The right part of Figure 1 shows a typical RNN and its forward
in [8] is used, which basically assigns each title a score based on
                                                                            computation structure after being unfolded into a full network
the frequencies in the Google n-gram corpus and multiplied by its
                                                                            within the sequence window t − 1, t, and t + 1. Assume that the
length. For strict noun phrases, similarly, the multiplication of their
                                                                            input sequence x is a sentence consisting of n words, x 1 , x 2 , ..., x n .
Web frequencies and length is assigned as the score. Finally, the
                                                                            x t is the input token at position t and it can be represented as a
segmentation with the highest score is chosen.
                                                                            typical one-hot vector or a word embedding of dimension d. st , the
                                                                            corresponding hidden state or "memory", is calculated based on
3.2    Conditional Random Fields                                            the previous hidden state and the input at the current step. In this
Conditional Random Fields have been widely used in NLP struc-               case, we would like to predict the next word given x 1 , x 2 , ..., x t −1
tured prediction tasks, especially sequence labeling such as part-          so ot would be a vector of probabilities across the vocabulary. The
of-speech (POS) tagging and named-entity recognition (NER). For-            following equations explicitly explain the computation of RNNs.
mally, let the input sequence x = x 1 , x 2 , ..., x n and label sequence
y = y1 , y2 , ..., yn , we want to model the conditional distribution                                   st = f (U x t + W st −1 )
P (y|x) so that the optimal label sequence can be predicted by solv-
ing y∗ = argmaxP (y|x). The probabilistic model for sequence CRFs
             y                                                                                   ot = so f tmax (V st )
defines a family of conditional probability P (y|x, λ ) over all possible   where the function f is a nonlinearity mapping such as tanh or
label sequences y given x with the following form:                          ReLU. U , V and W are matrices (model parameters) and can be
Query Segmentation via RNNs Encoder-Decoder Framework                                                           SIGIR 2017 eCom, August 2017, Tokyo, Japan

                                                                                                                                  graffiti   fonts               alphabet
optimized through back propagation. Usually, s −1 , which is required                             Encoder
to calculate the first hidden state, is initialized to a zero vector.                              RNN


                                                                                                                       context
          o                              ot-1            ot              ot+1
                                                                                                                                                Decoder
                                                                                                                                                 RNN
                                                                                       graffiti    fonts    alphabet
    V                              V                V              V
      s                           st-1              st            st+1
                              W                 W             W                 W   Figure 2: A RNN encoder decoder framework and its appli-
                 W
                                                                                    cation to query segmentation.
                     unfold
    U                             U                 U             U

          x                              xt-1            xt              xt+1
                                                                                    level and break level accuracy [7] as the evaluation matrices. At
                                                                                    query level, given a query q, its reference segmentation S and the
Figure 1: A recurrent neural network and the unfolding in                           output segmentation S 0 from the model, the query accuracy is 1 if
time of the computation involved in its forward computa-                            S 0 = S and 0 otherwise. At break level, a decision whether a break
tion.                                                                               needs to be inserted is made for every two consecutive words in the
                                                                                    query. The break accuracy is defined as the ratio of correct decisions
                                                                                    over all break positions in q with respect to S 0 . Theoretically, there
3.4           RNNs Encoder-Decoder Framework                                                                                                         (k 2 −k )
                                                                                    exists 2k −1 valid segmentations for each q, and 2                           potential
Figure 2 shows a typical encoder decoder framework, a model con-                    segments that contain at least two keywords from q.
sisting of two separate RNNs called the encoder and the decoder.
The encoder reads an input sequence one item at a time, and out-                    4.1       Model Parameters
puts a vector at each step (ignored in Figure 2). The final output of
                                                                                    In our experiment, we use CRFsuite 2 [17] for optimizing the CRF
the encoder serves as the context vector and the decoder uses this
                                                                                    model parameters and the following set of word uni-gram and
context vector to generate a sequence of outputs. In the context of
                                                                                    bi-gram features are utilized:
machine translation, the encoder first processes a variable-length
input word sequence from the source language and builds a fixed-                            • uni-gram: x −2 , x −1 , x, x 1 , x2
length vector representation (context vector). Conditioned on this                          • bi-gram: x −1x, xx 1
encoded representation, the decoder produces a variable-length                      For RNN encoder decoder, the following loss function, optimizer
word sequence in the target language. In an ideal case, the context                 and parameters are used:
vector can be considered as the meaning of the sequence in latent
semantic space, and this idea can be extended beyond sequences.                             • Word representation: 1-hot vector
For example, in image captioning tasks, the encoder decoder frame-                          • RNN hidden layer size: 1024
work takes the image as input and produces a text description as                            • RNN number of layers: 2
output. In the reverse direction, image generation tasks take a text                        • RNN activation function: tanh
description as input and output a generated image.                                          • Loss function: Negative log likelihood loss
   To fit the query segmentation task into encoder decoder frame-                           • Optimizer: Adam optimizer
work, we treat the original query as an input sequence from one                             • Learning rate: 0.0001
language and the segmented query as an output sequence from the                             • Dropout rate: 0.05
other language. The vocabulary size is therefore the same for both                          • Epochs: 50,000
languages except that the target language has one additional break
token, i.e.,                                                                        4.2       RNNs Encoder-Decoder Loss
                                                                                    Parameters optimization is obtained by Adam optimizer with nega-
                  V ocabt ar дet = V ocabsour ce + {”|”}                            tive log likelihood as the loss function. Adam optimizer (Adaptive
    In practice, the queries and their segmentations combined are                   Moment Estimation) [11] is an algorithm for first-order gradient-
treated as a parallel corpus for training. In testing phase, the encoder            based optimization of stochastic objective functions through com-
first calculates the context vector and then generates output tokens                puting adaptive learning rates for each parameter. Adam keeps an
one at a time from V ocabt ar дet .                                                 exponentially decaying average of both past gradients and squared
                                                                                    gradients. The loss function value on the training set is recorded ev-
4     EXPERIMENTS                                                                   ery 200 epochs and it shows that the training loss decreases steadily
The Webis-QSeC-10 corpus [8] comprises 53,437 web queries and                       with the number of epochs and eventually converges at the end
each of them has at least 10 segmentations. The reference segmen-                   (Figure 3).
tation is obtained as described in Section 2. There are 4,850 queries
in the training set and 48,587 queries in the testing set. To quantify
the segmentation result of different algorithms, we adopt query                     2 http://www.chokkan.org/software/crfsuite/
SIGIR 2017 eCom, August 2017, Tokyo, Japan                                          Yiu-Chang Lin, Giuseppe Di Fabbrizio, and Ankur Datta


                                                                         compared to standard sentences in machine translation corpus.
                                                                         Therefore, RNNs’ remarkable capacity of capturing long-distance
                                                                         dependency is not that effective in this task. Although CRFs out-
                                                                         performs RNNs encoder-decoder, one disadvantage of CRFs is that
                                                                         it requires human-designed features as opposed to RNNs which
                                                                         require no feature engineering.

                                                                         6   CONCLUSION AND FUTURE WORK
                                                                         Query segmentation is crucial for a search engine to better under-
                                                                         stand query intent and return higher quality search results. This
                                                                         paper provides a study on fitting query segmentation task into a
                                                                         RNN encoder-decoder framework and describes preliminary exper-
                                                                         imental results compared with other baselines. The RNNs does not
                                                                         perform as expected due to the lack of training data and the short
                                                                         nature of query length. However, three feasible future directions
                                                                         might be helpful for improving RNNs encoder decoder framework
                                                                         on query segmentation.
                                                                            The first direction is to automatically collect a large amount of
Figure 3: Negative log-likelihood loss for the RNN encoder-
                                                                         segmented queries via user implicit feedback from query logs as
decoder on the training set. The loss is recorded every 200
                                                                         proposed in [12]. This will solve the challenge of little training
epochs.
                                                                         data mentioned in Section 5. Another direction is to replace the
                                                                         RNN units in the encoder decoder framework with GRUs [6] or
4.3    Results                                                           LSTMs [9] and add an attention mechanism [1] at the encoder,
                                                                         giving the decoder a way to "pay attention" to different parts of
Table 2 shows query-level and break-level accuracy of Wikipedia
                                                                         the input while decoding. Finally, substituting pre-trained word
titles (WT), Wikipedia titles + strict noun phrases (WT+SNP), Con-
                                                                         embedding for the current one-hot word vector will both reduce
ditional Random Fields and RNN encoder-decoder. WT+SNP has
                                                                         the input dimension and provide the network with richer word
the best accuracy among the four methods at both levels. CRF per-
                                                                         representation.
forms better than WT baseline in terms of query level and break
level accuracy. The RNN encoder-decoder framework, however, in
this case does not perform as expected as it does in other tasks such    ACKNOWLEDGMENTS
as machine translation and image captioning.                             The authors would like to thank Dr. Martin Potthast from Bauhaus-
                                                                         Universität Weimar for kindly providing the full Webis Query Seg-
                               query     break                           mentation Corpus used for our modeling experiments. The authors
                              accuracy accuracy                          would also like to thank the anonymous reviewers for their feedback
                WT [7]          0.431    0.769                           and helpful advice.
             WT+SNP [7]         0.585    0.837
                  CRF           0.465    0.814                           REFERENCES
         RNN Encoder-Decoder    0.421    0.664                           [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-
Table 2: Query level and break level accuracy on Webis-                      chine translation by jointly learning to align and translate. arXiv preprint
                                                                             arXiv:1409.0473 (2014).
QSeC-10 test set.                                                        [2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term
                                                                             dependencies with gradient descent is difficult. IEEE transactions on neural
                                                                             networks 5, 2 (1994), 157–166.
                                                                         [3] Shane Bergsma and Qin Iris Wang. 2007. Learning Noun Phrase Query Segmen-
                                                                             tation.. In EMNLP-CoNLL, Vol. 7. 819–826.
                                                                         [4] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Ben-
5     DISCUSSION                                                             gio. 2014. On the properties of neural machine translation: Encoder-decoder
The first two methods (WT and WT+SNP) in Table 2 are unsuper-                approaches. arXiv preprint arXiv:1409.1259 (2014).
                                                                         [5] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,
vised but require external knowledge resource, e.g., Wikipedia titles,       Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase
Google n-gram frequencies and Web n-gram frequencies. On the                 representations using RNN encoder-decoder for statistical machine translation.
other hand, both CRFs and RNNs encoder-decoder are supervised                arXiv preprint arXiv:1406.1078 (2014).
                                                                         [6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.
machine learning methods relying on human annotation. Since                  Empirical evaluation of gated recurrent neural networks on sequence modeling.
the training set only consists of 4,850 annotated queries, which is          arXiv preprint arXiv:1412.3555 (2014).
                                                                         [7] Matthias Hagen, Martin Potthast, Anna Beyer, and Benno Stein. 2012. Towards
1/9 the size of testing set in Webis-QSeC-10, supervised methods             optimum query segmentation: in doubt without. In Proceedings of the 21st ACM
cannot benefit from a large amount of training data. In addition             international conference on Information and knowledge management. ACM, 1015–
to the small size of training set, short-query length is also another        1024.
                                                                         [8] Matthias Hagen, Martin Potthast, Benno Stein, and Christof Bräutigam. 2011.
key factor that limits the power of RNNs encoder-decoder in query            Query segmentation revisited. In Proceedings of the 20th international conference
segmentation. Web queries are typically short and less structured            on World wide web. ACM, 97–106.
Query Segmentation via RNNs Encoder-Decoder Framework                                   SIGIR 2017 eCom, August 2017, Tokyo, Japan


 [9] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
     computation 9, 8 (1997), 1735–1780.
[10] Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In
     Proceedings of the eighth ACM SIGKDD international conference on Knowledge
     discovery and data mining. ACM, 133–142.
[11] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-
     tion. arXiv preprint arXiv:1412.6980 (2014).
[12] Julia Kiseleva, Qi Guo, Eugene Agichtein, Daniel Billsus, and Wei Chai. 2010.
     Unsupervised query segmentation using click data: preliminary results. In Pro-
     ceedings of the 19th international conference on World wide web. ACM, 1131–1132.
[13] Giridhar Kumaran and Vitor R Carvalho. 2009. Reducing long queries using
     query quality predictors. In Proceedings of the 32nd international ACM SIGIR
     conference on Research and development in information retrieval. ACM, 564–571.
[14] John Lafferty, Andrew McCallum, Fernando Pereira, and others. Conditional
     random fields: Probabilistic models for segmenting and labeling sequence data.
     In Proceedings of the eighteenth international conference on machine learning,
     ICML, Vol. 1. 282–289.
[15] Yanen Li, Bo-Jun Paul Hsu, ChengXiang Zhai, and Kuansan Wang. 2011. Un-
     supervised query segmentation using clickthrough for information retrieval.
     In Proceedings of the 34th international ACM SIGIR conference on Research and
     development in Information Retrieval. ACM, 285–294.
[16] Nikita Mishra, Rishiraj Saha Roy, Niloy Ganguly, Srivatsan Laxman, and Monojit
     Choudhury. 2011. Unsupervised query segmentation using only query logs. In
     Proceedings of the 20th international conference companion on World wide web.
     ACM, 91–92.
[17] Naoaki Okazaki. 2007. CRFsuite: a fast implementation of Conditional Random
     Fields (CRFs). (2007). http://www.chokkan.org/software/crfsuite/
[18] Nish Parikh, Prasad Sriram, and Mohammad Al Hasan. 2013. On segmentation
     of ecommerce queries. In Proceedings of the 22nd ACM international conference
     on Conference on information & knowledge management. ACM, 1137–1146.
[19] Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A picture of search.
[20] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence
     learning with neural networks. In Advances in neural information processing
     systems. 3104–3112.
[21] Bin Tan and Fuchun Peng. 2008. Unsupervised query segmentation using gen-
     erative language models and wikipedia. In Proceedings of the 17th international
     conference on World Wide Web. ACM, 347–356.
[22] Xiaohui Yu and Huxia Shi. 2009. Query segmentation using conditional random
     fields. In Proceedings of the First International Workshop on Keyword Search on
     Structured Data. ACM, 21–26.