Towards Optimum Query Segmentation: In Doubt Without ∗ (Extended Abstract) Matthias Hagen Martin Potthast Anna Beyer Benno Stein Bauhaus-Universität Weimar 99421 Weimar, Germany .@uni-weimar.de ABSTRACT 3. HOW HUMANS SEGMENT Query segmentation is the problem of identifying compound con- Our study of human segmentation behavior is based on the cepts or phrases in a query. We conduct the first large-scale study of Webis-QSeC-10 corpus [3] consisting of 53 437 web queries (3– human segmentation behavior, introduce robust accuracy measures, 10 keywords) with at least 10 different annotators per query. One and develop a hybrid algorithmic segmentation approach based on of our intentions is to compare human quoting on noun phrase the idea that, in cases of doubt, it is often better to (partially) leave queries with that on other queries. As automatic POS-tagging in queries without any segmentation. short queries is a difficult task, we restrict our analysis to strict noun phrases (SNP) composed of only nouns, numbers, adjectives, and articles. These parts-of-speech can be identified reliably using for 1. INTRODUCTION instance Qtag.1 About 47% of the queries are tagged as SNP queries. Keyword queries are the predominant way of expressing informa- Our study of how humans quote queries results in the following tion needs on the web. Search engines nowadays rely on tools that major findings. (1) SNP queries are segmented more often than help them to interpret, correct, classify, and reformulate every sub- others, (2) in segmented SNP queries more keywords are contained mitted query in a split second before the actual document retrieval in segments, and (3) annotators agree more on short queries but begins. We study one such tool that identifies indivisible sequences unanimity is an exception (many queries even do not have a seg- of keywords in a query (e.g., new york times) that users could mentation supported by an absolute majority of annotators). These have included in double quotes—the task of query segmentation. findings suggest that algorithms aiming at accuracy against human Our contributions include the first large-scale analysis of human segmentations should take into account the query type. The sec- segmentation behavior (50 000 queries, each segmented by 10 an- ond implication is to carefully reconsider the traditional accuracy notators) showing that different segmentation strategies should be measures (some based on annotator unanimity). applied to different types of queries. In particular, a good strategy often is to refrain from segmenting too many keywords (i.e., in doubt without segmentation). 4. ACCURACY MEASURES REVISITED Segmentation accuracy is typically measured against a corpus of human segmentations on three levels: query accuracy (ratio of 2. NOTATION AND RELATED WORK correctly segmented queries), segment accuracy (precision and re- A query q is a sequence (w1 , . . . , wk ) of k keywords. Every call of the computed segments), and break accuracy (ratio of correct contiguous subsequence of q forms a potential segment. A valid decisions between pairs of consecutive words). The crucial point is segmentation for q consists of disjunct segments whose concate- the choice of the reference segmentation from the corpus. Tradition- nation yields q again. The problem of query segmentation is the ally, the reference is the segmentation that best fits the computed automatic identification of the “best” valid segmentation, where one (i.e., the one with highest break accuracy) without any further “best” refers to segmentations that humans would choose or that considerations. We argue that for corpora with many annotators per maximize retrieval performance. Note that a valid segmentation de- query (e.g., the Webis-QSeC-10) this is an oversimplification and termines for each pair hw, w0 i of consecutive keywords in q whether scoring references from a set of weighted alternatives should be an or not there should be a segment break between w and w0 . Hence, integral part of accuracy measuring. there are 2k−1 valid segmentations for a k-keyword query and Given a query q, and a list of m reference segmentations k(k − 1)/2 potential segments with at least two keywords. (S1 , . . . , Sm ) from m different annotators, we propose the follow- Risvik et al. [5] were the first to propose an algorithm for query ing two strategies to select a reference segmentation. (1) Weighted segmentation based on pointwise mutual information. Later on, Best Fit: select the Si chosen by an absolute majority of annotators more sophisticated approaches like the supervised learning method if there is one. Otherwise select the Si as the traditional best fit by Bergsma and Wang [1] combined many features (web and query strategy (i.e., the Si maximizing break accuracy). But then, the ob- log frequencies, POS tags, etc.). Recently, efficiency issues become tained accuracy values are weighted by the ratio of votes allotted to more important [3] and evaluation moves away from simple accuracy Si compared to the maximum number of votes on any segmentation against human segmentations towards retrieval impact analyses [4]. in (S1 , . . . , Sm ). (2) Break Fusion: instead of selecting a reference ∗Original paper with all the omitted details in CIKM 2012 [2]. segmentation from (S1 , . . . , Sm ), fuse them into one. For each pair of consecutive words in q: if at least half of the annotators inserted Copyright remains with the authors and/or original copyright holders. a segment break, so does this strategy. If not, no break is inserted. DIR 2013, April 26, 2013, Delft, The Netherlands. 1 . http://phrasys.net/uob/om/software To demonstrate the impact of the new reference schemes, we segmentation algorithms. Otherwise, query segmentation may not apply them in a comparison of the segmentation algorithms from the improve significantly over not segmenting at all. literature (results in the full paper). With our new schemes many of The main findings of evaluating accuracy and retrieval perfor- the relative accuracy differences between segmentation algorithms mance are the following: (1) better accuracy not necessarily im- increase and more of these differences become statistically signif- proves retrieval performance, (2) SNP queries can often be left icant. Hence, the new reference selectors provide a more robust unsegmented in terms of retrieval performance. However, there is means to evaluate segmentation accuracy. a grain of salt: our TREC experiments are small-scale compared to the number of queries that went into measuring accuracy. The 5. HYBRID QUERY SEGMENTATION retrieval performance experiments should be scaled up significantly in order to draw more reliable conclusions. In any case, our experi- The decision whether or not to introduce segments into a query is ments have shown that the decision of when to segment at all is an a risky one: a bad segmentation leads to bad search results or none important one. at all, whereas a good one improves them. Since keeping users safe Besides accuracy and retrieval performance, also runtime and from algorithm error is a core principle at most search engines, and memory consumption are crucial criteria to judge the applicability since even a small error probability yields millions of failed searches of a segmentation algorithm in a real-world setting. Runtime is typi- given billions of searches per day, a risk-averse strategy is the way to cally measured as throughput of queries per second while memory go. In doubt, it is always safer to do without any query segmentation. consumption concerns the data needed for operation. Regarding This observation suggests to use a hybrid strategy that treats different throughput, a pointwise mutual information baseline is by far the types of queries in different ways. One of the main findings on fastest approach (with bad accuracy and retrieval performance). The human segmentation behavior is to distinguish SNP queries from WT and WT+SNP baselines are faster than [3] since they sum up others. As potential strategies for either type, we consider algorithms fewer weights of potential segments. The hybrid approaches are from the literature and two newly developed baselines that only slowest due to the POS tagging step. With respect to memory con- segment Wikipedia titles (WT) or only Wikipedia titles and SNPs sumption the WT baseline needs an order of magnitude less data (WT+SNP) following our dictionary based scheme [3]. than mutual information or WT+SNP which in turn need much less than [3]. Taking into account the rumored monthly throughput 6. EVALUATION of major search engines of about 100 billion queries (i.e., about In our evaluation, we compare instances of hybrid query segmen- 40 000 queries per second), all segmentation approaches can easily tation to traditional approaches with respect to three performance handle such a load when run on a small cluster of standard PCs. measures. (1) We measure segmentation accuracy using the Webis- QSeC-10. (2) We measure retrieval performance in a TREC setting 7. CONCLUSION AND OUTLOOK using the commercial search engine Bing and the Indri ClueWeb09 Our study of human query segmentation behavior inspired a new search engine hosted at Carnegie Mellon University.2 (3) We mea- hybrid framework that treats SNP queries different than other queries sure runtime performance and memory footprint. and that can be tailored to mimic human query quoting better than We have systematically combined traditional segmentation al- the state-of-the-art algorithms. However, an important and some- gorithms (including the option “none” of not segmenting) to form what unexpected outcome of complementary TREC style evaluation instances of hybrid segmentation. As expected, there is no one-fits- is that maximizing segmentation accuracy not necessarily maxi- all combination which maximizes performance with respect to all of mizes retrieval performance as well. Nevertheless, we show the the above measures. The following table shows the best performing flexibility of the hybrid framework and optimize it for two retrieval combinations. models. There, not segmenting SNP queries at all is best, opposing Hybrid segmentation instance our finding that humans quote SNP queries more aggressively. Query We hypothesize that query segmentation is especially beneficial type HYB-A HYB-B HYB-I (accuracy) (Bing) (Indri) on long non-SNP queries, which currently are underrepresented in the TREC corpora. Hence, scaling up retrieval performance SNP [3] (= WT+SNP) None None evaluation with a broad range of retrieval models is an important other WT WT [3] future direction. This could shed light on the question of why SNP queries apparently are better off without any segmentation. In what follows, we give brief descriptions of the experimental One starting point could be an analysis of the best segmentations results (more details in the full paper). An explanation for the variant for different retrieval models in order to better understand what HYB-A can be found in our analysis of human quoting behavior. differentiates a “perfect” retrieval-oriented segmentation from those There, it is shown that accuracy-oriented algorithms should segment of the algorithms developed so far. SNP queries more aggressively (more keywords in segments) than other queries, which in turn should be segmented conservatively (less keywords in segments). This is exactly the strategy of HYB-A. 8. REFERENCES [1] S. Bergsma and Q. Wang. Learning noun phrase query On SNP queries, the algorithm [3] aggressively segments all phrases segmentation. In EMNLP-CoNLL 2007, pp. 819–826. that appear at least 40 times on the web, whereas the WT baseline [2] M. Hagen, M. Potthast, A. Beyer, and B. Stein. Towards optimum on the other queries conservatively segments only Wikipedia titles. query segmentation: in doubt without. In CIKM 2012, pp. With respect to retrieval performance we evaluate on the 1015–1024. TREC topics in the Web tracks 2009–2011 and the Million Query [3] M. Hagen, M. Potthast, B. Stein, and C. Bräutigam. Query track 2009 with at least one document being judged as relevant and segmentation revisited. In WWW 2011, pp. 97–106. at least 3 keywords (61 topics from the Web tracks, 294 from the [4] Y. Li, B.-J. P. Hsu, C. Zhai, and K. Wang. Unsupervised query Million query track). Our results suggest that different search en- segmentation using clickthrough for information retrieval. In SIGIR 2011, pp. 285–294. gines (i.e., retrieval models) each require specifically tailored hybrid [5] K. Risvik, T. Mikolajewski, and P. Boros. Query segmentation for 2 web search. In WWW 2003 (Posters). http://boston.lti.cs.cmu.edu/Services/batchquery