Reliability and Validity of Query Intent Assessments Compressed version of paper accepted for publication in JASIST Suzan Verberne Maarten van der Heijden Max Hinne s.verberne@cs.ru.nl m.vanderheijden@cs.ru.nl mhinne@cs.ru.nl Maya Sappelli Saskia Koldijk Eduard Hoenkamp m.sappelli@cs.ru.nl saskia.koldijk@tno.nl hoenkamp@acm.org Wessel Kraaij w.kraaij@cs.ru.nl Keywords measure, is the validity of the judgments. Query intent classification, User studies, Data collection, In this paper, we aim to measure the validity of query Validation intent assessments, i.e. how well an external assessor can estimate the underlying intent of a searcher’s query. We use a classification scheme to describe search intent. 1. INTRODUCTION The quality of a search engine critically depends on the 2. OUR INTENT CLASSIFICATION SCHEME ability to present results that are an adequate response to We introduce a multi-dimensional classification scheme of the user’s query and intent. If the intent (or the most likely query intent that is inspired by and uses aspects from [3], intent) behind a query is known, a search engine can im- [2], [4] and [5]. Our classification scheme consists of the prove retrieval results by adapting the presented results to following dimensions of search intent. the more specific intent instead of the — underspecified 1. Topic: categorical, fixed set of categories from the — query [6]. Several studies have proposed classification well-known Open Directory Project (ODP), giving a gen- schemes for query intent. Broder [3] suggested that the in- eral idea of what the query is about. tent of a query can be either informational, navigational or 2. Action type: categorical, consisting of: informational, transactional. He estimated percentages for each of the cat- navigational and transactional. This is the categorisation egories by presenting Altavista users a brief questionnaire by Broder. about the purpose of their search after submitting their 3. Modus: categorical, consisting of: image, video, map, query. After manual classification of 1,000 queries he warned text and other. This dimension is based on [5]. that “inferring the user intent from the query is at best an 4. source authority sensitivity: 4-point ordinal scale (high inexact science, but usually a wild guess.” Later, many ex- sensitivity: relevance strongly depends on authority of pansions and alternative schemes have been proposed, and source). more dimensions were added. 5. spatial sensitivity: 4-point ordinal scale (high sensitiv- In many existing intent recognition studies, training and ity: relevance strongly depends on location). test data for automatic intent recognition have been created 6. time sensitivity: 4-point ordinal scale (high sensitivity: in the form of annotations by external assessors who are not relevance strongly depends on time/date). the searchers themselves [2, 1, 4]. Post-hoc intent annota- 7. specificity: 4-point ordinal scale (high specificity: very tion by external assessors is not ideal; nevertheless, intent specific results desired; low specificity: explorative goal). annotations from external judges are widely used in the com- munity for evaluation or training purposes. Therefore it is important for the field to get a better understanding of the 3. EXPERIMENTS quality of this process as an approximation for first-hand an- In order to obtain labeled queries from search engine users, notation by searchers themselves. Some annotation studies we created a plugin for the Mozilla Firefox web browser. Af- have investigated the reliability of query intent annotations ter installation by the user, the plugin locally logs all queries by measuring the agreement between two external assessors submitted to Google. We asked colleagues (all academic sci- on the same query set [1, 4]. What these studies do not entists and PhD students) to participate in our experiment. Participants were asked to occasionally (at a self-chosen mo- ment) annotate the queries they submitted in the last 48 hours, using a form that presented our intent classification scheme. To guarantee that no sensitive information was in- voluntarily submitted, participants were allowed to skip any query they did not want to submit. In total, 11 participants enrolled in the experiment. To- gether, they annotated 605 queries with their query intent, DIR 2013, April 26, 2013, Delft, The Netherlands. of which 135 duplicates. On average, each searcher anno- . tated 55 queries (standard deviation=73). The three topic sions, there is no direct connection between words in the Table 1: Reliability and validity of query intent assessments query and the intent of the query. For example, in the in terms of Cohen’s Kappa, averaged over the assessor pairs. 33 queries that were annotated by the searcher with the Boldface indicates moderate agreement (κ >= 0.4) or higher. image modus (e.g. “photosynthesis”; “coen swijnenberg”) Dimension Reliability (stdev) Validity (stdev) Topic 0.56 (0.19) 0.42 (0.16) there were no occurrences of words such as ‘image’ or ‘pic- Action type 0.29 (0.20) 0.09 (0.08) ture’, and only 2 of the 90 queries that were annotated with Modus 0.41 (0.14) 0.22 (0.10) a high temporal sensitivity contained a time-related query Source authority sensitivity 0.05 (0.05) 0.10 (0.03) Time sensitivity 0.48 (0.08) 0.14 (0.04) word. This means that for automatic classification, it is dif- Spatial sensitivity 0.69 (0.07) 0.41 (0.04) ficult to generalize over queries. However, the most likely Specificity 0.26 (0.10) 0.05 (0.09) intent can still be learned for individual queries by follow- ing the diversification approach in the ranking of the search results: The engine can learn the probability of intents for categories that were used most frequently in the set of an- specific queries by counting clicks on different types of re- notated queries were computer, science and recreation. sults. This approach requires a huge amount of clicks to be To obtain labels from external assessors we used the same recorded (which is possible for large search engines such as form as was used by the participants. Four of the authors Google) and the long tail of low-frequency queries will not acted as external assessors; all queries were assessed by at be served. least two assessors. 5. CONCLUSIONS 4. RESULTS We found that four of the seven dimensions in our clas- In order to answer the question “How reliable is our intent sification scheme could be annotated moderately reliably classification scheme as an instrument for measuring search (κ > 0.4): topic, modus, time sensitivity and spatial sen- intent?”, we calculated the interobserver reliability as the sitivity. An important finding is that queries could not reli- agreement between the external assessors using Cohen’s κ. ably be classified according to the dimension ‘action type’, The middle column of Table 1 shows the average agreement which is the original Broder classification. Of the four re- over the assessor pairs for each dimension. For only one of liable dimensions, only the annotations on the topic and the seven dimensions from our classification scheme) sub- spatial sensitivity dimensions were valid (κ > 0.4) when stantial agreement (0.6 or higher) was reached. For four of compared to the searcher’s annotations. This shows that the seven, at least moderate agreement (0.4 or higher) was the agreement between external assessors is not a good es- reached: least moderately reliable query intent classification timator of the validity of the intent classifications. is possible for the dimensions topic, modus, time sensitivity In conclusion, we showed that Broder was correct with his and spatial sensitivity. warning that “inferring the user intent from the query is at In order to answer the question, “How valid are the in- best an inexact science, but usually a wild guess”. Therefore, tent classifications by external assessors?”, we compared the we encourage the research community to consider - where intent classifications by the external assessors to the intent possible - using query intent classifications by the searchers classifications by the searchers themselves. We calculated themselves as test data. κ-scores per dimension for each assessor–searcher pair. The rightmost column of Table 1 shows the average agreement 6. REFERENCES over the assessor–searcher pairs. The table shows that mod- [1] A. Ashkan, C. Clarke, E. Agichtein, and Q. Guo. erately valid query intent classification is possible on two of Classifying and characterizing query intent. Advances the seven dimensions from our classification scheme: topic in Information Retrieval, pages 578–586, 2009. and spatial sensitivity. The difference between the inter– [2] R. Baeza-Yates, L. Calderón-Benavides, and assessor agreement and the assessor–searcher agreement was C. González-Caro. The Intention Behind Web Queries. significant on all dimensions. In F. Crestani, P. Ferragina, and M. Sanderson, editors, Our experiments suggest that classification of queries into String Processing and Information Retrieval, LNCS Topic categories can be done reliably, even though we had 4209, pages 98–109, Berlin Heidelberg, 2006. 17 different topics to choose from. This is good news for a Springer-Verlag. future implementation of automatic query classification be- [3] A. Broder. A taxonomy of web search. In ACM SIGIR cause topic plays an important role in query disambiguation forum, volume 36, pages 3–10. ACM, 2002. and personalisation. The second reliable dimension, Spatial sensitivity, is an important dimension for local search: every [4] C. González-Caro, L. Calderón-Benavides, web search takes place at a physical location, and there are R. Baeza-Yates, L. Tansini, and D. Dubhashi. Web types of queries for which this location is relevant (e.g. the Queries: the Tip of the Iceberg of the User’s Intent. In search for restaurants or events). The finding that external Workshop on User Modeling for Web Applications, assessors can reach a moderate agreement with the searcher WSDM 2011, 2011. on this dimension shows the feasibility of recognizing that a [5] S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics query is sensitive to location. The search engine can respond of genre and domain intents. Information Retrieval by promoting search results that match with the location. Technology, pages 399–409, 2010. For the implementation of intent classification in a search [6] R. White, P. Bennett, and S. Dumais. Predicting engine, training data is needed: The features are the query short-term interests using activity-based search context. terms (the textual content of the query) and the labels are In Proceedings of the 19th ACM international the values for the dimensions in the classification scheme. conference on Information and knowledge management, Analysis of the queries shows that for many intent dimen- pages 1009–1018. ACM, 2010.