Reliability and Validity of Query Intent Assessments
               Compressed version of paper accepted for publication in JASIST
                 Suzan Verberne                     Maarten van der Heijden                Max Hinne
               s.verberne@cs.ru.nl                  m.vanderheijden@cs.ru.nl             mhinne@cs.ru.nl
                 Maya Sappelli                           Saskia Koldijk                Eduard Hoenkamp
               m.sappelli@cs.ru.nl                    saskia.koldijk@tno.nl           hoenkamp@acm.org
                                                         Wessel Kraaij
                                                        w.kraaij@cs.ru.nl

Keywords                                                         measure, is the validity of the judgments.
Query intent classification, User studies, Data collection,        In this paper, we aim to measure the validity of query
Validation                                                       intent assessments, i.e. how well an external assessor can
                                                                 estimate the underlying intent of a searcher’s query. We use
                                                                 a classification scheme to describe search intent.
1.    INTRODUCTION
   The quality of a search engine critically depends on the      2.   OUR INTENT CLASSIFICATION SCHEME
ability to present results that are an adequate response to         We introduce a multi-dimensional classification scheme of
the user’s query and intent. If the intent (or the most likely   query intent that is inspired by and uses aspects from [3],
intent) behind a query is known, a search engine can im-         [2], [4] and [5]. Our classification scheme consists of the
prove retrieval results by adapting the presented results to     following dimensions of search intent.
the more specific intent instead of the — underspecified             1. Topic: categorical, fixed set of categories from the
— query [6]. Several studies have proposed classification            well-known Open Directory Project (ODP), giving a gen-
schemes for query intent. Broder [3] suggested that the in-          eral idea of what the query is about.
tent of a query can be either informational, navigational or         2. Action type: categorical, consisting of: informational,
transactional. He estimated percentages for each of the cat-         navigational and transactional. This is the categorisation
egories by presenting Altavista users a brief questionnaire          by Broder.
about the purpose of their search after submitting their             3. Modus: categorical, consisting of: image, video, map,
query. After manual classification of 1,000 queries he warned        text and other. This dimension is based on [5].
that “inferring the user intent from the query is at best an         4. source authority sensitivity: 4-point ordinal scale (high
inexact science, but usually a wild guess.” Later, many ex-          sensitivity: relevance strongly depends on authority of
pansions and alternative schemes have been proposed, and             source).
more dimensions were added.                                          5. spatial sensitivity: 4-point ordinal scale (high sensitiv-
   In many existing intent recognition studies, training and         ity: relevance strongly depends on location).
test data for automatic intent recognition have been created         6. time sensitivity: 4-point ordinal scale (high sensitivity:
in the form of annotations by external assessors who are not         relevance strongly depends on time/date).
the searchers themselves [2, 1, 4]. Post-hoc intent annota-          7. specificity: 4-point ordinal scale (high specificity: very
tion by external assessors is not ideal; nevertheless, intent        specific results desired; low specificity: explorative goal).
annotations from external judges are widely used in the com-
munity for evaluation or training purposes. Therefore it is
important for the field to get a better understanding of the     3.   EXPERIMENTS
quality of this process as an approximation for first-hand an-     In order to obtain labeled queries from search engine users,
notation by searchers themselves. Some annotation studies        we created a plugin for the Mozilla Firefox web browser. Af-
have investigated the reliability of query intent annotations    ter installation by the user, the plugin locally logs all queries
by measuring the agreement between two external assessors        submitted to Google. We asked colleagues (all academic sci-
on the same query set [1, 4]. What these studies do not          entists and PhD students) to participate in our experiment.
                                                                 Participants were asked to occasionally (at a self-chosen mo-
                                                                 ment) annotate the queries they submitted in the last 48
                                                                 hours, using a form that presented our intent classification
                                                                 scheme. To guarantee that no sensitive information was in-
                                                                 voluntarily submitted, participants were allowed to skip any
                                                                 query they did not want to submit.
                                                                   In total, 11 participants enrolled in the experiment. To-
                                                                 gether, they annotated 605 queries with their query intent,
DIR 2013, April 26, 2013, Delft, The Netherlands.                of which 135 duplicates. On average, each searcher anno-
.                                                                tated 55 queries (standard deviation=73). The three topic
                                                                      sions, there is no direct connection between words in the
Table 1: Reliability and validity of query intent assessments         query and the intent of the query. For example, in the
in terms of Cohen’s Kappa, averaged over the assessor pairs.
                                                                      33 queries that were annotated by the searcher with the
Boldface indicates moderate agreement (κ >= 0.4) or higher.
                                                                      image modus (e.g. “photosynthesis”; “coen swijnenberg”)
  Dimension                    Reliability (stdev) Validity (stdev)
  Topic                           0.56 (0.19)        0.42 (0.16)      there were no occurrences of words such as ‘image’ or ‘pic-
  Action type                      0.29 (0.20)       0.09 (0.08)      ture’, and only 2 of the 90 queries that were annotated with
  Modus                           0.41 (0.14)        0.22 (0.10)      a high temporal sensitivity contained a time-related query
  Source authority sensitivity     0.05 (0.05)       0.10 (0.03)
  Time sensitivity                0.48 (0.08)        0.14 (0.04)      word. This means that for automatic classification, it is dif-
  Spatial sensitivity             0.69 (0.07)        0.41 (0.04)      ficult to generalize over queries. However, the most likely
  Specificity                      0.26 (0.10)       0.05 (0.09)      intent can still be learned for individual queries by follow-
                                                                      ing the diversification approach in the ranking of the search
                                                                      results: The engine can learn the probability of intents for
categories that were used most frequently in the set of an-           specific queries by counting clicks on different types of re-
notated queries were computer, science and recreation.                sults. This approach requires a huge amount of clicks to be
   To obtain labels from external assessors we used the same          recorded (which is possible for large search engines such as
form as was used by the participants. Four of the authors             Google) and the long tail of low-frequency queries will not
acted as external assessors; all queries were assessed by at          be served.
least two assessors.
                                                                      5.   CONCLUSIONS
4.   RESULTS                                                             We found that four of the seven dimensions in our clas-
   In order to answer the question “How reliable is our intent        sification scheme could be annotated moderately reliably
classification scheme as an instrument for measuring search           (κ > 0.4): topic, modus, time sensitivity and spatial sen-
intent?”, we calculated the interobserver reliability as the          sitivity. An important finding is that queries could not reli-
agreement between the external assessors using Cohen’s κ.             ably be classified according to the dimension ‘action type’,
The middle column of Table 1 shows the average agreement              which is the original Broder classification. Of the four re-
over the assessor pairs for each dimension. For only one of           liable dimensions, only the annotations on the topic and
the seven dimensions from our classification scheme) sub-             spatial sensitivity dimensions were valid (κ > 0.4) when
stantial agreement (0.6 or higher) was reached. For four of           compared to the searcher’s annotations. This shows that
the seven, at least moderate agreement (0.4 or higher) was            the agreement between external assessors is not a good es-
reached: least moderately reliable query intent classification        timator of the validity of the intent classifications.
is possible for the dimensions topic, modus, time sensitivity            In conclusion, we showed that Broder was correct with his
and spatial sensitivity.                                              warning that “inferring the user intent from the query is at
   In order to answer the question, “How valid are the in-            best an inexact science, but usually a wild guess”. Therefore,
tent classifications by external assessors?”, we compared the         we encourage the research community to consider - where
intent classifications by the external assessors to the intent        possible - using query intent classifications by the searchers
classifications by the searchers themselves. We calculated            themselves as test data.
κ-scores per dimension for each assessor–searcher pair. The
rightmost column of Table 1 shows the average agreement               6.   REFERENCES
over the assessor–searcher pairs. The table shows that mod-           [1] A. Ashkan, C. Clarke, E. Agichtein, and Q. Guo.
erately valid query intent classification is possible on two of           Classifying and characterizing query intent. Advances
the seven dimensions from our classification scheme: topic                in Information Retrieval, pages 578–586, 2009.
and spatial sensitivity. The difference between the inter–
                                                                      [2] R. Baeza-Yates, L. Calderón-Benavides, and
assessor agreement and the assessor–searcher agreement was
                                                                          C. González-Caro. The Intention Behind Web Queries.
significant on all dimensions.
                                                                          In F. Crestani, P. Ferragina, and M. Sanderson, editors,
   Our experiments suggest that classification of queries into
                                                                          String Processing and Information Retrieval, LNCS
Topic categories can be done reliably, even though we had
                                                                          4209, pages 98–109, Berlin Heidelberg, 2006.
17 different topics to choose from. This is good news for a
                                                                          Springer-Verlag.
future implementation of automatic query classification be-
                                                                      [3] A. Broder. A taxonomy of web search. In ACM SIGIR
cause topic plays an important role in query disambiguation
                                                                          forum, volume 36, pages 3–10. ACM, 2002.
and personalisation. The second reliable dimension, Spatial
sensitivity, is an important dimension for local search: every        [4] C. González-Caro, L. Calderón-Benavides,
web search takes place at a physical location, and there are              R. Baeza-Yates, L. Tansini, and D. Dubhashi. Web
types of queries for which this location is relevant (e.g. the            Queries: the Tip of the Iceberg of the User’s Intent. In
search for restaurants or events). The finding that external              Workshop on User Modeling for Web Applications,
assessors can reach a moderate agreement with the searcher                WSDM 2011, 2011.
on this dimension shows the feasibility of recognizing that a         [5] S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics
query is sensitive to location. The search engine can respond             of genre and domain intents. Information Retrieval
by promoting search results that match with the location.                 Technology, pages 399–409, 2010.
   For the implementation of intent classification in a search        [6] R. White, P. Bennett, and S. Dumais. Predicting
engine, training data is needed: The features are the query               short-term interests using activity-based search context.
terms (the textual content of the query) and the labels are               In Proceedings of the 19th ACM international
the values for the dimensions in the classification scheme.               conference on Information and knowledge management,
Analysis of the queries shows that for many intent dimen-                 pages 1009–1018. ACM, 2010.