<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reliability and Validity of Query Intent Assessments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maya Sappelli m.sappelli@cs.ru.nl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten van der Heijden m.vanderheijden@cs.ru.nl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wessel Kraaij w.kraaij@cs.ru.nl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eduard Hoenkamp</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saskia Koldijk</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Suzan Verberne</institution>
        </aff>
      </contrib-group>
      <kwd-group>
        <kwd>Query intent classi cation</kwd>
        <kwd>User studies</kwd>
        <kwd>Data collection</kwd>
        <kwd>Validation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The quality of a search engine critically depends on the
ability to present results that are an adequate response to
the user's query and intent. If the intent (or the most likely
intent) behind a query is known, a search engine can
improve retrieval results by adapting the presented results to
the more speci c intent instead of the | underspeci ed
| query [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Several studies have proposed classi cation
schemes for query intent. Broder [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] suggested that the
intent of a query can be either informational, navigational or
transactional. He estimated percentages for each of the
categories by presenting Altavista users a brief questionnaire
about the purpose of their search after submitting their
query. After manual classi cation of 1,000 queries he warned
that \inferring the user intent from the query is at best an
inexact science, but usually a wild guess." Later, many
expansions and alternative schemes have been proposed, and
more dimensions were added.
      </p>
      <p>
        In many existing intent recognition studies, training and
test data for automatic intent recognition have been created
in the form of annotations by external assessors who are not
the searchers themselves [
        <xref ref-type="bibr" rid="ref1 ref2 ref4">2, 1, 4</xref>
        ]. Post-hoc intent
annotation by external assessors is not ideal; nevertheless, intent
annotations from external judges are widely used in the
community for evaluation or training purposes. Therefore it is
important for the eld to get a better understanding of the
quality of this process as an approximation for rst-hand
annotation by searchers themselves. Some annotation studies
have investigated the reliability of query intent annotations
by measuring the agreement between two external assessors
on the same query set [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. What these studies do not
measure, is the validity of the judgments.
      </p>
      <p>In this paper, we aim to measure the validity of query
intent assessments, i.e. how well an external assessor can
estimate the underlying intent of a searcher's query. We use
a classi cation scheme to describe search intent.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>OUR INTENT CLASSIFICATION SCHEME</title>
      <p>
        We introduce a multi-dimensional classi cation scheme of
query intent that is inspired by and uses aspects from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our classi cation scheme consists of the
following dimensions of search intent.
      </p>
      <p>
        1. Topic: categorical, xed set of categories from the
well-known Open Directory Project (ODP), giving a
general idea of what the query is about.
2. Action type: categorical, consisting of: informational,
navigational and transactional. This is the categorisation
by Broder.
3. Modus: categorical, consisting of: image, video, map,
text and other. This dimension is based on [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
4. source authority sensitivity : 4-point ordinal scale (high
sensitivity: relevance strongly depends on authority of
source).
5. spatial sensitivity : 4-point ordinal scale (high
sensitivity: relevance strongly depends on location).
6. time sensitivity : 4-point ordinal scale (high sensitivity:
relevance strongly depends on time/date).
7. speci city: 4-point ordinal scale (high speci city: very
speci c results desired; low speci city: explorative goal).
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
      <p>In order to obtain labeled queries from search engine users,
we created a plugin for the Mozilla Firefox web browser.
After installation by the user, the plugin locally logs all queries
submitted to Google. We asked colleagues (all academic
scientists and PhD students) to participate in our experiment.
Participants were asked to occasionally (at a self-chosen
moment) annotate the queries they submitted in the last 48
hours, using a form that presented our intent classi cation
scheme. To guarantee that no sensitive information was
involuntarily submitted, participants were allowed to skip any
query they did not want to submit.</p>
      <p>In total, 11 participants enrolled in the experiment.
Together, they annotated 605 queries with their query intent,
of which 135 duplicates. On average, each searcher
annotated 55 queries (standard deviation=73). The three topic
categories that were used most frequently in the set of
annotated queries were computer, science and recreation.</p>
      <p>To obtain labels from external assessors we used the same
form as was used by the participants. Four of the authors
acted as external assessors; all queries were assessed by at
least two assessors.</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>In order to answer the question \How reliable is our intent
classi cation scheme as an instrument for measuring search
intent?", we calculated the interobserver reliability as the
agreement between the external assessors using Cohen's .
The middle column of Table 1 shows the average agreement
over the assessor pairs for each dimension. For only one of
the seven dimensions from our classi cation scheme)
substantial agreement (0.6 or higher) was reached. For four of
the seven, at least moderate agreement (0.4 or higher) was
reached: least moderately reliable query intent classi cation
is possible for the dimensions topic, modus, time sensitivity
and spatial sensitivity.</p>
      <p>In order to answer the question, \How valid are the
intent classi cations by external assessors?", we compared the
intent classi cations by the external assessors to the intent
classi cations by the searchers themselves. We calculated
-scores per dimension for each assessor{searcher pair. The
rightmost column of Table 1 shows the average agreement
over the assessor{searcher pairs. The table shows that
moderately valid query intent classi cation is possible on two of
the seven dimensions from our classi cation scheme: topic
and spatial sensitivity. The di erence between the inter{
assessor agreement and the assessor{searcher agreement was
signi cant on all dimensions.</p>
      <p>Our experiments suggest that classi cation of queries into
Topic categories can be done reliably, even though we had
17 di erent topics to choose from. This is good news for a
future implementation of automatic query classi cation
because topic plays an important role in query disambiguation
and personalisation. The second reliable dimension, Spatial
sensitivity, is an important dimension for local search: every
web search takes place at a physical location, and there are
types of queries for which this location is relevant (e.g. the
search for restaurants or events). The nding that external
assessors can reach a moderate agreement with the searcher
on this dimension shows the feasibility of recognizing that a
query is sensitive to location. The search engine can respond
by promoting search results that match with the location.</p>
      <p>For the implementation of intent classi cation in a search
engine, training data is needed: The features are the query
terms (the textual content of the query) and the labels are
the values for the dimensions in the classi cation scheme.
Analysis of the queries shows that for many intent
dimensions, there is no direct connection between words in the
query and the intent of the query. For example, in the
33 queries that were annotated by the searcher with the
image modus (e.g. \photosynthesis"; \coen swijnenberg")
there were no occurrences of words such as `image' or
`picture', and only 2 of the 90 queries that were annotated with
a high temporal sensitivity contained a time-related query
word. This means that for automatic classi cation, it is
difcult to generalize over queries. However, the most likely
intent can still be learned for individual queries by
following the diversi cation approach in the ranking of the search
results: The engine can learn the probability of intents for
speci c queries by counting clicks on di erent types of
results. This approach requires a huge amount of clicks to be
recorded (which is possible for large search engines such as
Google) and the long tail of low-frequency queries will not
be served.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>We found that four of the seven dimensions in our
classi cation scheme could be annotated moderately reliably
( &gt; 0:4): topic, modus, time sensitivity and spatial
sensitivity. An important nding is that queries could not
reliably be classi ed according to the dimension `action type',
which is the original Broder classi cation. Of the four
reliable dimensions, only the annotations on the topic and
spatial sensitivity dimensions were valid ( &gt; 0:4) when
compared to the searcher's annotations. This shows that
the agreement between external assessors is not a good
estimator of the validity of the intent classi cations.</p>
      <p>In conclusion, we showed that Broder was correct with his
warning that \inferring the user intent from the query is at
best an inexact science, but usually a wild guess". Therefore,
we encourage the research community to consider - where
possible - using query intent classi cations by the searchers
themselves as test data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ashkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , E. Agichtein, and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          .
          <article-title>Classifying and characterizing query intent</article-title>
          .
          <source>Advances in Information Retrieval</source>
          , pages
          <volume>578</volume>
          {
          <fpage>586</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Calderon-Benavides</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Gonzalez-Caro</surname>
          </string-name>
          .
          <article-title>The Intention Behind Web Queries</article-title>
          . In F. Crestani,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          , and M. Sanderson, editors,
          <source>String Processing and Information Retrieval</source>
          , LNCS
          <volume>4209</volume>
          , pages
          <fpage>98</fpage>
          {
          <fpage>109</fpage>
          , Berlin Heidelberg,
          <year>2006</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Broder</surname>
          </string-name>
          .
          <article-title>A taxonomy of web search</article-title>
          .
          <source>In ACM SIGIR forum</source>
          , volume
          <volume>36</volume>
          , pages
          <fpage>3</fpage>
          <lpage>{</lpage>
          10. ACM,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gonzalez-Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Calderon-Benavides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tansini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Dubhashi</surname>
          </string-name>
          .
          <article-title>Web Queries: the Tip of the Iceberg of the User's Intent</article-title>
          . In Workshop on User Modeling for Web Applications,
          <string-name>
            <surname>WSDM</surname>
          </string-name>
          <year>2011</year>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sushmita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          .
          <article-title>Dynamics of genre and domain intents</article-title>
          .
          <source>Information Retrieval Technology</source>
          , pages
          <volume>399</volume>
          {
          <fpage>409</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bennett</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          .
          <article-title>Predicting short-term interests using activity-based search context</article-title>
          .
          <source>In Proceedings of the 19th ACM international conference on Information and knowledge management</source>
          , pages
          <volume>1009</volume>
          {
          <fpage>1018</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>