<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating the Impact of Snippet Highlighting in Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tereza Iofciu</string-name>
          <email>iofciu@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nick Craswell and Milad Shokouhi</string-name>
          <email>milads@microsoft.com</email>
          <email>nickcr@microsoft.com</email>
          <email>{nickcr,milads}@microsoft.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft Bing Search</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>19</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>When viewing a list of search results, users see a snippet of text from each document. For an ambiguous query, the list may contain some documents that match the user's interpretation of the query and some that correspond to a completely di erent interpretation. We hypothesize that selectively highlighting important words in snippets may help users scan the list for relevant documents. This paper presents a lab experiment where we show the user a top-10 list, instructing them to look for a particular interpretation of an ambiguous query, and track the speed and accuracy of their clicks. We nd that under certain conditions, the additional highlighting improves the time to click without decreasing the user's ability to identify relevant documents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This paper evaluates lists of snippets, in the context of
ambiguous queries. For ambiguous queries, a user may be faced
with some results that are completely o -topic. For
example, when users type the query `house', they may be looking
for information on the US House of Representatives, the TV
series House or real estate. When users type `microsoft' they
may be looking for investment information, products to buy
or technical support. There are multiple interpretations of
the query, and it is unlikely that a user wants all of them.
Therefore snippets should allow users to quickly reject
results that are completely o topic, and scan towards those
that are valuable. Therefore our experiments involve
scanning a results lists of ambiguous queries.</p>
      <p>In particular we consider two types of highlighting for the
words in snippets. Our baseline approach is similar to the
typical interfaces of the current web search engines, where
the user's query keywords are highlighted in bold. Our other
method highlights additional words (in yellow), that are not
query words but are important for that particular document.
The baseline method always highlights the same words in
each snippet, while the new approach highlights the di
erences between snippets.</p>
      <p>For example, for the query \Cornwall England", where the
query intent is not very clear, a search engine retrieves
general information pages, like Wikipedia pages, but also pages
with tourist information. The baseline highlighting puts
only the words `cornwall' and `england' in bold. Our new
method, in addition, highlights `tourist', `Wikipedia' and
`pictures'. This potentially allows, for example, a user who
is ready to book their holiday to nd travel booking sites
more easily. In one experiment the additional highlighting
is automatic, in the other it is manual. In both cases the
hypothesis is that users will be able to scan towards relevant
documents more quickly with the additional highlighting.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>
        There are many studies in literature focusing on di erent
aspects of document representation and summarization in the
context of information retrieval. Some approaches are
evaluated in a task-oriented manner where speed and accuracy
are compared for di erent search result representations. A
recent example of `extrinsic' evaluation, with references to
past studies, is [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Alternatively snippet evaluation can be intrinsic: For
example measuring whether the summary contains important
ngrams from the document. These measures, such as DUC's
ROUGE1, are correlated with extrinsic measures, and have
the advantage of being reusable. The present study is
nonstandard, so we can not repeat any existing intrinsic or
extrinsic method. Ours is an extrinsic evaluation concerned
with lists of summaries.</p>
      <p>
        Our study is similar to the one presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and later
in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where the importance of query biased summaries for
web search result representation was demonstrated. A
taskoriented evaluation was conducted, similar to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where the
participants had to ful ll di erent types of search tasks. In
the task-oriented studies the users were free to build their
own queries in order to solve the tasks. Similar to our
experimental setup, in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the queries, TREC topics in this
1http://berouge.com/
case, and their search results, have been xed thoughout
the experiment.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. USER STUDY SETUP</title>
      <p>This paper describes two rounds of experiments. The main
di erence between the two is the highlighting method
(manual vs automatic) and the method for selecting ambiguous
queries. However, we made a number of general
improvements in our second experiment.</p>
      <p>In both experiments our experimental subjects followed a
similar procedure. The user is shown an ambiguous query,
along with a `topic description' of how the query should be
interpreted. For example, the query `house' and the
description `information on the TV show'. Then, the user clicks a
link to indicate that they are ready, and we show the
top10 list for the query (taken from the Microsoft Web search
engine). The user's task is to identify and click a document
that ts the topic description, and then the move on to the
next query-topic description. The top-10 results and
snippets are always the same for each query, and query words
are always highlighted in bold. We only vary whether there
is additional highlighting, in yellow, of non-query words.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Manual Experiment Setup</title>
      <p>Our pilot experiment used manual highlighting rather than
any realistic method for automatically highlighting extra
words in snippets. We describe the manual experiment,
although the `automatic highlighting' experiment improves on
it in a number of dimensions.</p>
      <p>
        Selecting the queries. If a query has most of its clicks on
a single URL, it is probably not an ambiguous query. It
is more likely to be navigational [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To select ambiguous
queries we rst select queries with skewness smaller than
0.5, from the `torso' of the query distribution (not a head
query, not a tail query). We manually inspected the top-10
list for 100 of these queries, to identify 50 that seemed to
have results that cover more than one topic, and used these
as our manual experiment queryset.
      </p>
      <p>Query intent. For each of the selected 50 queries, we
developed a topic description. The topic was selected to describe
some aspect of the query's top-10 results. We also judged
the relevance of each result to the topic, and made a second
pass where topics and judgments were checked by a second
assessor.</p>
      <p>Highlighting. Three assessors each viewed the top-10
result snippets and selected `important' words for
highlighting. The result snippets were shown in the order they were
retrieved by the search engine. They did so without knowing
the query's topic description, to avoid any bias towards that
interpretation. In our experiment, we then highlighted any
word or phrase that was selected by two or more assessors.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Automatic Experiment Setup</title>
      <p>After the manual experiment, we noticed that some queries
were not really ambiguous (for example `comet 17p holmes').
This is a problem because it led to the development of a
contrived topic, which was confusing to our users and unlikely to
agree with our highlighting. In our second experiment, we
improved our method for selecting ambiguous queries and
introduced an automatic highlighting method.</p>
      <p>Selecting the queries and query intents. To help us
identify ambiguous queries, we developed a distinctiveness
measure for search results based on information from search logs.
Session information connects query q and query q0 if query
q tends to be followed by q0 within user sessions. Click
information connects query q and URL u if we have observed
users clicking on search result u for query q.</p>
      <p>To calculate our distinctiveness score for a query, such as
`adobe' in Figure 1, we assign queries to the top-10 URLs.
The assignment is according to click data, however we only
include queries that are also connected to the original query
in session data. The query `adobe bricks' has a click
connection with one URL, and a session connection with `adobe',
so it is associated with the URL.</p>
      <p>The distinctiveness of a URL is the proportion of its
associated queries that were not assigned to any other URL.
The output of our process is a set of query-URL pairs with
distinctiveness of 0.5 or greater.</p>
      <p>For the automatic experiment, 40 pairs of query and
distinct URL were manually selected from 700 candidates. The
query's `topic description' was 5 of the associated click/session
queries, preferring queries with greater numbers of clicks.
Highlighting. We used three approaches for automatic
highlighting:</p>
      <p>Top query phrase. Using click data only (not session
data) we highlighted the most popular click query that
occurred in the snippet, if any.</p>
      <p>Top URL anchor phrase. If no query phrase was
highlighted, we highlighted the most popular incoming
anchor phrase that occurred in the snippet. Anchor
information came from a large Web search engine.</p>
      <p>Wikipedia disambiguation terms. Where a Wikipedia
disambiguation page existed for a given query, such
as \Cornwall (disambiguation)"2, then all the
disambiguating entity names were highlighted in the query
result page.
2http://en.wikipedia.org/wiki/Cornwall_
(disambiguation)
ito 15
aR 01</p>
      <p>Clicked documents
Relevant documents
Shallowest relevant document</p>
      <p>Automatic highlighting for the query
The rst two approaches can highlight di erently for each
result in the top-10, since each URL has di erent click data
and incoming anchor text. The third approach was applied
globally to the search results.</p>
    </sec>
    <sec id="sec-6">
      <title>4. EXPERIMENT RESULTS</title>
      <p>In both experiments, each user saw all queries. Half the users
saw additional highlighting on the odd numbered queries.
The other users saw it on even numbered queries. At the
end of the experiments the participants were asked to answer
a questionnaire.</p>
    </sec>
    <sec id="sec-7">
      <title>4.1 Manual Experiment</title>
      <p>The manual experiment had 16 participants who each
processed 50 queries. We manually judged the relevance of
each top-10 result with respect to the chosen interpretation
(topic). The same top-10 was also used for topic
development (i.e. assigning the desired topic to a query), so upon
judging the top-10 there were always one or more relevant
documents found for the assigned topic. Figure 3 shows that
relevant documents were distributed evenly over ranks, but
users tended to click documents near the top of the list. This
is consistent with our instructions to click the rst relevant
document found. It also matches the ranks of the
`shallowest relevant document' for each query, i.e. the rst relevant
document to be found in the top-10 retrieved.</p>
      <p>Results indicate that manual highlighting was not useful.
Table 1 shows that users were slower when faced with the
new highlighting, and users delayed longer in cases where
they eventually clicked an irrelevant document. We then
divided our observations into two groups, fast and slow, based
on the time to click. We show the accuracy of clicks in
Table 2. This again indicates that a delay in the manual
highlighting case is associated with making more mistakes.
1
2
3
4
5
6
7
8
9</p>
      <p>10</p>
    </sec>
    <sec id="sec-8">
      <title>4.2 Automatic Experiment</title>
      <p>The automatic experiment had 8 users who each processed
40 queries. Having identi ed a number of problems in the
manual experiment, we made a number of changes in the
automatic experiment. Of course we employed an automatic
highlighting method and used a new method for identifying
potentially ambiguous queries (see Section 3.2). For each
query users now click the topic description itself to indicate
that they are ready to see the top-10. This was intended
to reduce the chances of a user ignoring a topic. We also
precomputed and optimized the HTML of top-10 lists, to
make the top-10 lists render on the screen more quickly.
Highlighting had a much smaller e ect in the automatic
experiment than in the manual experiment. In particular,
automatic highlighting did not cause users to become both
slow and inaccurate for some queries. For example, adding
automatic highlighting did not change the click distribution
over ranks (Figure 4). The automatic method highlighted
fewer words than the manual method, and may have been
more consistent.</p>
      <p>In the automatic experiment click accuracy was 0.9,
compared to 0.75 for the manual experiment. In the automatic
experiment, this level of accuracy was maintained with and
without the additional highlighting. A breakdown of
accuracy di erences per-query is presented in Figure 5.
Within the automatic experiment, the main e ect we
observed was the time taken to click. The baseline
highlighting had a time till click of 13.5 seconds, while the time for
automatic highlighting was 11.2 seconds. Figure 6 shows the</p>
      <sec id="sec-8-1">
        <title>Baseline highlighting</title>
        <p>Automatic highlighting
di erence in average time on a per-query basis.</p>
        <p>0
ito 2
ra 0
k 1
ilc 5
C 0
6
0
Unlike many summarization experiments, we tested how
easy it was to scan a top-10 list of snippets, rather than
the quality of individual snippets.</p>
        <p>Our manual experiment was set up with a lot of human
e ort: Manual topic development, manual highlighting of
the snippet words selected by two out of three assessors,
and full relevance judgments of the top-10s. However, we
suspect that some topic descriptions were somewhat
`contrived', having been developed for queries that were not
really ambiguous. This may have been confusing our users,
who also reported in the post-experiment questionnaire that
there was too much highlighting. Overall, showing manual
highlighting was associated with slower and less accurate
clicks.</p>
        <p>Our automatic experiment used a log analysis method to
identify queries that seem ambiguous, because they have one
distinctive URL in the top-10. Although this set of
queryURL pairs still required manual vetting, we believe it was a
much cleaner set of ambiguous queries. We also introduced
an automatic highlighting method based on click logs,
anchors and Wikipedia disambiguation pages. Finally we made
two changes to the experimental interface, by speeding up
the software and increasing the focus on topic descriptions
by forcing them to click the description before proceeding.
In combination, these changes led to us no longer seeing slow
and inaccurate click behavior in the presence of
highlighting. Instead, click accuracy was maintained, while speed
improved by 17%, to about 11.2 seconds per query.
One drawback of our experiments is that we only used
ambiguous queries, and there was always a manual vetting
procedure during query selection. Therefore we have not
studied the in uence of highlighting in general. In future work
we would like to understand the in uence of query type on
our experiments, and improve our automatic techniques for
discovering ambiguous queries, since it may be desirable to
highlight di erently for di erent query types. We also intend
to experiment with eye-tracking tools, to measure more
directly the in uence of highlighting on user attention.</p>
        <p>0.25 0.5 0.75 1.0</p>
      </sec>
      <sec id="sec-8-2">
        <title>Accuracy (automatic) − Accuracy (baseline)</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4.3 Questionnaire Results</title>
      <p>At the end of the experiment the participants had to ll in
a questionnaire about the search tasks and their experience
with the experiment. In the manual experiment users were
more likely to say that there was too much yellow
highlighting (the additional highlighting was always yellow).
In both setups more than 60% of the participants have
reported to having been sometimes familiar with the search
topics and more than 70% found the connection between
the query and the selected intent often understandable.</p>
    </sec>
    <sec id="sec-10">
      <title>5. CONCLUSION AND FUTURE WORK</title>
      <p>This paper described our experiments in highlighting the
important words in the search snippets for ambiguous queries.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Hideo</given-names>
            <surname>Joho and Joemon M. Jose</surname>
          </string-name>
          . E ectiveness
          <article-title>of additional representations for the search result presentation on the web</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>44</volume>
          (
          <issue>1</issue>
          ):
          <volume>226</volume>
          {
          <fpage>241</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Uichin</given-names>
            <surname>Lee</surname>
          </string-name>
          , Zhenyu Liu, and
          <string-name>
            <given-names>Junghoo</given-names>
            <surname>Cho</surname>
          </string-name>
          .
          <article-title>Automatic identi cation of user goals in web search</article-title>
          .
          <source>In WWW '05</source>
          , New York, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Anastasios</given-names>
            <surname>Tombros</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>Advantages of query biased summaries in information retrieval</article-title>
          .
          <source>In SIGIR '98</source>
          , New York,USA,
          <year>1998</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ryen</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joemon M. Jose</surname>
            , and
            <given-names>Ian</given-names>
          </string-name>
          <string-name>
            <surname>Ruthven</surname>
          </string-name>
          .
          <article-title>A task-oriented study on the in uencing e ects of query-biased summarisation in web searching</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>39</volume>
          (
          <issue>5</issue>
          ):
          <volume>707</volume>
          {
          <fpage>733</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ryen</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ian</given-names>
            <surname>Ruthven</surname>
          </string-name>
          , and
          <string-name>
            <surname>Joemon</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jose</surname>
          </string-name>
          .
          <article-title>Web document summarisation: A task-oriented evaluation</article-title>
          .
          <source>In DEXA '01</source>
          , Washington, DC, USA,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>