=Paper=
{{Paper
|id=Vol-512/paper-11
|storemode=property
|title=Evaluating the Impact of Snippet Highlighting in Search
|pdfUrl=https://ceur-ws.org/Vol-512/paper11.pdf
|volume=Vol-512
|dblpUrl=https://dblp.org/rec/conf/sigir/IofciuCS09
}}
==Evaluating the Impact of Snippet Highlighting in Search==
<pdf width="1500px">https://ceur-ws.org/Vol-512/paper11.pdf</pdf>
<pre>
     Evaluating the Impact of Snippet Highlighting in Search

                              Tereza Iofciu                                  Nick Craswell and Milad Shokouhi
                          L3S Research Center                                         Microsoft Bing Search
                            iofciu@L3S.de                                      {nickcr,milads}@microsoft.com


ABSTRACT                                                                the user’s query keywords are highlighted in bold. Our other
When viewing a list of search results, users see a snippet              method highlights additional words (in yellow), that are not
of text from each document. For an ambiguous query, the                 query words but are important for that particular document.
list may contain some documents that match the user’s in-               The baseline method always highlights the same words in
terpretation of the query and some that correspond to a                 each snippet, while the new approach highlights the differ-
completely different interpretation. We hypothesize that                ences between snippets.
selectively highlighting important words in snippets may
help users scan the list for relevant documents. This paper             For example, for the query “Cornwall England”, where the
presents a lab experiment where we show the user a top-10               query intent is not very clear, a search engine retrieves gen-
list, instructing them to look for a particular interpretation          eral information pages, like Wikipedia pages, but also pages
of an ambiguous query, and track the speed and accuracy                 with tourist information. The baseline highlighting puts
of their clicks. We find that under certain conditions, the             only the words ‘cornwall’ and ‘england’ in bold. Our new
additional highlighting improves the time to click without              method, in addition, highlights ‘tourist’, ‘Wikipedia’ and
decreasing the user’s ability to identify relevant documents.           ‘pictures’. This potentially allows, for example, a user who
                                                                        is ready to book their holiday to find travel booking sites
1.    INTRODUCTION                                                      more easily. In one experiment the additional highlighting
When users view a list of search results they see ‘snippets’ of         is automatic, in the other it is manual. In both cases the
text from the retrieved documents. A snippet helps the user             hypothesis is that users will be able to scan towards relevant
decide whether to click, view and potentially make use of a             documents more quickly with the additional highlighting.
document. A good snippet gives an indication of whether a
document seems relevant, deserving click.
                                                                        2.     RELATED WORK
                                                                        There are many studies in literature focusing on different as-
This paper evaluates lists of snippets, in the context of am-
                                                                        pects of document representation and summarization in the
biguous queries. For ambiguous queries, a user may be faced
                                                                        context of information retrieval. Some approaches are eval-
with some results that are completely off-topic. For exam-
                                                                        uated in a task-oriented manner where speed and accuracy
ple, when users type the query ‘house’, they may be looking
                                                                        are compared for different search result representations. A
for information on the US House of Representatives, the TV
                                                                        recent example of ‘extrinsic’ evaluation, with references to
series House or real estate. When users type ‘microsoft’ they
                                                                        past studies, is [1].
may be looking for investment information, products to buy
or technical support. There are multiple interpretations of
                                                                        Alternatively snippet evaluation can be intrinsic: For exam-
the query, and it is unlikely that a user wants all of them.
                                                                        ple measuring whether the summary contains important n-
Therefore snippets should allow users to quickly reject re-
                                                                        grams from the document. These measures, such as DUC’s
sults that are completely off topic, and scan towards those
                                                                        ROUGE1 , are correlated with extrinsic measures, and have
that are valuable. Therefore our experiments involve scan-
                                                                        the advantage of being reusable. The present study is non-
ning a results lists of ambiguous queries.
                                                                        standard, so we can not repeat any existing intrinsic or ex-
                                                                        trinsic method. Ours is an extrinsic evaluation concerned
In particular we consider two types of highlighting for the
                                                                        with lists of summaries.
words in snippets. Our baseline approach is similar to the
typical interfaces of the current web search engines, where
                                                                        Our study is similar to the one presented in [5] and later
                                                                        in [4], where the importance of query biased summaries for
                                                                        web search result representation was demonstrated. A task-
                                                                        oriented evaluation was conducted, similar to [1], where the
                                                                        participants had to fulfill different types of search tasks. In
                                                                        the task-oriented studies the users were free to build their
                                                                        own queries in order to solve the tasks. Similar to our ex-
                                                                        perimental setup, in [3] the queries, TREC topics in this
Copyright is held by the author/owner(s). SIGIR’09, July 19-23, 2009,
Boston, USA.                                                            1
                                                                            http://berouge.com/
case, and their search results, have been fixed thoughout
the experiment.

3.    USER STUDY SETUP
This paper describes two rounds of experiments. The main
difference between the two is the highlighting method (man-
ual vs automatic) and the method for selecting ambiguous
queries. However, we made a number of general improve-
ments in our second experiment.

In both experiments our experimental subjects followed a          Figure 1: Ambiguous query and intent selection.
similar procedure. The user is shown an ambiguous query,
along with a ‘topic description’ of how the query should be
interpreted. For example, the query ‘house’ and the descrip-     improved our method for selecting ambiguous queries and
tion ‘information on the TV show’. Then, the user clicks a       introduced an automatic highlighting method.
link to indicate that they are ready, and we show the top-
10 list for the query (taken from the Microsoft Web search       Selecting the queries and query intents. To help us iden-
engine). The user’s task is to identify and click a document     tify ambiguous queries, we developed a distinctiveness mea-
that fits the topic description, and then the move on to the     sure for search results based on information from search logs.
next query-topic description. The top-10 results and snip-       Session information connects query q and query q 0 if query
pets are always the same for each query, and query words         q tends to be followed by q 0 within user sessions. Click in-
are always highlighted in bold. We only vary whether there       formation connects query q and URL u if we have observed
is additional highlighting, in yellow, of non-query words.       users clicking on search result u for query q.

                                                                 To calculate our distinctiveness score for a query, such as
3.1   Manual Experiment Setup                                    ‘adobe’ in Figure 1, we assign queries to the top-10 URLs.
Our pilot experiment used manual highlighting rather than
                                                                 The assignment is according to click data, however we only
any realistic method for automatically highlighting extra
                                                                 include queries that are also connected to the original query
words in snippets. We describe the manual experiment, al-
                                                                 in session data. The query ‘adobe bricks’ has a click connec-
though the ‘automatic highlighting’ experiment improves on
                                                                 tion with one URL, and a session connection with ‘adobe’,
it in a number of dimensions.
                                                                 so it is associated with the URL.
Selecting the queries. If a query has most of its clicks on      The distinctiveness of a URL is the proportion of its as-
a single URL, it is probably not an ambiguous query. It          sociated queries that were not assigned to any other URL.
is more likely to be navigational [2]. To select ambiguous       The output of our process is a set of query-URL pairs with
queries we first select queries with skewness smaller than       distinctiveness of 0.5 or greater.
0.5, from the ‘torso’ of the query distribution (not a head
query, not a tail query). We manually inspected the top-10       For the automatic experiment, 40 pairs of query and dis-
list for 100 of these queries, to identify 50 that seemed to     tinct URL were manually selected from 700 candidates. The
have results that cover more than one topic, and used these      query’s ‘topic description’ was 5 of the associated click/session
as our manual experiment queryset.                               queries, preferring queries with greater numbers of clicks.

Query intent. For each of the selected 50 queries, we devel-     Highlighting. We used three approaches for automatic high-
oped a topic description. The topic was selected to describe     lighting:
some aspect of the query’s top-10 results. We also judged
the relevance of each result to the topic, and made a second
pass where topics and judgments were checked by a second            • Top query phrase. Using click data only (not session
assessor.                                                             data) we highlighted the most popular click query that
                                                                      occurred in the snippet, if any.
Highlighting. Three assessors each viewed the top-10 re-
sult snippets and selected ‘important’ words for highlight-         • Top URL anchor phrase. If no query phrase was high-
ing. The result snippets were shown in the order they were            lighted, we highlighted the most popular incoming an-
retrieved by the search engine. They did so without knowing           chor phrase that occurred in the snippet. Anchor in-
the query’s topic description, to avoid any bias towards that         formation came from a large Web search engine.
interpretation. In our experiment, we then highlighted any
word or phrase that was selected by two or more assessors.
                                                                    • Wikipedia disambiguation terms. Where a Wikipedia
                                                                      disambiguation page existed for a given query, such
3.2   Automatic Experiment Setup                                      as “Cornwall (disambiguation)”2 , then all the disam-
After the manual experiment, we noticed that some queries             biguating entity names were highlighted in the query
were not really ambiguous (for example ‘comet 17p holmes’).           result page.
This is a problem because it led to the development of a con-
trived topic, which was confusing to our users and unlikely to   2
                                                                   http://en.wikipedia.org/wiki/Cornwall_
agree with our highlighting. In our second experiment, we        (disambiguation)
                                                                               25
                                                                                                    Clicked documents


                                                                               20
                                                                                                    Relevant documents


                                                                               15
                                                                      Ratio
                                                                                                    Shallowest relevant document


                                                                               10
                                                                               5
                                                                               0
                                                                                    1   2   3   4     5    6      7   8   9   10

                                                                                                    Result rank


                                                                         Figure 3: Relevant results vs. clicked results


                                                                                    Table 1: Average time until click
                                                                                             Time (sec)     Time (sec)
                                                                              Highlighting when relevant when not relevant
                                                                              baseline         20.83           19.19
Figure 2: Automatic highlighting for the query                                manual           23.24           27.38
“Cornwall”.

The first two approaches can highlight differently for each           Table 2: Probability of clicking a relevant result
result in the top-10, since each URL has different click data                             Relevance    Relevance
and incoming anchor text. The third approach was applied                    Highlighting (when fast) (when slow)
globally to the search results.                                             baseline        0.76          0.79
                                                                            manual          0.78          0.67
Figure 2 shows an example of automatic highlighting. As al-
ways, the additional highlighting gives the highlighted word/phrase
a yellow background.
                                                                  4.2          Automatic Experiment
                                                                  The automatic experiment had 8 users who each processed
4.    EXPERIMENT RESULTS                                          40 queries. Having identified a number of problems in the
In both experiments, each user saw all queries. Half the users    manual experiment, we made a number of changes in the
saw additional highlighting on the odd numbered queries.          automatic experiment. Of course we employed an automatic
The other users saw it on even numbered queries. At the           highlighting method and used a new method for identifying
end of the experiments the participants were asked to answer      potentially ambiguous queries (see Section 3.2). For each
a questionnaire.                                                  query users now click the topic description itself to indicate
                                                                  that they are ready to see the top-10. This was intended
4.1    Manual Experiment                                          to reduce the chances of a user ignoring a topic. We also
The manual experiment had 16 participants who each pro-           precomputed and optimized the HTML of top-10 lists, to
cessed 50 queries. We manually judged the relevance of            make the top-10 lists render on the screen more quickly.
each top-10 result with respect to the chosen interpretation
(topic). The same top-10 was also used for topic develop-         Highlighting had a much smaller effect in the automatic ex-
ment (i.e. assigning the desired topic to a query), so upon       periment than in the manual experiment. In particular, au-
judging the top-10 there were always one or more relevant         tomatic highlighting did not cause users to become both
documents found for the assigned topic. Figure 3 shows that       slow and inaccurate for some queries. For example, adding
relevant documents were distributed evenly over ranks, but        automatic highlighting did not change the click distribution
users tended to click documents near the top of the list. This    over ranks (Figure 4). The automatic method highlighted
is consistent with our instructions to click the first relevant   fewer words than the manual method, and may have been
document found. It also matches the ranks of the ‘shallow-        more consistent.
est relevant document’ for each query, i.e. the first relevant
document to be found in the top-10 retrieved.                     In the automatic experiment click accuracy was 0.9, com-
                                                                  pared to 0.75 for the manual experiment. In the automatic
Results indicate that manual highlighting was not useful.         experiment, this level of accuracy was maintained with and
Table 1 shows that users were slower when faced with the          without the additional highlighting. A breakdown of accu-
new highlighting, and users delayed longer in cases where         racy differences per-query is presented in Figure 5.
they eventually clicked an irrelevant document. We then di-
vided our observations into two groups, fast and slow, based      Within the automatic experiment, the main effect we ob-
on the time to click. We show the accuracy of clicks in           served was the time taken to click. The baseline highlight-
Table 2. This again indicates that a delay in the manual          ing had a time till click of 13.5 seconds, while the time for
highlighting case is associated with making more mistakes.        automatic highlighting was 11.2 seconds. Figure 6 shows the
                                                                                                    Unlike many summarization experiments, we tested how
                                                                                                    easy it was to scan a top-10 list of snippets, rather than
               20
                                                                                                    the quality of individual snippets.
 Click ratio


                                             Baseline highlighting
                                             Automatic highlighting                                 Our manual experiment was set up with a lot of human
               5 10


                                                                                                    effort: Manual topic development, manual highlighting of
                                                                                                    the snippet words selected by two out of three assessors,
               0


                                                                                                    and full relevance judgments of the top-10s. However, we
                        1     2    3    4      5       6        7         8        9     10
                                                                                                    suspect that some topic descriptions were somewhat ‘con-
                                            Result rank                                             trived’, having been developed for queries that were not re-
                                                                                                    ally ambiguous. This may have been confusing our users,
                                                                                                    who also reported in the post-experiment questionnaire that
Figure 4: Click histogram highlighting vs. baseline                                                 there was too much highlighting. Overall, showing manual
highlighting                                                                                        highlighting was associated with slower and less accurate
                                                                                                    clicks.

difference in average time on a per-query basis.                                                    Our automatic experiment used a log analysis method to
                                                                                                    identify queries that seem ambiguous, because they have one
                                                                                                    distinctive URL in the top-10. Although this set of query-
                                                                                                    URL pairs still required manual vetting, we believe it was a
                                                                                                    much cleaner set of ambiguous queries. We also introduced
 Queries (%)


                                                                                                    an automatic highlighting method based on click logs, an-
               60


                                                                                                    chors and Wikipedia disambiguation pages. Finally we made
                                                                                                    two changes to the experimental interface, by speeding up
               0 20


                                                                                                    the software and increasing the focus on topic descriptions
                                                                                                    by forcing them to click the description before proceeding.
                       −1.0 −0.75 −0.5 −0.25       0       0.25         0.5       0.75   1.0
                                                                                                    In combination, these changes led to us no longer seeing slow
                      Accuracy (automatic) − Accuracy (baseline)                                    and inaccurate click behavior in the presence of highlight-
                                                                                                    ing. Instead, click accuracy was maintained, while speed
                                                                                                    improved by 17%, to about 11.2 seconds per query.
Figure 5: Accuracy of automatic vs. baseline high-
lighting                                                                                            One drawback of our experiments is that we only used am-
                                                                                                    biguous queries, and there was always a manual vetting pro-
                                                                                                    cedure during query selection. Therefore we have not stud-
                                                                                                    ied the influence of highlighting in general. In future work
                                                                                                    we would like to understand the influence of query type on
               6


                                                                                                    our experiments, and improve our automatic techniques for
 Queries
               4


                                                                                                    discovering ambiguous queries, since it may be desirable to
                                                                                                    highlight differently for different query types. We also intend
               2


                                                                                                    to experiment with eye-tracking tools, to measure more di-
               0


                                                                                                    rectly the influence of highlighting on user attention.
                      −15    −12   −9   −7    −5   −3      −1       1         3    5     7

                            Time (automatic) − Time (baseline)
                                                                                                    6.   REFERENCES
Figure 6: Time taken for automatic vs.                                                   baseline   [1] Hideo Joho and Joemon M. Jose. Effectiveness of
highlighting                                                                                            additional representations for the search result
                                                                                                        presentation on the web. Inf. Process. Manage.,
                                                                                                        44(1):226–241, 2008.
4.3               Questionnaire Results                                                             [2] Uichin Lee, Zhenyu Liu, and Junghoo Cho. Automatic
At the end of the experiment the participants had to fill in
                                                                                                        identification of user goals in web search. In WWW ’05,
a questionnaire about the search tasks and their experience
                                                                                                        New York, USA, 2005. ACM.
with the experiment. In the manual experiment users were
more likely to say that there was too much yellow highlight-                                        [3] Anastasios Tombros and Mark Sanderson. Advantages
ing (the additional highlighting was always yellow).                                                    of query biased summaries in information retrieval. In
                                                                                                        SIGIR ’98, New York,USA, 1998. ACM.
In both setups more than 60% of the participants have re-                                           [4] Ryen White, Joemon M. Jose, and Ian Ruthven. A
ported to having been sometimes familiar with the search                                                task-oriented study on the influencing effects of
topics and more than 70% found the connection between                                                   query-biased summarisation in web searching. Inf.
the query and the selected intent often understandable.                                                 Process. Manage., 39(5):707–733, 2003.
                                                                                                    [5] Ryen White, Ian Ruthven, and Joemon M. Jose. Web
5.             CONCLUSION AND FUTURE WORK                                                               document summarisation: A task-oriented evaluation.
                                                                                                        In DEXA ’01, Washington, DC, USA, 2001.
This paper described our experiments in highlighting the im-
portant words in the search snippets for ambiguous queries.

</pre>