=Paper= {{Paper |id=None |storemode=property |title=Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs |pdfUrl=https://ceur-ws.org/Vol-1033/paper14.pdf |volume=Vol-1033 |dblpUrl=https://dblp.org/rec/conf/eurohcir/PetersenLS13 }} ==Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs== https://ceur-ws.org/Vol-1033/paper14.pdf
Comparative Study of Search Engine Result Visualisation:
             Ranked Lists Versus Graphs

                 Casper Petersen                              Christina Lioma                 Jakob Grue Simonsen
             Dept. of Computer Science                  Dept. of Computer Science             Dept. of Computer Science
             University of Copenhagen                   University of Copenhagen              University of Copenhagen
                    cazz@diku.dk                             c.lioma@diku.dk                    simonsen@diku.dk


ABSTRACT                                                                  tially useful or interesting features about how the retrieved
Typically search engine results (SERs) are presented in a                 data is connected.
ranked list of decreasing estimated relevance to user queries.               We present a user study comparing ranked list vs graph-
While familiar to users, ranked lists do not show inherent                based SER visualisation interfaces. We use a web crawl of
connections between SERs, e.g. whether SERs are hyper-                    ca. 50 million documents in English with associated hyper-
linked or authored by the same source. Such potentially                   link information and 10 participants. We find that ranked
useful connections between SERs can be displayed as graphs.               lists result in overall more accurate and faster searches than
We present a preliminary comparative study of ranked lists                graph displays, but that the latter result in slightly higher re-
vs graph visualisations of SERs. Experiments with TREC                    call. We also find overall higher inter-rater agreement about
web search data and a small user study of 10 participants                 SER relevance when using ranked lists instead of graphs.
show that ranked lists result in more precise and also faster
search sessions than graph visualisations.                                2.   MOTIVATION
                                                                              While traditional IR systems successfully support known-
                                                                          item search [5], what should users do if they want to locate
Categories and Subject Descriptors                                        something from a domain where they have a general interest
H.3.3 [Information Storage and Retrieval]: Information                    but no specific knowledge [8]? Such exploratory searching
Search and Retrieval                                                      comprises a mixture of serendipity, learning, and investiga-
                                                                          tion and is not supported by contemporary IR systems [5],
                                                                          prompting users to “develop coping strategies which involve
Keywords                                                                  [...] the submission of multiple queries and the interactive
Search Engine Result Visualization, Ranked List, Graph                    exploration of the retrieved document space, selectively fol-
                                                                          lowing links and passively obtaining cues about where their
                                                                          next steps lie” [9]. A step towards exploratory search, which
1.    INTRODUCTION                                                        motivates this work, is to make explicit the hyper-linked
   Typically search engine results (SERs) are presented in a              structure of the ordered list used by e.g. Google and Ya-
ranked list of decreasing estimated relevance to user queries.            hoo. Investigation of such a representation does not exist
Drawbacks of ranked lists include showing only a limited                  according to our knowledge, but is comparable to Google’s
view of the information space, not showing how similar the                Knowledge Graph whose aim is to guide users to other rel-
retrieved documents are and/or how the retrieved docu-                    evant information from an initial selection.
ments relate to each other [4, 6]. Such potentially use-
ful information could be displayed to users in the form of
SER graphs; these could present at a glance an overview
                                                                          3.   PREVIOUS WORK
of clusters or isolated documents among the SERs, features                   Earlier work on graph-based SER displays includes Beale
not typically integrated into ranked lists. For instance, di-             et al.’s (1997) visualisation of sequences of queries and their
rected/undirected and weighted/unweighted graphs could                    respective SERs, as well as the work of Shneiderman & Aris
be used to display the direction, causality and strength of               (2006) on modelling semantic search aspects as networks
various relations among SERs. Various graph properties                    (both overviewed in [10]). Treharne et al. (2009) present a
(see [7]), such as the average path length, clustering coef-              critique of ranked list displays side by side a range of other
ficient or degree, could be also displayed, reflecting poten-             types of visualisation, including not only graphs, but also
                                                                          cartesian, categorical, spring and set-based displays [6]. This
                                                                          comparison is analytical rather than empirical. Closest to
                                                                          ours is the work of Donaldson et al. (2008), who experi-
                                                                          mentally compare ranked lists to graph-based displays [2].
                                                                          In their work, graphs model social web information, such
                                                                          as user tags and ratings, in order to facilitate contextual-
Presented at EuroHCIR2013. Copyright c 2013 for the individual papers     ising social media for exploratory web search. They find
by the papers’ authors. Copying permitted only for private and academic   that users seem to prefer a hybrid interface that combines
purposes. This volume is published and copyrighted by its editors.
SIGIR 2013 Dublin, Ireland                                                ranked lists with graph displays. Finally, the hyperlinked
.                                                                         graph representation discussed in the paper allows users to
investigate the result space thereby discovering related and           • The right window shows the ranked list of the top-k
potential relevant information that might otherwise be by-               SERs. The position of the clicked document in the list
passed. Such representation and comparison to a traditional              is clearly marked.
ranked list does not exist according to our knowledge, but           We display the SER graph in a standard force-directed
the idea underpinning the graph representation is compara-        layout [1]. Our graph layout does not allow for other types
ble with Google’s Knowledge Graph as the aim is to guide          of interaction with the graph apart from clicking on it. We
users to other relevant information from an initial selection.    reason that for the simple web search tasks we consider,
                                                                  layouts allowing further interaction may be confusing or
4.    INTERFACE DESIGN                                            time-consuming, and that they may be more suited to other
  This section presents the two different SER visualisations      search tasks, involving for instance decision making, naviga-
used in our study. Our goal is to study the effect of display-    tion and exploration of large information spaces.
ing exactly the same information to the user in two different
ways, using ranked list and graph visualisations, respectively.   4.3     Document Snippets
                                                                     Both the RL and GR interfaces include short query-based
                1   docid                 4
                2   docid
                                                                  summaries of the top-k SERs (snippets). We construct them
                                3
                3   docid                     1                   as follows: We extract from each document a window of ±
                4   docid            2
                5   docid
                                                                  25 terms surrounding the query terms on either side. Let a
                6   docid   5                                     query consist of 3 terms q1 , q2 , q3 . We extract snippets for
                                          6
                     (A)            (B)                           all ordered but not necessarily contiguous sequences of query
                                                                  terms: (q1, q2, q3), (q1 , q2 ), (q1 , q3 ), (q2 , q3 ), (q1 ), (q2 ), (q3 ).
Figure 1: Ranked list (A) and graph (B) representation of the     This way, we match all snippets containing query terms in
top-k documents from a query.                                     the order they appear in the query (not as a bag of words),
                                                                  but we also allow other terms to occur in between query
4.1     Ranked List (RL) Display                                  terms, for instance common modifiers.
   We use a standard ranked list SER display, where docu-            Several snippets can be extracted per document, but only
ments are presented in decreasing order of their estimated        the snippet with the highest TF-IDF score is displayed to
relevance to the user query. The list initially displays only     the user. The TF-IDF of each window is calculated as a
the top-k retrieved document ids (docids) with their asso-        normalised sum of the TF-IDF weights for each term:
ciated rank (see Figure 1 (A)). When clicked upon, each                             |w|                                               
document expands to two mini windows, overlaid to the left                       1 X                                     |C|
                                                                       Ss(D) =          tf (t, D) × log
and right of the list:                                                          |w| t=0                        |D ∈ C : t ∈ D|
     • The left window shows a document snippet containing        where |w| is the number of terms in the window extracted,
       the query terms. The snippet provides a brief sum-         t ∈ w is a term in the window, tf is the term frequency of t
       mary of the document contents that relate to the query     in document D from which the snippet is extracted, C is the
       in order to aid the user to assess document relevance      collection of documents, and Ss(D) is the snippet score for
       prior to viewing the whole document [4]. We describe       document D. Finally, as research has shown that query term
       exactly what the snippet shows and how it is extracted     highlighting can be a useful feature for search interfaces [4],
       in Section 4.3.                                            we highlight all occurrences of query terms in the snippet.
     • The right window shows a graph of the top-k ranked
       SERs (see Section 4.2). The position of the clicked        5.    EVALUATION
       document in the graph is clearly indicated, so users          We recruited 2 participants for a pilot study to calibrate
       can quickly overview its connections, if any, to other     the user interfaces; the results from the pilot study were
       top-k retrieved documents.                                 not subsequently used. For the main study, we recruited 10
                                                                  new participants (9 males, 1 female; average age: 33.05, all
Previously visited documents in the list are colour-marked.       with a background in Computer Science) using convenience
4.2     Graph (GR) Display                                        sampling. Each participant was introduced to the two in-
                                                                  terfaces. Their task was to find and mark as many relevant
   We display a SER graph G = (V, E) as a directed graph
                                                                  documents as possible per query using either interface. For
whose vertices v ∈ V correspond to the top-k retrieved doc-
                                                                  each new query, the SERs could be shown in either interface.
uments, and edges e ∈ E correspond to links (hyperlinks
                                                                  Each experiment lasted 30 minutes.
in our case of web documents) between two vertices. Each
                                                                     Participants did not submit their own queries. The queries
vertex is shown as a shaded circle that displays the rank of
                                                                  were taken from the TREC Web tracks of 2009-2012 (200
its associated document in the middle, see Figure 1 (B). The
                                                                  queries in total). This choice allowed us to provide very fast
size of each vertex is scaled according to its out-degree, so
                                                                  response times to participants (< two seconds, depending
that larger vertex size indicates more outlinks to the other
                                                                  on disk speed), because search results and their associated
top-k documents. Edge direction points towards the out-
                                                                  graphs were pre-computed and cached. Alternatively, run-
linked document. Previously visited documents are colour-
                                                                  ning new queries and plotting their SER graphs on the fly
marked.
                                                                  would result in notably slower response times that would
   When clicked upon, each vertex expands to two mini win-
                                                                  risk dissatisfying participants. However, a drawback in us-
dows, overlaid to the left and right of the graph:
                                                                  ing TREC queries is that participants did not necessarily
     • The left window shows the same document snippet as         have enough context to fully understand the underlying in-
       in the RL display.                                         formation needs and correctly assess document relevance.
                             Ranked List   Graph
             MAP@20            0.4195      0.3211                                             50
                                                                                                                                             Relevant
             MRR               0.4698      0.3948                                             40
                                                                                                                                             Not relevant

             RECALL@20         0.0067      0.0069




                                                                                  Frequency
                                                                                              30

Table 1: Mean Average Precision (MAP), Mean Reciprocal                                        20
Rank (MRR) & Recall of the top 20 results.
                                                                                              10


                                                                                              0
To counter this, we allowed participants to skip queries they                                      0   5   10   15       20        25   30      35          40
                                                                                                                     Click order
were not comfortable with. To avoid bias, skipping a query
                                                                                                                   (a)
was allowed after query terms were displayed, but before the
SERs were displayed.                                                                          40
  We retrieved documents from the TREC ClueWeb09 cat.                                                                                        Relevant
                                                                                                                                             Not relevant

B dataset (ca. 50 million documents crawled from the web                                      30




                                                                                  Frequency
in 2009), using Indri, version 5.2. The experiments were
                                                                                              20
carried out on a 14 inch monitor with a resolution of 1400
x 1050 pixels. We logged which SERs participants marked                                       10

relevant, as well as the participants’ click order and time
spent per SER.                                                                                0
                                                                                                   0        5            10             15                  20
                                                                                                                     Click order


5.1     Findings                                                                                                   (b)
   In total the 10 participants processed 162 queries (89 queries   Figure 2: Click-order and participant relevance assessments for
with the RL interface and 73 with the GR interface) with            the (a) ranked list interface and (b) graph interface
mean µ = 16.2, and standard deviation σ = 7.8. Four queries
(two from each interface) were bypassed (2.5% of all pro-                     Interface                     Min         Max               µ                   σ
                                                                              Ranked List                  1.391       25.476           8.228               4.371
cessed queries).
                                                                              Graph                        3.322       20.963           9.705               3.699
   Table 1 shows retrieval effectiveness per interface, aggre-
gated over all queries for the top k = 20 SERs. The ranked                  Table 2: Time (seconds) spent on each interface.
list is associated with higher, hence better scores than the
graph display for MAP and MRR. MAP is +30.6% better
with ranked lists that with graph displays, meaning that            that for the graph display, the majority of participant clicks
overall a higher amount of relevant SERs is found by the            before the 5th click correspond to non-relevant documents.
participants at higher ranks in the ranked list as opposed          Even though the MRR scores of the graph display indicate
to the graph display. This finding is in agreement with the         that the first relevant document occurs around rank posi-
MRR scores, which indicate that the first SER to be as-             tion 2.5, we see that participants on average click four other
sessed relevant is likely to occur around rank position 2.13        documents before clicking the relevant document at rank
(1/2.13 = 0.469 ≈ 0.4698) with ranked lists, but around             position 2.5. This indicates that in the graph display, par-
rank position 2.55 (1/2.55 = 0.392 ≈ 0.3948) with graph             ticipants click documents not necessarily according to their
displays. Conversely, recall is slightly higher with graph dis-     rank position (indicated in the centre of each vertex), but
plays. In general, higher recall in this case would indicate        rather according to their graph layout or connectivity.
that participants are more likely to find a slightly larger
amount of relevant documents when seeing them as a graph            5.1.2     Time spent
of their hyperlinks. However, the difference in recall between         Table 2 shows statistics about the time participants spent
ranked lists and graphs is very small and can hardly be seen        on each interface. Overall participants spent less time on
as a reliable indication.                                           the ranked list than on the graph display. This observation,
                                                                    combined with the retrieval effectiveness measures shown
5.1.1    Click-order                                                in Table 1, indicates that participants conducted overall
   On average participants clicked on 9.46 entries per query        slightly more precise and faster searches using the ranked
in the ranked list (842 clicks for 89 queries) but only on          lists than using graph displays. The time use also suggests
6.7 entries per query in the graph display (490 clicks for 73       that participants are used to standard ranked list interfaces,
queries). The lower number of clicks in the latter case could       a type of conditioning not easy to control experimentally.
be due to the extra time it might have taken participants
to understand or navigate the graph. This lower number              5.1.3     Inter-participant agreement
of clicks also agrees with the lower MAP scores presented             To investigate how consistent participants were in their
above (if fewer entries were clicked, fewer SERs were as-           assessments, we report the inter-rater agreement using Krip-
sessed, hence fewer relevant documents were found in the            pendorff’s α [3]. Table 3 reports the agreement between the
top ranks).                                                         participants, and Table 4 reports the agreements between
   Figures 2a and 2b plot the order of clicks for the ranked        participants and the TREC preannotated relevance assess-
list and graph interfaces respectively on the x-axis, against       ments per interface. In both cases, only queries annotated
the frequency of clicks on the y-axis. We see that in the           more than once by different participants are included (19
ranked list, the first click of the participant is more often       queries for the ranked list and 11 for the graph SER).
on a relevant document, but in the graph display, the first           The average inter-rater agreements between participants
click is more often on a non-relevant document (as already          vary considerably. For the graph interface, α = 0.04471,
indicated by the MRR scores shown above). We also see               which suggests lack of agreement between raters. On a query
basis, some queries (query 169 and 44) suggest a compara-                       Graph                      Ranked List
                                                                      Query    Raters          α   Query    Raters          α
tively much higher agreement whereas others (e.g. query
                                                                        101      4       0.09559     110      3       0.38654
104 and 184) show a comparatively higher level of disagree-             104      2      -0.17861     119      2      -0.22370
ment. For the ranked list, inter-rater agreement is higher              132      2       0.06561     120      2       0.03146
(α = 0.19813). On a per query basis, quite remarkably,                  169      2       0.33625     129      2       0.05600
query 92 had a perfect agreement between raters, while                  180      2      -0.08949     132      3       0.01689
queries 175 and 129 also exhibited a moderate to high level             184      2      -0.08949     133      2       0.04398
of agreement. However, most queries show only a low to                    3      2      -0.37209     155      2      -0.21067
                                                                         38      2      -0.05006     165      2      -0.25532
moderate level of agreement or disagreement.                             44      2      -0.05861     175      2      -0.07886
   Overall, the lack of agreement may indicate the partici-              54      2      -0.25532     180      2      -0.17861
pants’ confusion in assessing the relevance of SERs to pre-              58      2      -0.22917      51      2      -0.05006
typed queries. This may be aggravated by problems in ren-                 –      –             –      53      2      -0.24694
dering the HTML snippets into text. Some HTML docu-                       –      –             –      74      2      -0.06033
ments were ill-formed, hence their snippets sometimes in-                 –      –             –      80      2      -0.24694
                                                                          –      –             –      81      3      -0.13634
cluded HTML tags or other not always coherent text.                       –      –             –      92      2      -0.21181
   Inter-rater agreements between our participants and the                –      –             –      95      2       0.04582
TREC preannotated relevance assessments show an almost                    –      –             –      96      2      -0.12919
complete lack of agreement. For both interfaces there is                  –      –             –      97      2       0.07813
a weak level of disagreement on average (α = −0.0750 and                      Average α: -0.0750           Average α: -0.0721
α = −0.0721 for the graph and ranked list respectively). On
a per query basis there are only two queries (queries 169 &      Table 4: Inter-rater agreement (α) between participants and
                                                                 TREC assessments for queries assessed by > 1 participant.
110) exhibiting a moderate level of agreement. For most re-
maining queries our participants’ assessments disagree with
the TREC assessments.                                            session in the analysis (e.g. user task, behaviour, satisfac-
                                                                 tion). Future work includes addressing the above limitations
               Graph                      Ranked list            and also testing whether and to what extent these results ap-
     Query    Raters          α   Query    Raters           α    ply when scaling up to wall-sized displays with significantly
       101       4      0.28696     110       3       0.41000
       104       2     -0.21875     119       2       0.00000
                                                                 larger screen real estate.
       132       2     -0.16071     120       2       0.49351
       169       2      0.48000     129       2       0.86022    7.    REFERENCES
       180       2     -0.10031     132       3     -0.08949
       184       2     -0.25806     133       2       0.30108     [1] G. D. Battista, P. Eades, R. Tamassia, and I. G.
         3       2      0.00000     155       2     -0.02632          Tollis. Graph drawing: algorithms for the visualization
        38       2     -0.07519     175       2       0.49351         of graphs. Prentice Hall PTR, 1998.
        44       2      0.49351     180       2     -0.37879      [2] J. J. Donaldson, M. Conover, B. Markines,
        58       2      0.00000      51       2       0.00000         H. Roinestad, and F. Menczer. Visualizing social links
         –       –            –      53       2       0.02151
         –       –            –      74       2     -0.14706          in exploratory search. In HT ’08, pages 213–218, New
         –       –            –      80       2       0.14420         York, NY, USA, 2008. ACM.
         –       –            –      81       3     -0.12919      [3] A. F. Hayes and K. Krippendorff. Answering the call
         –       –            –      92       2       1.00000         for a standard reliability measure for coding data.
         –       –            –      95       2       0.15584         Communication Methods and Measures, 1(1):77–89,
         –       –            –      96       2       0.15584
         –       –            –      97       2       0.30179
                                                                      2007.
             Average α: 0.04471           Average α: 0.19813      [4] M. Hearst. Search user interfaces. Cambridge
                                                                      University Press, 2009.
Table 3: Inter-rater agreement (α) for queries assessed by >1     [5] G. Marchionini. Exploratory search: from finding to
participant. Query is the TREC id of each query.                      understanding. Communications of the ACM,
                                                                      49(4):41–46, 2006.
6.    CONCLUSIONS                                                 [6] K. Treharne and D. M. W. Powers. Search engine
                                                                      result visualisation: Challenges and opportunities. In
  In a small user study, we compared ranked list versus               Information Visualisation, pages 633–638, 2009.
graph-based search engine result (SER) visualisation. Our
                                                                  [7] S. Wasserman and K. Faust. Social network analysis:
motivation was to conduct a preliminary experimental com-
                                                                      methods and applications. Structural analysis in the
parison of the two for the domain of web search, where doc-
                                                                      social sciences. Cambridge University Press, 1994.
ument hyperlinks were used to display them as graphs. We
                                                                  [8] R. W. White, B. Kules, S. M. Drucker, and
found that overall more accurate and faster searches were
                                                                      M. Schraefel. Supporting exploratory search.
done using ranked lists and that inter-user agreement was
                                                                      Communications of the ACM, 49(4):36–39, 2006.
overall higher with ranked lists than with graph displays.
Limitations of this study include: (1) using fixed TREC           [9] R. W. White, G. Muresan, and G. Marchionini.
queries, instead of allowing users to submit their own queries        Workshop on evaluating exploratory search systems.
on the fly; (2) having technical HTML to text rendering               SIGIR Forum, 40(2):52–60, 2006.
problems, resulting in sometimes incoherent document snip-       [10] M. L. Wilson, B. Kules, B. Shneiderman, et al. From
pets; (3) using only 10 users exclusively from Computer Sci-          keyword search to exploration: Designing future
ence, which makes for an overall small and rather biased              search interfaces for the web. Foundations and Trends
user sample; (4) not using the wider context of the search            in Web Science, 2(1):1–97, 2010.