=Paper= {{Paper |id=None |storemode=property |title=Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs |pdfUrl=https://ceur-ws.org/Vol-1033/paper14.pdf |volume=Vol-1033 |dblpUrl=https://dblp.org/rec/conf/eurohcir/PetersenLS13 }} ==Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs== https://ceur-ws.org/Vol-1033/paper14.pdf

Comparative Study of Search Engine Result Visualisation:
Ranked Lists Versus Graphs

Casper Petersen Christina Lioma Jakob Grue Simonsen
Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
University of Copenhagen University of Copenhagen University of Copenhagen
cazz@diku.dk c.lioma@diku.dk simonsen@diku.dk

ABSTRACT tially useful or interesting features about how the retrieved
Typically search engine results (SERs) are presented in a data is connected.
ranked list of decreasing estimated relevance to user queries. We present a user study comparing ranked list vs graph-
While familiar to users, ranked lists do not show inherent based SER visualisation interfaces. We use a web crawl of
connections between SERs, e.g. whether SERs are hyper- ca. 50 million documents in English with associated hyper-
linked or authored by the same source. Such potentially link information and 10 participants. We find that ranked
useful connections between SERs can be displayed as graphs. lists result in overall more accurate and faster searches than
We present a preliminary comparative study of ranked lists graph displays, but that the latter result in slightly higher re-
vs graph visualisations of SERs. Experiments with TREC call. We also find overall higher inter-rater agreement about
web search data and a small user study of 10 participants SER relevance when using ranked lists instead of graphs.
show that ranked lists result in more precise and also faster
search sessions than graph visualisations. 2. MOTIVATION
While traditional IR systems successfully support known-
item search [5], what should users do if they want to locate
Categories and Subject Descriptors something from a domain where they have a general interest
H.3.3 [Information Storage and Retrieval]: Information but no specific knowledge [8]? Such exploratory searching
Search and Retrieval comprises a mixture of serendipity, learning, and investiga-
tion and is not supported by contemporary IR systems [5],
prompting users to “develop coping strategies which involve
Keywords [...] the submission of multiple queries and the interactive
Search Engine Result Visualization, Ranked List, Graph exploration of the retrieved document space, selectively fol-
lowing links and passively obtaining cues about where their
next steps lie” [9]. A step towards exploratory search, which
1. INTRODUCTION motivates this work, is to make explicit the hyper-linked
Typically search engine results (SERs) are presented in a structure of the ordered list used by e.g. Google and Ya-
ranked list of decreasing estimated relevance to user queries. hoo. Investigation of such a representation does not exist
Drawbacks of ranked lists include showing only a limited according to our knowledge, but is comparable to Google’s
view of the information space, not showing how similar the Knowledge Graph whose aim is to guide users to other rel-
retrieved documents are and/or how the retrieved docu- evant information from an initial selection.
ments relate to each other [4, 6]. Such potentially use-
ful information could be displayed to users in the form of
SER graphs; these could present at a glance an overview
3. PREVIOUS WORK
of clusters or isolated documents among the SERs, features Earlier work on graph-based SER displays includes Beale
not typically integrated into ranked lists. For instance, di- et al.’s (1997) visualisation of sequences of queries and their
rected/undirected and weighted/unweighted graphs could respective SERs, as well as the work of Shneiderman & Aris
be used to display the direction, causality and strength of (2006) on modelling semantic search aspects as networks
various relations among SERs. Various graph properties (both overviewed in [10]). Treharne et al. (2009) present a
(see [7]), such as the average path length, clustering coef- critique of ranked list displays side by side a range of other
ficient or degree, could be also displayed, reflecting poten- types of visualisation, including not only graphs, but also
cartesian, categorical, spring and set-based displays [6]. This
comparison is analytical rather than empirical. Closest to
ours is the work of Donaldson et al. (2008), who experi-
mentally compare ranked lists to graph-based displays [2].
In their work, graphs model social web information, such
as user tags and ratings, in order to facilitate contextual-
Presented at EuroHCIR2013. Copyright c 2013 for the individual papers ising social media for exploratory web search. They find
by the papers’ authors. Copying permitted only for private and academic that users seem to prefer a hybrid interface that combines
purposes. This volume is published and copyrighted by its editors.
SIGIR 2013 Dublin, Ireland ranked lists with graph displays. Finally, the hyperlinked
. graph representation discussed in the paper allows users to
investigate the result space thereby discovering related and • The right window shows the ranked list of the top-k
potential relevant information that might otherwise be by- SERs. The position of the clicked document in the list
passed. Such representation and comparison to a traditional is clearly marked.
ranked list does not exist according to our knowledge, but We display the SER graph in a standard force-directed
the idea underpinning the graph representation is compara- layout [1]. Our graph layout does not allow for other types
ble with Google’s Knowledge Graph as the aim is to guide of interaction with the graph apart from clicking on it. We
users to other relevant information from an initial selection. reason that for the simple web search tasks we consider,
layouts allowing further interaction may be confusing or
4. INTERFACE DESIGN time-consuming, and that they may be more suited to other
This section presents the two different SER visualisations search tasks, involving for instance decision making, naviga-
used in our study. Our goal is to study the effect of display- tion and exploration of large information spaces.
ing exactly the same information to the user in two different
ways, using ranked list and graph visualisations, respectively. 4.3 Document Snippets
Both the RL and GR interfaces include short query-based
1 docid 4
2 docid
summaries of the top-k SERs (snippets). We construct them
3
3 docid 1 as follows: We extract from each document a window of ±
4 docid 2
5 docid
25 terms surrounding the query terms on either side. Let a
6 docid 5 query consist of 3 terms q1 , q2 , q3 . We extract snippets for
6
(A) (B) all ordered but not necessarily contiguous sequences of query
terms: (q1, q2, q3), (q1 , q2 ), (q1 , q3 ), (q2 , q3 ), (q1 ), (q2 ), (q3 ).
Figure 1: Ranked list (A) and graph (B) representation of the This way, we match all snippets containing query terms in
top-k documents from a query. the order they appear in the query (not as a bag of words),
but we also allow other terms to occur in between query
4.1 Ranked List (RL) Display terms, for instance common modifiers.
We use a standard ranked list SER display, where docu- Several snippets can be extracted per document, but only
ments are presented in decreasing order of their estimated the snippet with the highest TF-IDF score is displayed to
relevance to the user query. The list initially displays only the user. The TF-IDF of each window is calculated as a
the top-k retrieved document ids (docids) with their asso- normalised sum of the TF-IDF weights for each term:
ciated rank (see Figure 1 (A)). When clicked upon, each |w|
document expands to two mini windows, overlaid to the left 1 X |C|
Ss(D) = tf (t, D) × log
and right of the list: |w| t=0 |D ∈ C : t ∈ D|
• The left window shows a document snippet containing where |w| is the number of terms in the window extracted,
the query terms. The snippet provides a brief sum- t ∈ w is a term in the window, tf is the term frequency of t
mary of the document contents that relate to the query in document D from which the snippet is extracted, C is the
in order to aid the user to assess document relevance collection of documents, and Ss(D) is the snippet score for
prior to viewing the whole document [4]. We describe document D. Finally, as research has shown that query term
exactly what the snippet shows and how it is extracted highlighting can be a useful feature for search interfaces [4],
in Section 4.3. we highlight all occurrences of query terms in the snippet.
• The right window shows a graph of the top-k ranked
SERs (see Section 4.2). The position of the clicked 5. EVALUATION
document in the graph is clearly indicated, so users We recruited 2 participants for a pilot study to calibrate
can quickly overview its connections, if any, to other the user interfaces; the results from the pilot study were
top-k retrieved documents. not subsequently used. For the main study, we recruited 10
new participants (9 males, 1 female; average age: 33.05, all
Previously visited documents in the list are colour-marked. with a background in Computer Science) using convenience
4.2 Graph (GR) Display sampling. Each participant was introduced to the two in-
terfaces. Their task was to find and mark as many relevant
We display a SER graph G = (V, E) as a directed graph
documents as possible per query using either interface. For
whose vertices v ∈ V correspond to the top-k retrieved doc-
each new query, the SERs could be shown in either interface.
uments, and edges e ∈ E correspond to links (hyperlinks
Each experiment lasted 30 minutes.
in our case of web documents) between two vertices. Each
Participants did not submit their own queries. The queries
vertex is shown as a shaded circle that displays the rank of
were taken from the TREC Web tracks of 2009-2012 (200
its associated document in the middle, see Figure 1 (B). The
queries in total). This choice allowed us to provide very fast
size of each vertex is scaled according to its out-degree, so
response times to participants (< two seconds, depending
that larger vertex size indicates more outlinks to the other
on disk speed), because search results and their associated
top-k documents. Edge direction points towards the out-
graphs were pre-computed and cached. Alternatively, run-
linked document. Previously visited documents are colour-
ning new queries and plotting their SER graphs on the fly
marked.
would result in notably slower response times that would
When clicked upon, each vertex expands to two mini win-
risk dissatisfying participants. However, a drawback in us-
dows, overlaid to the left and right of the graph:
ing TREC queries is that participants did not necessarily
• The left window shows the same document snippet as have enough context to fully understand the underlying in-
in the RL display. formation needs and correctly assess document relevance.
Ranked List Graph
MAP@20 0.4195 0.3211 50
Relevant
MRR 0.4698 0.3948 40
Not relevant

RECALL@20 0.0067 0.0069

Frequency
30

Table 1: Mean Average Precision (MAP), Mean Reciprocal 20
Rank (MRR) & Recall of the top 20 results.
10

0
To counter this, we allowed participants to skip queries they 0 5 10 15 20 25 30 35 40
Click order
were not comfortable with. To avoid bias, skipping a query
(a)
was allowed after query terms were displayed, but before the
SERs were displayed. 40
We retrieved documents from the TREC ClueWeb09 cat. Relevant
Not relevant

B dataset (ca. 50 million documents crawled from the web 30

Frequency
in 2009), using Indri, version 5.2. The experiments were
20
carried out on a 14 inch monitor with a resolution of 1400
x 1050 pixels. We logged which SERs participants marked 10

relevant, as well as the participants’ click order and time
spent per SER. 0
0 5 10 15 20
Click order

5.1 Findings (b)
In total the 10 participants processed 162 queries (89 queries Figure 2: Click-order and participant relevance assessments for
with the RL interface and 73 with the GR interface) with the (a) ranked list interface and (b) graph interface
mean µ = 16.2, and standard deviation σ = 7.8. Four queries
(two from each interface) were bypassed (2.5% of all pro- Interface Min Max µ σ
Ranked List 1.391 25.476 8.228 4.371
cessed queries).
Graph 3.322 20.963 9.705 3.699
Table 1 shows retrieval effectiveness per interface, aggre-
gated over all queries for the top k = 20 SERs. The ranked Table 2: Time (seconds) spent on each interface.
list is associated with higher, hence better scores than the
graph display for MAP and MRR. MAP is +30.6% better
with ranked lists that with graph displays, meaning that that for the graph display, the majority of participant clicks
overall a higher amount of relevant SERs is found by the before the 5th click correspond to non-relevant documents.
participants at higher ranks in the ranked list as opposed Even though the MRR scores of the graph display indicate
to the graph display. This finding is in agreement with the that the first relevant document occurs around rank posi-
MRR scores, which indicate that the first SER to be as- tion 2.5, we see that participants on average click four other
sessed relevant is likely to occur around rank position 2.13 documents before clicking the relevant document at rank
(1/2.13 = 0.469 ≈ 0.4698) with ranked lists, but around position 2.5. This indicates that in the graph display, par-
rank position 2.55 (1/2.55 = 0.392 ≈ 0.3948) with graph ticipants click documents not necessarily according to their
displays. Conversely, recall is slightly higher with graph dis- rank position (indicated in the centre of each vertex), but
plays. In general, higher recall in this case would indicate rather according to their graph layout or connectivity.
that participants are more likely to find a slightly larger
amount of relevant documents when seeing them as a graph 5.1.2 Time spent
of their hyperlinks. However, the difference in recall between Table 2 shows statistics about the time participants spent
ranked lists and graphs is very small and can hardly be seen on each interface. Overall participants spent less time on
as a reliable indication. the ranked list than on the graph display. This observation,
combined with the retrieval effectiveness measures shown
5.1.1 Click-order in Table 1, indicates that participants conducted overall
On average participants clicked on 9.46 entries per query slightly more precise and faster searches using the ranked
in the ranked list (842 clicks for 89 queries) but only on lists than using graph displays. The time use also suggests
6.7 entries per query in the graph display (490 clicks for 73 that participants are used to standard ranked list interfaces,
queries). The lower number of clicks in the latter case could a type of conditioning not easy to control experimentally.
be due to the extra time it might have taken participants
to understand or navigate the graph. This lower number 5.1.3 Inter-participant agreement
of clicks also agrees with the lower MAP scores presented To investigate how consistent participants were in their
above (if fewer entries were clicked, fewer SERs were as- assessments, we report the inter-rater agreement using Krip-
sessed, hence fewer relevant documents were found in the pendorff’s α [3]. Table 3 reports the agreement between the
top ranks). participants, and Table 4 reports the agreements between
Figures 2a and 2b plot the order of clicks for the ranked participants and the TREC preannotated relevance assess-
list and graph interfaces respectively on the x-axis, against ments per interface. In both cases, only queries annotated
the frequency of clicks on the y-axis. We see that in the more than once by different participants are included (19
ranked list, the first click of the participant is more often queries for the ranked list and 11 for the graph SER).
on a relevant document, but in the graph display, the first The average inter-rater agreements between participants
click is more often on a non-relevant document (as already vary considerably. For the graph interface, α = 0.04471,
indicated by the MRR scores shown above). We also see which suggests lack of agreement between raters. On a query
basis, some queries (query 169 and 44) suggest a compara- Graph Ranked List
Query Raters α Query Raters α
tively much higher agreement whereas others (e.g. query
101 4 0.09559 110 3 0.38654
104 and 184) show a comparatively higher level of disagree- 104 2 -0.17861 119 2 -0.22370
ment. For the ranked list, inter-rater agreement is higher 132 2 0.06561 120 2 0.03146
(α = 0.19813). On a per query basis, quite remarkably, 169 2 0.33625 129 2 0.05600
query 92 had a perfect agreement between raters, while 180 2 -0.08949 132 3 0.01689
queries 175 and 129 also exhibited a moderate to high level 184 2 -0.08949 133 2 0.04398
of agreement. However, most queries show only a low to 3 2 -0.37209 155 2 -0.21067
38 2 -0.05006 165 2 -0.25532
moderate level of agreement or disagreement. 44 2 -0.05861 175 2 -0.07886
Overall, the lack of agreement may indicate the partici- 54 2 -0.25532 180 2 -0.17861
pants’ confusion in assessing the relevance of SERs to pre- 58 2 -0.22917 51 2 -0.05006
typed queries. This may be aggravated by problems in ren- – – – 53 2 -0.24694
dering the HTML snippets into text. Some HTML docu- – – – 74 2 -0.06033
ments were ill-formed, hence their snippets sometimes in- – – – 80 2 -0.24694
– – – 81 3 -0.13634
cluded HTML tags or other not always coherent text. – – – 92 2 -0.21181
Inter-rater agreements between our participants and the – – – 95 2 0.04582
TREC preannotated relevance assessments show an almost – – – 96 2 -0.12919
complete lack of agreement. For both interfaces there is – – – 97 2 0.07813
a weak level of disagreement on average (α = −0.0750 and Average α: -0.0750 Average α: -0.0721
α = −0.0721 for the graph and ranked list respectively). On
a per query basis there are only two queries (queries 169 & Table 4: Inter-rater agreement (α) between participants and
TREC assessments for queries assessed by > 1 participant.
110) exhibiting a moderate level of agreement. For most re-
maining queries our participants’ assessments disagree with
the TREC assessments. session in the analysis (e.g. user task, behaviour, satisfac-
tion). Future work includes addressing the above limitations
Graph Ranked list and also testing whether and to what extent these results ap-
Query Raters α Query Raters α ply when scaling up to wall-sized displays with significantly
101 4 0.28696 110 3 0.41000
104 2 -0.21875 119 2 0.00000
larger screen real estate.
132 2 -0.16071 120 2 0.49351
169 2 0.48000 129 2 0.86022 7. REFERENCES
180 2 -0.10031 132 3 -0.08949
184 2 -0.25806 133 2 0.30108 [1] G. D. Battista, P. Eades, R. Tamassia, and I. G.
3 2 0.00000 155 2 -0.02632 Tollis. Graph drawing: algorithms for the visualization
38 2 -0.07519 175 2 0.49351 of graphs. Prentice Hall PTR, 1998.
44 2 0.49351 180 2 -0.37879 [2] J. J. Donaldson, M. Conover, B. Markines,
58 2 0.00000 51 2 0.00000 H. Roinestad, and F. Menczer. Visualizing social links
– – – 53 2 0.02151
– – – 74 2 -0.14706 in exploratory search. In HT ’08, pages 213–218, New
– – – 80 2 0.14420 York, NY, USA, 2008. ACM.
– – – 81 3 -0.12919 [3] A. F. Hayes and K. Krippendorff. Answering the call
– – – 92 2 1.00000 for a standard reliability measure for coding data.
– – – 95 2 0.15584 Communication Methods and Measures, 1(1):77–89,
– – – 96 2 0.15584
– – – 97 2 0.30179
2007.
Average α: 0.04471 Average α: 0.19813 [4] M. Hearst. Search user interfaces. Cambridge
University Press, 2009.
Table 3: Inter-rater agreement (α) for queries assessed by >1 [5] G. Marchionini. Exploratory search: from finding to
participant. Query is the TREC id of each query. understanding. Communications of the ACM,
49(4):41–46, 2006.
6. CONCLUSIONS [6] K. Treharne and D. M. W. Powers. Search engine
result visualisation: Challenges and opportunities. In
In a small user study, we compared ranked list versus Information Visualisation, pages 633–638, 2009.
graph-based search engine result (SER) visualisation. Our
[7] S. Wasserman and K. Faust. Social network analysis:
motivation was to conduct a preliminary experimental com-
methods and applications. Structural analysis in the
parison of the two for the domain of web search, where doc-
social sciences. Cambridge University Press, 1994.
ument hyperlinks were used to display them as graphs. We
[8] R. W. White, B. Kules, S. M. Drucker, and
found that overall more accurate and faster searches were
M. Schraefel. Supporting exploratory search.
done using ranked lists and that inter-user agreement was
Communications of the ACM, 49(4):36–39, 2006.
overall higher with ranked lists than with graph displays.
Limitations of this study include: (1) using fixed TREC [9] R. W. White, G. Muresan, and G. Marchionini.
queries, instead of allowing users to submit their own queries Workshop on evaluating exploratory search systems.
on the fly; (2) having technical HTML to text rendering SIGIR Forum, 40(2):52–60, 2006.
problems, resulting in sometimes incoherent document snip- [10] M. L. Wilson, B. Kules, B. Shneiderman, et al. From
pets; (3) using only 10 users exclusively from Computer Sci- keyword search to exploration: Designing future
ence, which makes for an overall small and rather biased search interfaces for the web. Foundations and Trends
user sample; (4) not using the wider context of the search in Web Science, 2(1):1–97, 2010.