=Paper=
{{Paper
|id=None
|storemode=property
|title=Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs
|pdfUrl=https://ceur-ws.org/Vol-1033/paper14.pdf
|volume=Vol-1033
|dblpUrl=https://dblp.org/rec/conf/eurohcir/PetersenLS13
}}
==Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs==
Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs Casper Petersen Christina Lioma Jakob Grue Simonsen Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Copenhagen University of Copenhagen University of Copenhagen cazz@diku.dk c.lioma@diku.dk simonsen@diku.dk ABSTRACT tially useful or interesting features about how the retrieved Typically search engine results (SERs) are presented in a data is connected. ranked list of decreasing estimated relevance to user queries. We present a user study comparing ranked list vs graph- While familiar to users, ranked lists do not show inherent based SER visualisation interfaces. We use a web crawl of connections between SERs, e.g. whether SERs are hyper- ca. 50 million documents in English with associated hyper- linked or authored by the same source. Such potentially link information and 10 participants. We find that ranked useful connections between SERs can be displayed as graphs. lists result in overall more accurate and faster searches than We present a preliminary comparative study of ranked lists graph displays, but that the latter result in slightly higher re- vs graph visualisations of SERs. Experiments with TREC call. We also find overall higher inter-rater agreement about web search data and a small user study of 10 participants SER relevance when using ranked lists instead of graphs. show that ranked lists result in more precise and also faster search sessions than graph visualisations. 2. MOTIVATION While traditional IR systems successfully support known- item search [5], what should users do if they want to locate Categories and Subject Descriptors something from a domain where they have a general interest H.3.3 [Information Storage and Retrieval]: Information but no specific knowledge [8]? Such exploratory searching Search and Retrieval comprises a mixture of serendipity, learning, and investiga- tion and is not supported by contemporary IR systems [5], prompting users to “develop coping strategies which involve Keywords [...] the submission of multiple queries and the interactive Search Engine Result Visualization, Ranked List, Graph exploration of the retrieved document space, selectively fol- lowing links and passively obtaining cues about where their next steps lie” [9]. A step towards exploratory search, which 1. INTRODUCTION motivates this work, is to make explicit the hyper-linked Typically search engine results (SERs) are presented in a structure of the ordered list used by e.g. Google and Ya- ranked list of decreasing estimated relevance to user queries. hoo. Investigation of such a representation does not exist Drawbacks of ranked lists include showing only a limited according to our knowledge, but is comparable to Google’s view of the information space, not showing how similar the Knowledge Graph whose aim is to guide users to other rel- retrieved documents are and/or how the retrieved docu- evant information from an initial selection. ments relate to each other [4, 6]. Such potentially use- ful information could be displayed to users in the form of SER graphs; these could present at a glance an overview 3. PREVIOUS WORK of clusters or isolated documents among the SERs, features Earlier work on graph-based SER displays includes Beale not typically integrated into ranked lists. For instance, di- et al.’s (1997) visualisation of sequences of queries and their rected/undirected and weighted/unweighted graphs could respective SERs, as well as the work of Shneiderman & Aris be used to display the direction, causality and strength of (2006) on modelling semantic search aspects as networks various relations among SERs. Various graph properties (both overviewed in [10]). Treharne et al. (2009) present a (see [7]), such as the average path length, clustering coef- critique of ranked list displays side by side a range of other ficient or degree, could be also displayed, reflecting poten- types of visualisation, including not only graphs, but also cartesian, categorical, spring and set-based displays [6]. This comparison is analytical rather than empirical. Closest to ours is the work of Donaldson et al. (2008), who experi- mentally compare ranked lists to graph-based displays [2]. In their work, graphs model social web information, such as user tags and ratings, in order to facilitate contextual- Presented at EuroHCIR2013. Copyright c 2013 for the individual papers ising social media for exploratory web search. They find by the papers’ authors. Copying permitted only for private and academic that users seem to prefer a hybrid interface that combines purposes. This volume is published and copyrighted by its editors. SIGIR 2013 Dublin, Ireland ranked lists with graph displays. Finally, the hyperlinked . graph representation discussed in the paper allows users to investigate the result space thereby discovering related and • The right window shows the ranked list of the top-k potential relevant information that might otherwise be by- SERs. The position of the clicked document in the list passed. Such representation and comparison to a traditional is clearly marked. ranked list does not exist according to our knowledge, but We display the SER graph in a standard force-directed the idea underpinning the graph representation is compara- layout [1]. Our graph layout does not allow for other types ble with Google’s Knowledge Graph as the aim is to guide of interaction with the graph apart from clicking on it. We users to other relevant information from an initial selection. reason that for the simple web search tasks we consider, layouts allowing further interaction may be confusing or 4. INTERFACE DESIGN time-consuming, and that they may be more suited to other This section presents the two different SER visualisations search tasks, involving for instance decision making, naviga- used in our study. Our goal is to study the effect of display- tion and exploration of large information spaces. ing exactly the same information to the user in two different ways, using ranked list and graph visualisations, respectively. 4.3 Document Snippets Both the RL and GR interfaces include short query-based 1 docid 4 2 docid summaries of the top-k SERs (snippets). We construct them 3 3 docid 1 as follows: We extract from each document a window of ± 4 docid 2 5 docid 25 terms surrounding the query terms on either side. Let a 6 docid 5 query consist of 3 terms q1 , q2 , q3 . We extract snippets for 6 (A) (B) all ordered but not necessarily contiguous sequences of query terms: (q1, q2, q3), (q1 , q2 ), (q1 , q3 ), (q2 , q3 ), (q1 ), (q2 ), (q3 ). Figure 1: Ranked list (A) and graph (B) representation of the This way, we match all snippets containing query terms in top-k documents from a query. the order they appear in the query (not as a bag of words), but we also allow other terms to occur in between query 4.1 Ranked List (RL) Display terms, for instance common modifiers. We use a standard ranked list SER display, where docu- Several snippets can be extracted per document, but only ments are presented in decreasing order of their estimated the snippet with the highest TF-IDF score is displayed to relevance to the user query. The list initially displays only the user. The TF-IDF of each window is calculated as a the top-k retrieved document ids (docids) with their asso- normalised sum of the TF-IDF weights for each term: ciated rank (see Figure 1 (A)). When clicked upon, each |w| document expands to two mini windows, overlaid to the left 1 X |C| Ss(D) = tf (t, D) × log and right of the list: |w| t=0 |D ∈ C : t ∈ D| • The left window shows a document snippet containing where |w| is the number of terms in the window extracted, the query terms. The snippet provides a brief sum- t ∈ w is a term in the window, tf is the term frequency of t mary of the document contents that relate to the query in document D from which the snippet is extracted, C is the in order to aid the user to assess document relevance collection of documents, and Ss(D) is the snippet score for prior to viewing the whole document [4]. We describe document D. Finally, as research has shown that query term exactly what the snippet shows and how it is extracted highlighting can be a useful feature for search interfaces [4], in Section 4.3. we highlight all occurrences of query terms in the snippet. • The right window shows a graph of the top-k ranked SERs (see Section 4.2). The position of the clicked 5. EVALUATION document in the graph is clearly indicated, so users We recruited 2 participants for a pilot study to calibrate can quickly overview its connections, if any, to other the user interfaces; the results from the pilot study were top-k retrieved documents. not subsequently used. For the main study, we recruited 10 new participants (9 males, 1 female; average age: 33.05, all Previously visited documents in the list are colour-marked. with a background in Computer Science) using convenience 4.2 Graph (GR) Display sampling. Each participant was introduced to the two in- terfaces. Their task was to find and mark as many relevant We display a SER graph G = (V, E) as a directed graph documents as possible per query using either interface. For whose vertices v ∈ V correspond to the top-k retrieved doc- each new query, the SERs could be shown in either interface. uments, and edges e ∈ E correspond to links (hyperlinks Each experiment lasted 30 minutes. in our case of web documents) between two vertices. Each Participants did not submit their own queries. The queries vertex is shown as a shaded circle that displays the rank of were taken from the TREC Web tracks of 2009-2012 (200 its associated document in the middle, see Figure 1 (B). The queries in total). This choice allowed us to provide very fast size of each vertex is scaled according to its out-degree, so response times to participants (< two seconds, depending that larger vertex size indicates more outlinks to the other on disk speed), because search results and their associated top-k documents. Edge direction points towards the out- graphs were pre-computed and cached. Alternatively, run- linked document. Previously visited documents are colour- ning new queries and plotting their SER graphs on the fly marked. would result in notably slower response times that would When clicked upon, each vertex expands to two mini win- risk dissatisfying participants. However, a drawback in us- dows, overlaid to the left and right of the graph: ing TREC queries is that participants did not necessarily • The left window shows the same document snippet as have enough context to fully understand the underlying in- in the RL display. formation needs and correctly assess document relevance. Ranked List Graph MAP@20 0.4195 0.3211 50 Relevant MRR 0.4698 0.3948 40 Not relevant RECALL@20 0.0067 0.0069 Frequency 30 Table 1: Mean Average Precision (MAP), Mean Reciprocal 20 Rank (MRR) & Recall of the top 20 results. 10 0 To counter this, we allowed participants to skip queries they 0 5 10 15 20 25 30 35 40 Click order were not comfortable with. To avoid bias, skipping a query (a) was allowed after query terms were displayed, but before the SERs were displayed. 40 We retrieved documents from the TREC ClueWeb09 cat. Relevant Not relevant B dataset (ca. 50 million documents crawled from the web 30 Frequency in 2009), using Indri, version 5.2. The experiments were 20 carried out on a 14 inch monitor with a resolution of 1400 x 1050 pixels. We logged which SERs participants marked 10 relevant, as well as the participants’ click order and time spent per SER. 0 0 5 10 15 20 Click order 5.1 Findings (b) In total the 10 participants processed 162 queries (89 queries Figure 2: Click-order and participant relevance assessments for with the RL interface and 73 with the GR interface) with the (a) ranked list interface and (b) graph interface mean µ = 16.2, and standard deviation σ = 7.8. Four queries (two from each interface) were bypassed (2.5% of all pro- Interface Min Max µ σ Ranked List 1.391 25.476 8.228 4.371 cessed queries). Graph 3.322 20.963 9.705 3.699 Table 1 shows retrieval effectiveness per interface, aggre- gated over all queries for the top k = 20 SERs. The ranked Table 2: Time (seconds) spent on each interface. list is associated with higher, hence better scores than the graph display for MAP and MRR. MAP is +30.6% better with ranked lists that with graph displays, meaning that that for the graph display, the majority of participant clicks overall a higher amount of relevant SERs is found by the before the 5th click correspond to non-relevant documents. participants at higher ranks in the ranked list as opposed Even though the MRR scores of the graph display indicate to the graph display. This finding is in agreement with the that the first relevant document occurs around rank posi- MRR scores, which indicate that the first SER to be as- tion 2.5, we see that participants on average click four other sessed relevant is likely to occur around rank position 2.13 documents before clicking the relevant document at rank (1/2.13 = 0.469 ≈ 0.4698) with ranked lists, but around position 2.5. This indicates that in the graph display, par- rank position 2.55 (1/2.55 = 0.392 ≈ 0.3948) with graph ticipants click documents not necessarily according to their displays. Conversely, recall is slightly higher with graph dis- rank position (indicated in the centre of each vertex), but plays. In general, higher recall in this case would indicate rather according to their graph layout or connectivity. that participants are more likely to find a slightly larger amount of relevant documents when seeing them as a graph 5.1.2 Time spent of their hyperlinks. However, the difference in recall between Table 2 shows statistics about the time participants spent ranked lists and graphs is very small and can hardly be seen on each interface. Overall participants spent less time on as a reliable indication. the ranked list than on the graph display. This observation, combined with the retrieval effectiveness measures shown 5.1.1 Click-order in Table 1, indicates that participants conducted overall On average participants clicked on 9.46 entries per query slightly more precise and faster searches using the ranked in the ranked list (842 clicks for 89 queries) but only on lists than using graph displays. The time use also suggests 6.7 entries per query in the graph display (490 clicks for 73 that participants are used to standard ranked list interfaces, queries). The lower number of clicks in the latter case could a type of conditioning not easy to control experimentally. be due to the extra time it might have taken participants to understand or navigate the graph. This lower number 5.1.3 Inter-participant agreement of clicks also agrees with the lower MAP scores presented To investigate how consistent participants were in their above (if fewer entries were clicked, fewer SERs were as- assessments, we report the inter-rater agreement using Krip- sessed, hence fewer relevant documents were found in the pendorff’s α [3]. Table 3 reports the agreement between the top ranks). participants, and Table 4 reports the agreements between Figures 2a and 2b plot the order of clicks for the ranked participants and the TREC preannotated relevance assess- list and graph interfaces respectively on the x-axis, against ments per interface. In both cases, only queries annotated the frequency of clicks on the y-axis. We see that in the more than once by different participants are included (19 ranked list, the first click of the participant is more often queries for the ranked list and 11 for the graph SER). on a relevant document, but in the graph display, the first The average inter-rater agreements between participants click is more often on a non-relevant document (as already vary considerably. For the graph interface, α = 0.04471, indicated by the MRR scores shown above). We also see which suggests lack of agreement between raters. On a query basis, some queries (query 169 and 44) suggest a compara- Graph Ranked List Query Raters α Query Raters α tively much higher agreement whereas others (e.g. query 101 4 0.09559 110 3 0.38654 104 and 184) show a comparatively higher level of disagree- 104 2 -0.17861 119 2 -0.22370 ment. For the ranked list, inter-rater agreement is higher 132 2 0.06561 120 2 0.03146 (α = 0.19813). On a per query basis, quite remarkably, 169 2 0.33625 129 2 0.05600 query 92 had a perfect agreement between raters, while 180 2 -0.08949 132 3 0.01689 queries 175 and 129 also exhibited a moderate to high level 184 2 -0.08949 133 2 0.04398 of agreement. However, most queries show only a low to 3 2 -0.37209 155 2 -0.21067 38 2 -0.05006 165 2 -0.25532 moderate level of agreement or disagreement. 44 2 -0.05861 175 2 -0.07886 Overall, the lack of agreement may indicate the partici- 54 2 -0.25532 180 2 -0.17861 pants’ confusion in assessing the relevance of SERs to pre- 58 2 -0.22917 51 2 -0.05006 typed queries. This may be aggravated by problems in ren- – – – 53 2 -0.24694 dering the HTML snippets into text. Some HTML docu- – – – 74 2 -0.06033 ments were ill-formed, hence their snippets sometimes in- – – – 80 2 -0.24694 – – – 81 3 -0.13634 cluded HTML tags or other not always coherent text. – – – 92 2 -0.21181 Inter-rater agreements between our participants and the – – – 95 2 0.04582 TREC preannotated relevance assessments show an almost – – – 96 2 -0.12919 complete lack of agreement. For both interfaces there is – – – 97 2 0.07813 a weak level of disagreement on average (α = −0.0750 and Average α: -0.0750 Average α: -0.0721 α = −0.0721 for the graph and ranked list respectively). On a per query basis there are only two queries (queries 169 & Table 4: Inter-rater agreement (α) between participants and TREC assessments for queries assessed by > 1 participant. 110) exhibiting a moderate level of agreement. For most re- maining queries our participants’ assessments disagree with the TREC assessments. session in the analysis (e.g. user task, behaviour, satisfac- tion). Future work includes addressing the above limitations Graph Ranked list and also testing whether and to what extent these results ap- Query Raters α Query Raters α ply when scaling up to wall-sized displays with significantly 101 4 0.28696 110 3 0.41000 104 2 -0.21875 119 2 0.00000 larger screen real estate. 132 2 -0.16071 120 2 0.49351 169 2 0.48000 129 2 0.86022 7. REFERENCES 180 2 -0.10031 132 3 -0.08949 184 2 -0.25806 133 2 0.30108 [1] G. D. Battista, P. Eades, R. Tamassia, and I. G. 3 2 0.00000 155 2 -0.02632 Tollis. Graph drawing: algorithms for the visualization 38 2 -0.07519 175 2 0.49351 of graphs. Prentice Hall PTR, 1998. 44 2 0.49351 180 2 -0.37879 [2] J. J. Donaldson, M. Conover, B. Markines, 58 2 0.00000 51 2 0.00000 H. Roinestad, and F. Menczer. Visualizing social links – – – 53 2 0.02151 – – – 74 2 -0.14706 in exploratory search. In HT ’08, pages 213–218, New – – – 80 2 0.14420 York, NY, USA, 2008. ACM. – – – 81 3 -0.12919 [3] A. F. Hayes and K. Krippendorff. Answering the call – – – 92 2 1.00000 for a standard reliability measure for coding data. – – – 95 2 0.15584 Communication Methods and Measures, 1(1):77–89, – – – 96 2 0.15584 – – – 97 2 0.30179 2007. Average α: 0.04471 Average α: 0.19813 [4] M. Hearst. Search user interfaces. Cambridge University Press, 2009. Table 3: Inter-rater agreement (α) for queries assessed by >1 [5] G. Marchionini. Exploratory search: from finding to participant. Query is the TREC id of each query. understanding. Communications of the ACM, 49(4):41–46, 2006. 6. CONCLUSIONS [6] K. Treharne and D. M. W. Powers. Search engine result visualisation: Challenges and opportunities. In In a small user study, we compared ranked list versus Information Visualisation, pages 633–638, 2009. graph-based search engine result (SER) visualisation. Our [7] S. Wasserman and K. Faust. Social network analysis: motivation was to conduct a preliminary experimental com- methods and applications. Structural analysis in the parison of the two for the domain of web search, where doc- social sciences. Cambridge University Press, 1994. ument hyperlinks were used to display them as graphs. We [8] R. W. White, B. Kules, S. M. Drucker, and found that overall more accurate and faster searches were M. Schraefel. Supporting exploratory search. done using ranked lists and that inter-user agreement was Communications of the ACM, 49(4):36–39, 2006. overall higher with ranked lists than with graph displays. Limitations of this study include: (1) using fixed TREC [9] R. W. White, G. Muresan, and G. Marchionini. queries, instead of allowing users to submit their own queries Workshop on evaluating exploratory search systems. on the fly; (2) having technical HTML to text rendering SIGIR Forum, 40(2):52–60, 2006. problems, resulting in sometimes incoherent document snip- [10] M. L. Wilson, B. Kules, B. Shneiderman, et al. From pets; (3) using only 10 users exclusively from Computer Sci- keyword search to exploration: Designing future ence, which makes for an overall small and rather biased search interfaces for the web. Foundations and Trends user sample; (4) not using the wider context of the search in Web Science, 2(1):1–97, 2010.