=Paper=
{{Paper
|id=None
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1033/EuroHCIR2013-Proceedings.pdf
|volume=Vol-1033
}}
==None==
EuroHCIR!2013! 1st!August!2013!–!Dublin,!Ireland! ! Proceedings+of+the+! 3rd!European)Workshop)on)! Human"Computer)Interaction)and) Information*Retrieval! A"workshop"at"ACM"SIGIR"2013" " Preface! EuroHCIR)2013)was)organised)with)the)specific)goal)of)better)engaging)the)IR) community,)who)have)been)underrepresented)at)previous)EuroHCIR) conferences.)Thus)we)proposed)to)have)the)workshop)at)the)ACM)SIGIR) conference)in)Dublin.)Research,)Industry,)and)Position)papers)were)invited,)and) although)very)few)industry)submissions)were)received,)we)received)a)number)of) research)and)position)papers)focusing)on)the)intersection)of)IR)and)HCI) evaluations,)several)focusing)on)adapting)the)TREC)paradigm.)Many)interesting) system)and)demonstrator)papers)were)also)accepted.) Organised!by! Max"L."Wilson" Birger"Larsen" Mixed)Reality)Lab) The)Royal)School)of)Library)and)) University)of)Nottingham,)UK) Information)Science,)Denmark) max.wilson@nottingham.ac.uk) blar@iva.dk) ) Preben"Hansen" Dept.)of)Computer)&)Systems)Sciences) Stockholm)University,)Sweden) preben@dsv.su.se) ) Tony"RussellFRose" Kristian"Norling" UXLabs,)UK) Norling)&)Co,)Sweden) tgr@uxlabs.co.uk) kristian.norling@gmail.com) ) ) ) ) ) Research!Papers! Page"3"F"" Fading"Away:"Dilution"and"User"Behaviour"(Orally)Presented)" Paul%Thomas,%Falk%Scholer,%Alistair%Moffat% Page"7"F"" Exploratory"Search"Missions"for"TREC"Topics"(Orally)Presented)" Martin%Potthast,%Matthias%Hagen,%Michael%Völske,%Benno%Stein% Page"11"F"" Interactive"Exploration"of"Geographic"Regions"with"WebFbased"Keyword" Distributions" Chandan%Kumar,%Dirk%Ahlers,%Wilko%Heuten,%Susanne%Boll% Page"15"F"" Inferring"Music"Selections"for"Casual"Music"Interaction"(Orally)Presented)" Daniel%Boland,%Ross%McLachlan,%Roderick%MurrayESmith" Page"19"F"" Search"or"browse?"Casual"information"access"to"a"cultural"heritage"collection" Robert%Villa,%Paul%Clough,%Mark%Hall,%Sophie%Rutter% Page"23"F"" Studying"Extended"Session"Histories" Chaoyu%Ye,%Martin%Porcheron,%Max%L.%Wilson% Page"27"F" Comparative"Study"of"Search"Engine"Result"Visualisation:"Ranked"Lists"Versus" Graphs" Casper%Petersen,%Christina%Lioma,%Jakob%Grue%Simonsen" Position!Papers! Page"31"F"" Evolving"Search"User"Interfaces"(Orally)Presented)" Tatiana%Gossen,%Marcus%Nitsche,%Andreas%Nürnberger" Page"35"F"" A"Pluggable"WorkFbench"for"Creating"Interactive"IR"Interfaces"(Orally)Presented)" Mark%M.%Hall,%Spyros%Katsaris,%Elaine%Toms" Page"39"F"" A"Proposal"for"UserFFocused"Evaluation"and"Prediction"of"Information"Seeking" Process"(Orally)Presented)" Chirag%Shah% Page"43"F"" Directly"Evaluating"the"Cognitive"Impact"of"Search"User"Interfaces:"a"TwoF Pronged"Approach"with"fNIRS" Horia%A.%Maior,%Matthew%Pike,%Max%L.%Wilson,%Sarah%Sharples% Page"47"F"" Dynamics"in"Search"User"Interfaces" Marcus%Nitsche,%Florian%Uhde,%Stefan%Haun%and%Andreas%Nürnberger% Demo!Descriptions! Page"51"F"" SearchPanel:"A"browser"extension"for"managing"search"activity" Simon%Tretter,%Gene%Golovchinsky,%Pernilla%Qvarfordt"" Page"55"F"" A"System"for"PerspectiveFAware"Search" M.%Atif%Qureshi,%Arjumand%Younus,%Colm%O’Riordan,%Gabriella%Pasi,%Nasir%Touheed" Fading Away: Dilution and User Behaviour Paul Thomas Falk Scholer Alistair Moffat CSIRO ICT Centre School of Computer Science Department of Computing and paul.thomas@csiro.au and Information Technology Information Systems RMIT University The University of Melbourne falk.scholer@rmit.edu.au ammoffat@unimelb.edu.au ABSTRACT 2. Submitting another query, hoping for better results; When faced with a poor set of document summaries on the first page 3. Switching to a different search engine and entering the same of returned search results, a user may respond in various ways: by query, hoping that it provides better results; proceeding on to the next page of results; by entering another query; 4. Trying to find the information through other techniques, for by switching to another service; or by abandoning their search. We example by browsing. analyse this aspect of searcher behaviour using a commercial search system, comparing a deliberately degraded system to the original We investigate the first two possibilities, reporting on differences one. Our results demonstrate that searchers naturally avoid selecting in user behaviour when a standard retrieval system is compared to an poor results as answers given the degraded system; however, the adjusted system in which results are diluted by inserting non-relevant depth of the ranking that they view, their query reformulation rate, answers. Our results indicate that searchers remained attentive to the and the amount of time required to complete search tasks, are all task in the degraded system, and adapted their behaviour to avoid remarkably unchanged. clicking on non-relevant snippets. However, all other aspects of their behaviour were remarkably consistent, including the amount of Categories and Subject Descriptors time spent on tasks; the number of query reformulations undertaken; and their perceptions of search difficulty. H.3.4 [Information Storage and Retrieval]: Systems and soft- ware—performance evaluation. 2. METHODS General Terms We designed a user experiment to explore ways in which be- haviour changes with retrieval quality. A total of n = 34 participants, Experimentation, measurement. comprising staff and students from the Australian National Univer- sity, carried out six search tasks of differing complexity, covering Keywords the remember, analyse and understand tasks of Wu et al. [7] but modified for our context. On commencing a task, users were shown Retrieval experiment, evaluation, system measurement. a result page for an initial “starter” query that was constant across users. They were then free to explore the results list, including being 1. INTRODUCTION able to open documents, to view further results pages, and to enter While carrying out a search, users have a number of tactics avail- follow-up queries. Once any document was opened for viewing, able to them. Intuitively, it seems likely that these tactics or be- participants were asked to indicate whether or not it was relevant to haviours will vary based on the quality of the results that are re- their search task, before returning to the search results listing. The turned by the retrieval system. For example, other things being search interface prevented tabbed browsing, and while a document equal, a user who cannot find any relevant items on the first page was being viewed it replaced the results page. Participants were not of search results might be more inclined to reformulate their query given an explicit time limit for any task, but were told they could (by entering another query into the search interface) than a user who move on when they felt ready. has found a large number of relevant items. Possible tactics when The search results displayed to participants were sourced from using an apparently ineffective system include: the Yahoo! API, and presented in the usual way as an ordered list consisting of query-biased summaries, with ten results per page. No 1. Looking further in the results list, visiting pages beyond the branding from the underlying search service was shown. Without first, hoping that the results improve; telling our participants, we simulated search systems of two differ- ent effectiveness levels by showing results in one of two modes: full, where the ranking obtained from the search service was dis- played in its original form; and diluted, where the original results were interleaved with answers from a related but incorrect query [5]. Dilution was operationalised by leveraging the capacity-enhancing (and obfuscatory) power of “management-speak”: the original stake- holder information need was actioned going forward by enhancing Presented at EuroHCIR2013. Copyright © 2013 for the individual papers it through the win-win inclusion of a jargon competency chosen by the papers’ authors. Copying permitted only for private and academic randomly from a list of outside-the-box strategies, thereby disem- purposes. This volume is published and copyrighted by its editors. powering the results paradigm. For example, if the task was to “find 0.30 0.30 0.25 0.25 0.20 0.20 Proportion Proportion 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00 1 2 3 4 5 6 7 8 9 11 13 15 17 19 1 2 3 4 5 6 7 8 9 11 13 15 17 19 Rank Rank Figure 1: Normalised total click positions across participants and tasks, for full queries (left) and diluted queries (right). the Eurovision Song Contest home page”, a user’s initial full query 1st results page 2nd results page might be “eurovision”; whereas in the diluted system half of the full 207 15 results displayed might instead be derived from the query “eurovi- diluted 212 22 sion best practice”. There were a small number of queries issued for which it was not possible to generate five such results; these 22 out of 5930 page interactions are excluded from the analysis below. Table 1: Total page views, summed across users and topics, for the Most interactions with the search system were logged while par- full and diluted retrieval systems. ticipants carried out the six search tasks, including: submitted search queries; clicks on snippets in order to open documents for view- ing; assessments of document usefulness; and the point of gaze (c 2 test, p = 0.97). The number of items that were determined as on the screen, captured using an eye tracker. Task order was bal- being useful was also similar in the two conditions: 201 for full, anced across the participants and topics so as to minimise the risk and 214 for diluted (c 2 test, p = 0.52). Our participants needed to of bias; similarly, whether the full or diluted approach got applied read a remarkably similar number of documents, and a remarkably for each participant-task combination was pre-determined as part of similar number of useful documents, to satisfy the (assigned) needs the experimental design. regardless of the search system. Given this difference in click rates, it is reasonable to expect other changes in behaviour and we consider this below. 3. RESULTS User behaviour, and the differences caused by the full and diluted Depth of result page viewing: When presented with a search results query treatments, can be measured in a range of ways. page, the user chooses which snippets require further evaluation. In line with commercial search engines, our experimental partici- User click behaviour: The normalised click frequency at each rank pants were presented with ten answers per page, with the option of position in the answer pages is shown in Figure 1. In the diluted accessing subsequent results pages. retrieval system the “incorrect but plausible” documents were in- Faced with a relatively poor quality results list, a plausible strategy serted in positions 1, 3, 5, 7 and 9. The pattern of click behaviour for a user who is looking for an answer document is to look further demonstrates that our experimental manipulation was successful: down the results page. Table 1 shows the frequency with which for the full search results, the click distribution follows the expected results pages were viewed (that is, the user visited a results page pattern of users clicking more frequently on items that are higher and looked at one or more items on the screen as recorded using in the ranked list [1], whereas users of the diluted system were less eye-tracking), summed across users and queries. When using the likely to click answer items in the odd positions. Note that position full system, participants moved on to the second page of results for bias – the propensity for searchers to select items that occur higher 15 out of 207 issued queries (with a corresponding mean page depth in a ranking, possibly because they “trust” the underlying search of 1.07), while in the diluted system the second results page was system [3] – exists in both systems. In particular, all of the odd- visited for 22 out of the total of 212 queries that were issued (a mean numbered rank positions in the diluted system are equally “bad”, page depth of 1.10). The difference in depth was not significant but participants still favoured items higher in the ranking. (c 2 test, p = 0.34). No participants viewed results beyond the the A second check to confirm that our system dilution had an impact second page with either system. on search effectiveness is to consider the rates at which users saved Figures 2 and 3 provide a more detailed view of gaze behaviour, documents that they viewed (that is, the likelihood that a document showing the deepest rank position that searchers examined while was found to be relevant after it was clicked). The mean rate is carrying out a query, and the last rank position that was viewed 0.733 for the full system, compared to 0.597 for the diluted system, before finishing the query. The distributions of the lowest rank a statistically significant difference (t-test, p < 0.05). positions viewed are similar between the full and diluted systems: While Figure 1 establishes that our user study participants re- both show peaks at rank positions 7 (the last item above the fold) sponded differently in terms of rank-specific click behaviour, the and 10 (the last item in each page of search results). The distribution high-level aggregated click behaviour across all participants and of the last position viewed before finishing a query (which arises search tasks was not distinctive: in total (all tasks, and all users) when either enough relevant items have been found, or the user types there were 323 clicks for the full system, and 322 for the diluted a fresh query) are also broadly similar. However, for the diluted system. Unsurprisingly this difference is not statistically significant system, rank position 1 has a larger proportion of the probability 0.20 0.20 0.15 0.15 Proportion Proportion 0.10 0.10 0.05 0.05 0.00 0.00 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 Rank Rank Figure 2: Deepest rank position viewed, averaged across topics and participants, for full queries (left) and diluted queries (right). 0.30 0.30 0.25 0.25 0.20 0.20 Proportion Proportion 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 Rank Rank Figure 3: Final rank position viewed, averaged across topics and participants, for full queries (left) and diluted queries (right). mass. A possible reason is that searchers mentally compare answers ● as they view items in the results list, and most users scan at least the 14 top few items. The diluted system is likely to have a non-relevant 12 ● document in position one, and so reviewing that snippet may serve as a final confirmation, before the user commits to a click on a 10 ● Number of queries deeper-ranked snippet from the underlying full results. ● ● 8 ● ● Query reformulation: A second way in which a user might respond ● 6 to search systems of differing quality is to change the rate at which ● they stop looking through the current set of search results, and 4 instead enter a new query. ● ● 2 The number of queries used by participants when carrying out their search tasks is shown in Figure 4. Overall the number was Full Diluted low for both systems, with a median of 1 and 2 queries (0 and 1 reformulations) for the full and diluted results, respectively. This difference was not statistically significant (Wilcoxon signed-rank Figure 4: Number of queries per task, for full and diluted queries. test, p = 0.46). Ability to identify relevant answers: When a retrieval system serves unhelpful answers, it might be that the ability of the searcher to Time spent on tasks: While depth of viewing and query re-form- identify useful answers is similarly affected. However, based on our ulation do not show significant differences in searcher behaviour, it experiments, the mean rate at which clicked items were saved as could still be the case that using an inferior system makes querying being relevant was 0.787 for the full system and 0.747 for the diluted slower. Differences in system quality might alter the time spent system, showing no significant difference (t-test, p = 0.25). Thus by users when viewing and processing result pages. However, the the ability of users to identify relevant answers, once documents average gaze duration when viewing snippets, measured as the sum have been selected for viewing via their snippets, did not differ of fixation durations that occurred in the screen area defined by each between the experimental treatments. search result summary, was 0.586 second for full queries and 0.589 seconds for diluted queries. This difference was not statistically significant (t-test, p = 0.89). Differences could also occur at a higher level of system interac- tion. The mean time that participants spent working on each search unspecified “boss”, with an incentive to find the most good and task, including viewing search result pages, viewing selected doc- fewest bad sources possible [4]; participants were not constrained uments, and making relevance decisions, was 2.70 minutes for the in the amount of time that they could spend on a task. In contrast, full treatment, and 2.54 minutes for the diluted one. This difference our subjects were instructed that they would complete a sequence of was not statistically significant (t-test, p = 0.62). . . . web search tasks and were advised to spend what feels to be an Finally, we consider the interaction between time and query re- appropriate amount of time on each task, until you have collected a formulations. When using the full system, participants entered an set of answer pages that in your opinion allow the information need average of 1.50 queries per minute while completing each task. to be appropriately met. The overall expectations were therefore For the diluted system, the rate was 1.52 queries per minute. The different: in the Smith and Kantor study, participants were given the difference was not significant (t-test, p = 0.95). goal of maximising relevance by finding as many good answers as Overall, these results indicate that the quality of the search system possible; in our study, participants were “satisficing”, having been did not affect the rate at which participants were able to process requested to decide for themselves when an appropriate number of information on search results pages, or how much time they spent answers had been found. working on tasks before feeling that they had achieved their goals. Alternatively, it may be that our diluted system, while certainly The only significant difference between the two treatments was the poorer in overall quality (in the sense that non-relevant answers were click distribution, and the rate at which clicked documents were introduced into the ranking), was not poor enough to induce different judged to be useful. behaviour. Smith and Kantor used results typically from the 300th position in Google’s results: even today, these are unreliable for Searcher assessment of task difficulty: After carrying out each the simplest of our topics, and in 2008 will almost certainly have search task, experimental participants were asked to answer two produced a poor result set. Importantly, our diluted system always questions: “How difficult was it to find useful information on this included a few high-ranked results. topic?”, and “How satisfied were you with the overall quality of your Either way, our results raise an important question about how the search experience?”. The 5-point response scale for these questions effectiveness of search systems should be analysed. While some was anchored with the labels “Not at all” (assigned a value of 1) and fine-grained aspects of user clicking behaviour differed between “Extremely” (assigned a value of 5). the full and diluted treatments, the majority of behaviours did not. Searchers found the tasks relatively easy to complete: the median This outcome is in line with previous results that found little rela- response rate for the search difficulty question was 2 for both the di- tionship between user behaviour and system quality as measured luted and full systems; this difference was not significant (Wilcoxon by common IR evaluation metrics such as MAP [6]. The ques- test, p = 0.73). Satisfaction levels were also highly consistent be- tion then becomes one of whether even a significant improvement tween the two systems, with a median response level of 4 for both in effectiveness, as measured by some metric, actually results in systems (Wilcoxon test, p = 0.91). Overall, there were no system- improved task performance. In future work, we therefore plan to atic differences in participants’ perceptions of search difficulty or systematically investigate different levels of answer-page dilution, the overall experience resulting from the two different treatments. to establish guidelines for the extent of practical differences that need to be present in search systems for measurable disparities in 4. DISCUSSION AND CONCLUSIONS user behaviour to manifest. We also plan to explore the issue of the impact that specific variations in task instructions have on searcher It seems “obvious” that user behaviour will be influenced by the behaviour through a controlled user study in a work task-based quality of results that returned by a search service. Seeing many framework [2]. poor results near the start of an answer list may influence the user’s decision about whether to continue viewing subsequent answer pages, to enter a new query, or to abandon the search altogether. References Previous work has supported this view. For example, in a study of [1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno. Learning user inter- 36 users completing 12 search tasks with different search systems, action models for predicting web search result preferences. In Proc. Smith and Kantor [4] found that users adapted their behaviour: SIGIR, pages 3–10, Seattle, WA, 2006. when given a consistently degraded search system, they entered [2] P. Borlund. Experimental components for the evaluation of interactive more queries per minute than users of a standard system; similarly, information retrieval systems. Journal of Documentation, 56(1):71–90, a higher detection rate (the ability to identify relevant answers) was 2000. observed for users of degraded systems. [3] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately However our study, in which 34 subjects carried out search tasks interpreting clickthrough data as implicit feedback. In Proc. SIGIR, using an evenly balanced combination of full and diluted search pages 154–161, Salvador, Brazil, 2005. systems, contrasts strongly with that intuition and previous findings. Overall, searchers took around the same amount of time to complete [4] C. Smith and P. Kantor. User adaptation: good results from poor systems. their tasks in both experimental treatments; were able to save a In Proc. SIGIR, pages 147–154, Singapore, 2008. similar number of documents as being relevant; exhibited consistent [5] P. Thomas, T. Jones, and D. Hawking. What deliberately degrading viewing behaviour when looking at the search results lists returned search quality tells us about discount functions. In Proc. SIGIR, pages by the treatments; and did not perceive significant differences in the 1107–1108, Beijing, China, 2011. difficulty of carrying out tasks with both systems. The key difference [6] A. Turpin and F. Scholer. User performance versus precision measures in participant behaviour was their click rate at particular ranks: in for simple web search tasks. In Proc. SIGIR, pages 11–18, Seattle, WA, essence, they successfully avoided poor answers, as demonstrated 2006. by the shift in the click probability mass, shown in Figure 1. A possible explanation for the divergence in observed user be- [7] W.-C. Wu, D. Kelly, A. Edwards, and J. Arguello. Grannies, tanning beds, tattoos and NASCAR: Evaluation of search tasks with varying haviour between the two studies may be the context in which the levels of cognitive complexity. In Proc. 4th Information Interaction in searches were carried out. Participants in the Smith and Kantor Context Symp., pages 254–257, Nijmegen, The Netherlands, 2012. study were instructed to “find good information sources” for an Exploratory Search Missions for TREC Topics Martin Potthast Matthias Hagen Michael Völske Benno Stein Bauhaus-Universität Weimar 99421 Weimar, Germany. @uni-weimar.de ABSTRACT crowdsourcing by employing writers whose task was to write long We report on the construction of a new query log corpus that consists essays on given TREC topics, using a ClueWeb09 search engine for of 150 exploratory search missions, each of which corresponds to research. Hence, our corpus forms a strong connection to existing one of the topics used at the TREC Web Tracks 2009–2011. In- evaluation resources that are used frequently in information retrieval. volved in the construction was a group of 12 professional writers, Further, it captures the way how average users perform exploratory hired at the crowdsourcing platform oDesk, who were given the task search today, using state-of-the-art search interfaces. The new cor- to write essays of 5000 words length about these topics, thereby pus is intended to serve as a point of reference for modeling users inducing genuine information needs. The writers used a ClueWeb09 and tasks as well as for comparison with new retrieval models and search engine for their research to ensure reproducibility. Thousands interfaces. Key figures of the corpus are shown in Table 2. of queries, clicks, and relevance judgments were recorded. This After a brief review of related work, Section 2 details the corpus paper overviews the research that preceded our endeavors, details construction and Section 3 gives first quantitative and qualitative the corpus construction, gives quantitative and qualitative analyses analyses, concluding with insights into writers’ search behavior. of the data obtained, and provides original insights into the query- 1.1 Related Work ing behavior of writers. With our work we contribute a missing To date, the most comprehensive overview of research on ex- building block in a relevant evaluation setting in order to allow for ploratory search systems is that of White and Roth [19]. More better answers to questions such as: “What is the performance of recent contributions not covered in this body of work include the today’s search engines on exploratory search?” and “How can it be approaches proposed by Morris et al. [13], Bozzon et al. [2], Car- improved?” The corpus will be made publicly available. tright et al. [4], and Bron et al. [3]. Exploratory search is studied also Categories and Subject Descriptors: H.3.3 [Information Search within contextual IR and interactive IR, as well as across disciplines, and Retrieval]: Query formulation including human computer interaction, information visualization, Keywords: Query Log, Exploratory Search, Search Missions and knowledge management. Regarding the evaluation of exploratory search systems, White 1. INTRODUCTION and Roth [19] conclude that “traditional measures of IR perfor- mance based on retrieval accuracy may be inappropriate for the Humans frequently conduct task-based information search, i.e., evaluation of these systems” and that “exploratory search evalua- they interact with search appliances in order to conduct the research tion [...] must include a mixture of naturalistic longitudinal studies” deemed necessary to solve knowledge-intensive tasks. Examples while “[...] simulations developed based on interaction logs may include long-lasting interactions which may involve many search serve as a compromise between existing IR evaluation paradigms sessions spread out across several days. Modern web search en- and [...] exploratory search evaluation.” The necessity of user stud- gines, however, are optimized for the diametrically opposed task, ies makes evaluations cumbersome and, above all, expensive. By namely to answer short-term, atomic information needs. Never- providing part of the solution (a decent corpus) for free, we want theless, research has picked up this challenge: in recent years, a to overcome the outlined difficulties. Our corpus compiles a solid number of new solutions for exploratory search have been proposed database of exploratory search behavior, which researchers may use and evaluated. However, most of them involve an overhauling of for comparison purposes as well as for bootstrapping simulations. the entire search experience. We argue that exploratory search tasks Regarding standardized resources to evaluate exploratory search, are already being tackled, after all, and that this fact has not been hardly any have been published up to now. White et al. [18] dedi- sufficiently investigated. Reasons for this shortcoming can be found cated a workshop to evaluating exploratory search systems in which in the lack of publicly available data to be studied. Ideally, for any requirements, methodologies, as well as some tools have been pro- given task that fits the aforementioned description, one would have posed. Yet, later on, White and Roth [19] found out that still no a large set of search interaction logs from a diversity of humans “methodological rigor” has been reached—a situation which has not solving it. Obtaining such data, even for a single task, has not been changed much until today. The departure from traditional evalua- done at scale until now. Even search companies, which have access tion methodologies (such as the Cranfield paradigm) and resources to substantial amounts of raw query log data, face difficulties in (especially those employed at TREC) has lead researchers to devise discerning individual exploratory tasks from their logs. ad-hoc evaluations which are mostly incomparable across papers In this paper, we contribute by introducing the first large corpus of and which cannot be reproduced easily. long, exploratory search missions. The corpus was constructed via A potential source of data for the purpose of assessing current Presented at EuroHCIR2013. Copyright c 2013 for the individual papers exploratory search behavior is to detect exploratory search tasks by the papers’ authors. Copying permitted only for private and academic within raw search engine logs, such as the 2006 AOL query log [14]. purposes. This volume is published and copyrighted by its editors.. However, most session detection algorithms deal with short term Used TREC Topics. tasks only and the few algorithms that aim to detect longer search Since the topics from the TREC Web Tracks 2009–2011 were missions still have problems when detecting interesting semantic not amenable for our purpose as is, we rephrased them so that they connections of intertwined search tasks [10, 12, 8]. In this regard, ask for writing an essay instead of searching for facts. Consider for our corpus may be considered the first of its kind. example topic 001 from the TREC Web Track 2009: To justify our choice of an exploratory task, namely that of writing Query. obama family tree an essay about a given TREC topic, we refer to Kules and Capra [11], Description. Find information on President Barack who manually identified exploratory tasks from raw query logs on Obama’s family history, including genealogy, national a small scale, most of which turned out to involve writing on a origins, places and dates of birth, etc. given subject. Egusa et al. [6] describe a user study in which they Sub-topic 1. Find the TIME magazine photo essay asked participants to do research for a writing task, however, without “Barack Obama’s Family Tree.” actually writing something. This study is perhaps closest to ours, although the underlying data has not been published. The most Sub-topic 2. Where did Barack Obama’s parents and notable distinction is that we asked our writers to actually write, grandparents come from? thereby creating a much more realistic and demanding state of mind Sub-topic 3. Find biographical information on Barack since their essays had to be delivered on time. Obama’s mother. This topic is rephrased as follows: 2. CORPUS CONSTRUCTION Obama’s family. Write about President Barack Oba- As discussed in the related work, essay writing is considered a ma’s family history, including genealogy, national ori- valid approach to study exploratory search. Two data sets form the gins, places and dates of birth, etc. Where did Barack basis for constructing a respective corpus, namely (1) a set of topics Obama’s parents and grandparents come from? Also to write about and (2) a set of web pages to research about a given include a brief biography of Obama’s mother. topic. With regard to the former, we resort to topics used at TREC, In the example, Sub-topic 1 is considered too specific for our specifically to those from the Web Tracks 2009–2011. With regard purposes while the other sub-topics are retained. TREC Web track to the latter, we employ the ClueWeb09 (and not the “real web in the topics divide into faceted and ambiguous topics. While topics of the wild”). The ClueWeb09 consists of more than one billion documents first kind can be directly rephrased into essay topics, from topics of from ten languages; it comprises a representative cross-section of the the second kind one of the available interpretations is chosen. real web, is a widely accepted resource among researchers, and it is used to evaluate the retrieval performance of search engines within A Search Engine for Controlled Experiments. several TREC tracks. The connection to TREC will strengthen the To give the oDesk writers a familiar search experience while main- compatibility with existing evaluation methodology and allow for taining reproducibility at the same time, we developed a tailored unforeseen synergies. Based on the above decisions, our corpus search engine called ChatNoir [15]. Besides ours, the only other construction steps can be summarized as follows: public search engine for the ClueWeb09 is hosted at Carnegie Mel- 1. Rephrasing of the 150 topics used at the TREC Web Tracks lon and based on Indri. Unfortunately, it is far from our efficiency 2009–2011 so that they invite people to write an essay. requirements. Our search engine returns results after a couple of 2. Indexing of the English portion of the ClueWeb09 (about hundreds of milliseconds, its interface follows industry standards, 0.5 billion documents) using the BM25F retrieval model plus and it features an API that allows for user tracking. additional features. ChatNoir is based on the BM25F retrieval model [17], uses the 3. Development of a search interface that allows for answering anchor text list provided by Hiemstra and Hauff [9], the PageRanks queries within milliseconds and that is designed along the provided by the Carnegie Mellon University,1 and the spam rank list lines of commercial search interfaces. provided by Cormack et al. [5]. ChatNoir comes with a proximity 4. Development of a browsing interface for the ClueWeb09, feature with variable-width buckets as described by Elsayed et al. [7]. which serves ClueWeb09 pages on demand and which Our choice of retrieval model and ranking features is intended to rewrites links on delivered pages so that they point to their provide a reasonable baseline performance. However, it is neither corresponding ClueWeb09 pages on our servers. near as mature as those of commercial search engines nor does it 5. Recruiting 12 professional writers at the crowdsourcing plat- compete with the best-performing models proposed at TREC. Yet, form oDesk from a wide range of hourly rates for diversity. it is among the most widely accepted models in the information 6. Instructing the writers to write essays of at least 5000 words retrieval community, which underlines our goal of reproducibility. length (corresponds to an average student’s homework assign- In addition to its retrieval model, ChatNoir implements two search ment) about an open topic among the initial 150, using our facets: text readability scoring and long text search. The former search engine and browsing only ClueWeb09 pages. facet, similar to that provided by Google, scores the readability of a 7. Logging all writers’ interactions with the search engine and text found on a web page via the well-known Flesh-Kincaid grade the ClueWeb09 on a per-topic basis at our site. level formula: it estimates the number of years of education required 8. Double-checking all of the 150 essays for quality. in order to understand a given text. This number is mapped onto the three categories “simple”, “intermediate”, and “expert.” The After the deployment of the search engine and successfully com- long text search facet omits search results which do not contain at pleted usability tests (see Steps 2-4 and 7 above), the actual corpus least one continuous paragraph of text that exceeds 300 words. The construction took nine months, from April 2012 through Decem- two facets can be combined with each other. They are meant to ber 2012. The post-processing of the data took another four months, support writers that want to reuse text from retrieved search results. so that this corpus is among the first, late-breaking results from Especially interesting for this type of writers are result documents our efforts. However, the outlined experimental setup can obvi- containing longer text passages and documents of a specific reading ously serve different lines of research. The remainder of the section 1 presents elements of our setup in greater detail. http://boston.lti.cs.cmu.edu/clueWeb09/wiki/tiki-index.php?page=PageRank Table 1: Demographics of the twelve writers employed. Table 2: Key figures of our exploratory search mission corpus. Writer Demographics Corpus Distribution ⌃ Age Gender Native language(s) Characteristic min avg max stdev Minimum 24 Female 67% English 67% Writers 12 Median 37 Male 33% Filipino 25% Topics 150 Maximum 65 Hindi 17% Topics / Writer 1 12.5 33 9.3 Academic degree Country of origin Second language(s) Queries 13 651 Postgraduate 41% UK 25% English 33% Queries / Topic 4 91.0 616 83.1 Undergraduate 25% Philippines 25% French 17% Clicks 16 739 None 17% USA 17% Afrikaans, Dutch, Clicks / Topic 12 111.6 443 80.3 n/a 17% India 17% German, Spanish, Clicks / Query 0 0.8 76 2.2 Australia 8% Swedish each 8% Sessions 931 South Africa 8% None 8% Sessions / Topic 1 12.3 149 18.9 Years of writing Search engines used Search frequency Days 201 Minimum 2 Google 92% Daily 83% Days / Topic 1 4.9 17 2.7 Median 8 Bing 33% Weekly 8% Hours 2068 Standard dev. 6 Yahoo 25% n/a 8% Hours / Writer 3 129.3 679 167.3 Maximum 20 Others 8% Hours / Topic 3 7.5 10 2.5 Irrelevant 5962 level such that reusing text from the results still yields an essay with Irrelevant / Topic 1 39.8 182 28.7 Irrelevant / Query 0 0.5 60 1.4 homogeneous readability. Relevant 251 When clicking on a search result, ChatNoir does not link into Relevant / Topic 0 1.7 7 1.5 the real web but redirects into the ClueWeb09. Though ClueWeb09 Relevant / Query 0 0.0 4 0.2 provides the original URLs from which the web pages have been ob- Key 1937 Key / Topic 1 12.9 46 7.5 tained, many of these page may have gone or been updated since. We Key / Query 0 0.2 22 0.7 hence set up an interface that serves web pages from the ClueWeb09 on demand: when accessing a web page, it is pre-processed before Corpus Statistics. being shipped, removing all kinds of automatic referrers and replac- Table 2 shows key figures of the query logs collected, including ing all links to the real web with links to their counterpart inside the absolute numbers of queries, relevance judgments, working days, ClueWeb09. This way, the ClueWeb09 can be browsed as if surfing and working hours, as well as relations among them. On average, the real web and it becomes possible to track a user’s movements. each writer wrote 12.5 essays, while two wrote only one, and one The ClueWeb09 is stored in the HDFS of our 40 node Hadoop very prolific writer managed more than 30 essays. cluster, and web pages are fetched with latencies of about 200ms. From the 13 651 submitted queries, each topic got an average ChatNoir’s inverted index has been optimized to guarantee fast of 91. Note that queries often were submitted twice requesting response times, and it is deployed on the same cluster. more than ten results or using different facets. Typically, about 1.7 results are clicked for consecutive instances of the same query. Hired Writers. For comparison, the average number of clicks per query in the Our ideal writer has experience in writing, is capable of writing aforementioned AOL query log is 2.0. In this regard, the behavior of about a diversity of topics, can complete a text in a timely man- our writers on individual queries does not seem to differ much from ner, possesses decent English writing skills, and is well-versed in that of the average AOL user in 2006. Most of the clicks we recorded using the aforementioned technologies. This wish list lead us to are search result clicks, whereas 2457 of them are browsing clicks favor (semi-)professional writers over, for instance, volunteer stu- on web page links. Among the browsing clicks, 11.3% are clicks dents recruited at our university. To hire writers, we made use of on links that point to the same web page (i.e., anchor links using a the crowdsourcing platform oDesk.2 Crowdsourcing has quickly URL’s hash part). The longest click trail observed lasted 51 unique become one of the cornerstones for constructing evaluation cor- web pages but most click trails are very short. This is surprising, pora, which is especially true for paid crowdsourcing. Compared since we expected a larger proportion of browsing clicks, but it to Amazon’s Mechanical Turk [1], which is used more frequently also shows our writers relied heavily on the search engine. If this than oDesk, there are virtually no workers at oDesk submitting fake behavior generalizes, the need for a more advanced support of results due to advanced rating features for workers and employers. exploratory search tasks from search engines becomes obvious. Table 1 gives an overview of the demographics of the writers we The queries of each writer can be divided into a total of 931 ses- hired, based on a questionnaire and their resumes at oDesk. Most sions with an average 12.3 sessions per topic. Here, a session is of them come from an English-speaking country, and almost all of defined as a sequence of queries recorded on a given topic which them speak more than one language, which suggests a reasonably is not divided by a break longer than 30 minutes. Despite other good education. Two thirds of the writers are female, and all of them claims in the literature (e.g., in [10]), we argue that, in our case, have years of writing experience. Hourly wages were negotiated sessions can be reliably identified by means of a timeout because of individually and range from 3 to 34 US-dollars (dependent on skill our a priori knowledge about which query belongs to which topic and country of residence), with an average of about 12 US-dollars. (i.e., task). Typically, finishing an essay took 4.9 days, which fits In total, we spent 20 468 US-dollars to pay the writers. well the definition of exploratory search tasks being long-lasting. In their essays, writers referred to web pages they found during 3. CORPUS ANALYSIS their search, citing specific passages and topic-related information This section presents the results of a preliminary corpus analysis used in their texts. This forms an interesting relevance signal which that gives an overview of the data and sheds some light onto the allows us to separate irrelevant from relevant web pages. Slightly dif- search behavior of writers doing research. ferent to the terminology of TREC, we consider web pages referred to in an essay as key documents for its respective topic, whereas 2 http://www.odesk.com web pages that are on a click trail leading to a key document are 1 5 10 15 20 25 A 64 74 42 64 8 60 9 36 26 52 16 16 173 24 30 208 147 18 69 75 42 42 22 108 62 B 76 60 162 12 68 33 80 66 24 78 23 17 61 133 29 14 185 118 46 135 40 181 84 58 241 C 99 155 92 48 274 150 44 69 111 301 42 51 104 74 60 70 323 106 56 139 170 147 20 76 112 D 198 48 218 94 198 48 88 50 55 10 46 47 98 108 28 136 4 106 69 32 62 101 74 616 120 E 57 34 70 34 42 36 138 50 97 66 48 52 60 52 46 284 114 34 24 35 208 18 30 26 64 F 319 154 113 153 148 40 248 109 347 23 196 27 28 119 18 23 113 70 58 210 158 16 20 33 165 Figure 1: Spectrum of writer search behavior. Each grid cell corresponds to one of the 150 topics and shows a curve of the percentage of submitted queries (y-axis) at times between the first query until the essay was finished (x-axis). The numbers denote the amount of queries submitted. The cells are sorted by area under the curve from the smallest area in cell A1 to the largest area in cell F25. relevant. The fact, that there are only few click trails of this kind 5. REFERENCES explains the unusually high number of key documents compared [1] J. Barr and L. F. Cabrera. AI gets a brain. Queue, 4(4):24–29, to that of relevant ones. The remainder of web pages which were 2006. accessed but discarded by our writers may be considered irrelevant. [2] A. Bozzon, M. Brambilla, S. Ceri, and P. Fraternali. Liquid query: The writer’s search interactions are made freely available as the multi-domain exploratory search on the web. Proc. of WWW 2010. Webis-Query-Log-12.3 Note that the writing interactions are the [3] M. Bron, J. van Gorp, F. Nack, M. de Rijke, A. Vishneuski, and focus of our accompanying ACL paper [16] and contained in the S. de Leeuw. A subjunctive exploratory search interface to Webis text reuse corpus 2012 (Webis-TRC-12). support media studies researchers. Proc. of SIGIR 2012. [4] M.-A. Cartright, R. White, and E. Horvitz. Intentions and attention in exploratory health search. Proc. of SIGIR 2011. Exploring Exploratory Search Missions. [5] G. Cormack, M. Smucker, and C. Clarke. Efficient and effective To get an inkling of the wealth of data in our corpus, and how it spam filtering and re-ranking for large web datasets. Information may influence the design of exploratory search systems, we analyze Retrieval, 14(5):441–465, 2011. the writers’ search behavior during essay writing. Figure 1 shows [6] Y. Egusa, H. Saito, M. Takaku, H. Terai, M. Miwa, and N. Kando. for each of the 150 topics a curve of the percentage of queries at any Using a concept map to evaluate exploratory search. Proc. of IIiX given time between a writer’s first query and an essay’s completion. 2010. We have normalized the time axis and excluded working breaks of [7] T. Elsayed, J. Lin, and D. Metzler. When close enough is good more than five minutes. The curves are organized so as to highlight enough: approximate positional indexes for efficient ranked the spectrum of different search behaviors we have observed: in retrieval. Proc. of CIKM 2011. row A, 70–90% of the queries are submitted toward the end of the [8] M. Hagen, J. Gommoll, A. Beyer, and B. Stein. From search writing task, whereas in row F almost all queries are submitted at the session detection to search mission detection. Proc. of SIGIR 2012. beginning. In between, however, sets of queries are often submitted [9] D. Hiemstra and C. Hauff. MIREX: MapReduce information in short “bursts,” followed by extended periods of writing, which retrieval experiments. Tech. Rep. TR-CTIT-10-15, University of can be inferred from the plateaus in the curves (e.g., cell C12). Only Twente, 2010. in some cases (e.g., cell C10) a linear increase of queries over time [10] R. Jones and K. L. Klinkner. Beyond the session timeout: can be observed for a non-trivial amount of queries, which indicates automatic hierarchical segmentation of search topics in query logs. continuous switching between searching and writing. Proc. of CIKM 2008. From these observations, it can be inferred that query frequency [11] B. Kules and R. Capra. Creating exploratory tasks for a faceted alone is not a good indicator of task completion or the current stage search interface. Proc. of HCIR 2008. of a task, but different algorithms are required for different mission [12] C. Lucchese, S. Orlando, R. Perego, F. Silvestri, and G. Tolomei. types. Moreover, exploratory search systems have to deal with a Identifying task-based sessions in search engine query logs. Proc. broad subset of the spectrum and be able to make the most of few of WSDM 2011. queries, or be prepared that writers interact only a few times with [13] D. Morris, M. Ringel Morris, and G. Venolia. SearchBar: a search-centric web history for task resumption and information them. Our ongoing research on this aspect focuses on predicting the re-finding. Proc. of CHI 2008. type of search mission, since we found it does not simply depend [14] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. on the writer or a topic’s difficulty as perceived by the writer. Proc. of Infoscale 2006. [15] M. Potthast, M. Hagen, B. Stein, J. Graßegger, M. Michel, 4. SUMMARY M. Tippmann, and C. Welsch. ChatNoir: a search engine for the ClueWeb09 corpus. Proc. of SIGIR 2012. We introduce the first corpus of search missions for the ex- [16] M. Potthast, M. Hagen, M. Völske, and B. Stein. Crowdsourcing ploratory task of writing. The corpus is of representative scale, interaction logs to understand text reuse from the web. Proc. of comprising 150 different writing tasks and thousands of queries, ACL 2013. clicks, and relevance judgments. A preliminary corpus analysis [17] S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 shows the wide variety of different search behavior to expect from a extension to multiple weighted fields. Proc. of CIKM 2004. writer conducting research online. We expect further insights from [18] R. White, G. Muresan, and G. Marchionini, editors. Proc. of a forthcoming in-depth analysis, whereas the results mentioned SIGIR workshop EESS 2006. demonstrate the utility of our publicly available corpus. [19] R. White and R. Roth. Exploratory search: beyond the 3 query-response paradigm. Morgan & Claypool, 2009. http://www.webis.de/research/corpora Interactive Exploration of Geographic Regions with Web-based Keyword Distributions Chandan Kumar Dirk Ahlers University of Oldenburg, NTNU – Norwegian University Oldenburg, Germany of Science and Technology, chandan.kumar@uni- Trondheim, Norway oldenburg.de dirk.ahlers@idi.ntnu.no Wilko Heuten Susanne Boll OFFIS – Institute for University of Oldenburg, Information Technology, Oldenburg, Germany Oldenburg, Germany susanne.boll@uni- wilko.heuten@offis.de oldenburg.de ABSTRACT 1. INTRODUCTION The most common and visible use of geographic information Geospatial search has become a widely accepted search mode retrieval (GIR) today is the search for specific points of inter- o↵ered by many commercial search engines. Their inter- est that serve an information need for places to visit. How- faces can easily be used to answer relatively simple requests ever, in some planning and decision making processes, the such as “restaurant in Berlin” on a point-based map inter- interest lies not in specific places, but rather in the makeup face, which additionally gives extended information about of a certain region. This may be for tourist purposes, to find entities [1]. A corresponding strong research interested has a new place to live during relocation planning, or to learn developed in the field of geographic information retrieval, more about a city in general. Geospatial Web pages contain e.g., [2, 17, 15]. However, there are many tasks in which the rich spatial information content about the geo-located facil- retrieval of individual pinpointed entities such as facilities, ities that could characterize the atmosphere, composition, services, businesses, or infrastructure cannot satisfy user’s and spatial distribution of geographic regions. But the cur- more complex spatial information needs. rent means of Web-based GIR interfaces only support the sequential search of geo-located facilities and services indi- To support more complex tasks we propose a new retrieval vidually, and limit the end users on abstracted view, analy- method based on entities. For example, sometimes the dis- sis and comparison of urban areas. In this work we propose tribution of results on a map can already inform certain a system that abstracts from the places and instead gener- views about areas, e.g., a search for “bar” may show a clus- ates the makeup of a region based on extracted keywords we tering of results that can be used for “eyeballing” a region find on the Web pages of the region. We can then use this of nightlife even without sophisticated geospatial analysis. textual fingerprint to identify and compare other suitable However, as users become more used to local search, more regions which exhibit a similar fingerprint. The developed complex search types and supporting analysis are desired interface allows the user to get a grid overview, but also to that enable a combined view onto the underlying data [10]. drill in and compare selected regions as well as adapt the Exploration of geographic regions and their characterization list of ranked keywords. was found as one of the key desire of local search users in our requirement study [11]. A person who is moving to a Categories and Subject Descriptors new area or city would like to find similar neighborhoods H.3.3 [Information Storage and Retrieval]: Information or regions with a similar makeup to their current home. It Search and Retrieval; H.5.2 [Information Interfaces and might not even be the concrete entities, but rather the atmo- Presentation]: User Interfaces sphere, composition, and spatial distribution that make up the “feeling” of a neighborhood that best capture the inten- Keywords tion of a user. To assess this similarity of regions we propose a spatial fingerprint (query-by-spatial-example) that acts as Geographic information retrieval, Spatial Web, Geographic an abstracted view onto the same point-based data. regions, Keyword distributions, Visualization, Interaction We also aim to provide new visual tools for the exploration of geographic regions. While the necessary multi-dimensional geospatial data is already available, there is no suitable inter- face to query them, let alone to deal with the multi-criteria complexity. In this paper we describe a visual-interactive GIR system to support the retrieval of relevant geospatial Presented at EuroHCIR2013. Copyright c 2013 for the individual papers regions and enable users to explore and interact with geospa- by the papers’ authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. tial data. We propose a new query-by-spatial-example in- Figure 1: Geographic querying and ranking of ge- ographic regions, with user-selected target regions and alternative grid view Figure 2: Keyword-based visual comparison of geo- teraction method in which a user-selected region’s charac- graphic regions teristic is fingerprinted to present similar regions. Users can interactively refine their query to use those characteristics of a region that are most important to them. For a more fetched from OSM for the major cities of Germany. To re- detailed overview, we use the full text of georeferenced Web trieve actual pages, we crawled the Web with a geospatially pages for queries and analysis. This work goes beyond con- focused crawler [3] based on the geoparser and built a rich ventional GIR interfaces as it allows users to interact with geo-index for various cities of Germany, where each city con- aggregated spatial information via spatial queries instead tains several thousand geotagged Web pages with their full of only textual querying, which is especially important to textual content. define regions of interest. We discuss the necessary input, visualization, comparison, refinement, and ranking methods in the remainder of this paper. 3. INTERFACE FOR EXPLORATION OF GEOGRAPHIC REGIONS OF INTEREST 2. USING THE GEOSPATIAL WEB TO CHAR- We have implemented two main interaction modes in the Web interface as shown in Figure 1. A user intends to com- ACTERIZE GEOGRAPHIC REGIONS pare multiple geographic regions of Frankfurt (target region, The distribution of geo-entities is used to illustrate the char- right in the dual-map view) with respect to a certain relevant acteristics and dynamics of a geographic region. A geo- region in Berlin (query region, left). The current reference entity is a real life entity at a physical location, e.g., a region of interest is specified via a visual query. The user restaurant, theatre, pub, museum, business, school, etc. To can then either select regions by placing markers onto the open these entities up for aggregate and multi-criteria region map, or alternatively use a grid overview (right side of Fig- characterization, they need a certain depth of information ure 1). In both cases, the system computes the relevance of associated with them. It is obvious that only position infor- the target regions with respect to the characteristics of the mation or the name of a place is insufficient, so categorial query region. or textual description is needed. For initial studies [11, 9] we used OpenStreetMap (OSM)1 which uses a tagging sys- tem for categories. To better characterize the geo entities 3.1 Query-by-spatial-example we now use their associated Web pages. The reason for this Most GIR interfaces use a conventional textual query as in- is the massive increase of the amount of usable data. The put method to describe user’s information need or use the Web pages of entities contain a lot more than just the ba- currently selected map viewport. We wanted to give users sic information and can therefore be used to uncover much the ability to arbitrarily define their own spatial region of more detailed information. This method can also include interest. The free definition of the query region is important, additional sources such as events happening in the region as users may not always want a neighborhood that is eas- or user-generated content on third-party pages [2]. We later ily describable by a textual query. We therefore enabled to describe how we identify the most meaningful keywords from query by spatial example, where users can define the query the pages for this task. region by drawing on map. Figure 1 shows an example of a user selected region of interest via a polygon query (by To actually make the connection from a location to Web mouse clicks and drag) in the city of Berlin. pages, we assume that the presence of location references on a page is a strong indication that the page is associated with the entity at that location. We use our geoparser to ex- 3.2 Visualization of suitable geographic regions Users can select several location preferences in their target tract location references and thereby assess the geographical region that they would like to explore by positioning markers scopes of a page. The geoparser is trained to the presence of on the map interface. The system defines the targets with location references in the form of addresses within the page a circle around the user-selected locations with the same di- content. This is a suitable approach for the urban areas ameter as the reference region polygon. The target regions we are addressing in this work, because we need a geospa- obtain the ranking with respect to their similarity with the tial granularity at the sub-neighborhood level. Knowledge- reference region. Their relevance is shown by the percent- based identification and verification of the addresses is done age similarity and the heatmap based relevance visualiza- against a gazetteer extended with street names, which we tion. We used a color scheme of di↵erent green tones which 1 http://www.openstreetmap.org/ di↵ered in their transparency. Light colors represented low tion and wants to influence the keywords. In the example of Figure 3, a user decides that pubs are more important than restaurant, fast food is not an aspect of his lifestyle and should be replaced by education facilities near his new home. In such scenarios users need to interact and adapt the generated keyword distributions of query regions. We make the word cloud interactive and editable. Users can drag keywords to alter their position and thus their signifi- cance. They can also edit, delete or replace keywords in the word cloud to change the criteria. After modifying the key- word distribution, users can revisualize the target regions to update their ranking. Figure 3 shows this user interaction Figure 3: User interaction with the keyword distri- with the word cloud, including the revisualization of the up- bution and revisualization dated ranking of target regions, which are visibly di↵erent from the previous ranking of Figure 2. relevance, dark colors were used to indicate high relevance. 4. TEXT-BASED CHARACTERIZATION AND The color scheme selection was aided by ColorBrewer 2 . RANKING OF GEOGRAPHIC REGIONS As an example, Figure 1 shows 4 user-selected locations on We adapt common IR methods for ranking and similarity the city map of Frankfurt, the circle regions around these measures. In relevance-based language models, the similar- 4 markers have the same diameter as the query region in ity of a document to a query is the probability that a given Berlin. The target region in the centre of the city is most document would generate the query [12]. To be able to do relevant with the similarity of 88%, and consequently has the same with geographic regions, we add a transitional step. the darkest green tone. If a user has not yet formed any Regions are considered as compound documents built from preference, we o↵er an aggregate overview of geo-entities. the Web pages of the entities inside them. We can then We partition the map area using a grid raster [14], as we define the similarity of document clusters of regions based do not intend to restrict user exploration to only selected on the probability that the target region can generate the areas. There could be situations when users look beyond the query region. The Kullback-Leibler divergence is used for specific target regions, and would like to have an overview comparison [4]. of the whole city with respect to a query region. The right side of Figure 1 shows the aggregated ranked view of the For a geospatial document d, we estimate P (w|d) , which is a grid-based visualization. Each grid cell represents the overall unigram language model , with the maximum likelihood es- relevance with respect to the query region. The visualization timator, simply given by relative counts: P (w|d) = tf (w,d) |d| , gives a good overview and assessment on relevant regions here tf (w, d) is the frequency of word w in the document d which the user can then explore further. Users can select and |d| is the length of the document d. A geographic region the grid size, which is otherwise similar to the size of the contains several geospatial documents insides its footprint query region. The grid layout is fixed to the city boundaries area. We define a geographic region based on a document as we intend to give the overview of whole city. In the future cluster D which contains document {d1 , d2 ....dk }, and the we would like to make it more dynamic where users should distribution of a particular word w in the geographic re- be able to shift the grid layout, since a slight variation in gion would be estimatedPwith its combine probability in the grid cell boundaries could alter the relevance results. collection P (w|D) = k1 ki=1 P (w|di ). The word cloud rep- resents the most prominent keywords of the region with re- spect to their ranked probability distribution P (w|D). The 3.3 Exploration and interaction with geographic comparison of regions is done with respect to their probabil- regions via keyword distributions ity distribution using KL-divergence. A target region x will Interaction models should provide end users the opportu- be compared to the query region as following nity to explore the characteristics of selected regions, and X P (w|Dq ) adapt it further to their requirements. We initially show Relevance(Regionx ) = P (w|Dq )log the most relevant keywords of the respective region using a w P (w|Dx ) word cloud. The word cloud provides more detailed infor- The computation of this formula involves a sum over all mation on keyword distribution when the mouse hovers over the words that have a non-zero probability according to it. The font size and order of the keywords signify their rel- P (w|Dq ). Each region Regionx gets a relevance score ac- evance. Figure 2 shows the comparison of the query region cording to its distribution comparison to the query region with the most relevant target region via both their keyword Regionq . All target regions (user selected regions or grid distributions. In this case, the distributions of both regions based divisions) are ranked with respect to their relevance are very similar, leading to the high relevance score for the score for visualization. target region. Since the keyword characteristics of a query region is de- 5. RELATED WORK rived from the georeferenced Web pages, there are situations The field of geographic information retrieval examines doc- where a user might not be satisfied with the spatial descrip- uments’ geospatial features at a regional scale and also at smaller granularities and usually supports keyword@location 2 http://colorbrewer2.org queries [2, 17, 15]. Similarly, location-based services (e.g., FourSquare, Yelp, Google Maps) allow users to retrieve and Acknowledgments visualize geo-entities matching a category or search term. The authors are grateful to the DFG SPP 1335 ‘Scalable However, search for multiple categories or other complex Visual Analytics’ priority program which funds the project tasks is usually not supported. Some non-conventional spa- UrbanExplorer. The 2nd author acknowledges funding from tial querying methods have been proposed, e.g., query-by- the ERCIM “Alain Bensoussan” Fellowship Programme. sketch on a map [6]. Other work uses the density of arbi- trary user-supplied keywords to build a query region [8]. Tag 7. REFERENCES clouds have been adapted to maps, exploiting georeferenced [1] D. Ahlers. Local Web Search Examined. In Web tags [16]. Locally characteristic keywords can be extracted Search Engine Research. Emerald, 2012. for map visualization and to show their spatial extent [19]. [2] D. Ahlers and S. Boll. Location-based Web search. In None of these approaches make a larger word cloud available, The Geospatial Web. Springer, 2007. but only the main terms. Other geovisualization approaches [3] D. Ahlers and S. Boll. Adaptive Geospatially Focused [5, 7] approach multi-criteria analysis, but are usually tar- Crawling. In CIKM ’09, 2009. geted to specific domains and experts. The Inspect system was tailored at geospatial analysts to visually filter and ex- [4] T. M. Cover and J. A. Thomas. Elements of plore multidimensional data [13]. A multi-criteria evalua- information theory, 1991. tion for home buyers was proposed in [18]. The scenario of [5] J. Dykes, A. M. MacEachren, and M.-J. Kraak. spatial decision making is similar to ours, but it focused on Exploring Geovisualization. Elsevier, 2005. experts and spatial computation issues rather than interface [6] M. J. Egenhofer. Query processing in spatial-query- and visualization aspects. by-sketch. J. Vis. Lang. Comput., 8, 1997. [7] R. Greene et al. GIS-based multiple-criteria decision Our system interface di↵ers in the granularity of informa- analysis. Geography Compass, 5(6), 2011. tion need and representation, i.e., we focus on the ranking [8] A. Henrich and V. Lüdecke. Measuring Similarity of of regions, but base it on high-granularity geo-entities that Geographic Regions for Geographic Information have a very exact location, which ensures that the spatial Retrieval. In ECIR ’09, 2009. query does not produce overlap to neighboring regions and [9] C. Kumar, W. Heuten, and S. Boll. Visual interfaces makes the multi-criteria analysis more exact to be executed to support spatial decision making in geographic at arbitrary region sizes. information retrieval. In CD-ARES 2013. to appear. [10] C. Kumar, W. Heuten, and S. Boll. Geovisualization 6. CONCLUSIONS AND FUTURE WORK for end user decision support: Easy and e↵ective Most current local search interfaces do not o↵er adequate exploration of urban areas. In GeoViz Hamburg 2013: support for the exploration and comparison of geographic Interactive Maps That Help People Think, 2013. areas and regions. End users need visual and interactive as- [11] C. Kumar, B. Poppinga, D. Haeuser, W. Heuten, and sistance from GIR systems for an abstracted overview and S. Boll. Geovisual interfaces to find suitable urban analysis of geospatial data. We proposed interactive inter- regions for citizens: A user-centered requirement faces for the characterization and assessment of relevant ge- study. In UbiComp’13 Adjunct, 2013. to appear. ographic regions that enable end-users to query, analyze and [12] V. Lavrenko and W. B. Croft. Relevance based interact with the rich geospatial data available on the Web language models. SIGIR ’01. ACM, 2001. in user-selected geographic regions. The relevance of regions [13] S.-J. Lee et al. Inspect: a dynamic visual query system is based on the similarity of keyword distributions. for geospatial information exploration. In SPIE, 2003. [14] A. M. MacEachren and D. DiBiase. Animated maps of The observation of results shows satisfactory performance by aggregate data: Conceptual and practical problems. uncovering realistic and meaningful keywords defining the CaGIS, 18(4), 1991. regions. We observed that the characterization and compar- [15] A. Markowetz et al. Design and Implementation of a ison of geographic regions show good results with respect Geographic Search Engine. In WebDB 2005, 2005. to geo-located facilities and infrastructure of German cities, [16] D.-Q. Nguyen and H. Schumann. Taggram: Exploring e.g., clearly distinct characteristics for university, industrial, geo-data on maps through a tag cloud-based or party districts. In the future we plan a more formal qual- visualization. In IV’10, 2010. itative and quantitative evaluation of these interfaces, to [17] R. S. Purves et al. The design and implementation of examine the acceptance of these visualizations with regard SPIRIT: a spatially aware search engine for informa- to user-centered aspects such as exploration ability, infor- tion retrieval on the internet. IJGIS, 21(7), 2007. mation overload, and cognitive demand. We would also like to explore more advanced interaction methods to enhance [18] C. Rinner and A. Heppleston. The spatial dimensions the usability of the proposed visualizations. of multi-criteria evaluation – case study of a home buyer’s spatial decision support system. In Geographic Additionally, we envision more powerful region similarity Information Science. 2006. measures such as landscape and topological similarity, simi- [19] B. Thomee and A. Rae. Uncovering locally larity via social media, and an integration of additional data characterizing regions within geotagged data. sources. WWW ’13, 2013. Inferring Music Selections for Casual Music Interaction Daniel Boland Ross McLachlan Roderick Murray-Smith University of Glasgow University of Glasgow University of Glasgow United Kingdom United Kingdom United Kingdom daniel@dcs.gla.ac.uk r.mclachlan.1@ rod@dcs.gla.ac.uk research.gla.ac.uk ABSTRACT Music listeners are not always fully engaged with the se- We present two novel music interaction systems developed lection of music - as evidenced by the success of the shu✏e for casual exploratory search. In casual search scenarios, playback feature. Large libraries of music such as Spotify are users have an ill-defined information need and it is not clear available but users often just want background music, not a how to determine relevance. We apply Bayesian inference specific song out of millions. In these casual search scenar- using evidence of listening intent in these cases, allowing ios, users often satisfice i.e. search for something which is for a belief over a music collection to be inferred. The first ‘good enough’ [11]. As this information need is poorly de- system using this approach allows users to retrieve music fined, so too is relevance, placing these interactions outside by subjectively tapping a song’s rhythm. The second sys- of typical Information Retrieval approaches. tem enables users to browse their music collection using a radio-like interaction that spans from casual mood-setting through to explicit music selection. These systems embrace 2. UNCERTAIN MUSIC SELECTION the uncertainty of the information need to infer the user’s By asking ‘What would this user do?’, we can develop a intended music selection in casual music interactions. likelihood model of user input within an interaction. With Bayes theorem, this allows for an uncertain belief over a Categories and Subject Descriptors music space to be inferred. Users can provide evidence of H.5.2 [Information interfaces]: User Interfaces their listening intent as part of a casual music interaction, not needing to be fully engaged in the music retrieval. This is an explicitly user-centered approach, focusing on how a user General Terms will interact with the system. Both the systems discussed Design, Human Factors, Theory here have been iteratively developed by comparing real user behaviour against that predicted by the user input models. 1. INTRODUCTION We present two novel music retrieval systems which explore two challenges with this approach: i) how to correctly inter- When interacting with a music system, listeners are faced pret evidence which may be subjective and ii) how to allow with selecting songs from increasingly large music collec- users to set their current level of engagement: tions. With services like Spotify, these libraries can include i) ‘Query by Tapping’ is a music retrieval technique where many songs the user has never heard of. This retrieval is users tap the rhythm of a song in order to retrieve it [1]. often a hedonic activity and may not serve a particular in- As part of a user-centred development process, we identified formation need. Users do not always have a song in mind that rhythmic queries are often subjective and so developed and are often just interested in setting a mood or finding a model of rhythmic input which captures some of this sub- something ‘good enough’ [9]. This type of casual search has jective behaviour. This allows for the system to be trained recently been identified as not being well supported within to the user’s tapping style, giving significant improvements IR literature [14]. In particular, the concept of relevance over previous e↵orts at rhythmic music retrieval. becomes nebulous where the information need is not well ii) FineTuner is a prototype of a radio-like music interface defined. By inferring a belief over a music collection using that enables users to retrieve music at a level of engage- the likelihood of a user’s input, we implement interactions ment suited to their current information need. Users nav- which incorporate this uncertainty. These interactions can igate their music collection using a dial, with the system account for subjectivity and span from casual, serendipitous using prior knowledge of the user to inform the music se- listening through to highly engaged music selection. lection. A pressure sensor enables users to assert varying levels of control over the system – with no pressure, users can casually tune in to sections of their music collection to hear recommended music with common characteristics. As pressure is applied, the user is able to make increasingly specific selections from the collection. The inferred music selection is conditioned upon the asserted control, allowing Presented at EuroHCIR2013. Copyright c 2013 for the individual papers by the papers authors. Copying permitted only for private and academic for the seamless transition from casual mood-setting to en- purposes. This volume is published and copyrighted by its editors. gaged music interaction. Vocals Guitar Bass Drums User 1 User 2 t Figure 1: Users construct queries by sampling from preferred instruments. User 1 prefers Vocals and Guitar whereas User 2 prefers Drums and Bass. 3. MODELLING SUBJECTIVITY 3.1 Query By Tapping In this section we describe our e↵orts to model the subjectiv- ‘Query by Tapping’ has received some consideration in the ity of rhythmic queries, yielding a query by tapping system Music Information Retrieval community. The term was in- for casual music retrieval which can be trained to users to troduced in [7] which demonstrated that rhythm alone can account for their subjective querying style. After training be used to retrieve musical works, with their system yield- the system, a user can tap a rhythm to re-order their music ing a top 10 ranking for the desired result 51% of the time. collection by rhythmic similarity to their query. The top 20 Their work is limited however in considering only mono- highly ranked results are listed on-screen as a music playlist, phonic rhythms i.e. the rhythm from only one instrument, from which the user could also then select a specific song. as opposed to being polyphonic and comprising of multiple Query by tapping provides an example of a casual music instruments. Their music corpus consists of MIDI repre- interaction which su↵ers from subjective queries. In mo- sentations of tunes such as ”You are my sunshine” which is bile music-listening contexts, it can often be inconvenient hardly analogous to real world retrieval of popular music. for users to remove their mobile device from their pocket Rhythmic interaction has been recognised in HCI [8, 15] or bag and engage with it to select music. This tapping of with [4] introducing rhythmic queries as a replacement for music as a querying technique for music is depicted in figure hot-keys. In [2] tempo is used as a rhythmic input for explor- 2. Tapping a rhythm is already a common act and rhythm ing a music collection – indicating that users enjoyed such a is a universal aspect of music [13]. In an exploratory design method of interaction. The consideration of human factors session where users were asked to provide rhythmic queries, is also an emerging trend in Music Information Retrieval it became apparent that users di↵ered in querying style. We [12]. Our work draws upon both these themes, being the describe this subjective behaviour and our approach to mod- first QBT system to adapt to users. A number of key tech- elling it in previous work [1]. One of the key aspects of the niques for QBT are introduced in [5] which describes rhythm model is that users have preferences for which instruments as a sequence of time intervals between notes – termed inter- they tap to, as depicted in figure 1. onset intervals (IOIs). They identify the need for such in- In order to assign a belief to the songs in the music col- tervals to be defined relative to each other to avoid the user lection given a rhythmic query, we compare the query to having to exactly recreate the music’s tempo. those predicted by the user input model. This comparison is In previous implementations of QBT, each IOI is defined done using the edit distance from string comparison meth- relative to the preceding one [5]. This sequential depen- ods, scaling the mismatch penalty to the time di↵erences dency compounds user errors in reproducing a rhythm, as between the rhythmic sequences [5]. an erroneous IOI value will also distort the following one. 100 recognition rate (%) 75 querymodel 50 Gen. Model Baseline 25 0 0.5 1.0 2.0 5.0 10.0 20.0 30.0 query length in seconds (log) Figure 2: Users are able to select music by simply Figure 3: Percentage of queries yielding a highly tapping a rhythm or tempo on the device, enabling ranked result (in the top 20 i.e. 6.7%) plotted a casual eyes-free music interaction. against query length in seconds. Probability of input Probability of input Probability of input Dial position Dial position Dial position (a) (b) (c) Figure 4: As the user asserts control, the distribution of predicted input for a given song becomes narrower. This adds weight to the input, meaning a belief is inferred over fewer songs and the view zooms in. The approach to rhythmic interaction in [4] however used k- We explore how the inference of listening intent can be con- means clustering to classify taps and IOIs into three classes ditioned upon the user’s level of engagement, with the music based on duration. The clustering based approach avoids interaction spanning from casual mood-setting through to the sequential error however loses a great deal of detail in specific song selection. While it would be desirable to bring the rhythmic query and so we explore a hybrid approace. the simplicity of radio-like interaction to modern music col- lections, mapping a modern music collection to a dial such as 3.2 Evaluation in figure 5 would require prolonged scrolling. An alternative The most important metric for the system to be usable was would be to instead support scrolling through an overview of whether a rhythmic input produced an on-screen (top 20) the music space however this removes granularity of control result. We asked eight participants to provide queries for from the user, leaving them unable to select specific items. songs selected from a corpus of 300 songs which we had We developed a radio-like system called FineTuner that al- complete note onset data for. Participants listened to the lows users to navigate their music, which is arranged along songs first to ensure familiarity and were asked to provide a mood axis. Users can ‘tune in’ to a mood to hear recom- training queries for each song. These training queries were mended songs based on their listening history. FineTuner used to train the generative model using leave-one-out cross- allows the user to assert control over the music recommenda- validation. We use a state-of-the-art onset detection algo- tion by applying pressure to a sensor. This enables users to rithm (based on measuring spectral flux [10]) as a baseline seamlessly transition from a casual style of interaction akin which does not account for subjectivity. Performance typi- to a radio to controlling styles such as specifying a particular cally improves with query length as seen in figure 3. Higher sub-area of interest in a music space, or even selecting indi- rankings are achieved for all query lengths when using the vidual songs. FineTuner provides a single interaction which generative model. Interestingly, queries over 10 seconds lead supports casual search through to fully engaged retrieval. to a rapid fall-o↵ in performance - possibly due to errors ac- cumulating beyond the initial query the user had in mind or due to users becoming bored. 4. MODELLING ENGAGEMENT We consider casual search interactions as spanning a range of levels of engagement. How much a user is willing to en- gage with a system and provide evidence of their listening intent will undoubtedly vary with listening context. An in- teraction which is fixedly casual would be as problematic as one which requires a user’s full attention, with users un- able to take control when they wish to. An example of this would be old analogue radios – whilst they o↵er a simple music interaction, users have limited control over what they hear. Previous work by Hopmann et al. sought to bring the benefits of interaction with vintage analog radio to modern digital music collections [6], however their work also required Figure 5: Users share control over an intelligent ra- explicit selection (a fixed level of engagement). dio system, using a knob and pressure sensor. 4.1 Varying Engagement creating a more personalised search experience. A key fea- Our system enables both casual and engaged forms of inter- ture of the second system, FineTuner, is its ability to span action, giving users varying degrees of control over the selec- seamlessly from casual search scenarios, such as satisficing, tion of music. In casual interactions where users apply less through to more explicit selections of music. By condition- pressure, the system can become more autonomous – making ing the inference upon the user’s level of engagement, we are inferences from prior evidence about what the user intended. able to interpret the same input space (in this case the dial) This handover of control was termed the ‘H-metaphor’ by according to the current context. Flemisch et al. where it was likened to riding a horse – as Our approach to casual music interaction empowers the user the rider asserts less control the horse behaves more au- to enjoy their music while expending as much or as little tonomously [3]. By allowing users to make selections from e↵ort in the retrieval as they wish, providing queries in their the general to the specific, the system supports both specific own subjective style. Instead of focusing solely on optimising selections and satisficing. Users can make broad and uncer- the retrieval process, we consider it equally important to tain general selections to casually describe what they want design retrieval systems which suit how the user currently to listen to. However, they can also assert more control over wants to interact. By considering how users might provide the system and force it to play a specific song. Control is casual evidence for their listening intent, we achieve music asserted by applying force to a pressure sensor. interactions as simple as tapping a beat or tuning a radio. As the user begins an interaction, they have not applied pres- sure and therefore are not asserting control over the system. 6. ACKNOWLEDGMENTS The inferred selection is thus broad, covering an entire re- We are grateful for support from Bang & Olufsen and the gion of their collection and is biased towards popular tracks Danish Council for Strategic Research. (fig. 4a). The music in the inferred selection is visualised by randomly sampling tracks from it and drawing beams from the dial position to the album art. The user may press in the 7. REFERENCES [1] Boland, D., and Murray-Smith, R. Finding My Beat: knob to accept the selection and the sampled track is played. Personalised Rhythmic Filtering for Mobile Music At low levels of assertion it is likely that most tracks played Interaction. In MobileHCI 2013 (2013). would be highly popular tracks. This behaviour is a design [2] Crossan, A., and Murray-Smith, R. Rhythmic Interaction assumption, users may want the system to use other prior for Song Filtering on a Mobile Device. Haptics and Audio evidence. When the user applies pressure, the system inter- Interface Design (2006), 45–55. [3] Flemisch, O., Adams, A., Conway, S. R., Goodrich, K. H., prets this as an assertion of control. The inferred selection is Palmer, M. T., and Schutte, P. C. smaller and the spread of beams becomes narrower, the al- NASA/TMâĂŤ2003-212672 The H-Metaphor as a bum art visualisation zooms in to show the smaller selection Guideline for Vehicle Automation and Interaction, 2003. (fig. 4b). This selection is a combination of evidence from [4] Ghomi, E., Faure, G., Huot, S., and Chapuis, O. Using the dial position with prior evidence i.e. their last.fm music rhythmic patterns as an input method. Proc. CHI (2012), history. When users fully assert control (max. pressure), 1253–1262. they navigate the collection album by album (fig. 4c) and [5] Hanna, P. Query by tapping system based on alignment can make exact selections. By varying the pressure, users algorithm. In Proc. ICASSP (2009), 1881–1884. seamlessly move through this continuous range of control. [6] Hopmann, M., Vexo, F., Gutierrez, M., and Thalmann, D. Vintage Radio Interface: Analog Control for Digital The smooth change in engagement is achieved using a sim- Collections. In CHI 2012: Case Study (2012). ple model of user input. We assume that in an engaged [7] Jang, J., Lee, H., and Yeh, C.-H. Query by Tapping: A interaction, users will point precisely at the song of inter- New Paradigm for Content-based Music Retrieval from est (as in fig. 4c). For more casual selection, we assume Acoustic Input. Proc. PCM (2001). that users will point in the general area (mood) of the music [8] Lantz, V., and Murray-Smith, R. Rhythmic interaction they want, modelled using a normal distribution as in (fig. with a mobile device. In Proc. NordiCHI, ACM (2004), 4b). As less pressure is applied the distribution is widened, 97–100. leading to less precise selection and a greater role for a prior [9] Laplante, A., and Downie, J. S. Everyday life music information-seeking behaviour of young adults, 2006. belief over the music collection such as listening history. [10] Masri, P. Computer modelling of sound for transformation and synthesis of musical signals. PhD thesis, University of 5. SUMMARY Bristol, 1996. The scenarios explored here involve casual music retrieval, [11] Scheibehenne, B., Greifeneder, R., and Todd, P. M. What where users have an ill-defined information need and browse Moderates the Too-Much-Choice E↵ect? Journal of Psychology & Marketing 26(3) (2009), 229–253. for hedonic purposes or to satisfice a music selection. In [12] Stober, S., and Nürnberger, A. Towards user-adaptive these cases, considering what input a user would provide for structuring and organization of music collections. Adaptive target songs and inferring selections is an intuitive approach Multimedia Retrieval. Identifying, Summarizing, and which avoids the issue of defining relevance. We show two Recommending Image and Music (2010), 53–65. music interactions which support the uncertain selection of [13] Trehub, S. E. Human processing predispositions and music, inferred from casual user input such as tapping a musical universals. In The Origins of Music, N. L. Wallin, rhythm or turning a radio dial. B. Merker, and S. Brown, Eds. MIT Press, 2000, ch. 23, 427–448. We have shown that modelling user input for inferring mu- [14] Wilson, M. L., and Elsweiler, D. Casual-leisure Searching: sic selection can address issues of subjectivity by taking a the Exploratory Search scenarios that break our current user-centered approach to model development. The model models. In HCIR 2010 (2010). can be iterated by comparing its predictions against actual [15] Wobbrock, J. O. Tapsongs: tapping rhythm-based user behaviour. Accounting for this subjectivity can yield passwords on a single binary sensor. In Proc. UIST (2009), significant improvements in retrieval performance as well as 93–96. Search or browse? Casual information access to a cultural heritage collection Robert Villa, Paul Clough, Mark Hall, Sophie Rutter Information School University of Sheffield Sheffield, UK S1 4DP {r.villa, p.d.clough, m.mhall, sarutter1} @sheffield.ac.uk ABSTRACT The work reported here is based on initial results from the Public access to cultural heritage collections is a challenging and Interactive CHiC (Cultural Heritage in CLEF) track of CLEF1 as ongoing research issue, not least due to the range of different run at Sheffield University. The interactive CHiC track is based reasons a user may want to access materials. For example, for a on the CHiC Europeana data set as used in 2011 and 2012 [1]. An virtual museum website users may vary from professionals or early prototype of an evaluation framework was used [2] which experts, to interested members of the public visiting on a whim. In allowed the interactive experiment to be semi-automated. In this this paper, we are interested in the latter user: a user who visits a work, our focus is on how users explored the collection and in cultural heritage website without a clear goal or information need particular how search and browse were used in this exploration. in mind. In the user study reported here, carried out within the We consider three research questions: context of the interactive task at CLEF (interactive CHiC), 20 RQ1. How do participants initiate their exploration? participants explored a subset of Europeana with no explicit task provided using a custom-built interface that offered both search RQ2. Do participants use browse or search in their exploration and browse functionalities. Results suggest that browsing is used of the collection? considerably more by the majority of users when compared to text RQ3. How do participants decide to search or browse, when search (all participants used the category browser before carrying given no explicit task? out a text search). This highlights the need for cultural heritage search interfaces to provide browsing functionality in addition to With RQ1 we are particularly interested whether users start their conventional text search if they wish to support casual search exploration by browsing categories, or by search. RQ2 then tasks. considers how users access the collection over their whole session. For RQ3 we will present some initial qualitative data from our lab-based interactive study, where the aim is to identify General Terms reasons for the use of either the search or browse functions. Design, Experimentation, Human Factors. 2. PREVIOUS WORK Keywords A general review of museum informatics is provided in [3], Cultural heritage, virtual museums, information access. although the more specific area of museum visitor studies, investigating why and how individuals visit museums, has a long 1. INTRODUCTION history [4]. More recent work has focused on visitors to digital Providing public access to cultural heritage is an ongoing and museums [5-7]. In [6] the information seeking behavior of challenging area of research. Previous work suggests that visitors cultural heritage experts was studied through interviews, finding to online cultural heritage collections (e.g. virtual museum that complex information gathering was required for the majority visitors) are not necessarily motivated by an explicit task, and that of search tasks. In contrast [7] studied virtual museum visitors, interacting with cultural heritage collections is exploratory in inspired by the work of [8] and [9] which suggest that museum nature [8, 9]. Recent work in the area of ‘casual search’ [10] has visitors are exploratory in their information seeking. This work also investigated situations where users are driven by the pleasure [7] found that search occurred far more often than browse of the search process itself, rather than an explicit information behavior for three of the four tasks used in the study, the need. exception being an open and broad task where browsing occurred to a greater degree. The focus for this paper is how individuals explore a cultural heritage collection when given no task. The results may be used Museum visitors can, in some respects, be considered as examples both to contrast with studies which have used explicit tasks, and of “casual leisure” searchers, as outlined in [10], where examples to motivate changes to cultural heritage systems to better support were found of “need-less” browsing (based on a diary study, and a diverse range of user tasks. analysis of Tweets, both outside the domain of cultural heritage). Darby and Clough [11] investigated the information seeking Presented at EuroHCIR2013. Copyright © 2013 for the individual papers 1 http://www.promise-noe.eu/unlocking-culture by the papers’ authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. underlying search system is based on Apache Solr2, which provides the text search, spelling checker, and the “more like this” suggestions (determined using Solr’s standard more-like-this functionality. The data set used was the same as that used in interactive CHiC, a dump of the Europeana data set3. 4. EXPERIMENTAL SETUP The search and browse interface was embedded into an IR evaluation system, which automatically administered pre- and post-questionnaires, and displayed the experimental system. All data reported here is from an in-lab study. This allowed a follow up interview to be carried out, during which each participant reviewed his or her search session. To enable this reviewing, Morae screen recording software was used to record the user’s activity, and during the interview, an audio recording was made of Figure 1: Screenshot of the Interactive CHiC interface the user’s comments. behavior of genealogists, with an emphasis on the behavior of An important aspect of the interactive CHiC experimental design amateurs and hobbyists, rather than professionals. In [12] a was that no explicit task was provided to users. Instead review of three digital libraries projects is carried out, from the instructions asked the user to explore freely as they wished, until point of view of Ingwersen and Järvelin's Information Seeking they were bored. Users were informed after they had been active and Retrieval framework [13]. Similar to [10], it points out that for 10 minutes, and could then continue for a further 5 minutes if information behavior by end users may be the “end in itself”. they wished, at which point they would be asked to stop (these timings were carry out by hand, and were approximate). Once this The study reported here uses a conventional lab-based protocol. was finished, the user’s search session would be replayed to them, However, unlike in previous work, such as [7], the participants and an interview conducted to investigate the user’s search were not given an explicit task: the underlying aim being to model process. Participants were paid 10 pounds for taking part. a situation closer to that investigated in [10], where there is no explicit information need. In total 20 participants were recruited for the study, 11 male and 9 female. Eight participants were in the 18-25 year age band, nine in 3. INTERACTIVE CHiC the 26-35 band; the other 3 between 36-45. The majority were A screenshot of the CHiC interactive system is shown in Figure 1. students (13), with 5 employed, one unemployed, and one The interface is split into five main areas, clockwise from left to “other”. 13 had completed a higher education degree, while six right: a category browser, search box, item display, bookbag, and were currently studying an undergraduate degree. search results. The search box operates in the conventional manner, allowing free text queries with search results being 5. RESULTS displayed as a grid below. When a result is clicked, it is displayed 5.1 Initiation of exploration in the “item display” on the right. This information will typically RQ1 asks how users initiate their exploration of the collection. To include a small thumbnail, textual description, and the item’s investigate this, we first looked at how users started their session, associated metadata. Metadata is clickable, e.g. if an item is listed and in particular, their searching. For example, did they select a as being owned by the British Library, clicking on the field will category or enter a query? search for British Library objects. At the bottom of the item display is a “more like this”, which displays the images of up to Over the whole data set four different actions were used by eight similar objects, which can be viewed three at a time. participants to initiate their session (Table 1, column 2). For the majority of users, the first action was to select one of the On the left of the interface is the “category browser”, which categories (15 out of the 20 users). It should be noted that the allows the user to browse the Europeana collection through a interface, on startup, showed a set of default results to all users. hierarchy of categories. This hierarchy is automatically generated, For three users, the first action was to display one of these default and is based on the work of [14]. The technique combines the results, another user clicked the “next page” to view the next page Wikipedia category hierarchy with topics derived from Wikipedia of default results, while the final user’s first action was to articles into which items are mapped. When a category is clicked, bookmark one of the default result items. the main results are updated to list the category contents. Small right arrows beside each non-leaf category allows the viewing of We also investigated the logs to find out each user’s first search or sub-categories. The user can therefore search and browse the browse action, which could be one of category select, text query, collection in three main ways: using a text query, selecting a or metadata/more like this select. As shown in Table 1 (column category, or selecting item metadata or “more like this”. 3), for all users this was a category select. In addition to counting the first actions, we also investigated how long each user spent On the bottom right of the interface is the bookbag, into which before either clicking the interface, or starting a new items can be placed. Book-bagged items are kept listed on the display, and can be removed and redisplayed as required. The 2 http://lucene.apache.org/solr/ 3 http://www.europeana.eu/ search/browse using the three previously listed methods. These results are shown in Table 2, along with the overall length of time of each session. Table 1: Number of users whose first action/first search or browse action were as column one. Action #Users first #Users first action search/browse action Category select 15 20 Display item 3 - Next search result page 1 - Add to bookbag 1 - Table 2: Time to first action, time to first search/browse Figure 2: Comparison of query and category select counts action, and overall session time (all times in seconds) Min 1st Qu. Median Mean 3rd Qu Max First 7.00 19.00 25.00 30.50 38.75 90.00 action First search/ 7.00 22.75 38.00 57.50 81.75 204.0 browse Total 129 631.8 783.5 787.8 918.0 1544 time There was a considerable variance in the length of time users spent on the task. The median time taken by users was 783.5 seconds (just over 13 minutes), with an interquartile range of Figure 3: Estimated time querying vs. browsing by category 286.2 seconds (approximately 4 minutes, 45 seconds). The minimum time was 129 seconds, and maximum 1544 seconds 5.3 “How did you start?” In addition to the quantitative data above, in the post-session (over 25 minutes). interview two questions were asked of users: “how did you start?” Most users spent some time at the start of their session before and “Why did you choose to start with a [category/search either clicking on an interface element (median time 25 seconds) query]?” It was intended to alter this latter question depending on or initiating a search (median 38 seconds). how the user initiated their exploration. While some users started by examining the results, all users chose the category browser 5.2 Search vs. browse over the search box to initiate searchers. RQ2 asks whether participants use search or browse. Figure 2 presents query and category counts across all users (i.e. counts of The responses to the first question “how did you start?” how often either text queries were executed or categories mentioned the category browser explicitly in 8 of the 12 answers. selected). Item select and the “more like this” functionality is not In most of these cases this was linked to exploring the interface. included here, due to the relative rarity of these events (across the For example, participant P3 stated: whole data set this functionality was used only 15 times, by 7 “I was drawn to the middle then decided to look around at different users). the interface. I decided to look at categories first, picked A non-parametric Wilcoxon rank-sum test indicated that there was politics” a significant difference between queries executed and categories Similarly, participant P10 stated: selected (W = 50.5, p ≤ 0.001). As can be seen from the boxplots, “I just looked round to see what I could use to explore things. categories were selected far more than queries entered, the median The category browser looked like the most likely candidates number of queries executed being 2, compared to a median of 11 because it had descriptions of stuff.” for category selects. All but three users selected more categories than executed queries, and 8 users did not enter a text query at all. As well as being influenced by the interface, responses from some users suggest that prior interests also played a part. For example,: A similar situation exists when the time querying vs. browsing categories is estimated (Figure 3). Such times were estimated by “I just look at the layout of the website and then found that I starting a timer when a query or category was selected, and taking had a category browser so I went to what I study actually, all activity between this point and the next query or category and I study languages and I try to find something select as the user either “querying” or “browsing categories”. As interesting.” [P8] might be expected, the trend is similar to that of Figure 2, with “There is no particular task and so I started from browse to users spending more time browsing categories when compared to see which information is more interesting to me.” [P1] executing queries. All but five participants spent more time The design of the interface, with a relatively small search box, browsing using the categories than spent querying. appears to also have had an effect on the choses of at least two of the user, indicated by responses to the second question. 8. REFERENCES Participants P2 and P4 stated: [1] Gäde, M., Ferro, N., and Lestari Paramita, M. 2011. ChiC “Because I only saw that [category]. I didn’t see the search 2011 – Cultural Heritage in CLEF: From Use Cases to until a bit later on.” [P2] Evaluation in Practice for Multilingual Information Access to “I didn’t really see this one at first [the search box] it was a Cultural Heritage. In Petras, V., Forner, P., and Clough, P., bit obscure.” [P4] editors, CLEF 2011 Labs and Workshops, Italy. For many users, however, the fact that the category browser [2] Hall, M. and Toms E. 2013. Building a Common Framework allowed easy exploration appeared to be the key, with some users for IIR Evaluation. Information Access Evaluation meets making connections to physical museums. For example: Multilinguality, Multimodality, and Visualization, 4 th International Conference of the CLEF Initiative. “If I was going to a museum I would look at the categories [museum sections] that are of most interest to me: arts, old [3] Marty, P. F., Rayward, W. B. and Twidale, M. B. 2003. stuff and so this is why I was looking for Mona Lisa.” [P5] Museum informatics. Ann. Rev. Info. Sci. Tech., 37, 259– 294. The lack of an explicit task was mentioned by some, and search was explicitly commented on by two users. E.g., P7 stated “When [4] Booth, B. 1998. Understanding the Information Needs of I wanted to find something specific I went to the search box.” Visitors to Museums, In Museum Management and Curatorship, 17(2). 6. DISCUSSION [5] White, L., Gilliland-Swetland A., and Chandler R. 2004. RQ1 asks how participants initiate their exploration of the We're Building It, Will They Use It? The MOAC II collection. From Table 1 it can be seen that all 20 participants Evaluation Project. In Museums and the Web (MW2004), started their exploration using the category browser, rather than a http://www.museumsandtheweb.com/mw2004/papers/g- text search. Indeed, the first action for the majority of users (75%) swetland/g-swetland.html was to select a category. Quantitative data from Section 5.3 backs [6] Amin, A., van Ossenbruggen, J., Hardman, L. and van this up, with 8 out of 12 of the participants for which text Nispen, A. 2008. Understanding cultural heritage experts' transcripts are available explicitly mentioning the category information seeking needs. In Proceedings of the 8th browser as a way of starting their exploration. Looking at Table 2, ACM/IEEE-CS joint conference on Digital libraries (JCDL it can be seen that there is typically a short delay until participants '08). ACM, New York, NY, USA, 39-47. started their browsing (median 38 seconds, interquartile range of 59). This delay is consistent with participant’s comments which [7] Skov, M. and Ingwersen, P. 2008. Exploring information suggested that many first spent some time orienting themselves to seeking behaviour in a digital museum context. In the interface before starting (e.g. P10 from Section 5.3). Proceedings of the second international symposium on Information interaction in context (IIiX '08), ACM, New Moving to RQ2 and RQ3, which asked whether participants have York, NY, USA, 110-115. a preference for browse or search and why, it is clear from Figure 2 and Figure 3 that there is a general preference for browsing, e.g. [8] Black, G. 2005. The engaging museum. London: Routledge. from Figure 3 the median estimated time spent browsing using the [9] Treinen, H. 1993. What does the visitor want from a categories was 524 seconds (IQR 399), compared to 77 seconds museum? Mass media aspects of museology. In S. Bicknell (IQR 394) for text queries. Looking at the participant comments, and G. Farmelo (Eds.), Museum visitor studies in the 90s, the lack of any explicit task would appear to have played a part in London, Science Museum, 86-93. this preference (e.g. P1 and P5 quotes from Section 5.3). In [10] Wilson, M. L. and Elsweiler, D. 2010. Casual-leisure addition to this the design of the interface, with a relatively small Searching: the Exploratory Search scenarios that break our text search box at the top, appeared to also play a part, with some current models. In: 4th International Workshop on Human- users pointing out that they did not see the search box until later Computer Interaction and Information Retrieval, Aug 22 in their session (e.g. P2 and P4). 2010, New Brunswick, NJ, 28-31. 7. CONCLUSIONS AND FUTURE WORK [11] Darby, P. and Clough, P. 2013 Investigating the information- The preliminary results reported here would suggest that seeking behaviour of genealogists and family historians. providing browse functionality to cultural heritage collections is Journal of Information Science February, 39, 73-84. important for users arriving without a specific information need, [12] Richard Butterworth and Veronica Davis Perkins. 2006. as may be typical in casual search. For the majority of users, this Using the information seeking and retrieval framework to preference for category browsing continues to hold for the session analyse non-professional information use. In Proceedings of as a whole, with all but 5 users spending more time browsing than the 1st international conference on Information interaction in keyword searching. Initial analysis of quantitative interface data context (IIiX). ACM, New York, NY, USA, 162-168. backs up the qualitative results, with more of the currently analysed user transcripts explicitly mentioning the category [13] Ingwersen, P. and Järvelin, K. 2005. The turn: integration of browser. The results presented here are preliminary. Future work information seeking and retrieval in context. Dordrecht, The will expand on the analysis presented here, both the qualitative Netherlands: Springer. and quantitative results. However, these initial results provide [14] Fernando, S., Hall, M.M., Agirre, E., Soroa, A., Clough, P. evidence of the importance of providing browse functionality to & Stevenson, M. (2012) Comparing taxonomies for cultural heritage collections, and Europeana in particular. organizing collections of documents, Proceedings of COLING 2012: Technical Papers, 879-894. Acknowledgements: This work was supported by the EU projects PROMISE (no. 258191) and PATHS (no. 270082). Studying Extended Session Histories Chaoyu Ye Martin Porcheron Max L. Wilson Mixed Reality Lab Mixed Reality Lab Mixed Reality Lab University of Nottingham, UK University of Nottingham, UK University of Nottingham, UK psxcy1@nottingham.ac.uk me@mporcheron.com max.wilson@nottingham.ac.uk ABSTRACT however our research has focused on using such methods to While there is an increasing amount of interest in evalu- better understand real extended search sessions. This pa- ating and supporting longer “search sessions”, the majority per begins by first summarising literature on sessions and of research has focused on analysing large volumes of logs then describes our research methods and preliminary find- and dividing sessions according to obvious gaps between en- ings about extended search sessions. tries. Although such approaches have produced interesting insights into some di↵erent types of longer sessions, this pa- per describes the early results of an investigation into ses- 2. UNDERSTANDING “SESSIONS” sions as experienced by the searcher. During interviews, Although investigations into web sessions can be dated participants reviewed their own search histories, presented back to around 20 years ago (e.g. [2]), the concept of a session their views of “sessions”, and discussed their actual sessions. still lacks clear definition. A number of researchers have gen- We present preliminary findings around a) how users under- erated diverse definitions of a session using di↵erent delim- stand sessions, b) how these sessions are characterised and iters such as cuto↵ time, query context, or even the status of c) how sessions relate to each other temporally. the browser windows (e.g. [7]). In 1995, Catledge and Pitkow used a “timeout”, the time between two adjacent activities, to divide user’s web activities into sessions and found that Categories and Subject Descriptors a 25.5 minute timeout was best [2]. Their research, how- H5.2 [Information interfaces and presentation]: User ever, was focused on general web activity rather than search Interfaces. - Graphical user interfaces. sessions, but their 25.5 minutes timeout has been used by many others. He and Goker later aimed to find the optimal Keywords interval that would divide large sessions, whilst not a↵ect- ing smaller sessions [4]. Their analysis found that optimal HCIR, Interactive, Information Retrieval, Sessions timeout values vary between 10 and 15 minutes. In 2006, Spink et al [11] defined a session as the entire 1. INTRODUCTION series of queries submitted by a user during one interaction Information Retrieval (IR) specialists are becoming in- with a search engine, and one session may consist of single creasingly concerned with users who continue to search be- or multiple topics. Their approach focused on topic changes yond a few queries or a few minutes1 . Although Informa- rather than temporal breaks, yet it is perhaps unclear how tion Retrieval, and even Interactive IR, evaluations are well they determined “one interaction” with a search engine. known, research is recognising situations where people con- A clear definition has also been cited as an important tinue to search after finding seemingly useful results [13]. challenge in other research. While focusing on “revisitation” Some might be in a larger session involving several related behaviour, Jhaveri and Räihä [6] and Tausher and Green- subtopics, while others may continue to search for enter- berg [12] found it challenging to di↵erentiate between in- taining videos until they struggle to find ‘good’ results [3, session revisitation and post-session revisitation, for which 1]. Consequently, researchers are interested in how to eval- a clear detection of session boundaries would be useful. uate, measure, and ultimately better support searchers who When focusing on searching, rather than web sessions, continue to search for extended sessions. some use the concept of a “query session”. Nettleton et al Most research into extended search sessions, described in defined a query session as at least one query made to a detail below, has focused on analysing search engine logs [1, search engine, together with the results which were clicked 4, 8] by dividing the logs using obvious periods of inactivity on and other user behaviours as well [8]. They also evaluated and either qualitatively [1] or quantitatively [4, 8] charac- the “session quality” based on the number of clicks, hold terising them. Some research has investigated human web time and ranking of selected documents, and they used these behaviour and user goals qualitatively through interviews, measures to help determine the di↵erence between sessions. To summarise the di↵erent approaches used to define ses- 1 The recent NII Shonan event and the forthcoming Dagstuhl sions, Jansen et al. provided a summary of the three most are both, for example, focused on this topic. representative strategies [5], as shown in Table 1. As IP and cookies were utilised to identify a user, the most frequent strategies involve temporal cuto↵s and topic change. The methods summarised in Table 1 are primarily focused Presented at EuroHCIR2013. Copyright c 2013 for the individual papers by the papers’ authors. Copying permitted only for private and academic on temporal and topical boundaries, but other research has purposes. This volume is published and copyrighted by its editors. shown clear challenges to these strategies. Mackay et al, in Table 1: Session Diving Strategies; Jansen et al [5] Approach Session Constraints 1 IP, cookie 2 IP, cookie, and temporal cuto↵ 3 IP, cookie, and content change Figure 1: Session Card Information 2008, examined tasks that frequently occur as multi-session tasks, where something thematically consistent occurs over multiple sessions [7]. Moreover, research into web, browser, In addition, the reasons for leading to non-success and dif- and browser-tabs, has found that some users often keep web ficulty can be investigated via the card sorting of difficulty, pages spread out over time, especially in the information and the di↵erence of user’s web behaviour in di↵erent envi- gathering tasks, e.g. [10]. These situations indicate that ronments can also be examined by the sorting of location. the logged web behaviour may di↵er significantly from the The entire interview was audio recorded, and physical copies actual behaviours and intentions of the searchers. This re- of the card sorts were kept for analysis. search focuses on the searcher’s experience of web sessions, This paper describes our preliminary analysis of the first such that others may continue to develop strategies for more phase of the study, which involved 11 interviews. Phase two, accurately dividing large scale logs into sessions. which is still under way, involves a slightly refined methodol- ogy to capture more information about topics that emerged from the initial analysis described below. A more compre- 3. EXPERIMENT DESIGN hensive analysis of both phases will be published later. To understand and characterise real extended search ses- sions, we employed similar interview methods to Sellen et 4. PRELIMINARY FINDINGS al. [10]. Participants were engaged in a 90-120 minute inter- Based on our preliminary investigation, some potentially view about their own search behaviour. To ground the inter- interesting results relating to perceived duration, time of views in real data, participants focused on printouts of their day, and use of queries were found. We considered each of own web history, and we used the card sorting technique [9] these below according to two aspects: activity goal and ac- to probe their mental models of sessions. The procedure was tivity context. For activity goal, we used Sellen et al’s [10] approved by the school ethics board and pilot tested. 6 categories: ‘finding’, ‘information gathering’, ‘browsing’, Participants began by providing their web history and ‘transaction’, ‘communication’, and ‘housekeeping’. This they were advised to edit their history in advance should approach did not include any email, so this was added as a they wish to keep some logged activities private2 . These logs 7th category. For activity context, we applied Elseweiler et were gathered by importing their search histories to Firefox al’s [3] comparison between work and non-work (leisure) ac- (if not already there), and creating an XML export using tivities, involving: ‘work’, ‘serious-leisure’, ‘project-leisure’, “History Export 0.4”3 . This log was then structured and and ‘casual-leisure’. At this early stage in the project, the preliminarily processed using a) automatic methods to find primary author performed the classification individually based search URLs, and b) manual investigation to find possible on corresponding examples given in the referenced work. sessions to discuss in the interview. After providing demo- graphic information, participants spent around 20 minutes 4.1 Defining Sessions examining the structured printout of their history, using a There were 216 sessions in total and 19.6 sessions per pen to mark sessions. These sessions, unless duplicates of person have been studied thus far, as shown as Table 2. prior sessions, were written onto separate cards for later sort- Amongst these, 94 were longer than 5 minutes, 99 featured ing until around 20 cards were produced. Each card had search and only 9 sessions were unsuccessful. a number, a title, activity purpose, included history items from the history list and also whether it has been completed Table 2: All Session Information successfully or not; an example is shown in Figure 1. Parti- Session Long Ses- Unsuccess Search Ses- Query The remainder of the interview involved first open, and cipant No. Sion No. Session No. Sion No. No. then closed card sorting. Open card sorting allowed the 1 18 9 1 13 45 2 30 14 0 11 34 participants to classify and group the sessions according to 3 20 12 1 12 101 4 20 8 1 9 22 their own ideas, whilst closed card sorting allowed us to 5 16 10 0 6 17 make sure the following dimensions were considered: pur- 6 7 26 30 6 5 0 1 16 0 27 0 pose, for whom, with whom, location, duration, difficulty, 8 17 7 1 12 74 9 10 6 0 6 18 importance, frequency, and priority. This exercise was to 10 10 8 4 4 57 11 19 9 0 10 23 help explore the session feature in a more detailed way. For Total 216 94 9 99 418 example, studying frequency helps to find out the most fre- Avg. 19.6 8.5 0.8 9 38 quent sessions and elicit the pattern of user’s web activity. All participants mentioned that activities with the same 2 Although this means we have likely missed common search purpose and subject should be grouped into one session, as sessions, like the lengthy adult sessions observed by Bailey shown in Table 3. In addition, 8 of the 11 suggested that et al [1], it was considered an important ethical provision. similar tasks happened in di↵erent time periods should be 3 addons.mozilla.org/en-us/firefox/addon/history-export/ classified as a single session, rather than them being tem- Table 3: Session Delimiters Summary Table 5: Duration Categories Parti- Type of Differ time-> Group Detail Topic Emotion cipant Source Differ Session Sessions defined as Long Defined Long 1 + + - - by Participant 2 + - + - Session whose actual duration 3 + - - - Long is >= 5 mins 4 + - - - Session defined as Long and its 5 + - - - Actual Long actual duration is >= 5 mins 6 + - + - Session defined as Long but its 7 + - - + Over-estimated actual duration is less than 5 mins 8 + - + - Session defined as Short 9 + - - - Defined Short by participant 10 + - - - Session whose actual duration 11 + - - - Short is less than 5 mins Session defined as short and its Actual Short actual duration is less than 5 mins Session defined as Short but its Under-estimated actual duration is >= 5 mins porally connected. Some participants said that they always kept the browser windows open when doing long-term tasks. Table 6: Duration, by Acitivity Goal Finally, 1 participant advised that they care about the emo- tion involved within these web activities, even when they Defined Long Over-esti Defined Short Under-esti Finding 24 17 (70.8%) 36 3 (8.3%) were doing the same task, such as “buying a pair shoes”. In Info-gathering 35 15 (42.9%) 7 4 (57.1%) Browsing 28 17 (60.7%) 5 0 particular, this participant indicated that one topically con- Transaction 4 2 (50.0%) 5 2 (40.0%) sistent session should be divided between two disappoint- Communication Housekeeping 9 0 3 (33.3%) 0 5 1 0 1 (100.0%) ingly unproductive and excitingly productive phases. Email 7 6 (85.7%) 7 0 Firstly, considering activity goals given in Table 6, the number of ‘information-gathering’ sessions defined as long was 5 times as that of those ‘defined short’, as was the same with ‘browsing’. On the contrary, the number of ‘finding’ sessions defined as short was 1.5 times the number defined as long. Overall, nearly 70% of ‘finding’, 42% of ‘information- gathering’, 60.7% of ‘browsing’, 50% of ‘transaction’, and (a) Acitivty Goal (b) Activity Context 85.5% of ‘email’ sessions defined as long were overestimated by users. Moreover, under-estimation occurred with ‘find- Figure 2: Session Categories ing’, ‘information-gathering’, and ‘housekeeping’ although Finally, besides the pre-defined dimensions, participants over-estimation was more frequent with ‘finding’, ‘browsing’, also came up with some unique sorting dimensions as shown ‘communication’, and ‘email’ sessions. in Table 4, and these may benefit in exploring the session’s delimiters and features in new perspectives. Table 7: Duration, by Activity Context Defined Long Over-est. Defined Short Under-est. Table 4: Unique Dimensions Work 38 22 (57.9%) 31 2 (6.5%) Serious-Leisure 8 2 (25%) 1 0 Project-Leisure 22 15 (68.2%) 23 5 (21.7%) Unique Dimensions Casual-Leisure 39 21 (53.8%) 11 3 (27.2%) Google it or Go to Website directly Content contributor National Certain topic or not University related or not Based on old knowledge or brand new Table 7 above shows that the number of ‘casual-leisure’ Amusement Preference sessions defined as long was as 3 times as that those ‘defined Result Satisfaction Eyes Ears Needed Security short’ and that 57.9% of ‘work’, 68.2% of ‘project-leisure’, and 53.8% of ‘casual-leisure’ sessions defined as long were over-estimated by users with lower levels of under-estimation 4.2 Duration occurring. This encouraged a further study on the feature As duration is one of the targeted dimensions, all par- of each kind of web activity to determine the main cause for ticipants were asked for their own definition of what con- an incorrectly perceived length. stitutes a “long session”. 45% of participants defined the session where the duration is more than 5 minutes, whereas 4.3 Time of Day 27% went with over 30 minutes, 18% more than 1 hour, and Figure 3 shows that most the ‘information-gathering’, ‘find- 1 participant chose over 2 hours. ing’ and ‘housekeeping’ sessions seem to occur between 10:00 Because participants first defined what they considered and 16:00 whilst more ‘browsing’, ‘email’, and ‘communica- to be a long session, and then later sorted their sessions tion’ activities were done between 22:00 and 0:00, which into length categories, we investigated the di↵erence be- was labelled “before bed time”. Additionally, there is a tween sessions that met their definition of long, and ones peak around 14:00, in which more ‘finding’ and ‘information- they remembered as being long during the card sorts. Par- gathering’ happened rather than other kinds of sessions. Fi- ticipants frequently grouped ‘defined short’ sessions as long nally, at 23:00, general ‘browsing’ is most prevalent. and vice-versa. Consequently, we investigated both ‘overes- Figure 4 shows that most of the ‘serious-leisure’ sessions timated’ and ‘under-estimated’ sessions in addition to ‘de- occurred between 18:00 and 22:00. Most of the ‘work’ ac- fined long’, ‘long’, ‘actual long’, ‘defined short‘, ‘short’, and tivities happened between 11:00 and 18:00, which seems to ‘actual short’ as given in Table 5. fit in within a typical working day. In the time ‘before bed’, minutes, but many had inaccurate recollections of the length of sessions. Long sessions were typically a mix of casual and serious leisure that often involved information gathering and browsing behaviour, while the majority of work related ses- sions were typically short. We also noticed that some of these activities may also be related to certain times of the day. All of the findings will be further explored after phase two of the study, but early insights suggest that real ex- tended search sessions could be more accurately modelled based on additional factors such as: time of day, activity goal, activity context, and number of queries. Figure 3: Time of Day, by Activity Goal 6. REFERENCES the most frequent activity is ‘casual-leisure’. [1] P. Bailey, L. Chen, S. Grosenick, L. Jiang, Y. Li, P. Reinholdtsen, C. Salada, H. Wang, and S. Wong. User task understanding: a web search engine perspective. In NII Shonan Meeting on Whole-Session Evaluation of Interactive Information Retrieval Systems, Kanagawa, Japan, October 2012. [2] L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the World-Wide web. Computer Figure 4: Time of Day, by Activity Context Networks and ISDN Systems, 27(6):1065–1073, 1995. Combined with the two comparisons above, there seems to [3] D. Elsweiler, M. L. Wilson, and B. K. Lunn. be some overlap between ‘information-gathering’, ‘finding’, Understanding casual-leisure information behaviour. ‘housekeeping’ and ‘work’. There was also some overlap be- In A. Spink and J. Heinstrom, editors, Library and tween ‘browsing’ and ‘casual-leisure’. Furthermore, these Information Science, pages 211–241. Emerald Group tend to suggest that there may be some patterns for user’s Publishing Limited, 2011. web activity in their daily life. [4] D. He and A. Göker. Detecting session boundaries 4.4 Search Queries from Web user logs. Methodology, pages 57–66, 2000. In Figure 5 below, sessions with more search queries tend [5] B. J. Jansen, A. Spink, C. Blakely, and S. Koshman. to be classified as ‘defined long’, ‘long’, and ‘actual long’ Defining a session on Web search engines. JASIST, than those with fewer queries. An interesting observation is 58(6):862–871, 2007. that what the user defined as a long session features a rela- [6] N. Jhaveri and K.-J. Räihä. The advantages of a tively low average number of search queries compared with cross-session web workspace. In CHI2005 Ext. ‘long’ and ‘actual long’ sessions. Equally, sessions defined as Abstracts, page 1949. ACM Press, 2005. ‘short’ by the user actually feature relatively more queries [7] B. Mackay and C. Watters. Exploring Multi-session compared to ‘short’ and ‘actual short’. This may indicate Web Tasks. Time, pages 1187–1196, 2008. that the user did not consider the number of queries per- [8] D. Nettleton, L. Calderon-benavides, and formed when defining the duration of sessions and failed to R. Baeza-yates. Baezayates, analysis of web search realise the e↵ect of this behaviour. engine query sessions. In Proc. WebKDD 2006, 2006. [9] G. Rugg and P. McGeorge. The sorting techniques: a tutorial paper on card sorts, picture sorts and item sorts. Expert Systems, 14(2):80–93, 1997. [10] A. J. Sellen, R. Murphy, and K. L. Shaw. How knowledge workers use the web. In Proc. CHI2002, pages 227–234. ACM Press. [11] A. Spink, M. Park, B. J. Jansen, and J. Pedersen. Multitasking during Web search sessions. IP&M, 42(1):264–275, 2006. [12] L. Tauscher and S. Greenberg. How people revisit web pages: empirical findings and implications for the Figure 5: Average Number of Search Queries design of history systems. IJHCS, 47(1):97–137, 1997. [13] E. G. Toms, R. Villa, and L. McCay-Peet. How is a search system used in work task completion? Journal 5. CONCLUSIONS of Information Science, 39(1):15–25, 2013. Although this paper only describes a preliminary analysis of over 200 sessions from 11 participants, we have begun to see some potentially interesting early findings. Initially, par- ticipants varied greatly in their opinions about their own ses- sions, with some matching topical divisions, some temporal divisions, and some a combination of the two. The majority of participants judged “long sessions” as being longer than 5 Comparative Study of Search Engine Result Visualisation: Ranked Lists Versus Graphs Casper Petersen Christina Lioma Jakob Grue Simonsen Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Copenhagen University of Copenhagen University of Copenhagen cazz@diku.dk c.lioma@diku.dk simonsen@diku.dk ABSTRACT tially useful or interesting features about how the retrieved Typically search engine results (SERs) are presented in a data is connected. ranked list of decreasing estimated relevance to user queries. We present a user study comparing ranked list vs graph- While familiar to users, ranked lists do not show inherent based SER visualisation interfaces. We use a web crawl of connections between SERs, e.g. whether SERs are hyper- ca. 50 million documents in English with associated hyper- linked or authored by the same source. Such potentially link information and 10 participants. We find that ranked useful connections between SERs can be displayed as graphs. lists result in overall more accurate and faster searches than We present a preliminary comparative study of ranked lists graph displays, but that the latter result in slightly higher re- vs graph visualisations of SERs. Experiments with TREC call. We also find overall higher inter-rater agreement about web search data and a small user study of 10 participants SER relevance when using ranked lists instead of graphs. show that ranked lists result in more precise and also faster search sessions than graph visualisations. 2. MOTIVATION While traditional IR systems successfully support known- item search [5], what should users do if they want to locate Categories and Subject Descriptors something from a domain where they have a general interest H.3.3 [Information Storage and Retrieval]: Information but no specific knowledge [8]? Such exploratory searching Search and Retrieval comprises a mixture of serendipity, learning, and investiga- tion and is not supported by contemporary IR systems [5], prompting users to “develop coping strategies which involve Keywords [...] the submission of multiple queries and the interactive Search Engine Result Visualization, Ranked List, Graph exploration of the retrieved document space, selectively fol- lowing links and passively obtaining cues about where their 1. INTRODUCTION next steps lie” [9]. A step towards exploratory search, which motivates this work, is to make explicit the hyper-linked Typically search engine results (SERs) are presented in a structure of the ordered list used by e.g. Google and Ya- ranked list of decreasing estimated relevance to user queries. hoo. Investigation of such a representation does not exist Drawbacks of ranked lists include showing only a limited according to our knowledge, but is comparable to Google’s view of the information space, not showing how similar the Knowledge Graph whose aim is to guide users to other rel- retrieved documents are and/or how the retrieved docu- evant information from an initial selection. ments relate to each other [4, 6]. Such potentially use- ful information could be displayed to users in the form of SER graphs; these could present at a glance an overview 3. PREVIOUS WORK of clusters or isolated documents among the SERs, features Earlier work on graph-based SER displays includes Beale not typically integrated into ranked lists. For instance, di- et al.’s (1997) visualisation of sequences of queries and their rected/undirected and weighted/unweighted graphs could respective SERs, as well as the work of Shneiderman & Aris be used to display the direction, causality and strength of (2006) on modelling semantic search aspects as networks various relations among SERs. Various graph properties (both overviewed in [10]). Treharne et al. (2009) present a (see [7]), such as the average path length, clustering coef- critique of ranked list displays side by side a range of other ficient or degree, could be also displayed, reflecting poten- types of visualisation, including not only graphs, but also cartesian, categorical, spring and set-based displays [6]. This comparison is analytical rather than empirical. Closest to ours is the work of Donaldson et al. (2008), who experi- mentally compare ranked lists to graph-based displays [2]. In their work, graphs model social web information, such as user tags and ratings, in order to facilitate contextual- Presented at EuroHCIR2013. Copyright c 2013 for the individual papers ising social media for exploratory web search. They find by the papers’ authors. Copying permitted only for private and academic that users seem to prefer a hybrid interface that combines purposes. This volume is published and copyrighted by its editors. SIGIR 2013 Dublin, Ireland ranked lists with graph displays. Finally, the hyperlinked . graph representation discussed in the paper allows users to investigate the result space thereby discovering related and • The right window shows the ranked list of the top-k potential relevant information that might otherwise be by- SERs. The position of the clicked document in the list passed. Such representation and comparison to a traditional is clearly marked. ranked list does not exist according to our knowledge, but We display the SER graph in a standard force-directed the idea underpinning the graph representation is compara- layout [1]. Our graph layout does not allow for other types ble with Google’s Knowledge Graph as the aim is to guide of interaction with the graph apart from clicking on it. We users to other relevant information from an initial selection. reason that for the simple web search tasks we consider, layouts allowing further interaction may be confusing or 4. INTERFACE DESIGN time-consuming, and that they may be more suited to other This section presents the two di↵erent SER visualisations search tasks, involving for instance decision making, naviga- used in our study. Our goal is to study the e↵ect of display- tion and exploration of large information spaces. ing exactly the same information to the user in two di↵erent ways, using ranked list and graph visualisations, respectively. 4.3 Document Snippets Both the RL and GR interfaces include short query-based 1 docid 4 2 docid summaries of the top-k SERs (snippets). We construct them 3 3 docid 1 as follows: We extract from each document a window of ± 4 docid 2 5 docid 25 terms surrounding the query terms on either side. Let a 6 docid 5 6 query consist of 3 terms q1 , q2 , q3 . We extract snippets for (A) (B) all ordered but not necessarily contiguous sequences of query terms: (q1, q2, q3), (q1 , q2 ), (q1 , q3 ), (q2 , q3 ), (q1 ), (q2 ), (q3 ). Figure 1: Ranked list (A) and graph (B) representation of the This way, we match all snippets containing query terms in top-k documents from a query. the order they appear in the query (not as a bag of words), but we also allow other terms to occur in between query 4.1 Ranked List (RL) Display terms, for instance common modifiers. We use a standard ranked list SER display, where docu- Several snippets can be extracted per document, but only ments are presented in decreasing order of their estimated the snippet with the highest TF-IDF score is displayed to relevance to the user query. The list initially displays only the user. The TF-IDF of each window is calculated as a the top-k retrieved document ids (docids) with their asso- normalised sum of the TF-IDF weights for each term: ciated rank (see Figure 1 (A)). When clicked upon, each |w| ✓ ◆ document expands to two mini windows, overlaid to the left 1 X |C| Ss(D) = tf (t, D) ⇥ log and right of the list: |w| t=0 |D 2 C : t 2 D| • The left window shows a document snippet containing where |w| is the number of terms in the window extracted, the query terms. The snippet provides a brief sum- t 2 w is a term in the window, tf is the term frequency of t mary of the document contents that relate to the query in document D from which the snippet is extracted, C is the in order to aid the user to assess document relevance collection of documents, and Ss(D) is the snippet score for prior to viewing the whole document [4]. We describe document D. Finally, as research has shown that query term exactly what the snippet shows and how it is extracted highlighting can be a useful feature for search interfaces [4], in Section 4.3. we highlight all occurrences of query terms in the snippet. • The right window shows a graph of the top-k ranked SERs (see Section 4.2). The position of the clicked 5. EVALUATION document in the graph is clearly indicated, so users We recruited 2 participants for a pilot study to calibrate can quickly overview its connections, if any, to other the user interfaces; the results from the pilot study were top-k retrieved documents. not subsequently used. For the main study, we recruited 10 new participants (9 males, 1 female; average age: 33.05, all Previously visited documents in the list are colour-marked. with a background in Computer Science) using convenience 4.2 Graph (GR) Display sampling. Each participant was introduced to the two in- terfaces. Their task was to find and mark as many relevant We display a SER graph G = (V, E) as a directed graph documents as possible per query using either interface. For whose vertices v 2 V correspond to the top-k retrieved doc- each new query, the SERs could be shown in either interface. uments, and edges e 2 E correspond to links (hyperlinks Each experiment lasted 30 minutes. in our case of web documents) between two vertices. Each Participants did not submit their own queries. The queries vertex is shown as a shaded circle that displays the rank of were taken from the TREC Web tracks of 2009-2012 (200 its associated document in the middle, see Figure 1 (B). The queries in total). This choice allowed us to provide very fast size of each vertex is scaled according to its out-degree, so response times to participants (< two seconds, depending that larger vertex size indicates more outlinks to the other on disk speed), because search results and their associated top-k documents. Edge direction points towards the out- graphs were pre-computed and cached. Alternatively, run- linked document. Previously visited documents are colour- ning new queries and plotting their SER graphs on the fly marked. would result in notably slower response times that would When clicked upon, each vertex expands to two mini win- risk dissatisfying participants. However, a drawback in us- dows, overlaid to the left and right of the graph: ing TREC queries is that participants did not necessarily • The left window shows the same document snippet as have enough context to fully understand the underlying in- in the RL display. formation needs and correctly assess document relevance. Ranked List Graph MAP@20 0.4195 0.3211 50 Relevant MRR 0.4698 0.3948 40 Not relevant RECALL@20 0.0067 0.0069 Frequency 30 Table 1: Mean Average Precision (MAP), Mean Reciprocal 20 Rank (MRR) & Recall of the top 20 results. 10 0 To counter this, we allowed participants to skip queries they 0 5 10 15 20 25 30 35 40 Click order were not comfortable with. To avoid bias, skipping a query (a) was allowed after query terms were displayed, but before the SERs were displayed. 40 We retrieved documents from the TREC ClueWeb09 cat. Relevant Not relevant B dataset (ca. 50 million documents crawled from the web 30 Frequency in 2009), using Indri, version 5.2. The experiments were 20 carried out on a 14 inch monitor with a resolution of 1400 x 1050 pixels. We logged which SERs participants marked 10 relevant, as well as the participants’ click order and time spent per SER. 0 0 5 10 15 20 Click order 5.1 Findings (b) In total the 10 participants processed 162 queries (89 queries Figure 2: Click-order and participant relevance assessments for with the RL interface and 73 with the GR interface) with the (a) ranked list interface and (b) graph interface mean µ = 16.2, and standard deviation = 7.8. Four queries (two from each interface) were bypassed (2.5% of all pro- Interface Min Max µ Ranked List 1.391 25.476 8.228 4.371 cessed queries). Graph 3.322 20.963 9.705 3.699 Table 1 shows retrieval e↵ectiveness per interface, aggre- gated over all queries for the top k = 20 SERs. The ranked Table 2: Time (seconds) spent on each interface. list is associated with higher, hence better scores than the graph display for MAP and MRR. MAP is +30.6% better with ranked lists that with graph displays, meaning that that for the graph display, the majority of participant clicks overall a higher amount of relevant SERs is found by the before the 5th click correspond to non-relevant documents. participants at higher ranks in the ranked list as opposed Even though the MRR scores of the graph display indicate to the graph display. This finding is in agreement with the that the first relevant document occurs around rank posi- MRR scores, which indicate that the first SER to be as- tion 2.5, we see that participants on average click four other sessed relevant is likely to occur around rank position 2.13 documents before clicking the relevant document at rank (1/2.13 = 0.469 ⇡ 0.4698) with ranked lists, but around position 2.5. This indicates that in the graph display, par- rank position 2.55 (1/2.55 = 0.392 ⇡ 0.3948) with graph ticipants click documents not necessarily according to their displays. Conversely, recall is slightly higher with graph dis- rank position (indicated in the centre of each vertex), but plays. In general, higher recall in this case would indicate rather according to their graph layout or connectivity. that participants are more likely to find a slightly larger amount of relevant documents when seeing them as a graph 5.1.2 Time spent of their hyperlinks. However, the di↵erence in recall between Table 2 shows statistics about the time participants spent ranked lists and graphs is very small and can hardly be seen on each interface. Overall participants spent less time on as a reliable indication. the ranked list than on the graph display. This observation, combined with the retrieval e↵ectiveness measures shown 5.1.1 Click-order in Table 1, indicates that participants conducted overall On average participants clicked on 9.46 entries per query slightly more precise and faster searches using the ranked in the ranked list (842 clicks for 89 queries) but only on lists than using graph displays. The time use also suggests 6.7 entries per query in the graph display (490 clicks for 73 that participants are used to standard ranked list interfaces, queries). The lower number of clicks in the latter case could a type of conditioning not easy to control experimentally. be due to the extra time it might have taken participants to understand or navigate the graph. This lower number 5.1.3 Inter-participant agreement of clicks also agrees with the lower MAP scores presented To investigate how consistent participants were in their above (if fewer entries were clicked, fewer SERs were as- assessments, we report the inter-rater agreement using Krip- sessed, hence fewer relevant documents were found in the pendor↵’s ↵ [3]. Table 3 reports the agreement between the top ranks). participants, and Table 4 reports the agreements between Figures 2a and 2b plot the order of clicks for the ranked participants and the TREC preannotated relevance assess- list and graph interfaces respectively on the x-axis, against ments per interface. In both cases, only queries annotated the frequency of clicks on the y-axis. We see that in the more than once by di↵erent participants are included (19 ranked list, the first click of the participant is more often queries for the ranked list and 11 for the graph SER). on a relevant document, but in the graph display, the first The average inter-rater agreements between participants click is more often on a non-relevant document (as already vary considerably. For the graph interface, ↵ = 0.04471, indicated by the MRR scores shown above). We also see which suggests lack of agreement between raters. On a query basis, some queries (query 169 and 44) suggest a compara- Graph Ranked List Query Raters ↵ Query Raters ↵ tively much higher agreement whereas others (e.g. query 101 4 0.09559 110 3 0.38654 104 and 184) show a comparatively higher level of disagree- 104 2 -0.17861 119 2 -0.22370 ment. For the ranked list, inter-rater agreement is higher 132 2 0.06561 120 2 0.03146 (↵ = 0.19813). On a per query basis, quite remarkably, 169 2 0.33625 129 2 0.05600 query 92 had a perfect agreement between raters, while 180 2 -0.08949 132 3 0.01689 queries 175 and 129 also exhibited a moderate to high level 184 2 -0.08949 133 2 0.04398 of agreement. However, most queries show only a low to 3 2 -0.37209 155 2 -0.21067 38 2 -0.05006 165 2 -0.25532 moderate level of agreement or disagreement. 44 2 -0.05861 175 2 -0.07886 Overall, the lack of agreement may indicate the partici- 54 2 -0.25532 180 2 -0.17861 pants’ confusion in assessing the relevance of SERs to pre- 58 2 -0.22917 51 2 -0.05006 typed queries. This may be aggravated by problems in ren- – – – 53 2 -0.24694 dering the HTML snippets into text. Some HTML docu- – – – 74 2 -0.06033 ments were ill-formed, hence their snippets sometimes in- – – – 80 2 -0.24694 – – – 81 3 -0.13634 cluded HTML tags or other not always coherent text. – – – 92 2 -0.21181 Inter-rater agreements between our participants and the – – – 95 2 0.04582 TREC preannotated relevance assessments show an almost – – – 96 2 -0.12919 complete lack of agreement. For both interfaces there is – – – 97 2 0.07813 a weak level of disagreement on average (↵ = 0.0750 and Average ↵: -0.0750 Average ↵: -0.0721 ↵ = 0.0721 for the graph and ranked list respectively). On a per query basis there are only two queries (queries 169 & Table 4: Inter-rater agreement (↵) between participants and TREC assessments for queries assessed by > 1 participant. 110) exhibiting a moderate level of agreement. For most re- maining queries our participants’ assessments disagree with the TREC assessments. session in the analysis (e.g. user task, behaviour, satisfac- tion). Future work includes addressing the above limitations Graph Ranked list and also testing whether and to what extent these results ap- Query Raters ↵ Query Raters ↵ ply when scaling up to wall-sized displays with significantly 101 4 0.28696 110 3 0.41000 104 2 -0.21875 119 2 0.00000 larger screen real estate. 132 2 -0.16071 120 2 0.49351 169 2 0.48000 129 2 0.86022 7. REFERENCES 180 2 -0.10031 132 3 -0.08949 184 2 -0.25806 133 2 0.30108 [1] G. D. Battista, P. Eades, R. Tamassia, and I. G. 3 2 0.00000 155 2 -0.02632 Tollis. Graph drawing: algorithms for the visualization 38 2 -0.07519 175 2 0.49351 of graphs. Prentice Hall PTR, 1998. 44 2 0.49351 180 2 -0.37879 [2] J. J. Donaldson, M. Conover, B. Markines, 58 2 0.00000 51 2 0.00000 H. Roinestad, and F. Menczer. Visualizing social links – – – 53 2 0.02151 – – – 74 2 -0.14706 in exploratory search. In HT ’08, pages 213–218, New – – – 80 2 0.14420 York, NY, USA, 2008. ACM. – – – 81 3 -0.12919 [3] A. F. Hayes and K. Krippendor↵. Answering the call – – – 92 2 1.00000 for a standard reliability measure for coding data. – – – 95 2 0.15584 Communication Methods and Measures, 1(1):77–89, – – – 96 2 0.15584 – – – 97 2 0.30179 2007. Average ↵: 0.04471 Average ↵: 0.19813 [4] M. Hearst. Search user interfaces. Cambridge University Press, 2009. Table 3: Inter-rater agreement (↵) for queries assessed by >1 [5] G. Marchionini. Exploratory search: from finding to participant. Query is the TREC id of each query. understanding. Communications of the ACM, 49(4):41–46, 2006. 6. CONCLUSIONS [6] K. Treharne and D. M. W. Powers. Search engine result visualisation: Challenges and opportunities. In In a small user study, we compared ranked list versus Information Visualisation, pages 633–638, 2009. graph-based search engine result (SER) visualisation. Our [7] S. Wasserman and K. Faust. Social network analysis: motivation was to conduct a preliminary experimental com- methods and applications. Structural analysis in the parison of the two for the domain of web search, where doc- social sciences. Cambridge University Press, 1994. ument hyperlinks were used to display them as graphs. We [8] R. W. White, B. Kules, S. M. Drucker, and found that overall more accurate and faster searches were M. Schraefel. Supporting exploratory search. done using ranked lists and that inter-user agreement was Communications of the ACM, 49(4):36–39, 2006. overall higher with ranked lists than with graph displays. Limitations of this study include: (1) using fixed TREC [9] R. W. White, G. Muresan, and G. Marchionini. queries, instead of allowing users to submit their own queries Workshop on evaluating exploratory search systems. on the fly; (2) having technical HTML to text rendering SIGIR Forum, 40(2):52–60, 2006. problems, resulting in sometimes incoherent document snip- [10] M. L. Wilson, B. Kules, B. Shneiderman, et al. From pets; (3) using only 10 users exclusively from Computer Sci- keyword search to exploration: Designing future ence, which makes for an overall small and rather biased search interfaces for the web. Foundations and Trends user sample; (4) not using the wider context of the search in Web Science, 2(1):1–97, 2010. Evolving Search User Interfaces Tatiana Gossen, Marcus Nitsche, Andreas Nürnberger Data & Knowledge Engineering Group, Faculty of Computer Science Otto von Guericke University Magdeburg, Germany http://www.dke.ovgu.de/ ABSTRACT This is a very wide and heterogeneous target group with different When designing search user interfaces (SUIs), there is a need to tar- backgrounds, knowledge, experience, etc. Therefore, researchers get specific user groups. The cognitive abilities, fine motor skills, suggest providing a customized solution to cover the needs of indi- emotional maturity and knowledge of a sixty years old man, a four- vidual users (e.g., [6]). Nowadays, solutions in personalisation and teen years old teenager and a seven years old child differ strongly. adaptation of backend algorithms, i.e. query adaptation, adaptive These abilities influence the decisions made in the user interface retrieval, adaptive result composition and presentation, have been (UI) design process of SUIs. Therefore, SUIs are usually designed proposed in order to support the search of an individual user [13, and optimized for a certain user group. However, especially for 14]. But the front end, i.e. the SUI, is usually designed and opti- young and elderly users, the design requirements change rapidly mized for a certain user group and does not support many mecha- due to fast changes in users’ abilities, so that a flexible modifica- nisms for personalisation. Common search engines allow the per- tion of the SUI is needed. In this positional paper we introduce the sonalisation of a SUI in a limited way: Users can choose a colour concept of an evolving search user interface (ESUI). It adapts the scheme or change the settings of the browser to influence some pa- UI dynamically based on the derived capabilities of the user inter- rameters like font size. Some search engines also detect the type acting with it. We elaborate on user characteristics that change over of device the user is currently using – e.g. a desktop computer or a time and discuss how each of them can influence the SUI design us- mobile phone – and present an adequate UI. ing an example of a girl growing from six to fourteen. We discuss Current research concentrates on designing SUIs for specific user the ways to detect current user characteristics. We also support our groups, e.g. for children [4, 6, 10] or elderly people [1, 2]. These idea of an ESUI with a user study and present its first results. SUIs are optimized and adapted to general user group character- istics. However, especially young and elderly users undergo fast changes in cognitive, fine motor and other abilities. Thus, design Keywords requirements change rapidly as well and a flexible modification of Search User Interface, Human Computer Interaction, Adaptivity, the SUI is needed. Therefore, we suggest to provide users with Context Support, Information Retrieval. an evolving search user interface (ESUI) that adapts to individual user’s characteristics and allows for changes not only in properties Categories and Subject Descriptors (e.g., colour) of UI elements but also influences the UI elements themselves and their positioning. Some UI elements are continu- H.5.2 [Information Interfaces and Presentation]: User Interfaces. ously adaptable (e.g. font size, button size, space required for UI elements), whereas others are only discretely adaptable (e.g. type General Terms of results visualization). Not only SUI properties, but also the com- Design, Human Factors. plexity of search results is continuously adaptable and can be used as a personalisation mechanism for users of all age groups. 1. INTRODUCTION Search user interfaces [8] are an integral part of our lives. Most 2. ESUI VISION common known SUIs come in the form of web search engines with In this section we share our vision of an ESUI. In general, we an audience of hundreds of millions of people1 all over the world. suggest to use a mapping function and adapt the SUI using it, in- 1 stead of building a SUI for a specific user group. Using a generic Google, for example, has over 170 million unique visi- model of an adaptive system, as discussed in [14], we depict the tors per month, only in the U.S. http://www.nielsen. com/us/en/newswire/2013/january-2013--top-u-s\ model of an ESUI as following (see Fig. 1). We have a set of user characteristics (or skills) on one side. In the ideal case, the sys- tem detects the skills automatically, e.g. based on user’s interaction with the information retrieval system (user’s queries, selected re- sults, etc.). On the other side, there is a set of options to adapt the SUI, e.g. using different UI elements for querying or visualisation of results. In between, an adaptation component contains a set of logic rules to map the user’ skills to the specific UI elements of the Presented at EuroHCIR2013. Copyright c 2013 for the individual papers ESUI. by the papers’ authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. --entertainment-sites-and-we-brands.html support and a resulting feeling of success [5]. Therefore, they re- quire support to increase their confidence. In general, reading and writing skills of adults are better than those of children. Knowl- edge is gathered during life. Thus, elderly people posses a larger knowledge base than adults, and adults have usually more knowl- edge than children. We believe that the discussed characteristics can affect the design of SUIs. However, further research should be done in this direction. 2.3 Detection of User Abilities An ESUI can provide a specific SUI for a specific user given the knowledge of his specific abilities. A simple case is an adapt- able SUI, where a user manually adjusts the search user interface to his personal needs and tasks. An adaptable SUI may also provide Figure 1: Model of an ESUI. several standard settings for a specific user selection to explore the options (e.g. young user, adult user, elderly user). More interest- ing and challenging is the case of an adaptive SUI, where a system 2.1 Mapping Function automatically detects the abilities of a user and provides him with The function between the user skill space and the options to an appropriate SUI. Concepts for an automatic detection of user’s adapt the UI elements of the SUI has to be found. We suggest using abilities have been studied in the past. We can use the age of a the knowledge about human development, e.g. from medical, cog- registered and logged-in user. However, the age provides only an nitive, psychosocial science fields to specify the user skill space. approximation of a user’s capabilities. For an individual user an The results of user studies about users’ search behaviour and SUI appropriate mapping to the age group has to be found, e.g. us- design preferences can provide recommendations for UI elements. ing psychological tests covered in form of games. Those games As far as the research provides information about the studied age can be used to derive the quality of user’s fine-motor skills as well. group, we can use the age group as a connector between the skill Furthermore, we can use the user history from log files, in spe- space and the UI elements. Note that we use age groups in the cific, issued queries (their topic and specific spelling errors) and sense of a more abstract category defining a set of specific capabil- accessed documents. However, research is required to determine ities while growing up. A lot of research is already done and can be how to adapt a SUI in the way users would accept the changes. used, e.g. [2, 4, 7]. In addition, if the set of adaptable UI elements is defined, we can evaluate the mapping function by letting users 3. DESIGN IDEAS from different age groups put the UI elements of a SUI together When designing an ESUI, we first have to define the components (similar to the end user programming). of a SUI that should be adapted. We consider three main compo- nents. The first component is an input, i.e. UI elements which 2.2 Evolving Skills allow a user to transform his information need into a machine un- In order to allow a SUI to evolve together with a user we first derstandable format. This component is traditionally represented have to determine those characteristics that vary from user to user by an input field and a search button. Other variants are a menu and change during his life (or due to some circumstances like dis- with different categories or voice input. The second component is eases). For example, discussion about the skills of young users is an output of an information retrieval (IR) system. The output con- given in [7]. We suggest to consider cognitive skills, information sists of UI elements that provide an overview of retrieved search processing rates, fine motor skills, different kinds of perception, results. There can be different kinds of output, e.g. a vertical list of knowledge base, emotional state, reading and writing skills. snippets (Fig. 2a), tiles (Fig. 2c) or coverflow (Fig. 2b). The third In the following, brief summary of current research results in is a management component. Management covers UI elements that human development science is given. Human cognitive develop- support users in information processing and retaining. Examples of ment occurs in a sequential order in which later knowledge, abili- management UI elements are bookmark management components ties and skills build upon the previously acquired ones [12]. Cog- or other history mechanisms like breadcrumbs. Historically, man- nitive abilities of users in those stages differ, for example, before agement UI elements are not part of an SUI. But recent research the last (formal operational) stage they are unable to think logi- [6] shows that users are highly motivated to use elements of man- cally and to understand abstract concepts. Again, not only age but agement. Besides these main components, there also exist general also some diseases or accelerated cognitive development cause that properties of UI elements that might affect all the three categories, cognitive abilities, i.e. skills to gain, use and retain knowledge, dif- e.g. font size or color. We propose to adapt these three main com- fer from user to user. Information processing capabilities change ponents of a SUI and its general UI properties to the user’s skills. during life. Children’s information processing is slower than that of adults [11]. Therefore, children have a limited cognitive recall. 3.1 Use Cases It is widely agreed that elderly people have a decline in intellec- In order to demonstrate the proposed ESUI, we consider a young tual skills which affects the aggregation of new information [15]. girl called Jenny who is growing older. We show how input and Fine motor skills are influenced by information processing rates output of a SUI can be adapted to changes of Jenny’s abilities. [9]. Therefore, young children’s performance in pointing move- Use Case 1: Jenny is six years old. She started to learn reading, ments, e.g. using a mouse, are lower than that of adults. Perception but she has difficulties with writing. Jenny’s active vocabulary is of color can also change while aging. Color discrimination is more limited to 5,000 words. She cannot yet think in abstract categories difficult for elderly people. Elderly people have also problems with and is not able to process much information. Due to her limited hearing [3]. Children are immature in the emotional domain and, writing abilities, Jenny is not able to use an input field and write especially at the age of six to twelve, require additional emotional a query. She is learning to read, therefore, she can use a menu (a) (b) (c) Figure 2: Different kinds of output of an information retrieval system: a) vertical list of snippets offers a fast overview of several results at once b) coverflow view of results offers an attractive animation by browsing, uses a familiar book metaphor, central element is clear separated from the rest c) tiles of search results offer a fast overview of several results at once, a user has small jumps by reading within results, however the ordering of results is not so clear as by a list. (a) (b) (c) Figure 3: Different kinds of input of an information retrieval system: a) an ESUI enables a six-year-old Jenny to draw her query b) an ESUI supports nine-year-old Jenny by voice input and through several pre-defined categories c) an ESUI enables fourteen-year-old Jenny to use keyword-based input supported by an adaptive query cloud. with different categories which are supported by images. In order by spelling correction and suggestion mechanisms. A SUI can still to search for any information Jenny can draw her query (Fig. 3a). support Jenny by finding the “right” keywords, for example using Jenny’s fine motor skills are not fully developed yet. She has dif- a query cloud2 (Fig. 3b). Jenny can already manage different in- ficulties using interactions like scrolling. She also cannot process teraction techniques and is able to process more information than much information at once. Therefore, the coverflow (Fig. 2b) result the nine-year-old Jenny. Therefore, coverflow and a vertical list vi- visualisation fits her abilities (best). Coverflow allows her to con- sualisation would probably restrain her performance, whereas tiles centrate on one item at a time, thus, her cognitive load is reduced. (Fig. 2c) allow Jenny a better overview of results. Jenny can interact with it using simple point-and-click interactions. An integrated text-to-speech reader supports Jenny by reading the results to her. 4. USER STUDY Use Case 2: Jenny is nine years old. Jenny can read and write In order to demonstrate the idea of an ESUI, we conducted a short stories with just a few spelling errors. Jenny has some diffi- user study to compare users’ preferences in the visualization of dif- culties with typing using a keyboard. She “hunts and pecks” on the ferent UI elements of a SUI. In specific, our hypothesis was that keyboard for correct keys. This increases the amount of spelling users from different age groups would prefer to use different UI el- errors and also slows down the process. Jenny is frustrated because ements and different general UI properties. We built a SUI that the system does not understand her well. Thus, a standard keyword can be personalized, i.e. users can choose input, output and tune input field does not fit Jenny’s abilities well. Jenny still cannot general UI properties. In this paper we present our first results, i.e. think in abstract categories and process a lot of information. But users’ preferences in results visualization. Our SUI allows users to her language skills improved and her vocabulary size is increased. choose between a vertical list of snippets, tiles (Fig. 4b) and cov- Therefore, she can use voice input to search for information. A erflow (Fig. 4a). In our experiment we demonstrated these three menu with different categories in addition to voice input can in- output types. The subjects interacted with the search system to get spire Jenny to search for some new information. However, these a better feeling and were encouraged to solve a simple search task categories should match her cognitive abilities (Fig. 3b). Jenny can using the prefered SUI setup. 44 subjects participated in the study, already manage different interaction techniques and is able to pro- 27 children and 17 adults. The children were between eight and ten cess more information than the six-year-old Jenny. Therefore, a list years old (8.9 on average), 19 girls and 8 boys from third (18 sub- of snippets (Fig. 2a) is an adequate output visualization. It requires jects) and fourth (9 subjects) grade. The adults were between 22 not that much cognitive recall as tiles, but allows to process more and 53 years old (29.2 on avarage), five women and 12 men. Nine results items at a time than coverflow does. of them were students in computer science and four worked in the Use Case 3: Jenny is 14 years old. Jenny’s writing skills are fur- IT sector. The results for the output are presented in Fig. 5. The ther developed with use of correct grammar, punctuation and spel- majority of the children prefered the coverflow results visualiza- ling. She learns to think logically about abstract concepts. Her tion, whereas the adults had a week tendency towards tiles. These vocabulary size is about 20,000 words. She chats a lot with her results can be explained by the fact that on average children cannot friends which results in fast typing skills using a keyboard. There- 2 fore, Jenny is able to use a keyword-oriented input search supported Similar to the quinturakids.com search engine, accessed on 02.05.2013 (a) (b) Figure 4: Different kinds of result visualization: a) ESUI with coverflow result visualization b) ESUI with tiles result visualization. process much information, but adults do. Thus, it is easier for chil- 7. REFERENCES dren to use coverflow. Coverflow offers an animation by browsing [1] A. Aula. User study on older adults’use of the web and that is attractive for children. Many adults told us that they prefer search engines. Universal Access in the Information Society, tiles as, since many results can be compared at once, tiles offer a 4(1):67–81, 2005. good overview of results. [2] A. Aula and M. Käki. Less is more in web search interfaces for older adults. First Monday, 10(7-4), 2005. [3] J. E. Birren and K. W. Schaie. Handbook of the psychology of aging, volume 2. Gulf Professional Publishing, 2001. [4] C. Eickhoff, L. Azzopardi, D. Hiemstra, F. de Jong, A. de Vries, D. Dowie, S. Duarte, R. Glassey, K. Gyllstrom, F. Kruisinga, et al. Emse: Initial evaluation of a child-friendly medical search system. In IIiX Symposium, 2012. [5] E. Erikson. Children and society. WW Norton & Company, 1963. [6] T. Gossen, M. Nitsche, and A. Nürnberger. Knowledge journey: A web search interface for young users. In Proc. of the Sixth Symposium on HCIR, 2012. [7] T. Gossen and A. Nürnberger. Specifics of information Figure 5: Study results: what type of visualization do children retrieval for young users: A survey. Information Processing and adults prefer. & Management, 49(4):739–756, 2013. [8] M. Hearst. Search user interfaces. Cambridge University Press, 2009. 5. CONCLUSION [9] J. Hourcade, B. Bederson, A. Druin, and F. Guimbretière. In this positional paper we introduced the concept of an evolv- Differences in pointing task performance between preschool ing search user interface that adapts itself to abilities of a particular children and adults using mice. ACM Transactions on user. Instead of building a SUI for a specific user group, we use a Computer-Human Interaction, 11(4):357–386, 2004. mapping function between user skills and UI elements of a search [10] M. Jansen, W. Bos, P. van der Vet, T. Huibers, and system in order to adapt it dynamically, allowing the user to per- D. Hiemstra. TeddIR: tangible information retrieval for form his search process in a more efficient way. We considered children. In Proc. of the 9th Int. Conf. on Interaction Design different abilities of a user, e.g. his cognitive skills, knowledge, and Children, pages 282–285. ACM, 2010. reading and writing skills, that change during life. Furthermore, [11] R. Kail. Developmental change in speed of processing during we proposed to adapt three main components of a SUI, i.e. input, childhood and adolescence. Psychological bulletin, output and management, and its general UI properties to the user 109(3):490, 1991. skills. A key component of an ESUI is a mapping function between [12] J. Ormrod and K. Davis. Human learning. Merrill, 1999. user skill space and UI elements of a SUI, that has to be found. We [13] B. Steichen, H. Ashman, and V. Wade. A comparative survey elaborate on ways to learn this function. In order for an ESUI to be of personalised information retrieval and adaptive adaptive, ways to detect user abilities are required. We pointed in hypermedia techniques. Information Processing & several directions how the detection can be done. Management, 2012. [14] S. Stober and A. Nürnberger. Adaptive music retrieval–a 6. ACKNOWLEDGEMENTS state of the art. Multimedia Tools and Applications, pages The work presented here was partly supported by the German 1–28, 2012. Ministry of Education and Science (BMBF) within the ViERforES [15] I. Stuart-Hamilton. Intellectual changes in late life. John II project, contract no. 01IM10002B. Wiley & Sons, 1996. A Pluggable Work-bench for Creating Interactive IR Interfaces Mark M. Hall Spyros Katsaris Elaine Toms Sheffield University Sheffield University Sheffield University S1 4DP, Sheffield, UK S1 4DP, Sheffield, UK S1 4DP, Sheffield, UK m.mhall@sheffield.ac.uk evolve.sheffieldis@gmail.com e.toms@sheffield.ac.uk ABSTRACT and interacted with a participant [5], usually using a be- Information Retrieval (IR) has benefited from standard eval- spoke IIR interface. Developing and running such experi- uation practices and re-usable software components, that en- ments is a time-consuming, resource exhaustive and labour able comparability between systems and experiments. How- intensive process [6]. As a result of this bespoke approach, ever, Interactive IR (IIR) has had only very limited benefit the comparability of IIR experiments and their results suf- from these developments, in part because experiments are fers. Where studies of the same activities show divergent still built using bespoke components and interfaces. In this results, it is difficult to determine whether the di↵erences paper we propose a flexible workbench for constructing IIR are due to the specific aspect of IIR under investigation, or interfaces that will standardise aspects of the IIR experiment simply due to di↵erent participant samples or small di↵er- process to improve the comparability and reproducibility of ences in how the non-investigated user-interface (UI) compo- IIR experiments. nents were implemented. The bespoke nature also makes it harder to replicate studies, as publications frequently do not contain sufficient detail to exactly replicate the experiment. Categories and Subject Descriptors In [3] we have proposed a flexible, standardised IIR eval- H.3.3 [Information Storage and Retrieval]: Information uation framework that aims to address the issues created by Search and Retrieval; H.5.3 [Information Interfaces and variations in the experimental processes and by how context Presentation]: Group and Organization Interfaces information is acquired from the participants. However, the framework makes no provisions towards providing standard- Keywords ised IIR components that would improve the comparability of the experiment itself, the ease of setting up the experi- evaluation, framework, standardisation ment, and the ease of reproducibility. A number of attempts at developing a configurable, re- 1. MOTIVATION usable IIR evaluation system have been made in the past. Information Retrieval (IR) has benefited from standard In 2004, Toms, Freund and Li designed and implemented evaluation practices and re-usable software components. The the WiIRE (Web-based Interactive Information Retrieval) Cranfield-style evaluation methodology enabled evaluation system [6], which devised an experimental workflow pro- programmes such as TREC, INEX, or CLEF. At the same cess that took the participant through a variety of question- time provision of re-usable software components such as naires and the search interface. Used in TREC 11 Interac- Lucene1 , Terrier2 , Heritrix3 , or Nutch4 have enabled IR re- tive Track, it was built using Microsoft Office desktop tech- searchers to focus on the development of those components nologies, severely limiting its capabilities. The system was directly related to their research. However, Interactive IR re-created for the web and successfully used in INEX2007 (IIR) as had only very limited benefit from these develop- [7], but lacked flexibility in setup and data extraction. More ments. recently, SCAMP (Search ConfigurAtor for experiMenting Typically IIR research is still conducted using a single sys- with PuppyIR) [4] was developed to assess IR systems, but tem in a laboratory setting in which a researcher observed does not include the range of IIR research designs that are 1 typically done. A heavy-weight solution is PIIRExS5 [1], https://lucene.apache.org/ 2 which supports the researcher through the whole process http://terrier.org/ 3 from setting up the experiment to analysis, providing greater https://webarchive.jira.com/wiki/display/Heritrix/Heritrix 4 http://nutch.apache.org/ support but also a steeper learning curve. These approaches highlight the difficulty of balancing the two main constraints that limit a system’s wide-spread use: • sufficient flexibility to support the wide range of IIR interfaces and experiments; Presented at EuroHCIR2013. Copyright 2013 for the individual papers • sufficiently simple to implement that it does not in- by the papers’ authors. Copying permitted only for private and aca- crease the resource commitment required to set up the demic purposes. This volume is published and copyrighted by its edi- experiment. tors. 5 http://sourceforge.net/projects/piirexs [SearchResults] handler = application.components.SearchResults name = search_results layout = grid-9 vgrid-expand connect = search_box:query Figure 1: The evaluation workbench consists of the Figure 3: Configuration for a Standard Results List four core modules, into which the IIR components component, showing how the component’s layout (9 used in the experiment are plugged. grid-cells wide and vertically expanding) and con- nections to other components (to the “search box” component via the query message) are specified. When the researcher sets up the workbench for their ex- periment, they can freely configure which components to use, how to lay them out, and which components to con- nect to which other components. Based on this configura- tion the Web Frontend generates the initial user-interface that is shown to the participants. Then, when the partici- pant interacts with a UI element (fig. 2), the resulting UI Figure 2: The workbench’s main workflow starts event is handled by the Web Frontend, which generates a with the generation of the initial UI and then waits message based on the UI event. This message is passed to for the participant to generate a UI event. The event the Message Bus, which uses the configuration provided is processed, the a↵ected component’s state and UI by the researcher to determine which components to deliver are updated and the workbench goes back to wait- the message to. The components that are listening for that ing for the next UI event. A powerful aspect of the message update their own Session state based on the mes- workflow is that components when they receive a sage and then mark themselves as changed. After message message, can generate their own messages. processing has been completed for all components, the Web Frontend then updates the UI for each of the changed com- ponents. 2. DESIGN An example of the configuration used to set-up the exper- To achieve the goal of developing a system that fulfils iment is shown in figure 3 (from the experiment in figure 4), these requirements, we propose a system design that is based specifying the configuration of the “search results” compo- around a very lean core into which the researcher can plug nent. It specifies that the component should be displayed 9 the IIR components they wish to include in their experiment. grid-cells wide (the application layout uses a 12-by-12 cell We have implemented this design in our web-based evalua- grid layout) and should expand vertically to use as much tion framework (fig. 1), which complements the larger IIR space as is available. The component is configured to be experiment support system presented in [3]. To achieve max- connected to the “search box” component via the “query” imum flexibility, the system was designed using a message- message. It is this ability to freely plug components together passing architecture that consists of the following four com- that, we believe, makes the framework sufficiently flexible to ponents: support the wide range of IIR experiments, while remaining simple to set-up and use. • Web Frontend is handles the interface between the participant’s browser and the evaluation workbench and is implemented using a combination of client-side 3. STANDARD COMPONENTS and server-side functionality. The core system provides only the framework into which the IIR components can be plugged. This allows the re- • Message Bus handles the inter-component communi- searcher to build any custom IIR UI they wish to test, while cation and forms the core of the system. It is respon- at the same time being able to take advantage of the stan- sible for passing messages from the Web Frontend dardised session and log handling functionality. As IIR UIs to the IIR components configured to be listening for frequently include required elements that are not the focus of those messages and also for passing messages directly the study the researcher wishes to undertake, an optional set between the components. of default components for core IR UI elements is provided to reduce set-up time. This has the additional advantage that • Session handles loading and saving the components’ as their behaviour is consistent across experiments, the com- current state for a specific participant, hiding the com- parability of experiments using the framework is improved. plexities of web-application state from the individual components. 3.1 Search Box • Logging provides a standardised logging interface that The Search Box component ([8], p. 49, “Formulate Query allows the components to easily attach logging infor- Interface” [2], p. 76) provides a standard search box. When mation to the UI event generated by the participant. the participant enters text and clicks on the “Search” button, it generates a query message, which is usually connected to View, a query message is sent to the Standard Results List a Standard Results List. to find items with the same bit of meta-data. The interface was used to investigate un-directed exploration behaviour in 3.2 Standard Results List a large digital cultural heritage collection. The Standard Results List component ([8], p. 50, “Exam- ine Results Interface” [2], p. 77) provides a default 10 item 5. WHERE TO GO NEXT? listing of search results. The Standard Results List includes The stated aim of this paper was to present a novel, plug- support for displaying snippets ([8], p. 51) and what Wilson gable, extensible, and configurable IIR interface work-bench, calls “Usable Information” ([8], p. 51) for each result doc- that supports our wider aim of improving IIR experiment ument. Unlike the other standard components, which can comparability. The work-bench is sufficiently flexible to sup- be used out-of-the-box, the Standard Results List has to be port the wide range of web-based IIR experiments that are extended by the researcher in order to be able to access the undertaken, while being sufficiently simple and light-weight search-engine used to power the UI. to encourage wide-spread use of the workbench. 3.3 Pagination To enable this wide-spread use, the system has been re- leased under an open-source license6 . We are also moving The Pagination component ([8] p. 70) displays a config- to engage with the wider research community to determine urable number of pages around the current search-results to what degree the work-bench satisfies their needs for an page. In response to user interaction it sends a start mes- evaluation system and what needs to be done to achieve the sage with the rank of the first document to paginate to. wide-spread use needed to improve IIR experiment compa- 3.4 Category Browsing rability. The Category Browsing component ([8], p. 54) provides a hierarchical category structure that the participant can use 6. ACKNOWLEDGEMENTS to explore a collection. Clicking on a category sends a query The research leading to these results was supported by message with the category’s identifier. the Network of Excellence co-funded by the 7th Framework Program of the European Commission, grant agreement no. 3.5 Saved Documents 258191. The Saved Documents component provides an area where the participant can save things that they have found inter- 7. REFERENCES esting, to support them in their current task. Documents [1] R. Bierig, M. Cole, J. Gwizdka, N. J. Belkin, J. Liu, are added through a save_document message. The Saved C. Liu, J. Zhang, and X. Zhang. An experiment and Documents component supports an optional tagging feature analysis system framework for the evaluation of enabling the participant to tag the document with values contextual relationships. In CIRSE 2010, page 5, 2010. specified by the researcher. This can be used to let the par- [2] C. Chua. A user interface guide for web search systems. ticipant specify why they have chosen that document or how In Proceedings of the 24th Australian Computer-Human much it helps them in their current task. Interaction Conference, OzCHI ’12, pages 76–84, New 3.6 Task York, NY, USA, 2012. ACM. [3] M. M. Hall and E. G. Toms. Building a common The Task component provides a static display of the task framework for iir evaluation. In Information Access information to show to the user. Two versions of this com- Evaluation meets Multilinguality, Multimodality, and ponent are provided, one that displays a static text set in Visualization. 4th International Conference of the the configuration, and one that can fetch a task description CLEF Initiative - CLEF 2013, 2013. from the database, based on a parameter passed to it. [4] G. Renaud and L. Azzopardi. Scamp: a tool for conducting interactive information retrieval 4. APPLICATION experiments. In Proceedings of the 4th Information The evaluation work-bench has so far been used to build Interaction in Context Symposium, pages 286–289. two IIR experiments, very di↵erent in their nature, clearly ACM, 2012. demonstrating the work-bench’s flexibility. [5] J. Tague-Sutcli↵e. The pragmatics of information The first experiment (fig. 4) re-uses the standard Task, retrieval experimentation, revisited. Information Search Box, Pagination, and Saved Documents components, Processing & Management, 28(4):467–490, 1992. and extends the Standard Results List to work with the spe- [6] E. G. Toms, L. Freund, and C. Li. Wiire: the web cific search backend. This set-up re-creates what is essen- interactive information retrieval experimentation tially a relatively standard search UI configuration, that is system prototype. Information Processing & being used to investigate query session behaviour. Management, 40(4):655–675, 2004. The second experiment (fig. 5) demonstrates a much [7] E. G. Toms, H. O’Brien, T. Mackenzie, C. Jordan, richer interface, with more modifications to the components L. Freund, S. Toze, E. Dawe, and A. Macnutt. Task and an experiment-specific component. It re-uses the Task e↵ects on interactive search: The query factor. In and Category Browsing components, extends the default Focused access to XML documents, pages 359–372. Search Box, Pagination, Standard Results List, and Saved Springer, 2008. Documents components, and adds a new Item View com- [8] M. L. Wilson. Search User Inteface Design, volume 20. ponent. The message-passing nature of the system made Morgan & Claypool Publishers, 2011. it possible to quickly integrate the new component, so that 6 when the participant clicks on a meta-data facet in the Item https://bitbucket.org/mhall/pyire Figure 4: Screenshot showing an experiment with a very basic configuration consisting of Task, Search Box, Pagination, Standard Results List, and Saved Documents components. This is being used to investigate query behaviour for tasks that require query reformulations. Figure 5: Screenshot showing an experiment that makes heavy use of the customisation options o↵ered by the workbench. This configuration was used to investigate un-directed exploration in a digital cultural heritage collection. A Proposal for User-Focused Evaluation and Prediction of Information Seeking Process Chirag Shah School of Communication & Information (SC&I) Rutgers University 4 Huntington St, New Brunswick, NJ 08901, USA chirags@rutgers.edu ABSTRACT 1 INTRODUCTION One of the ways IR systems help searchers is by predicting or IR evaluations are often concerned with explaining factors assuming what could be useful for their information needs based relating to user or system performance after the search and on analyzing information objects (documents, queries) and retrieval are conducted [20]. Most recommender systems, finding other related objects that may be relevant. Such however, operate with an objective to suggest objects that could approaches often ignore the underlying search process of be useful to a user based on his/her or others’ past actions information seeking, thus forgoing opportunities for making [2][19]. We commenced our investigation by broadly asking process-based recommendations. To overcome this limitation, how we could take valuable lessons from both IR evaluations we are proposing a new approach that analyzes a searcher’s and recommender systems to not only evaluate an ongoing current processes to forecast his likelihood of achieving a certain search process, but also predict how well it will unfold and level of success in the future. Specifically, we propose a suggest a better path to the searcher if it is likely to machine-learning based method to dynamically evaluate and underperform. The motivation behind this investigation was predict search performance several time-steps ahead at each based on the following assumptions and realizations grounded in given time point of the search process during an exploratory the literature. search task. Our prediction method uses a collection of features extracted solely from the search process such as dwell time, 1. The underlying rational processes involved in information query entropy and relevance judgment in order to evaluate search are reflected in the actions users make while whether it will lead to low or high performance in the future. searching. These actions include entering search queries, Experiments that simulate the effects of switching search paths skimming the results, as well as selecting and collecting show a significant number of subpar search processes improving useful information [8][14][15]. 2. A searcher’s performance is a function of these actions after the recommended switch. In effect, the work reported here performed during a search episode [7][22]. provides a new framework for evaluating search processes and predicting search performance. Importantly, this approach is With these assumptions, we propose to quantify a search process based on user processes, and independent of any IR system using various user actions, and use it for user performance allowing for wider applicability that ranges from searching to (henceforth, ‘search performance’ or ‘performance’) prediction recommendations. as well as search process recommendations. Categories and Subject Descriptors 2 BACKGROUND H.3: INFORMATION STORAGE AND RETRIEVAL H.3.3: Past research on predictive models that relates to the approach Information Search and Retrieval: Search process; H.3: we describe in this paper can be grouped into two main INFORMATION STORAGE AND RETRIEVAL H.3.4: categories: (1) behavioral studies and (2) IR approaches. In both Systems and Software: Performance evaluation (efficiency and cases; however, the focus has been on end products instead of in effectiveness) the process required to produce them. As far as the behavioral studies go, research has been conducted General Terms to explore users models that help anticipating specific aspects of Measurement, Performance, Experimentation the search process. One goal in this context has been the Keywords determination of whether a search process will be completed in a single or multiple sessions. For example, Agichtein et al. [3] Exploratory search, Evaluation, Performance prediction investigated different patterns that can be identified in tasks that require multiple sessions. As a result, the authors devised an algorithm capable of predicting whether users will continue or abandon the task. Similar work is described in Diriye et al. [6], Presented at EuroHCIR2013. which focuses on predicting and understanding of why and Copyright © 2013 for the individual papers by the papers’ authors. Copying permitted only for private and academic purposes. This volume when users abandon Web searches. To address this problem, the is published and copyrighted by its editors. authors studied features such as queries and interactions with result pages. Based on this approach, the authors were able to determine reasons for search abandonment such as accidental causes (e.g. Web browser crashing), satisfaction levels, and query suggestions, among others. There have been also attempts to understand past users' steps ahead with the aim to aid their search process awareness behaviors in order to predict future ones in similar conditions. and performance trends. For example, Adar et al. [1] visually explored behavioral aspects Unlike previous works in IR, we are not proposing to use time using large-scale datasets containing queries and other series analyses or seasonal components of historic data. Instead, information objects produced by users. The authors were able to we investigate predictive models based on machine learning identify different behavioral patterns that seem to appear (ML) techniques; namely: SVM, logistic regression, and Naïve consistently in different datasets. While not directly related to Bayes which are trained over a set of features such as time, performance prediction, this work focused on attributes of the number of queries, and page dwell time. In contrast to most IR search process instead of in final products derived from it. evaluations, our method focuses on user-processes. Also, unlike Research like the ones described above often relies on historic most recommender systems, our approach could output data from large populations and the use of trend and seasonal alternative strategies instead of similar/relevant products to help components, which are used to model long-term direction and the searcher. In essence, the work reported here takes several periodicity patterns of time-series [17]. For example, some have lessons from tradition IR evaluations, recommender system explored seasonal aspects in Web search (e.g. weekly, monthly, designs, and weather/stock forecasting to come up with a new or annual behaviors) that provides useful information to predict approach for evaluating and predicting search performance. and suggest queries [5]. In the next section we provide a detailed description of our From an IR perspective, Radinski et al. [18] explored models to method, feature selection, and the measures we used in order to predict users’ behaviors in a population in order to improve create ML-based predictive models. results from IR systems. The authors also developed a learning algorithm capable of selecting an appropriate predictive model 3 METHOD depending on the situation and time. As described by the In order to analyze the search processes followed by different authors, applications of this approach could go from click users/teams, we assume that the underlying dynamics of the predictions to query-URL predictions. In contrast to this search processes are expressed by a collection of activities that approach, our method presented in this paper considers both the take place from the beginning to the end of the search processes. population trends and an individual user behavior. The first part of our method is a feature extraction step in which In a similar track, several works have been conducted on query we extract a wide array of features relating to webpages, queries performance prediction, focusing on developing techniques that and snippets saved from the search processes for each unit of help IR system to anticipate whether a query will be effective or time t. This step is performed in order to evaluate how well we not to provide results that satisfy users’ needs [4][10][11]. For could use those features to capture the underlying dynamics example, Gao et al. [10] found that features derived from search which would lead to recognizing whether a search process is results and interactions features offer better prediction results going to lead to high or low performance in the future time steps than a prediction baseline defined in terms of query features. at t+n (n=1,2,….,N), where N is the furthest time step. Results from this study have direct implications to individual The decision to include or exclude a feature was based on users by aiding the auto evaluation process of IR systems. literature (e.g., [7]) as well as our past experience [22] with In information search, users may be unaware of their individual representing and evaluating search objects and processes. Each performance when solving an information search task. For feature is extracted for each user or team, u, up to time t from instance, Shah & Marchionini [23] showed how lack of the search processes and they are explained in detail as follows. awareness about different objects involved in searching (queries, • Total coverage (u,t): The total number of distinct visited pages, bookmarks) could result in mistaken perception Webpages visited by a user (u) up to time t. This feature about search performance during an exploratory search task. captures the Webpage based activity performed by a user Even if an IR system is highly effective, users may run into and provides a measure to see how much distinct multiple query formulation and evaluation of several pages information has been found by the user up to this time. before finding what they need. This process, which can be related to search strategies, implies effort and time that is • Useful coverage (u,t): The total number of distinct usually underestimated by the users themselves. In this sense, webpages in which a user spent at least 30 seconds, up to instead of predicting end products (i.e., overall performance), time t. This measure evaluates out of the total pages he/she the approach we introduce in this paper is oriented toward has visited how many of them were useful in finding predictions at different times in order to increase the level of relevant information leading to satisfaction with their awareness of users about their own search process. Similar to context in completing the exploratory task [9][22][25]. weather forecast, this information could help users to be aware • Number of queries (u,t): Total number of unique queries of possible trends based on past and current behavior. executed by a user up to time t. This feature implicitly For a more recent discussion on IR evaluations and their relates to how much effort and cognitive thinking a user has shortcomings, see [12]. To the best of our knowledge, search put in to this task. process performance prediction at different times from a user • Number of saved snippets (u,t): Total number of snippets perspective has not been explored. Similar approaches can be saved by user u up to time t. This measures the amount of found in weather and stock market studies. For example, using information that the user thought that might be relevant in machine learning approaches such as Support Vector Machine the future to complete the task and needed to be (SVM), some models have been implemented to predict the remembered. In other words, this feature is an indication of trends of two different daily stock price indices using NASDAQ explicit relevance judgments made by the user. and Korean Stock prices [13][16]. In a similar fashion, our • Length of Query (u,q,t): Length of each query(q) executed approach is oriented to forecast users’ search performance N- by a user u based on the character count of the query up to time t. This feature captures how the user imposed the queries and how long they were at different times of the above mentioned criteria and threshold and used as the output search process. class labels to be used in the n-step ahead prediction model. If a • Number of tokens in each query (u,q,t): This is the count of class label at n-step ahead was correctly predicted based on the tokens/words in each query(q) executed by user u up to features extracted up to time t from the classification model it time t. This query based measure takes into account how was considered as correctly classified and if not as misclassified. specific a user was in defining the query. By inspecting the 4 EXPERIMENTS datasets, we realized that queries with a less number of In order to evaluate whether users who are predicted to perform tokens tend to get general results. On the other hand, at low performance in the future based on the current search composed queries with multiple terms are related to more process, could benefit from this analysis to improve their search specific searchers. We also observed that typically the users process, we conducted some simple simulation analysis. started with general queries with few words at the beginning of the search process but then went into more We considered the individual user search processes as a detailed queries to find more specific information later. For collection of search paths, where each search path is defined as all these reasons, we found it to be useful to capture the the search process from the time a user issued a query up to the number of token used in a query. time user issued another quite different query. This was found • Query entropy (u,q,t): This measures the information out using generalized Levenshtein (edit) distance, which is a content in a given query (q), by finding the expected value commonly used distance metric for measuring the distance of information contained in a query. We used the widely between two character sequences. If the Levenshtein (edit) recognized notion of Shannon entropy [24] in Information distance between two subsequent queries were greater than 2 Theory to calculate the information content of a query. We (assuming less than 2 was when there were changes in the calculated the number of unique characters appearing in queries due to simple spelling mistakes or refining of the query), each of the queries, which represent the observed counts of we considered the search process from the former query to the the random variable. This was used as the input to Shannon next query as a single search path. entropy calculation and we used to the maximum- Following this method, we found the first search path of each likelihood method to calculate the entropy. Query entropy user and based on the features extracted up to the end of the first feature has been used in the past to predict goodness of a search path, and based on the classification model learnt from query for making query expansion decision [21]. that corresponding n-step ahead prediction we predicted whether The method used to assess the search performance of a user is the user is going to have low/high performance at the end of the described below. We define a measure called Efficiency (u,t), for session. If the user was going to have low performance, then out each user u up to time t in order to predict whether a given of the users who predicted to have high performance, we looked search process is going to yield in high/low performance in the at which high performing user has the lowest Levenshtein (edit) future We first define Effectiveness of user u up to time t as the distance between the queries issued by low performing user ratio of useful coverage and total coverage (both defined within the first search path and considered it as a pair of users, earlier). A similar measure was used in [7] and [22]. whom we are going to use in the simulation. Then, for each low performing user and high performing user that was matched, we Useful coverage(u,t) switched the search process of low performing user at the end of Effectiveness(u,t) = Total coverage(u,t) the first search path with the high performing user’s search path (1) up to t=T minutes, where T is the total number of minutes for a We then calculated Efficiency as defined in Equation 2. session. Then we evaluated by switching the search process Effectiveness(u, t) early during the overall process whether it would benefit each Efficiency(u,t) = low performing user to improve their performance. We found NumberofQueries(u,t) (2) that we were able to move most of the underperforming search In other words, Efficiency is defined as the Effectiveness processes to higher performance by early detection and obtained per query, or how effective a query is in terms of switching, while keeping the higher performing processes achieving a certain level of useful coverage. unharmed. The performance for each user u at each time t was classified in These simulations provide verification that by realizing early to the two classes; high performance and low performance based during the search process whether a user is going to perform on the following criteria: well or not, one could recommend better search processes/strategies for that user which would lead to uplifting Class = { low high ;if ;else Efficiency(u, t) ≥ Efficiency(u, t) (3) the search performance of a previously destined to low performing user. 5 CONCLUSION Using various user studies data available to us, we constructed When it comes to prediction, information retrieval and filtering feature matrices which consist of all aforementioned features for systems are primarily focused on objects while assessing what each minute of time t for all the users in each dataset, and and if something could help the users. These approaches are converted in to a long vector of features which we fed as the often system-dependent even though the process of information input to the classification models used.1 The class labels were seeking is usually user-specific. Personalization and generated as high/low performance at minute t+n based on the recommendations are frequently exercised as methods to address user-specific IR and filtering, but still limited to comparing and 1 In the interest of space and scope of work here, details of these recommending objects, not focusing on underlying IR processes experiments have been omitted, but will be available for that are carried out by the searchers. We presented a new discussion at the workshop. approach to address these shortcomings. We began by asking whether we could model a user’s search process based on the in collaborative IR systems. In Proceedings of the 75th Annual actions he/she is performing during an exploratory search task Meeting of the Association for Information Science and and forecast how well that process will do in the future. This Technology (ASIS&T). Baltimore, MD, USA. was based on a realization that an information seeker’s search [8] Gwizdka, J. (2008). Cognitive load on web search tasks. Workshop goal/task can be mapped out as a series of actions, and that a on Cognition and the Web, Information Processing, sequence of actions or choices the searcher makes, and Comprehension, and Learning. Granada, Spain. Available from http://eprints.rclis.org/14162/1/GwizdkaJ_WCW2008_short_paper especially the search path he/she takes, affects how well he/she _finalp.pdf will do. Thus, in contrast to approaches that measure the goodness of search products (e.g., documents, queries) as a way [9] Fox, S., Karnawat, K., Mydland, M., Dumais, S., & White, T. (2005). Evaluating implicit measures to improve web search. ACM to evaluate the overall search effectiveness, we measured the TOIS, 23(2): 147−168. likelihood of an existing search process to produce good results. [10] Gao, Q., White, R., Dumais, S.T., Wang, S., & Anderson, B. Here we presented simulations to demonstrate what could (2010). Predicting query performance using query, result and happen if one can make process-based predictions, but one could interaction features. In Proceedings of RIAO 2010. develop an actual recommender system using the proposed [11] He, B., & Ounis, I. (2006). Query performance prediction, method. Another potential application of such prediction-based Information Systems, Volume 31, Issue 7, November 2006, Pages method would be to use such approach in IR systems to provide 585-594, ISSN 0306-4379, 10.1016/j.is.2005.11.003. the awareness to users how their future performance will be [12] Järvelin, K. (2012). IR research: systems, interaction, evaluation based on the current/past search process. The system could and theories. ACM SIGIR Forum, 45(2), 17. identify that a user will have low performance if, he continues doi:10.1145/2093346.2093348 this manner at an early stage of the process, and what could be [13] Kyoung-jae, K. (2003). Financial time series forecasting using done to provide suggestions to improve overall performance. support vector machines. Neurocomputing, Volume 55, Issues 1–2, September 2003, Pages 307-319, ISSN 0925-2312, Given that the proposed technique is independent of any specific 10.1016/S0925-2312(03)00372-2. kind of system, and solely focused on user-based processes, it [14] Liu, C., Gwizdka, J., Liu, J., Xu, T., & Belkin, N. J. (2010). will presumably be easy to apply it to a variety of IR systems Analysis and evaluation of query reformulations in different task and situations irrespective of retrieval, ranking, or types. American Society for Information Science, 47(17). Available recommendation algorithms. Finally, while we have used from http://dl.acm.org/citation.cfm?id=1920331.1920356 datasets borrowed from previous user studies, one could easily [15] Liu, J., Gwizdka, J., Liu, C., & Belkin, N. J. (2010). Predicting apply the proposed method to Web logs, TREC data, and other task difficulty for different task types. In Proceedings of the forms of datasets with various user actions recorded over time. Association for Information Science, 47(16). Available from http://dl.acm.org/citation.cfm?id=1920331.1920355 6 ACKNOWLEDGEMENTS [16] Ming-Chi, L. (2009). Using support vector machine with a hybrid The work reported here is supported by The Institute of Museum feature selection method to the stock trend prediction. Expert and Library Services (IMLS) Cyber Synergy project as well as Systems with Applications, Volume 36, Issue 8, October 2009, IMLS grant # RE-04-12-0105-12. The author is also grateful to Pages 10896-10904, ISSN 0957-4174, 10.1016/j.eswa.2009.02.038 his PhD students Chathra Hendahewa and Roberto Gonzalez- [17] Ord, J., Hyndman, R., Koehler, A., & Snyder, R. (2008). Ibanez for their valuable contributions to this work. Forecasting with Exponential Smoothing (The State Space Approach). Springer, 2008. 7 REFERENCES [18] Radinski, K., Svore, K., Dumais, S. T., Teevan, J., Horvitz, E., & [1] Adar, E., Weld, D. S., Bershad, B. N., & Gribble, S. D. (2007). Bocharov, A. (2012). Modeling and predicting behavioral Why we search: visualizing and predicting user behavior. In dynamics on the Web. In Proceedings of WWW 2012. Proceedings of World Wide Web (WWW) Conference 2007. [19] Resnick, P., & Varian, H. R. (1997). Recommender Systems. [2] Adomavicius, G., & Tuzhilin, A. (2005). Toward the Next Communications of the ACM, 40(3), 56–58. Generation of Recommender Systems: A Survey of the State-of- [20] Saracevic, T. (1995). Evaluation of evaluation in information the-Art and Possible Extensions. IEEE Transactions on Knowledge retrieval. In Proceedings of the Annual ACM Conference on and Data Engineering, 17(6), 734–749. Research and Development in Information Retrieval (SIGIR) (pp. [3] Agichtein, E., White, R.W., Dumais, S.T., & Bennett. P.N. (2012). 138–146). Search interrupted: Understanding and predicting search task [21] Shah, C., & Croft, W. B. (2004). Evaluating high accuracy continuation. In Proceedings of the Annual ACM Conference on retrieval techniques. Proceedings of the Annual ACM Conference Research and Development in Information Retrieval (SIGIR) 2012. on Research and Development in Information Retrieval (SIGIR) [4] Cronen-Townsend, S., Zhou, Y., & Croft, B. (2002). Predicting (pp. 2-9). Sheffield, UK. query performance. In Proceedings of the Annual ACM [22] Shah, C., & Gonzalez-Ibanez, R. (2011). Evaluating the Synergic Conference on Research and Development in Information Effect of Collaboration in Information Seeking. Proceedings of the Retrieval (SIGIR) 2002. Annual ACM Conference on Research and Development in [5] Dignum, S., Kruschwitz, U., Fasli, M., Yunhyong, K., Dawei, S., Information Retrieval (SIGIR) (pp. 913–922). Beijing, China. Beresi, U.C., & De Roeck, A. (2010). Incorporating Seasonality [23] Shah, C., & Marchionini, G. (2010). Awareness in Collaborative into Search Suggestions Derived from Intranet Query Logs. In Information Seeking. Journal of American Society of Information Proceedings of IEEE/WIC/ACM International Conference on Web Science and Technology (JASIST), 61(10), 1970–1986. Intelligence and Intelligent Agent Technology (WI-IAT) 2010, vol.1, no., pp.425-430, Aug. 31 2010-Sept. 3 2010 [24] Shannon, C. E. and Weaver, W. Mathematical Theory of Communication. Urbana, IL: University of Illinois Press, 1963. [6] Diriye, A., White, R.W., Buscher, G., & Dumais, S.T. (2012). Leaving so soon? Understanding and predicting web [25] White, R. W., & Huang, J. (2010). Assessing the scenic route: search abandonment rationales. In Proceedings of CIKM 2012. Measuring the value of search trails in web logs. In Proceedings of the Annual ACM Conference on Research and Development in [7] González-Ibáñez, R., Shah, C., & White, R. W. (2012). Pseudo- Information Retrieval (SIGIR). Geneva, Switzerland. collaboration as a method to perform selective algorithmic mediation 8 Directly Evaluating the Cognitive Impact of Search User Interfaces: a Two-Pronged Approach with fNIRS Horia A. Maior1,2 , Matthew Pike1 , Max L. Wilson1 , Sarah Sharples3 1 Mixed Reality Lab, 2 Horizon DTC, 3 Human Factors - School of Engineering University of Nottingham, UK {psxhama,psxmp8,max.wilson,sarah.sharples}@nottingham.ac.uk ABSTRACT movement, and fMRI requiring users to lay in tunnel void Recent research has pointed towards further understanding of any metal objects. Recent Human-Computer Interaction the cognitive processes involved in interactive information research has listed the benefits of fNIRS brain sensing tech- retrieval, with most papers using secondary measures of cog- niques, which are less a↵ected by body movement, and can nition to do so. Our own research is focused on using direct be more easily used in ecologically valid study conditions. measures of cognitive workload, using brain sensing tech- Functional Near Infrared Spectroscopy (fNIRS) is an emerg- niques with fNIRS. Amongst various brain sensing technolo- ing neuroimaging technique that is non-invasive, portable, gies, fNIRS is most conducive to ecologically valid user stud- inexpensive and suitable for periods of extended monitor- ies, as it is less a↵ected by body movement and can be worn ing. fNIRS measures the hemodynamic response - the de- while using a computer at a desk. This paper describes our livery of blood to active neuronal tissues. fNIRS is designed two pronged approach focusing on a) moving fNIRS research to be placed directly upon a participants scalp, typically beyond simple psychological tests towards actual interactive targeting the prefrontal cortex. This paper describes our IR tasks and b) evaluating real search user interfaces. two-pronged approach to using fNIRS to study the cogni- tive workload created by SUIs, focused on a) task analysis and b) SUI analysis. Categories and Subject Descriptors H5.2 [Information interfaces and presentation]: Eval- uation/methodology, Theory and methods 2. RELATED WORK Understanding the cognitive aspects of interactive search- Keywords ing (as well as interaction in general) has been a long-standing goal for researchers in the field of Interactive IR. In the 1970s Functional near-infrared spectroscopy(fNIRS), Brain-computer Bates suggested that searchers employ both search tactics interface(BCI), Human cognition, Information processing sys- and idea tactics [7]. In an attempt to explain an individual’s tem, Multiple resource model, Limited resource model path during IR, Bates’ “Berrypicking” model [8] argued that search will vary as the user recognises information and has 1. INTRODUCTION new ideas and questions. The cognitive aspects of Information Retrieval (IR) have In the main cognitive evolution of information seeking re- repeatedly received focus over time, from Ingwersen’s Cog- search, Ingwersen proposed a cognitive model of IR [11], nitive Model [11], to recent analyses of cognitive workload where the searcher’s understanding of the document collec- during search tasks [2, 10]. The recurring interest is in what tion, system, and task that would determine which path a users think about at di↵erent task stages, and how much search would take. The model again put the user’s cognition mental workload is involved. The benefits of knowing more as the central point of interest. More recently, Joho [12] ar- about the searcher’s cognitive state would come from pro- gued that the cognitive e↵ects typically observed in Psychol- viding better support for their needs, with Wilson et al sug- ogy could provide a potential building block of theoretical gesting that better designed Search User Interfaces (SUIs) development for evaluating interactive IR. Back et al [2], for could reduce unnecessary workload on the user [23]. example, examined the cognitive demands on users during Although some prior work (e.g. [2]) have used indirect the relevance judgement phase, suggesting that the amount techniques to analyse workload during search tasks, the de- of workload involved was the reason behind searchers rarely creasing cost of brain sensing hardware has meant that more providing relevance judgements in previous work. Using a recent research is using more objective techniques. Pike et secondary measure, the Stroop task, Gwizdka [10] mapped al [17] and Gwizdka et al [10] used EEG technology, while varying levels of workload at multiple stages of search. Moshfeghi et al used fMRI to measure workload when mak- More recently, researchers have focused on objectively mea- ing relevance judgements [15]. Each of these technologies suring interactive IR phases, in line with Back et al’s work, have known limitations for studying actual interactive IR be- Moshfeghi et al measured workload during relevance assess- haviour, with EEG being highly a↵ected by even tiny body ments by asking people to make judgements while lying in an fMRI machine. As making relevance judgements can be performed without directly interacting with a computer, this made use of an fMRI machine more realistic. Using more Presented at EuroHCIR2013. Copyright c 2013 for the individual papers by the papers’ authors. Copying permitted only for private and academic commercialised tools, Anderson [1] used an EEG sensor to purposes. This volume is published and copyrighted by its editors. compare visualization techniques in terms of the burden they place on a viewer’s cognitive resources. Similarly, Pike et al One important part of cognition during interactive search- [17] developed a prototype tool named CUES that was ca- ing involves human memory systems. There are two dif- pable of collecting a variety of data including EEG whilst ferent types of memory [21]: working memory (sometimes interacting with a website. Pike et al used this to moni- called short-term memory) and long-term memory. Wick- tor aspects such as frustration and concentration, but their ens describes working memory as the temporary holding of work demonstrated the variability of EEG data across the information that is “active”, while long-term memory involv- several minutes involved in an interactive IR task. ing the unlimited, passive storage of information that is not Using fNIRS, as introduced above, Peck [16] performed a currently in working memory. similar study of di↵erent visualisation techniques, while a Working memory. Working memory, proposed by Bad- system called Brainput [18] was able to identify and corre- deley and Hitch (1974) [6], refers to a specific system in the late brain activity patterns among users during multitasking brain which “provides temporary storage and manipulation studies, and intervene when it sensed workload exceeding a of information...” [3]. Working memory [6, 4, 5] processes certain level. Our work intends to build upon these HCI information in two forms: verbal and spatial, and has four studies, to study interactive IR tasks and SUIs in more eco- main components (Figure 1): logically valid user study situations. • A central executive managing attention, acting as supervisory system and controlling the information from 3. RESEARCH PATHS and to its “slave systems”. Pike et al [17] highlighted the challenges of using brain • A visuo-spatial sketch pad holding information in sensing technologies to evaluate IIR tasks: that tasks have an analogue spatial form (e.g. Colours, shapes, maps, di↵erent stages, that behaviour quickly diverges after the etc.), specialised on learning by means of visuospatial first interaction (and thus is hard to compare), and that imagery. brain measurements vary dramatically over time. In order to address these challenges, we have initiated two clear re- • A phonological loop holding verbal information in search paths, both utilising fNIRS technology: 1) evaluating an acoustical form (e.g. Numbers, words, etc.); spe- the cognitive aspects of Interactive IR tasks and 2) meth- cialised on learning and remembering information us- ods to evaluate the design of SUIs. The aim of the first ing repetition. path, is to move beyond using fNIRS to measure workload • A episodic bu↵er dedicated to linking verbal and in simplistic psychology memory tasks (like Peck et al [16]), spatial information in chronological order. It is also towards being able to break down real search tasks into pri- assumed to have links to long-term memory. mary components. This implies three considerations: • Collected data would be meaningless if is not related to existing knowledge. Therefore, to interpret sensed fNIRS data we use proposed theories and models. • It is known that fNIRS can sense cognition information [19, 16] related to so called working memory (if placed on the forehead). Assuming this is correct, we are using models of working memory. Figure 1: Baddeley’s Working Memory Model • The proposed models will help us interpret the sensed Information processing system. As humans, we are data with fNIRS and have a better understanding of exposed to large amounts of information via our sensory the cognitive impact of various complex tasks (such as systems. One of our strengths is in selecting information a IR). from our environment, perceiving it, processing it, and cre- ating a response. Therefore we can use this understanding Such a technique would allow researchers to analyse data by of brain activity to identify which elements of an interac- stage, and find e↵ective points of comparison during several tive IR environment need to be considered when measuring minutes of continuous measurements. The second path is brain activity, and how we can reduce rather than increase focused on identifying which aspects of working memory are a user’s mental workload via interface and system design. a↵ected by di↵erent features of SUIs, such that researchers Wicken’s Information Processing Model [21] aims to il- can objectively evaluate the e↵ect of di↵erent SUI design lustrate how elements of the human information processing decisions. A combination of both paths works towards being system such as attention, perception, memory, decision mak- able to proactively evaluate how SUIs support searchers. ing and response selection interconnect. We are interested in observing how and when these elements interconnect during IR. He describes three di↵erent ‘stages’ (see STAGES di- 4. PATH 1: WORKLOAD MODELS mension in Figure 2) at which information is transformed: To understand the cognitive aspects of IIR, it is essential a perception stage, a processing or cognition stage, and a to learn about user’s capabilities and limitations in terms response stage, the first two being processes involved in cog- of their cognition: how people perceive, think, remember, nition. The first stage involves perceiving information that and process information. This path of research focuses on is gathered by our senses and provide meaning and interpre- existing models from Cognitive Psychology and Human Fac- tation of what is being sensed. The second stage represents tors, models that conceptualize and highlight aspects that the step where we manipulate and “think about” the per- typically describe or influence elements of human cognition. ceived information. This part of the information processing system takes place in working memory and consists of a • Avoid unnecessary zeros in codes to be remembered; wide variety of the mental activities. In relation to IR, it is interesting to observe how elements of cognition, such as • Encourage regular use of information to increase fre- rehearsal of information, planning the search strategy and quency and redundancy; deciding on the search keywords interconnect. • Encourage verbalization or reproduction of informa- Multiple Resource Model. One model of mental work- tion that needs to be reproduced in the future; load that has been widely accepted in Human Factors is Wickens Multiple Resource Model [20] (Figure 2). The ele- • Carefully design information to be remembered; ments of this model overlap with the needs and considera- tions of evaluating complex tasks (such as IR). He describes Resource vs Demands. One other model that is of inter- the aspects of human cognition and the multiple resource est is the limited resource model [22] describing the relation- theory in four dimensions: ship between the demands of a task, the resources allocated to the task and the impact on performance. Figure 3: Resources available vs task demands ! impact on performance [22] The graph from Figure 3 is used to represent the lim- ited resource model. The X-axes represent the resources demanded by the primary task and as we move to the right Figure 2: The 4-D multiple resource model [20] of the axes, the resources demanded by the primary task increase. The axes on the left indicate the resources being used, but also the maximum available resources point (if we • The STAGES dimension refers to the three main stages think of working memory that is limited in capacity). The of information processing system (Wickens, 2004 [21]). right axes indicate the performance of the primary task (the dotted line on the graph). The key element of this model is • The MODALITIES dimension indicating that audi- the concept of a limited set of resources which, if exceeded, tory and visual perception have di↵erent sources. has a negative impact on performance. However, it does not distinguish between resource modality, therefore we propose • The CODES dimension refers to the types of memory to use both the limited and multiple resources models to encodings which can be spatial or verbal. inform our work. • The VISUAL PROCESSING dimension refers to a nested dimension within visual resources distinguishing be- 5. PATH 2: SUI EVALUATION tween focal vision (reading text) and ambient vision Relating quantitative data from brain sensing devices into (orientation and movement). feedback about SUI designs is one of our ultimate goals in conducting this research. SUIs are inherently information Our aim is to understand how these elements link together rich and thus a↵ect both visual (results page layout) and and compose more complex components/tasks. Additionally verbal (text based results) memory. Detecting a change in ei- we want to consider how complex tasks (such as a search ther verbal or spatial working memory would help determine task) can be divided into primary components according to if a workload di↵erence was caused by SUI design (spatial) the models described. This will help identify possible prob- or the amount of information the design provides (verbal). lems in SUI design as well as indicating a possible solution Our first in-progress study has stimulated each memory type to the problem (suggested implications by Wickens [21]): in di↵erent tasks - Verbal memory was tested by performing an n-back [13] number memory task, whereas spatial mem- • Minimize working memory load of the SUI system and ory was tested using an n-back visual block matrix task. consider working memory limits in instructions; Other studies have also looked at each type of memory and confirmed fNIRS ability to detect changes in heamodynamic • Provide more visual echoes (cues) of di↵erent types responses accordingly [9]. during IR (verbal vs spatial); In addition to developing an understanding of the ex- • Exploit chunking (Miller, 1956 [14]) in various ways: tent to which we can monitor di↵erent memory, our ini- physical size, meaningful size, superiority of letters tial study also sought to measure the e↵ect of artefacts on over numbers, etc; the fNIRS data. Controlling the environment and human derived sources of noise is a potentially difficult factor to • Minimize confusability; control without e↵ecting the ecological validity of a study. Solovey et al [19] showed that fNIRS is relatively resilient to [3] A. Baddeley. Working memory. Science, motion derived artefacts when compared to EEG [17] for ex- 255(5044):556–559, 1992. ample, but still required some consideration by researchers [4] A. Baddeley. The episodic bu↵er: a new component of conducting studies. In our own experience, we found that working memory? Trends in cognitive sciences, asking participants to remain still as much as possible was 4(11):417–423, 2000. fairly successful. We are additionally looking at possible [5] A. D. Baddeley. Is working memory still working? methods for correcting motion derived artefacts using an European psychologist, 7(2):85–97, 2002. external gyroscope connected to the participant. [6] A. D. Baddeley and G. Hitch. Working memory. The Designing tasks for experiments that measure cognitive ef- psychology of learning and motivation, 8:47–89, 1974. fect via a brain sensor require careful consideration in order [7] M. J. Bates. Idea tactics. JASIST, 30(5):280–289, to ensure that results can be attributed to a cause. Thank- 1979. fully this problem space has been well explored in the field [8] M. J. Bates. The design of browsing and berrypicking of Psychology and we are able to adapt the approaches de- techniques for the online search interface. Online scribed in the literature to suit our task type requirements. Information Review, 13(5):407–424, 1989. A primary example of this adaptation is demonstrated by Peck et al [16], where 2 data visualisations techniques were [9] X. Cui, S. Bray, D. M. Bryant, G. H. Glover, and compared using a methodology based loosely on the n-back A. L. Reiss. A quantitative comparison of NIRS and task - a widely used psychology task that is designed to in- fMRI across multiple cognitive tasks. Neuroimage, crease load on working memory. 54(4):2808–2821, 2011. Additionally, we are interested in exploring standard search [10] J. Gwizdka. Distribution of cognitive load in web studies (without following a psychological study layout) and search. JASIST, 61(11):2167–2187, 2010. seeing whether interesting states can be detected. Solovey [11] P. Ingwersen. Cognitive perspectives of information et al [18] performed a similar function by utilising a ma- retrieval interaction: elements of a cognitive IR chine learning algorithm that had classified “states of inter- theory. Journal of documentation, 52(1):3–50, 1996. est” prior to performing a task. [12] H. Joho. Cognitive e↵ects in information seeking and Using a similar approach, we could evaluate a SUI to de- retrieval. In Proc. CIRSE2009, 2009. termine whether a particular change in layout has a positive [13] W. K. Kirchner. Age di↵erences in short-term or negative impact on visual memory. Alternatively, to test retention of rapidly changing information. Journal of the relevance of a results page (which would be dependant experimental psychology, 55(4):352, 1958. on the textual results), we could analyse the e↵ects on verbal [14] G. Miller. The magical number seven, plus or minus memory between 2 varied results pages, we could then re- two: Some limits on our capacity for processing flect these changes to the Wickens Multiple Resource Model information. The psychological review, 63:81–97, 1956. [20]. We are also working towards enabling the interpreta- [15] Y. Moshfeghi, L. R. Pinto, F. E. Pollick, and J. M. tion of data within the context of complex multimodal tasks Jose. Understanding Relevance: An fMRI Study. In to further extending our knowledge of the processes involved Proc. ECIR2013, pages 14–25. Springer, 2013. during IR and how they interact and e↵ect one another. [16] E. M. Peck, B. F. Yuksel, A. Ottley, R. J. Jacob, and R. Chang. Using fNIRS Brain Sensing to Evaluate 6. SUMMARY Information Visualization Interfaces. In Proc. This paper has aimed to summarise our two-pronged ap- CHI2013. ACM, 2013. proach towards actually evaluating the design of search user [17] M. Pike, M. L. Wilson, A. Divoli, and A. Medelyan. interfaces, in realistic ecologically valid study conditions, us- CUES: Cognitive Usability Evaluation System. In ing fNIRS technology. The approach first involves braking EuroHCIR2012, pages 51–54, 2012. down interactive IR tasks into how they e↵ect the di↵er- [18] E. Solovey, P. Schermerhorn, M. Scheutz, A. Sassaroli, ent elements of working memory, and second understanding S. Fantini, and R. Jacob. Brainput: enhancing how SUIs are processed by di↵erent parts of working mem- interactive systems with streaming fNIRs brain input. ory. Our two paths of research will build towards a stage In Proc. CHI2012, pages 2193–2202. ACM, 2012. where we can combine them and objectively evaluate cogni- [19] E. T. Solovey, A. Girouard, K. Chauncey, L. M. tive workload involved in interactive IR. We believe that this Hirshfield, A. Sassaroli, F. Zheng, S. Fantini, and R. J. research will provide a novel new direction that SUI’s and Jacob. Using fNIRS brain sensing in realistic HCI indeed HCI in a broader sense can benefit from. The asso- settings: experiments and guidelines. In Proc. ciation of physical recordings in ecological valid settings, to UIST2009, pages 157–166. ACM, 2009. an existing theoretical model, provides a new measure from [20] C. D. Wickens. Multiple resources and mental which future SUI development and evaluation could benefit. workload. The Journal of the Human Factors and Ergonomics Society, 50(3):449–455, 2008. 7. REFERENCES [21] C. D. Wickens, S. E. Gordon, and Y. Liu. An introduction to human factors engineering. Pearson [1] E. W. Anderson, K. Potter, L. Matzen, J. Shepherd, Prentice Hall Upper Saddle River, 2004. G. Preston, and C. Silva. A user study of visualization [22] J. R. Wilson and E. N. Corlett. Evaluation of human e↵ectiveness using EEG and cognitive load. Computer work. CRC Press, 2005. Graphics Forum, 30(3):791–800, 2011. [23] M. L. Wilson. Evaluating the cognitive impact of [2] J. Back and C. Oppenheim. A model of cognitive load search user interface design decisions. EuroHCIR for IR: implications for user relevance feedback 2011, pages 27–30, 2011. interaction. Information Research, 6(2):6–2, 2001. Dynamics in Search User Interfaces Marcus Nitsche, Florian Uhde, Stefan Haun and Andreas Nürnberger Otto von Guericke University, Magdeburg, Germany {marcus.nitsche, stefan.haun, andreas.nuernberger}@ovgu.de, florian.uhde@st.ovgu.de ABSTRACT knowledge available online. Therefore, a proficient tool to anal- Searching the WWW has become an important task in today’s in- yse the structure of the web and to provide guidance to specific formation society. Nevertheless, users will mostly find static search sources of information is needed. This task is accomplished by user interfaces (SUIs) with results being only calculated and shown modern search engines like Google2 , Bing3 , Yahoo4 and other lo- after the user triggers a button. This procedure is against the idea cal or topic centred search engines. By the increase of computa- of flow and dynamic development of a natural search process. The tional power in smart phones and wider access to online resources main difficulty of good SUI design is to solve the conflict between the demand for these search tools has risen and the quality of the good usability and presentation of relevant information. Serving a search terms has changed. Instead of single-query-searches, users UI for every task and every user group is especially hard because tend to request complex answers5 , trying to learn about topics in of varying requirements. Dynamic search user interface elements deep. While the need for information and the expectations of users allow the user to manage desired information fluently. They offer increased, matching the broader knowledge base contained in the the possibility to add individual meta information, like tags, to the Internet in the last few years. About 300 Mio. websites were added search process and enrich it thereby. in 20116 . Search engines mainly remain the same. This leads to the fact that a “significant design challenge for web search engine de- Keywords velopers is to develop functionality that accommodates the wide va- Search User Interface, User Experience, Exploratory Search. riety of skills and information needs of a diverse user population” [1]. Therefore, this paper proposes the concept of using dynamic elements in SUIs, that focus on fluent work flow characteristics, a Categories and Subject Descriptors high grade of interactivity and an adequate answer-time-behaviour. H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval.; H.5.2 [Information Interfaces and Presentation]: User Interfaces. 2. INFORMATION GATHERING Looking at users’ habits in search, they no longer perform sim- ple lookup searches. There is an increasing need to answer com- General Terms plex information needs. Therefore, we mainly consider informa- Design, Human Factors, Management. tion gathering processes, searches where users are not familiar with the domain. Users need to refine search queries, branch out into 1. MOTIVATION other queries to gain additional understanding and collect results to Since the launch of the WWW, users accumulated a vast amount of merge them into a single topic. This kind of search process is called information. With broadband technologies becoming a part of ev- exploratory search and is contrary to a known-item search task as eryday life1 the WWW offers a great opportunity in terms of learn- stated in [2]. Exploratory search processes “depend on selection, ing and education. University courses, for instance, are available navigation, and trial-and-error tactics, which in turn facilitate in- online and nearly every topic is handled somewhere in the great creasing expectations to use the Web as a source for learning and amount of blogs, Q&A pages, fora, web pages or databases. Yet exploratory discovery” [3]. Search tasks are fragmented, consist- there is no map, no guide leading through this vast amount of in- ing of single queries and search requests. The search requests may formation. Users need to search for information, to locate the bits yield additional data or parts of the final information which in the fitting to their specific information need, indexing the amount of end form the information requested by the user. While perform- 1 ing such a complex search task, a pattern called berry picking [4] http://www.internetworldstats.com/images/ can be observed. While reading through a source of data, looking world2012pr.gif, 02.05.2013 for qualified information the user discovers new traces leading to other sources, which have to be handled one after the next. By re- 2 http://www.google.com, 02.05.2013 3 http://www.bing.com, 02.05.2013 4 http://www.yahoo.com, 02.05.2013 5 see the 2009 HitWise study for more details: http: //image.exct.net/lib/fefc1774726706/d/1/ Presented at EuroHCIR2013. Copyright c 2013 for the individual papers SearchEngines_Jan09.pdf, 10.07.2013 6 by the papers’ authors. Copying permitted only for private and academic http://royal.pingdom.com/2012/01/17/ purposes. This volume is published and copyrighted by its editors. internet-2011-in-numbers/, 02.05.2013 fining the search and gaining deeper information the user satisfies synonyms. By adding and linking those parts the user constructs the initial need for it. These different traces span a map in the end, a boolean query which will be submitted to the Google search en- representing the whole search and its processing. When someone gine. Boolify was built for children and elderly. Tests in a third is learning about something this map is refined and expanded. The grade technology class showed that children without any knowl- learner may track back to a certain node and deepen the understand- edge of boolean queries were able to construct complex queries ing about it by adding new queries, and therefore new branches. Or just by pulling them together piece by piece9 . A similar approach he may discard a whole part of the map because it turned out that was implemented at SortFix10 . This tool offers the user the “abil- the contained information was not relevant to him. When the user ity to drag and drop search terms in between several buckets” [6] is satisfied with the gained information this map is encapsulated to in- and exclude them in the query. With a Standy Bucket users and represents the whole development of this complex information. are “able to keep track of all [their] inspirations and alternative According to this concept the result is not a single object. It is a set search words off to the side, ready to be dragged and dropped into of sources, representing the learning process for a specific user. your search box if needed.” [6] Another possible use of dynamic interface elements is the weighting of search terms based on their Looking at the current process of information gathering in the In- font size as used at SearchCloud.net11 . The ranked keywords are ternet there are only two places. The Internet itself, containing the shown in a Tag Cloud like manner and additionally the site shows, pool of existing information, in an unstructured form and a mental based on the ranking, “the calculated relevance score for each [re- model about the information (space) that is constructed. This sys- sult]” [6]. Not only the query building process can be enchanted tem may work perfect when dealing with short, exact search queries by dynamic elements, also the presentation of the result can benefit like postal code New York City, but when it comes to complex in- from it. Dynamic side loading can provide the user a lens like view formation needs, where the user needs to access a lot of information to parts of the result where keywords occur. Microsoft’s WaveLens and generate more detailed search queries while looming through “[...] fetches a longer sample for the page containing your key- pages this system reaches it boundaries. The user might retrieve words, without you having to download it.” [8] Microsoft Research only partial facts. For example, if the user needs explanation of a shows that in a study using WaveLens, presenting the participants term used in its initial query. The user is now in need of another with a normal interface and two versions of WaveLens’ UI (instant place, where he can store information, reorder it and put it into the zoom and dynamic zoom), “participants were not only slower with context of other information pieces. the normal view than the other two, but they were more than twice as likely to give up” [9]. Another way of result presentation was shown at SearchMe12 : “Fragmentation into multiple sites, domains 3. STATE OF THE ART and identities becomes a huge distraction. User don’t know which Looking at Google, the most used search engine today [5], the user site to visit for which purpose, and the lack of consistent, intuitive interface of a modern search engine is mostly static. Google’s fea- inter-site search and navigation makes it hard to find content [..]” tures include some dynamic elements like real time search. For [6]. All these dynamic features can be used as a mask over tradi- example “[..] Google Suggest which interactively displays sugges- tional SUIs to extend them. By hiding the dynamic part, dynamic tions in a drop-down list as the searcher types in each character of elements can be added to an existing search engine and let the user his/her query. The suggestions are based on similar queries submit- make a choice which part should be shown and used. The proposed ted by other users.” [1] Dynamic previews of results will be offered concept is similar to Byström & Hansen’s approach in [19]. when clicking on the double arrow beside a result. But the core of the interface has not changed a lot since its launch in 19977 . While Issues. Comparing the state of the art with the process of infor- adopting fast to new information sources like Facebook and Twitter, mation gathering some issues appear, which may be resolved or at Google discarded the adoption of new HCI methods in favour of a least damped by using of dynamic elements. While collecting in- clean, slim interface. With increasing touch support on the devices, formation pieces for solving complex questions the user discovers a richer user interface can be designed to provide the user with new sources, containing more information. These sources may not immediate feedback and allows haptic interaction with the search form a linear search process every time. Sometimes there will be a process. Some mobile clients take advance of the additional in- split and the user needs to decide which trace to follow first. This formation available, like the iOS search client, which switches to issue is also noted in [10]. Today’s search engines offer only little voice queries when the phone is lifted to the head, but there is no support for this. The user needs to save web pages to favourites or full extension of Google’s search services. While Google is an ad- organize them himself for later reading. Searching different terms equate tool for short queries and queries calling for a direct answer, one by one allows users to follow new pages like traces through features for deep research on complex topics are missing. the Internet. By connecting these traces and setting them into re- lation the user can retrieve the whole information needed to cover One way to integrate dynamic elements into existing SUI infras- his query. Most modern search engines discard this feature, it is tructure is to build an overlay. Thereby, dynamic UI utilize existing, again something the user needs to do by himself. This leads to well known search engines and provide a benefit by enriching them. another more general problem, the enclosing of search queries. This approach is shown in the Boolify8 search engine, which pro- Google for example handles every search term as a new opera- vides a dynamic drag and drop interface on top of Google’s search tion. Data is stored, but contains only general information about engine. This engine is relatively new and was build to promote the the user, queries are not related to each other and therefore miss- understanding of boolean queries. Users build a query by drag- ging jigsaw like parts onto a search surface. These parts contain 9 http://ed-tech-axis.blogspot.de/2009/03/ words (general or exact) and linkers like AND and OR. Additional boolified.htm, 02.05.2013 10 parts have been added to provide search on a specific page or for SortFix.com, offline since 11/2011, Firefox plugin: https://addons.mozilla.org/en-us/firefox/ 7 http://www.google.com/about/company/ addon/sortfix-Extension, 02.05.2013 11 history/, 02.05.2013 http://searchcloud.net/, 02.05.2013 8 12 http://www.boolify.org/, 02.05.2013 http://www.searchme.com, offline since 2009 holds an array of parameters, which is used to evaluate every item. Possible criteria are Accuracy, Clarity, Currency and Source Nov- elty. These and more criteria are mentioned and explained in [14]. When a user reorders items to fit his preferences the search engine may use the information provided by this ranking to weight the ex- isting parameters to yield better results in the future. The engine will be able to present results ranked according to the user’s prefer- ence. This can be done for all users and also search process wide, as some search tasks require documents and papers while others may Figure 1: Data flow while refining during search. focus on web pages or media. This addition to classical user inter- faces can make great use of the up-trend for touch based devices, in 2012 89% of mobile phones and smart-books support touch [15]. ing its broader context. But when learning about a complex topic Designing the SUI responsive to touch and gesture is maybe one refining the search query is more important to the user. In the iter- of the most natural solutions for human computer interaction and ation of search processes, to narrow down the mass of information adds an amount of possible actions based on gestures. and to tap new sources, the searcher needs to rewrite and modify the query, to link it to other related search tasks. Building a con- Workbench. The workbench targets the issue of loosing informa- nection between parts of information and evaluating it against each tion while switching between different searches. It adds a third other is a core principle of learning. This leaves the user targeting place to the proposed search process, located outside of the search a broader, intense search, in the need to build a custom solution to scope but still related to it. The user may drop queries here to keep extract knowledge and manage it. This is strictly against the guide- them throughout the whole search process. When entering a query, line for online interfaces which suggests to “[..] not require users indicators show how relevant items on the bench are. This allows to remember information from place to place on a Web site” [11] as the user to classify new results in terms of integrity towards already this is a distraction from the main process of searching and destroys selected snippets. The workbench acts as a buffer between search the interaction flow triggered by the search process. queries, adding a broader context to every entry. Like a frame, it contains information exclusively attached to the current search pro- 4. COMPOSING A DYNAMIC SUI cess, leading to the possibility of customization and user centred search environments. When the user switches between queries he The proposed approach shows a design based on today’s search en- can immediately determine how well the new results fit into already gines, enriched with dynamic UI elements to provide a plus for the selected items. This allows identifying false positive as well as ex- user. The design includes principles to form web based learning ap- ploratory search [16] results. Users may just enter queries that lead plications [12] to focus on the completion of complex search tasks. to a peripheral topic and check the indicators whether the result is By adding dynamic elements internal states can be visualized for relevant to his initial information. the user to give a better overview about the current position in the search process. Furthermore it will allow the serialization of search Tag Cloud. The tag cloud is another feature to guide the user in the processes and to step in at every point of the process later on. As search process. As shown in [17] a tag cloud supported retrieval stated in Beyond Box Search “different interfaces (or at least dif- system can increase the find rate of adjacent data nodes by nearly ferent forms of interaction) should be available to match different 15%. When adding an item to the workbench its most relevant tags search goals” and “[t]he interface should facilitate the selection of are extracted and visualized in the tag cloud. It is able to show how appropriate context for the search” [13]. Both of this quality mea- often a tag occurs and how different tags are related to each other. surements should be regarded when conceptualizing a SUI. The When entering a new search query the tag cloud displays the rele- first point will be covered by a modular UI, the user may move, vant tags and reorders the cloud to revolve around the current tags. hide and scale elements to fit his current need. The second point By combining distance and size of the entered tag with their direct is strongly bounded to the use of dynamic items in the UI design. neighbours the user can directly spot how homogeneous its current By giving immediate feedback to the user it is easier to classify query is in terms of the whole process. The tag cloud can also use the current results. The context of the whole search process will the existing tags to show the user other closely related tags and sug- be persistent over multiple search queries and provide a method of gest query refinement based on tag proximity. Colours can indicate accumulation parts of the search process into a single object. the state a tag is currently in. A possible color scheme for western culture can be based on the three colors used in traffic lights. The Four features are proposed and explained in this paper, showing a concept of three-coloured traffic lights also work for color-blind use-case for dynamic search interfaces and giving a suggestion how people, since they do have a given position. Therefore, we also use this can be accomplished. Together these features build up a mid second coding paradigm: form. A green triangle is proposed for instance to accumulate into a bigger context for a search process. tags resulting from the current query, which are contained in the This clipboard (Fig. 1) reshapes the search process and provide the overall tag cloud spanned by the workbench. An orange circle in- place to store information between search queries. Instead of trying dicates a warning for tags, either in the current query result or the to accumulate knowledge and information directly the user is able bench, which are not related to the rest of the cloud. A red square is to construct a solution of the search query in this buffer and save it avoided for the reason that uncontained tags may not be bad, they as a complete collection of the information retrieval process. can lead to a new direction or add a reasonable value to the whole search process. The tags are scaled depending on their frequency. Reordering. Giving users the opportunity to reorder and therefore When the user selects any item from the bench or the search re- to rate a search result is an important step towards dynamics in sult the corresponding tags are centred. The other tags are located SUIs. Every result is handled as a single item and can be picked based on their coherence with the selected tags; closer means the by the user and dropped in another place. The other items reorder tag is in a direct relation to the selected item. A user can quickly fluently, giving user feedback while the user moves on. The SUI Starting as overlays and additional feature of existing search en- gines may develop and emerge into independent solutions. Acknowledgement Part of the work is funded by the German Ministry of Education and Science (BMBF) within the ViERforES II project (01IM10002B). Figure 2: Search map, representing the search process. 6. REFERENCES [1] Sandvig, J. C., Deepinder B.: User Perceptions of Search check the integrity of his search process by looking at the tag cloud. Enhancements in Web Search. In: J. of Comp. Inform. Syst. A slim, packed cloud means the results are all related to each other, 52, no. 2, 2011. an open, wide cloud indicates a broad result field, covering many aspects. False positives may be filtered out, when enough items ex- [2] White, R. W., Marchionini, G.: A Study of Real-Time Query ist, as they stick out the rest of the cloud. Expansion Effectiveness. In: SIGIR Forum 39, 2006. [3] Marchionini, G.: Exploratory Search: From Finding to Search Map Support. The search map (Fig. 2) acts as a representa- Understanding. In: Comm. of the ACM 49, 4.2006. tion of the whole search process, by storing every query and follow- [4] Bates, Marcia J.: The design of browsing and berrypicking ing up querying and visualize it in a chronological order. The user techniques for the online search interface. Univers. of Calif. may select single nodes in the map to get into the state of search at L.A., 1989. process at this moment and refine it. The map provides a kind of top [5] Purcell, K., Brenner, J., Rainie, L.: Search Engine Use 2012. view to the path of the search and shows where the user branched In: Pew Internet & American Life Project, 2013. out into new queries. It allows the user to cut off nodes and whole [6] Bates, M. E.: Make Mine Interactive. Vol. 31, Issue 10, p. 63, branches if they are not needed any more to fulfil the need for in- 12/2008. formation. As it contains every action and some data in the current [7] Heer, J., Viégas, F. B., Wattenberg, M.: Voyagers and search process, the search map might be serialized and stored to re- voyeurs: Supporting asynchronous collaborative trieve the search process later on. With this map at hand a user can visualization. In: Commun. of the ACM, 52, No. 1, pp. save whole search tasks just like he saves favourite web pages. He 87–97, ACM, New York, NY, USA, 01/2009. can step back into the process at any time and reconstruct the whole [8] MS Research: Cutting Edge. New Scientist 181, no. 2434, learning process or correct parts of the search which has proven to 2004. be not correct. This kind of Story Telling helps to visualize the [9] Paek, T., Dumais, S., Logan, R.: WaveLens: A new view given data, “[...] lead to findings, which prompt actions [...] [and] onto Internet search results. In: Proc. of the SIGCHI can indicate the need to forage for new data.” [18] The search map Conference on Human Factors in Computing Systems (CHI [7] features two ways of expanding. The user may follow a result to ’04), pp. 727–734, 2004. expand it vertically. The result is added as a new node and resides [10] Morville, P., Callender, J.: Search Patterns - Design for in the map until it is processed further. When the user selects an Discovery. In: O’Reilly, 2010. existing node he steps back to the vertical position of this node and [11] U.S. Department of Health and Human Services, can now branch out horizontally. This deals with an issue of berry- Research-Based Web Design and Usability Guidelines. picking [4], where the new sources has to be processed one by one. Washington, D.C.: GPO, n.d. While not abolishing this the search map provides a visual repre- sentation to simulate parallelism. The map also allows scoping of [12] Jayasimman, L., Nisha Jebaseeli, A., Prakashraj, E.G., the analysis by creating a horizontal or vertical bound. Only tags Charles, J.: Dynamic User Interface Based on Cognitive and items inside this bound will be considered, the rest is greyed Approach in Web Based Learning. In: Int. J. of CS Iss. out. This allows the user to dig deep into a certain topic (small (IJCSI), 2011. vertical bounds) or create a better understanding of a certain term [13] Buck, S., Nicholas, J.: Beyond the search box. Reference & and add more results to a certain query (horizontal boundary). This User Services 51(3), pp. 235-245, 2012. can help the user to concentrate on smaller pieces of a big search [14] Beresi, U. C., Kim, Y., Song, D., Ruthven, I.: Why did you process and to narrow down problems one by one. pick that? Visualising relevance criteria in exploratory search. In: Int. J. on Dig. Lib. 11 (2), pp. 59–74, 2010. [15] Lee, D: The State of the Touch-Screen Panel Market in 2011. 5. CONCLUSION In: Walker Mobile, LLC, SID Information Display This paper has shown certain design flaws of today’s search engines Magazine, 3.2011. and some proposed dynamic design principles to counter them. The application of the envisioned elements can extend a search engine [16] White, R. W., Kules, B., Drucker, S. M., schraefel, m. c.: towards a software capable of complex research tasks. With the Supporting Exploratory Search. In: Comm. of the ACM 49, current up-trend of online learning this unlock a new way of using 4.2006. them. The surplus resides not only in the dynamic and vivid inter- [17] Trattner, C.: QUERYCLOUD: Automatically linking related face, it prepares a whole new tier of online search solutions. The documents via search query (Tags) Clouds. In: Proc. of the process of learning can be preserved and shared with others. One IADIS Int. Conf. on WWW/Internet, 2010. can come back at any time, jump right into the saved search process [18] Mackinlay, J. D.: Technical Perspective: Finding and Telling and reconstruct the development of certain knowledge. With this Stories with Data. In: Comm. of the ACM 52, 2009. tool chain at hand learning becomes a social and an integrative part [19] Byström, K., Hansen, P.: Conceptual framework for tasks in of the WWW. The next step in deploying dynamic elements into information studies: Book Reviews. In: J. Am. Soc. Inf. Sci. search user interfaces would be prototyping them. Design snippets Technol., Vol. 56, 10, pp. 1050–1061, John Wiley & Sons, need to be tested for usability and acceptance in the real world. Inc., New York, NY, USA, 2005. SearchPanel: A browser extension for managing search activity Simon Tretter Gene Golovchinsky Pernilla Qvarfordt University of Amsterdam FX Palo Alto Laboratory, Inc. FX Palo Alto Laboratory, Inc. Amsterdam, The Netherlands 3174 Porter Drive 3174 Porter Drive Palo Alto, CA Palo Alto, CA s.tretter@gmail.com gene@fxpal.com pernilla@fxpal.com ABSTRACT documents in relation to the searcher’s activity: how many People often use more than one query when searching for times was a document retrieved, whether it was viewed be- information; they also revisit search results to re-find infor- fore, etc. This kind of information can help searchers to mation. These tasks are not well-supported by search inter- remember, understand and plan their search processes. faces and web browsers. We designed and built a Chrome The browser plugin enhances the searcher’s ability to use browser extension that helps people manage their ongoing process metadata to understand their search results and to information seeking. The extension combines document and plan subsequent activity by displaying surrogates for the process metadata into an interactive representation of the current set of retrieved documents. We represent prior re- retrieved documents that can be used for sense-making, for trieval state, whether a document was opened, and whether navigation, and for re-finding documents. it was bookmarked in an integrated overview that appears at the side of the browser window. We also make it pos- sible for searchers to examine multiple documents without 1. INTRODUCTION returning to the search results or using multiple tabs. Broder et al. [3] proposed a taxonomy of web search that The remainder of this paper is organized as follows: we included transactional and navigational searches in addition review the relevant related work, describe the browser ex- to the more traditional (from an IR perspective) informa- tension, and conclude with a discussion of the design space. tional searches. To this taxonomy we might add re-finding [17] [5], the task of locating a previously-found document. From a theoretical perspective, it is not clear whether refind- 2. RELATED WORK ing is a di↵erent kind of search activity or an orthogonal di- There are two broad categories of related work: the man- mensions. Regardless, while major web search engines o↵er agement of search history and the representation of search simple and efficient interfaces for navigational and transac- results. Refinding has received increasing attention recently. tional searches, relatively little support is available for more While the browser implements some history mechanisms, complex informational search or re-finding. these are typically not well-suited to users’ needs [15]. El- These seemingly neglected activities are not unimportant, sweiler and Ruthven [5] described di↵erent patterns of re- however: Teevan et al. [17] reported that 39% of queries are finding; Teevan [16] proposed a mechanism for merging pre- re-finding queries; furthermore, 20-30% of searches represent viously-found and newly-retrieved documents. More explicit open-ended informational needs [13]. Related, Qvarfordt et management of search history has also been investigated in al. [11] found query overlap rates of 50-60% in exploratory the literature; see [7] for a succinct summary. search, and suggested that awareness of this overlap may be Information overload due to large numbers of results is useful in supporting more efficient searching behavior. Thus a common problem in information seeking [2]. This prob- we decided to explore ways in which searchers’ interactions lem can be addressed in a variety of ways. MetaSpider with search engines could be enhanced to support these more [4] uses a 2D map to display and classify retrieved doc- complex information-seeking tasks. uments. Grokker [8] uses nested circular and rectangular We created a web browser extension that enriches com- shapes to present results and also shows them in a hier- mon web search engine interfaces and addresses important arachical grouped way. Sparkler [12] uses a star plot for the deficits with respect to open-ended (exploratory) search and result presentation, where every star represents a document. re-finding. Our extension visualizes search results to help One potential issue with the systems above is that the users find the right document or documents by visualizing overall organization of the interface itself may induce us- metadata of the retrieved pages. ability problems. Complex interfaces allow more individual Following Golovchinsky et al. [7] we distinguish docu- settings to be specified by a user, but simple interfaces allow ment metadata from process metadata. Document metadata a broader spectrum of users to use them. This tradeo↵ is – dates of publication, titles, hosting web sites, etc. – are not trivial to handle, and as we see nowadays, most Web basic characteristics of documents that are independent of search interfaces tend to be quite simple. the means by which these documents were retrieved. Pro- Supporting the searcher’s decision making process can be cess metadata, on the other hand, characterize aspects of crucial for e↵ective search performance for complex infor- mation needs. This support can take the form of enhanced Presented at EuroHCIR2013. Copyright c 2013 for the individual papers by the papers’ authors. Copying permitted only for private and academic surrogates for documents. One type of information often purposes. This volume is published and copyrighted by its editors. used for this purpose is document metadata (author, date, images of the document, etc.). Even et al. [6] has shown Table 1: Design space: Activities and supporting features that the decision making process can be highly improved by related to document and process metadata. ”Doc” refers to adding process metadata (in our case information that is re- document metadata and ”Proc” to process metadata. lated to the search process) to the user interface. Research has shown that presenting simple tasks in a slightly di↵er- Activity Feature Doc Proc ent way may help the user to understand how the search Search perform search yes no is performing and what can be done to gain better results switch engine no yes [18]. One common example of incorporating process meta- results list yes no data in web browsers is the practice of changing the color of visit status no yes a traversed link anchor. visualize no. of visits no yes Spoerri [14] showed that users can benefit from di↵erent Navigation access results - - or additional visualizations of web search results. However, mark current result path no yes none of the techniques above have been integrated by major identify results: preview yes no search engines into their main interfaces. In some cases, ex- snippet tension developers have enhanced the user experience of web identify results: favicon yes no search. Examples include: SearchPreview[9] that fetches Organization bookmarking no yes screen shots of the result pages and shows them directly organize bookmarks no yes next to the each search result. Bettersearch[1] is a Firefox extension that performs a similar task, but also enriches the When searchers find useful web pages, they may wish to result page with more features and links. For example, this save those documents for future access. More specialized extention allows users to open a result in a new tab, or adds search engines sometimes support this capability directly, links to a search result to quickly show the web page on the but it is most often supported only by the browser’s book- ”Wayback Machine”1 . WebSearch Pro [10] is also a Firefox marking capability. extension that adds the ability to look up a text by high- We can consider these search and sense-making activities lighting it on a page. Another feature is drag&drop zones in light of the kinds of information required to satisfy them. to search for things directly from any website. In particular, Table 1 shows when document and process metadata might be pertinent for the di↵erent categories of search activities. A representation of the number of visits 3. BROWSER EXTENSION to a retrieved result (process metadata) could be used by a To compensate for the deficiencies of SERPs we created a searcher to decide how to interact with that result. In a re- browser extension called SearchPanel. This extension com- finding sub task, for example, searchers might want to ignore bines document and process metadata in a visual represen- newly-found documents or pages that were not opened. tation of search results to help people manage their infor- The purpose of the search panel is to complement the mation seeking. We chose the browser extension approach SERP and to be available when exploring search results; we rather than creating a proxy for several reasons. While both wanted the design to be simple and unobtrusive but still o↵er the potential of parsing and augmenting SERP and convey useful information. Some features (e.g., organizat- document pages, a browser extension has some advantages. ing bookmarks) listed in Table 1 are too complex to be in- It scales better with respect to storing user history data. It tegrated into the extension. Others, such as favicons, while ensures a higher level of data privacy, since data that might seemingly trivial, may still provide useful information for potentially reveal user interests (e.g., query keywords, se- navigating search results. lected URLs, etc.) can be logged as hashed values. Finally, it has access to bookmarks and local browsing history. 3.2 Implementation 3.1 Design space SearchPanel displays automatically on the right side of the browser window when it is enabled (Figure 1). The right side When performing search tasks, searchers may need di↵er- of the content page has been chosen because this location is ent kinds of information to support their information seek- frequently free of document content. In cases of overlap, its ing. We represent the design space as consisting of three vertical position can be adjusted manually to accommodate categories of activities: search activity, navigation activity, page content that may be occluded. and organization activity. SearchPanel displays immediately after a search has been Historically, web UI support for the search process, or performed on a supported web search engine (currently, they search activity, has been focused on query formulation and are Google, Google Scholar, Yahoo, Bing and Microsoft Aca- understanding the current query. Web browsers o↵er lim- demic Search). SearchPanel remains visible even if the sear- ited support for comparing current results set with earlier cher follows links from retrieved documents. In addition, activity by marking the visited status of documents. searchers can return directly to the original query, or re-run When engaged with a search task, users need to shift their it on a di↵erent search engine. attention between the SERP and the retrieved pages. In A short tutorial page is displayed at installation, and can some cases, the searcher does not find the desired informa- also be reached through the option menu. This page also tion in a retrieved document, but rather in links to other allows logging (see 3.2.4) to be disabled, and can be used to documents containing relevant information. This naviga- delete the recorded history. tion activity can be an important part of the information seeking process. 3.2.1 Document metadata 1 SearchPanel displays several kinds of document metadata. The Wayback Machine is a service that provides access to archived and historical versions of web sites. Documents are represented by bars arranged in order corre- Figure 2: Highlighting of snippet on the SERP when mous- ing over SearchPanel. Figure 3: Snippets of other pages are shown on a document Figure 1: SearchPanel control annotated to show impor- page when mousing over SearchPanel. tant aspects. 1 search engine selector; 2 bar representing a newly-found page; 3 favicon representing the site from the star to bookmark the corresponding page. Second, pre- viously bookmarked documents in the SERP will show a which the page was retrieved; 4 bar representing page that yellow star next to them. This allows to re-find a web page has been visited; 5 highlighted bar based on cursor posi- quicker, as the user does not need to navigate to a document tion; 6 bookmark indicator; 7 currently-selected page. to know if they have previously bookmarked it. sponding to the retrieved list; clicking on a bar is equivalent 3.2.3 Navigational support to clicking on a link on the SERP. Almost all websites have The selection indicator (see item 7 in Figure 1) indicates icons (favicons) to help re-identify the web page quickly; the currently-selected result page. If a link on a result page these icons are shown to the right of the bar (see Figure 1, is clicked, the page indicator will stay on the last retrieved item 3 ). A tooltip with the title of the document is added document page to indicate that navigation started with it. to each bar as well. We considered identifying other meta- Hovering over the result highlights the associated bar (item data such as document MIME type, but that would incur 5 ), and also highlights the corresponding snippet in the the overhead of a separate HTTP request for each document. SERP (Figure 2); the SERP is scrolled as necessary to bring At least initially, we chose not to pursue this strategy. highlighted snippet into view. Conversely, when the mouse is over a snippet on the SERP, the related bar jiggles left- 3.2.2 Process metadata right to reinforce the connection between the two. Process metadata is also incorporated into SearchPanel. When the user navigates o↵ the SERP to a search re- First, the icon of the search engine that ran the search is sult, SearchPanel remains active. Clicking on bars navigates highlighted in the top bar (item 1 ). Other icons repre- among the retrieved documents, bypassing the intermediate sent available comparable search engines. Clicking on one step of reloading the search results. When the mouse is over of these icons re-runs the query with the selected search en- a bar in SearchPanel, the SERP snippet of that result will gine. Search engines are grouped into two categories (web be shown. This can be seen in Figure 3, where a preview search and academic research) and only the relevant ones are of the Wolfram Alpha snippet is shown. If the snippet is shown. The current selection (highlighted with a black bor- not available, a tooltip with the document title is shown der) links back to the search result page if the user navigates instead. Both of these features should make it easier and to one of the retrieved documents. more efficient to navigate the search results without neces- Each bar can have one of three di↵erent colors, depending sarily creating a large number of tabs in the process. on the link history. If a link has never been retrieved before, the state of the link is ”new” and the color will be teal. Re- 3.2.4 Logging sults that have been retrieved by prior queries but have not The extension was created to study people’s information been clicked on are colored blue. Visited links are colored seeking behaviors. The goal of the project is to understand violet. The local browser history is examined to retrieve the how people use the web when looking for information to link status. This allows us to incorporate page views that improve their search experience. Therefore logging of user occurred before SearchPanel was installed. activity was necessary. To encapsulate it from the basic Each bar’s length reflects the frequency of retrieval of the functionality it was designed as plugin that could be con- corresponding page. The more frequently a page has been nected or disconnected from SearchPanel. It collects infor- retrieved, the shorter the bar gets (item 3 ). The retrieval mation related to the use of SearchPanel for the purposes of history is stored locally in the browser for privacy reasons statistical analysis of patterns of behavior. and can be deleted through SearchPanel’s option page. To maximize searchers’ privacy, no personally-identifying In SearchPanel, the bookmarking function serves two pur- information is saved. Queries and found URLs are recorded poses (item 6 in Figure 1). First, searchers can click on as MD5-hashed values only. This allows us to identify re- curring queries and documents, without being able to read [6] Even, A., Shankaranarayanan, G., and Watts, the content of the query or to observe which pages people S. Enhancing decision making with process metadata: view. Specifically, the following information is recorded: Theoretical framework, research tool, and exploratory examination. In System Sciences, 2006. HICSS’06. • The IP address and the time the event was logged Proceedings of the 39th Annual Hawaii International • When a search result was clicked and where this hap- Conference on (2006), vol. 8, IEEE, pp. 209a–209a. pened (SearchPanel or SERP) [7] Golovchinsky, G., Diriye, A., and Dunnigan, T. • Hash strings that represent the queries and found web The future is in the past: designing for exploratory pages. search. In Proceedings of the 4th Information • Time spent with the mouse on di↵erent interface parts Interaction in Context Symposium (New York, NY, (SearchPanel vs SERP) USA, 2012), IIIX ’12, ACM, pp. 52–61. [8] Hong-li, Q. A novel visual search engines: Grokker. • Various actions related to the extension (adding book- Journal of Library and Information Sciences in marks by clicking the start, moving it, etc.). Agriculture 8 (2008), 047. [9] KG, P. U. . C. Searchpreview, the browser extension 4. NEXT STEPS previously known as googlepreview. After an in-house pilot deployment, SearchPanel has been http://searchpreview.de/, 2013. [Online; accessed made available through the Google Chrome store. The goal 06/06/2013]. of the deployment is to understand whether the extension [10] Martijn. Web seach pro, search the web the way you helps people with their search tasks, and to assess the rela- like... http://websearchpro.captaincaveman.nl, tive utility of document vs. process metadata. We also ex- 2012. [Online; accessed 06/06/2013]. pect to collect a dataset that characterizes people’s browsing [11] Qvarfordt, P., Golovchinsky, G., Dunnigan, T., and searching behaviors in terms of patterns of retrieval and and Agapie, E. Looking ahead: Query preview in re-retrieval, search result navigation, etc. exploratory search. In Proceedings of the 36th international ACM SIGIR conference on Research and 5. CONCLUSIONS development in Information Retrieval (New York, NY, Web search engines are used for many di↵erent kinds of USA, 2013), SIGIR ’13, ACM. search tasks. While navigational and transactional uses of [12] Roberts, J., Boukhelifa, N., and Rodgers, P. search engines are well-supported by current interfaces and Multiform glyph based web search result visualization. algorithms, searchers are left to their own devices for more In Information Visualisation, 2002. Proceedings. Sixth open-ended information seeking and re-finding. We created International Conference on (2002), IEEE, a Google Chrome browser extension to help people manage pp. 549–554. their search activity. We explored the design space of doc- [13] Rose, D. E., and Levinson, D. Understanding user ument and process metadata related to the wide range of goals in web search. In Proceedings of the 13th activities searchers may engage in during information seek- international conference on World Wide Web (2004), ing. The extension keeps track of retrieval, page visits, and ACM, pp. 13–19. bookmarking, and integrates traces of these activities with [14] Spoerri, A. How visual query tools can support users document metadata to give people a more complete impres- searching the internet. In Information Visualisation, sion of their search activity. An upcoming deployment will 2004. IV 2004. Proceedings. Eighth International explore the e↵ect that this extension has on how people in- Conference on (2004), IEEE, pp. 329–334. teract with search results. [15] Tauscher, L., and Greenberg, S. How people revisit web pages: empirical findings and implications 6. REFERENCES for the design of history systems. Int. J. [1] ABAKUS. Bettersearch a firefox addon for enhancing Hum.-Comput. Stud. 47, 1 (July 1997), 97–137. search engines. http://mybettersearch.com/, 2010. [16] Teevan, J. The re:search engine: simultaneous [Online; accessed 06/06/2013]. support for finding and re-finding. In Proceedings of [2] Baeza-Yates, R., Ribeiro-Neto, B., et al. the 20th annual ACM symposium on User interface Modern information retrieval, vol. 463. ACM press software and technology (New York, NY, USA, 2007), New York, 1999. UIST ’07, ACM, pp. 23–32. [3] Broder, A. A taxonomy of web search. SIGIR Forum [17] Teevan, J., Adar, E., Jones, R., and Potts, M. 36, 2 (Sept. 2002), 3–10. A. S. Information re-retrieval: repeat queries in [4] Chen, H., Fan, H., Chau, M., and Zeng, D. yahoo’s logs. In Proceedings of the 30th annual Metaspider: Meta-searching and categorization on the international ACM SIGIR conference on Research and web. Journal of the American Society for Information development in information retrieval (New York, NY, Science and Technology 52, 13 (2001), 1134–1147. USA, 2007), SIGIR ’07, ACM, pp. 151–158. [5] Elsweiler, D., and Ruthven, I. Towards [18] Wang, T. D., Deshpande, A., and Shneiderman, task-based personal information management B. A temporal pattern search algorithm for personal evaluations. In Proceedings of the 30th annual history event visualization. Knowledge and Data international ACM SIGIR conference on Research and Engineering, IEEE Transactions on 24, 5 (2012), development in information retrieval (New York, NY, 799–812. USA, 2007), SIGIR ’07, ACM, pp. 23–30. A System for Perspective-Aware Search M. Atif Qureshi*†! , Arjumand Younus*†! , Colm O’Riordan* , Gabriella Pasi! , Nasir Touheed† * Computational Intelligence Research Group, Information Technology, National University of Ireland, Galway, Ireland ! Information Retrieval Lab, Informatics, Systems and Communication, University of Milan Bicocca, Milan, Italy † Web Science Research Group, Faculty of Computer Science, Institute of Business Administration, Karachi, Pakistan muhammad.qureshi,arjumand.younus@nuigalway.ie, colm.oriordan@nuigalway.ie, pasi@disco.unimib.it, ntouheed@iba.edu.pk ABSTRACT terrorism in most of the cases. This prompts the user Traditional search engines fail to capture the notion of “per- to explicitly evaluate how much Islam is related to ter- spective” in their search results and at times present the re- rorism in the returned search results. sults skewed towards a particular topic. Under most of these • Consider the case of a user who wishes to find out cases even query reformulation fails to retrieve desired search about roles and rights of women in Islam but the search results and the underlying reason for such failure is often engine returns articles that contain a high amount of the bias within the document collection itself (e.g., news ar- terms highlighting oppression against women instead ticles). A perspective-aware search interface enabling users of women rights and roles. In this case the user is to look into search results for some “perspective” terms may prompted to check the correlation between women and be of great use for certain information needs. In this paper oppression within the search results that have been we describe such a system. returned. Categories and Subject Descriptors Note that the perspective given by most search results H.1.2 [User/Machine Systems]: Human factors; H.3.3 (Islam in our motivating example (1) and oppression in our [Information Search and Retrieval]: Search process motivating example (2)) may or may not be aligned with the user’s query intent. In case of search results not being aligned with his/her query intent he/she may be interested General Terms in observing the amount of perspective tendencies in various Human Factors, Performance news reports. This paper proposes the concept of a “perspective-aware” Keywords search interface that enables the user to explicitly analyse search results for information from a particular perspec- Perspective, Wikipedia, Bias tive with respect to an issued query. To the best of our knowledge, previous research within Human-Computer In- 1. INTRODUCTION AND RELATED WORK teraction and Information Retrieval has failed to capture It is often the case that when using a search engine for in- the notion of “perspective” within the information retrieval formation seeking users have an underlying intent [1]. Tra- process. Early research related to Interactive Information ditional search interfaces fail to capture the user intent for Retrieval by Belkin [2] and Ingwersen [6] suggests the inte- certain topics and at times return results that may be skewed gration of cognitive aspects within the information retrieval towards a certain perspective. Here, perspective as defined process: in line with this suggestion we argue for incorporat- by the Oxford Dictionary refers to a “point of view”1 within ing the essential cognitive element of “perspectives”2 within the search results that may or may not be something what the search engine interface. user is looking for. We explain further through the following Recently the information retrieval community has turned motivating examples: attention to diversification of search results which aims to tackle the issue of query ambiguity on the user side [8]. How- • Consider the case of a user who wishes to find more ever, even when formulating a non-ambiguous query users about a certain event (say, a bomb attack in a certain may have an intent that influences the perspective from region). The search results returned contain a ma- which the query terms can be interpreted in a text; in case of jority of news reports blaming Islam relating it with 2 1 According to Wikipedia the definition of perspective states This may also be seen as topic drifts within a document. the following: “Perspective in theory of cognition is the Presented at EuroHCIR2013. Copyright ! c 2013 for the individual pa- choice of a context or a reference (or the result of this choice) pers by the papers’ authors. Copying permitted only for pri- from which to sense, categorize, measure or codify experi- vate and academic purposes. This volume is published and copy- ence, cohesively forming a coherent belief, typically for com- righted by its editors.. paring with another.” Figure 1: Entry Point of Perspective-Aware Search Interface the entry point of the interface which resembles the standard type-keywords-in-entry-form interface with the augmenta- tion of an additional input text box for entry of perspective terms. The underlying perspective detection algorithm makes use of the encyclopedic structure in Wikipedia; more specifi- cally the knowledge encoded in Wikipedia’s graph structure is utilized for the discovery of various perspectives in docu- ments returned by the search engine. Wikipedia is organized into categories in a taxonomy-like3 structure (see Figure 2). Each Wikipedia category can have an arbitrary number of subcategories as well as being mentioned inside an arbitrary number of supercategories (e.g., category C4 in Figure 1 is a subcategory of C2 and C3 , and a supercategory of C5 , C6 and C7 .) Furthermore, in Wikipedia each article can belong to an arbitrary number of categories, where each category is Figure 2: Wikipedia Category Graph Structure along a kind of semantic tag for that article [11]. As an example, with Wikipedia Articles in Figure 2, article A1 belongs to categories C1 and C10 , article A2 belongs to categories C3 and C4 , while article A3 belongs to categories C4 and C7 . It can be seen that the perspective mismatch between the user intent and the doc- articles and the Wikipedia Category Graph are interlinked uments returned in first positions by a search engine, users and our system makes use of these interlinks for the detec- may find the retrieved results annoying or subjective to a tion of a certain perspective within a document retrieved by non-agreed perspective [7]. One may argue that a query re- the search engine. formulation technique could be employed to tackle this prob- lem [5]; e.g. considering the motivating example (2), the user could issue a reformulated query such as “roles and rights of 2.1 Underlying Algorithm women in islam”. However, for some topics query reformu- The underlying perspective detection algorithm within our lation may fail to retrieve the desired search results, and the system requires the perspective term/phrase to match the underlying reason for such failure is often the bias within the title of a Wikipedia article. This may seem to impose a cog- document collection itself (e.g., news articles) [10]. Under nitive load on the user at search time. However, this is not such a scenario it would be interesting to provide a search the case: as shown in Figure 3 the entered text automati- interface that would enable the users to look into the search cally turns green when a certain user-specified perspective results for some “perspective” terms and we describe such a term matches the title of a Wikipedia article, and symmet- system in this paper. rically the entered text automatically turns red in case of a mismatch. Once the perspective term is entered correctly the system 2. PERSPECTIVE-AWARE SEARCH INTER- fetches the Wikipedia article corresponding to the perspec- FACE AND IMPLEMENTATION DETAILS tive term referred to as Seed Perspective Article (PAseed ) This section presents the essential details of the proposed along with the categories to which it belongs and we use perspective-aware search interface along with the underlying implementation details. We keep the interface as simple as 3 We say taxonomy-like because it is not strictly hierarchi- possible on account of research suggesting users’ reluctance cal due to the presence of cycles in the Wikipedia category in switching from a simple search form [3]. Figure 1 shows graph. Figure 3: Automatic Text Color Changing to Test Match of Perspective Term with Wikipedia Article Title PC0 4 to refer to these categories. After fetching of Wikipedia “terrorism” is shown in Figure 4. As evident from the top categories in PC0 , the system retrieves sub-categories of PC0 search result, there is a high perspective of terrorism within until depth 2 i.e., PC1 and PC2 5 and collectively these cat- the returned document and perspective terms that our al- egories related to PAseed are referred to as PC (where PC gorithm fetches are as follows: a) the war on terrorism, b) is union of PC0 , PC1 and PC2 .). Next, the set of all ar- ayman al zawahiri, and c) osama bin laden. ticles within the Wikipedia category set PC is retrieved and we refer to this set as Expanded Perspective Article Set 3. DISCUSSION (PAexpanded ). The system then retrieves all categories as- There have been many efforts in the information retrieval sociated with the set PAexpanded which we refer to as WC ; research to present to users information regarding the rela- note that PC is a subset of WC. Finally, the intersection be- tionship between the query and the answer set and the query tween PC and WC is retrieved which is a set of categories and document collection. Capturing this information during representative of the domain of the perspective term origi- the retrieval process provides the user with much valuable in- nally input by the user, we refer to this set of representative formation (e.g. whether a term is overly specific, or whether categories as RC. a term is ambiguous etc.). Various attempts have been made After building the Wikipedia category sets as defined above6 to tackle this problem, ranging from the definition of snip- i.e., PC, RC and WC we match variable-length n-grams pets to the definition of approaches to cluster search results within a document with articles in the set PAexpanded , and (Clusty.com), to the presentation of diversified search results we check for cardinality of RC and WC. The cardinality in the first position of the ranked list offered to the users. scores along with n-gram frequencies are used to compute a Recently there has been a resurgence of interest in defining perspective score for each document. visualization techniques of search results that offer an effec- 2.2 Search Results Presentation tive and more informative alternative to usual and scarcely informative ranked lists. Pioneer visualization systems are The perspective scores computed in section 2.1 are dis- represented by Tilebar [4], and Infocyrstal [9], and these played within the search results, and based on the perspec- attempts have been aimed to provide the user with more tive score a document receives , we define four levels of information than that provided by the traditional ranked perspective adherence as follows: a) High, b) Medium, c) list. Low, and d) Neutral. Moreover, in case of documents with This additional information can help the user in their high, medium and low scores we also report the top-scoring search task (e.g. allowing them to navigate the collection perspective terms that were extracted using the Wikipedia more easily or providing evidence to allow the user to refor- graph structure as explained previously. A sample search mulate their query more efficiently). corresponding to search query “india pakistan relations” and Our proposed system, although related in that we also at- 4 tempt to give the user an insight into the answer set and its These are basically perspective categories at depth zero. relation to the query, differs in a fundamental manner. Our 5 These are basically perspective categories at depth one and system, we posit, allows the user to gain insight into the an- two. 6 The set building phase is performed through a cus- swer set and its relation to the query, but moreover, allows tom Wikipedia API that has pre-indexed Wikipedia to the user to gain an insight into a perspective inherent in data and hence, it is computationally fast. For details the answer set. Our system uses an external and collectively http://www3.it.nuigalway.ie/cirg/prj/WikiMadeEasy.html created knowledge resource (which is less likely to be biased Figure 4: Search Results within Perspective-Aware Search in a given direction) to obtain extra terms to represent the [3] M. A. Hearst. ’natural’ search user interfaces. perspective of interest to the user. This knowledge (per- Commun. ACM, 54(11):60–67, Nov. 2011. spective term and related terms) does not modify the query [4] M. A. Hearst and J. O. Pedersen. Visualizing (as would an additional query term), but is instead used to information retrieval results: a demonstration of the highlight the presence of a perspective in the answer set. tilebar interface. In Conference Companion on Human In this paper we have proposed a novel approach for cap- Factors in Computing Systems, pages 394–395, 1996. turing the relationship between a user’s query and the re- [5] J. Huang and E. N. Efthimiadis. Analyzing and turned answer set. We do not rely on evidence in the doc- evaluating query reformulation strategies in web ument collection or the query stream, but rather instead search logs. In Proceedings of the 18th ACM extract terms from an external source of evidence to help conference on Information and knowledge users quickly see the presence of a particular perspective in management, CIKM ’09, pages 77–86, 2009. the document collection and answer set. [6] P. Ingwersen. Cognitive perspectives of information retrieval interaction: Elements of a cognitive IR 4. FUTURE WORK theory. Journal of Documentation, 52(1):3–50, 1996. Having built the system and undertaken preliminary user [7] B. J. Jansen, D. L. Booth, and A. Spink. Determining evaluations7 , we aim at undertaking a complete and system- the informational, navigational, and transactional atic review of the approach. This will comprise a number intent of web queries. Inf. Process. Manage., of separate user evaluation tasks. The initial experiments 44(3):1251–1266, May 2008. will involve comparing our search approach with and with- [8] R. L. Santos, C. Macdonald, and I. Ounis. out the perspective-aware component over a number of tasks Intent-aware search result diversification. In to see if the additional context and information provided by Proceedings of the 34th international ACM SIGIR our perspective aware system aids the users in a range of conference on Research and development in information-seeking tasks. Our second planned experiments Information Retrieval, SIGIR ’11, pages 595–604, will be focussed on persons seeking information from news- 2011. paper articles, a domain wherein a degree of bias often exists. [9] A. Spoerri. Infocrystal: A visual tool for information We wish to explore the users’ experience with regards to any retrieval & management. In Proceedings of the second perceived bias in the considered corpora. international conference on Information and knowledge management, pages 11–20, 1993. [10] A. Younus, M. A. Qureshi, S. K. Kingrani, M. Saeed, 5. REFERENCES N. Touheed, C. O’Riordan, and P. Gabriella. [1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Investigating bias in traditional media through social Diversifying search results. In Proceedings of the media. In Proceedings of the 21st international Second ACM International Conference on Web Search conference companion on World Wide Web, WWW and Data Mining, WSDM ’09, pages 5–14, 2009. ’12 Companion, pages 643–644, 2012. [2] N. Belkin. Cognitive models and information transfer. [11] T. Zesch and I. Gurevych. Analysis of the Wikipedia Social Science Information Studies, 4(2âĂŞ3):111 – Category Graph for NLP Applications. In Proceedings 129, 1984. of the TextGraphs-2 Workshop (NAACL-HLT), 2007. 7 The preliminary user evaluations have not been shared in this paper.