Improving search experience on distributed leisure events Richard Schaller Morgan Harvey David Elsweiler Computer Science (AI Group) Computer Science (AI Group) I:IMSK Uni of Erlangen-Nuremberg Uni of Erlangen-Nuremberg University of Regensburg richard.schaller@cs.fau.de morgan.harvey@cs.fau.de david@elsweiler.co.uk ABSTRACT performance with over 40% of searches being unsuccessful. This paper examines how simple changes to a search system A major conclusion of the paper is that people used the can influence the user’s experience when using the system. system as a tool for filtering and not for searching for new In previous work, we evaluated user behaviour with a search things. The paper gives suggestions for better tailoring the tool designed to help people discover events distributed over search system towards the observed user behaviour. a city of interest to them personally. We established, con- In this paper we build upon these results and explore how trary to our expectations, that users mostly searched for the proposed changes to the system influenced search charac- events they already knew about, made several spelling er- teristics and performance with the overall aim of improving rors and often achieved poor search performance. Taking search experience. We first describe the design of a study these findings as inspiration, we made changes to how the to answer this question. We report analyses of interaction system works. In this paper, we describe and motivate the logs of a variant of the search system including the proposed changes and present a naturalistic log-based study (n=860) changes which was tested on a similar event as was used in to examine the effect on user search behaviour. [8]: The Long Night of Music 2012 opposed to the Long Night of Museums 2011, both located in Munich. We admit 1. INTRODUCTION AND MOTIVATION possible doubts as to the comparability of these Nights due When studying search behaviour most published work fo- to the different topics: however we are able to address these cuses on analysis in the context of work tasks. Such tasks doubts by showing that user behaviour is similar enough to are not necessarily related to work but rather involve people make valid comparisons of system changes. We then show performing a sequence of activities in order to accomplish a differences and similarities in search behaviour between the goal [5]. A work task has a recognisable start and end, may original and the improved search system and finally draw consist of a series of sub-tasks, and results in a meaningful conclusions as to the effectiveness of the analysed changes. product [2]. Thus models developed tend to assume that people look for information to close a gap in knowledge [1] 2. DISTRIBUTED EVENTS which prevents them from completing their current task. A distributed event is a collection of single events over In contrast we want to do search analysis in context of the same time period and having the same general theme. leisure activities where no clear focus on a concrete working One such event is the Long Night of Munich Museums1 (LN- task is given. Elsweiler and colleagues [4] proposed a model Museums), an annual cultural event organised in the city of for what they refer to as casual leisure search, which devi- Munich2 , which was the context of the study performed in ates from standard work-based models. According to their [8]. In addition to a diverse range of small and large muse- model, in casual-leisure situations users are not focused on ums, other cultural venues, such as the Hofbräuhaus and the accomplishing a task but rather aim to be entertained or to botanical garden open their doors for one evening in Octo- pass time. These needs are influenced by emotional state, ber. Many venues organise special activities and exhibitions physical state or the social context in which they live. Addi- not otherwise available. A similar distributed event is the tionally such needs differ from work tasks by weighting the Long Night of Music (LNMusic) which also takes place in emotions induced by the found content or even the search Munich. Aside from pubs, discotheques and clubs also some process itself more than the raw informational content. cultural venues like churches and museums take part which Schaller et al. analysed in [7, 8] mobile search behaviour leads to some overlap between both nights regarding the in the context of a distributed event and compared search provided events. characteristics and performance of a naturalistic user study Visitors to Long Night include both locals and tourists to those of mobile web search. A main finding is that the and represent a broad range of age groups and social back- analysed queries were much shorter than those of mobile grounds. In 2011 an estimated 20,000 people visited a total web search. Also, most queries – in contrast to web search – of 176 events at 91 distinct locations at the LNMuseums, were for known-items: predominantly events names. Users including exhibitions, galleries and interactive events. The made a huge amount of spelling mistakes perhaps due to the LNMusic is about the same size with 206 events at 123 lo- environmental context (e,g, typing on a bumpy bus) or due cations and approximately 20,000 visitors. Events on both to the unfamiliarity with the correct spelling of the known- nights take place all over the city, mostly in the city centre, items. This is probably one of the causes of the poor search but some, such as the Museum of the MTU Aero Engines 1 Presented at EuroHCIR2012. Copyright c 2012 for the individual papers Name in German: Lange Nacht der Münchner Museen 2 by the papers’ authors. Copying permitted only for private and academic The event is organised by Münchner Kultur GmbH purposes. This volume is published and copyrighted by its editors. (http://www.muenchner.de/museumsnacht/) and the Potato Museum, are located in suburbs. Special bus then to match topically similar words, each token is mapped tours are set up to transport visitors between events. to one or more topic groups (these groups are taken from [3]). Events can be discovered by means of the booklet that is This way terms such as “dinner” and “food” are mapped to distributed for free by the organisers and contains descrip- the same groups, thus event descriptions containing one of tions of all events in the order they lie along the bus tours. these words could be found by the other. To speed up inter- This booklet is necessarily large (110 A6 pages per Long action with the system, queries were submitted after each Night) and can be difficult to navigate. typed character (search-as-you-type). The presented result list contains the name and nearest bus stop for each of the 3. SYSTEM retrieved events. An Android app was developed in [8] to help visitors of the Long Night find events of interest to them personally. 4. SEARCH SYSTEM CHANGES Once they have found and selected the events they would Based on insight gained from user interactions with the most like to visit, the system can create a time plan for the original search system, as described in [8], it was determined evening, taking into account constraints such as start and that the following improvements could be made: end times of events, time to travel between events and pub- lic transport routes and schedules. If the user chooses more • Grep-Like Search: events than would fit into the available time then the sys- Since users used the system mainly as a filtering tool, tem tries to maximise the number of scheduled events by the search-as-you-type feature might have led users to leaving out those requiring long travel time. It is also possi- give up early: the system tried to match whole (or ble for the user to manually customise the plans by adding, stemmed) words while the user faces an empty result removing and re-ordering events to be visited. Based on list after typing in the first few characters but before the created plan, the application can lead the user between finishing the word. In our new system – used and eval- chosen events using a map display and textual instructions. uated on the LNMusic – we extended the search system Figure 1 provides some screenshots of the app3 as was used with a grep-like feature which would also match parts on the LNMuseums and LNMusic. of words and not just complete words. For example, if a user is looking for the event ”Lenbach” it is sufficient to type in just ”Lenb”. This means that users are not so often presented with an empty list of search results. • Fuzzy Search: It was noticeable from the user interactions that a huge number of spelling errors were being made. This was presumably due to environmental factors, e.g. typing on a bumpy bus or due to the a high number of named entities, the spelling of which people are not familiar. In either case the system was adapted to better sup- port the user by performing a fuzzy search according to [8]. Our system was improved by utilising the Lucene Fuzzy Search mechanism which uses the levenshtein edit distance to match term that differ only by a few characters. If looking for the event ”Lenbach” it will be found even if the user by mistake typed ”Lembach”. Both changes aim to allow users to more quickly and easily find the events they are interested in and improve the overall search experience. Figure 1: The search screen with a query (left) and As the same naturalistic study was used to analyse other the map screen with the planned route (right) parts of the system there were other small changes which are unrelated to the search system analysis: the number of tabs was increased due to the addition of a recommender The user has four ways to find events he would like to tab which was beforehand combined with the list of already visit, namely he can: receive recommendations based on a selected events in one tab. Secondly, the tab position of the pre-defined profile and collaborative filtering algorithm built recommender tab was tested in an A/B test. Both changes into the app; browse events by bus route; browse events by are beyond the scope of this search behaviour paper and genre or type or submit free-text queries, which search over should not have any significant influence on it as the layout the names and descriptions of the events. of the search tab itself wasn’t changed. As described in [8] the search functionality was imple- mented in Lucene4 and documents were represented by titles 5. DATA COLLECTION and descriptions from the Long Night booklet. Lucene was We examined the search behaviour of users by record- extended to perform a search based on topics. Firstly the ing user interactions with our refined app at the LNMusic event descriptions and titles were tokenised and stemmed 2012 in the same manner as with the original app on the 3 LNMuseums2011. Again the app was available for down- a video demo of the application can be found on YouTube (http://www.youtube.com/watch?v=qy1F8fZbowo) load from Google Play Store and advertised on the official 4 Lucene version 3.1. (http://lucene.apache.org) Long Night of Music web page. In total the application was downloaded approximately 1000 times and 860 users al- provided are partially the same. We noticed that some of lowed us to record their interaction data (In [8] approx. 500 the events on the LNMusic were also available on the LN- downloads and 391 users are reported). We recorded all in- Museums. We investigated further into how many events teractions with the application including submitted queries, are in common between both nights. To do so we looked result click-throughs, all interactions with browsing and rec- into how many LNMuseums events had, at the same loca- ommendation interfaces, tours generated, modifications to tion, an event on the LNMusic that were organised by the tours, as well as all ratings submitted for events. Users same museum, church, bar, etc.. Surprisingly 21.6% of the interacted on average for 11.79 minutes5 with the system LNMuseums2011 events had a matching event on the LN- (median 6.46). 57.3% of users interacted for more than 5; Music2012. The topic of the matched events might differ but 17.9% for more than 30. as mentioned above the topic is only one of many relevant Since queries were submitted after every typed character, aspects for visitors to choose an event. it is necessary to pre-process the recorded queries to estab- Secondly we looked into the app usage itself. Upon first lish those that the users actually intended to submit. For start-up of the app we ask our users to fill out a short ques- example, if the user wanted to search for “food”, the system tionnaire; among others we ask for the age of the user (below logged “f”, “fo”, “foo”, as well as “food”. Furthermore, should 18, 18 to 29, 30 to 39, 40 to 49, 50 to 59 and above 60 years the user wish to submit a new query, then he must first re- old). Answering of these questions was optional but 246 move the old search terms from the search box, resulting on the LNMuseums2011 and 495 users on the LNMusic2012 again in all prefixes but this time in decreasing length. chose to do so. Based on these data (omitting the first and As in [8] we manually judged queries to be intended or not. last age groups due to the small sample size) we compared 2 3 assessors separately annotated all of the 12,500 queries the age distribution of users on both nights with a χ -Test 2 logged as being either intended or not-intended. A very high revealing a χ of 1.459 and p = 0.6918. This result states inter-assessor agreement was found (Fleiss’ kappa = 0.915, that this is no significant difference between app users of 89.8% of queries which were labeled by at least 1 assessor both nights. were also labelled by at least one other). This process re- To ascertain if the two studies are comparable it is important sulted in a final list of 1,434 search queries, which is used that system usage is similar on both nights. As described in in the following analyses and compared against the results Section 3 there were multiple ways (different tabs) of access- reported in [8] which are based on 801 search queries. ing the events. We looked into which of these tabs users were interested in. We therefore define a tab session to start when 6. IS A COMPARISON OF NIGHTS FAIR? a user switches to a tab and to end when he switches to an- Undoubtedly the best way of comparing two version of a other tab. Table 1 shows the number of tab sessions. Based 2 2 system is to run experiments under the same external con- on this data, a χ -Test shows a χ of 5.387 and p = 0.1456, ditions. Unfortunately this was not possible with our app again indicating no significant difference between users of for the LNMusic as we work together with the organisers to both nights. provide a real system for “productive” use and cannot ex- by Tour by Genre Search Rec.+Rated periment with arbitrary system variations. The alternative LNMuseums 28.0% 14.5% 15.4% 42.0% LNMusic 26.6% 15.6% 15.1% 42.7% would be a lab study, however we consider this to be a less preferable option given that we wish to record interaction with the system in a real-world (i.e. non-simulated) set- Table 1: Number of tab sessions ting. Therefore we looked into whether data obtained from Lastly we considered properties of the search behaviour different events could be fairly compared. itself that should be invariant to our changes to the search We learned from our experience with past Long Nights system. One of the main findings of [8] was the huge num- that user behaviour is to a huge extent independent from ber of named-entity searches and we compared the reported the actual type of Long Night. In [7] interviews with visitors numbers to those of our own system. Using the same method of two different Long Nights revealed that beside the topic described in [8] we instructed 3 human accessors to label of an event other characteristics – such as novelty, the time all search queries into one of three categories: specific event and location of the event or the possibility to take part in name, not a specific event name or indeterminate. For 82.0% the event – play a crucial role. It is precisely this that sets a of all queries at least two of the assessors were able to agree system for casual leisure activities apart from a system for on one of the three categories (Fleiss Kappa of 0.32). 84.5% solving a work task [4]. (LNMuseums2011: 59.4%) of the agreed on queries were In this section we want to give more insights into why our marked as clearly named entities and 8.2% (34.6%) that study on the LNMusic2012 and the study presented in [8] on might be named entities. Only 7.2% (6.0%) were labeled the LNMuseums2011 are comparable. First of all both dis- as non named-entity searches. It is notable that the low tributed events – although their topics are different – have number of non named entity searches is similar to what is a lot in common: They take place in the same city, are or- described in [8]. ganised by the same company and hence have the same ads, In [7] it is reported that the same system used on the LN- booklet format, price tag and even the special bus routes Museums2011 was also evaluated on the Long Night of Sci- ence in Erlangen-Nuremberg (LNScience2011), which also 5 These figures were calculated by summing the time peri- is a distributed event but dedicated to science. We looked ods for which a user was active, discounting times where into how search characteristics differ if the same system is the system reported no interactions for more than 15 sec- used on different nights. We compared query length with re- onds. We further discounted any interaction sequence that contains gaps of non-interaction longer than 30 minutes as spect to the number of characters and the number of terms these are likely due to logging problems caused by running per query by performing a (non-parametric) Kruskal-Wallis out of power, connection problems, app crashes, etc. Rank Sum Test. No significant difference between the usage on the LNMuseums2011 and the LNScience2011 could be 2,157 interactions with events (viewing, rating, selecting) found (for characters: p = 0.1169; for terms: p = 0.6039). on the result list of the LNMusic2012 app (created by the When performing the same test between queries on the LN- improved search system). We then ran the corresponding Museums2011 and the LNMusic2011 a highly significant dif- search queries through the old search system and counted ference can be found with p  0.01 for both characters and whether these events would be on the result list had the orig- terms. Thus changes to the search system have an influence inal search system been used. Only 938 event interactions on search behaviour but changes to the overall setting of the would be possible with the previous search system, meaning distributed event have not. that 56.5% of the interactions performed by users wouldn’t In conclusion we believe a comparison of the app user have been possible, simply because the events wouldn’t be behaviour on both nights is appropriate, given the circum- in the results. stances and difficulty of obtaining real-world user data of This analysis has revealed indicators of an improved search such apps. experience which means that the changes proposed in [8] are useful in the context of distributed events assistance. 7. INFLUENCES OF THE CHANGES 8. DISCUSSION AND CONCLUSIONS In [8] many statistics on query characteristics and query In this paper we analysed the changes in query behaviour performance are given based on analysis of search logs, a of users due to modifications of a search system used on common technique in the literature [6]. In this section we re- distributed events. We first describe two studies performed calculate these statistics based on the data logged with the during two such events, including a description of the system new system and compare behaviour with both app variants used and what changes to the search system were tested. As to determine what user behaviour changes the search system the LNMuseums and LNMusic have different topics we then modifications caused. To do so we consider a number of showed that a comparison of user behaviour between both different indicators of an improved search experience. nights is sensible and worthwhile. With this preparatory The average length of a search query on the LNMusic2012 work we then analysed users’ search behaviour by compar- was 5.6 characters (σ = 3.36) and 1.14 terms (σ = 0.41). ing search characteristics and search performance. Overall, This is much shorter than what is reported [8] for the LN- users typed much shorter search queries, especially in the Museums2011: 8.9 characters (σ = 5.31) and 1.21 terms case of a successful search. Also comparing query perfor- (σ = 0.52). A Z-test performed between both nights reveals mance revealed a much higher success rate with the ratio that this difference is highly significant for both metrics (i.e. of unsuccessful searches being almost halved. Finally we p  0.01 in both cases). It seems that the grep-like search- presented a comparison of both search systems running on ing – which matches also partial words – has influenced peo- the same search queries which showed that only half of the ple to stop typing much earlier. We assume that there are interactions with events would have been possible with the two main causes that users stopped typing: either they have old system. found what they were looking for (successful search) or they The search system as it is now is designed for users to gave up on the search because they couldn’t find what they find events they already know of in advance. But how can were looking for (unsuccessful search). Users of our system users be assisted in finding events that are new to them? had three options to interact with the entries in the result How can we better support the discovery of serendipitous list: they could view details of an event, mark an event as events? Since the users seldom used the search system for a candidate for tour inclusion or add the event to an pre- that purpose, a second tool like a recommender is necessary. existing tour. We consider any of the three as an indicator The user could then decide if he wants to look for a concrete for search success and the lack thereof as an indicator of an event he already knows of or if he would rather be inspired unsuccessful search. Good abandonment wasn’t considered by the system. If such a split into two “orthogonal” tools is since the result list contains no information beyond the event understood and accepted by users then it is worth investi- name and nearest bus stop. The length of successful queries gating and would point the way to vastly better distributed on the LNMusic2012 was 5.39 characters (σ = 3.27) and 1.12 events assistance systems. terms (σ = 0.38) which is highly significantly (p  0.01) shorter than reported in [8]: 9.90 characters (σ = 5.42) and Acknowledgments This work was supported by the Embedded 1.26 terms (σ = 0.57). This means users have to type on av- Systems Initiative (http://www.esi-anwendungszentrum.de). erage 45% less to find the events they are interested in. On the other hand the query length of unsuccessful queries was 9. REFERENCES slightly reduced with 6.39 characters (σ = 3.56) opposed to [1] N. J. Belkin, R. N. Oddy, and H. M. Brooks. ASK for information retrieval: 7.47 characters (σ = 4.80) but slightly longer with regard to Part I. Background and theory. Journal of Documentation, 38(2):61–71, 1982. [2] K. Byström. Task complexity, information types and information sources. terms: 1.20 (σ = 0.49) as opposed to 1.13 (σ = 0.42). Examination of relationships. PhD thesis, University of Tampere, Dep. of Inf. Studies, 1999. Of the 1,434 queries entered on the LNMusic 76.7% re- [3] F. Dornseiff. Der deutsche Wortschatz nach Sachgruppen. DeGruyter, Berlin, New sulted in an interaction of the user with an event, mean- York, 2004. [4] D. Elsweiler, M. L. Wilson, and B. Kirkegaard Lunn. New Directions in ing they were successful. 23.3% were unsuccessful, a much Information Behaviour, chapter Understanding Casual-leisure Information Behaviour. Emerald Pub., 2011. better conversion rate compared to the 40.3% unsuccessful [5] P. Hansen. User interface design for IR interaction. a task-oriented searches in [8]. This decrease is highly significant (p  0.01) approach. In CoLIS 3, pages 191–205, 1999. [6] B. J. Jansen and A. Spink How are we searching the world wide web?: a and demonstrates that the improved search system was able comparison of nine search engine transaction logs In IPM, (1,42) pp. 248–26, 2006. to assist users in finding events they were looking for. [7] R. Schaller, M. Harvey, and D. Elsweiler. Entertainment on the go: Finding In [8] a large ratio of 59.75% of unsuccessful search queries things to do and see while visiting distributed events. In Proceedings of IIiX, 2012. had an empty result list. With the improved search system [8] R. Schaller, M. Harvey, and D. Elsweiler. Out and about on museums night: this was only the case in 12.57% of unsuccessful queries. But Investigating mobile search behaviour for leisure events. In Proc. of Searching4Fun Wksp, ECIR, 2012. how successful were those “added” entries? We looked at all