Contextual evaluation of mobile search Ourdia Bouidghaghen Lynda Tamine Mariam Daoud IRIT, Paul Sabatier University IRIT, Paul Sabatier University IRIT, Paul Sabatier University 118, Route de Narbonne 118, Route de Narbonne 118, Route de Narbonne Toulouse, France Toulouse, France Toulouse, France bouidgha@irit.fr lechani@irit.fr daoud@irit.fr Cécile Laffaire IRIT, Paul Sabatier University 118, Route de Narbonne Toulouse, France laffaire@irit.fr ABSTRACT (location, time and interests), such systems are faced to a We discuss the issue of evaluating our context-based person- new challenge for IR, that is how those contextual data can alized mobile search approach with a methodology based on enhance user satisfaction. Another important issue is how to a combination of two evaluation approaches: context simu- evaluate the strategies and techniques involved in these new lation and user study. Our personalized approach aims at systems. It is commonly accepted that the traditional evalu- exploiting some context-aware user profiles through a per- ation methodologies used in TREC, CLEF and INEX cam- sonalized score to re-rank initial search results obtained from paigns are not always suitable for considering the contex- a standard search system. We use Yahoo!’s open search web tual dimensions in the information access process. Indeed, services platform BOSS 1 as a baseline. The context simu- laboratory-based or system oriented evaluation is challenged lation allows us to simulate user locations and their related by the presence of contextual dimensions such as user profile user interests. The user study involves real users who give or environment which significantly impact on the relevance their relevance judgments to the top 20 documents returned judgments or usefulness ratings made by the end user [17]. by yahoo and by our approach through an assessment tool To alleviate such limitations, contextual evaluation method- available on the web platform OSIRIM2 . The experimental ologies have been proposed to support simulated user profile results show the effectiveness of our personalized approach through contextual simulations [16] or real evaluation sce- according to the proposed evaluation protocol. narios through user studies [5]. As an initial approach, yet allowing meaningful observations, Categories and Subject Descriptors we present here, the evaluation protocol aiming to evalu- H.3.3 [Information Search and Retrieval]: Relevance ate empirically the performance of a novel context-based feedback personalized mobile search system. For this purpose, we compare the performance of retrieval: without personaliza- tion and with personalization. We compare our approach to Keywords the results obtained from yahoo BOSS web search service, mobile search, context, user profile, evaluation protocol which did not implement itself any personalization capa- bility. This paper discusses the methodology adopted and 1. INTRODUCTION presents the results obtained. We first briefly survey IR eval- The proliferation of mobile technologies such as (PDAs and uation methodologies in mobile contexts (Sec. 2). We then mobile phones, . . . ) and, with them, of mobile users, have presents our approach for mobile search personalization, and moved the static world of classical and Web IR towards an introduce our contextual IR evaluation protocol (Sect. 3). always changing context-based world. The notion of con- Finally, we conclude and give perspectives for future works. text, roughly described as the situation the user is in, is exploited in the development of new IR systems. Starting from considering only a low number of contextual features 2. EVALUATION OF IR IN MOBILE CON- 1 http://developer.yahoo.com/search/boss/ TEXTS 2 https://osirim.irit.fr developed at IRIT lab Context-awareness in mobile IR focuses on context models including user profiles and environmental data (time, loca- tion, near persons, device and networks). The state-of-the- art highlights that significative theoretical and technolog- ical progress has been achieved in this area over the last few years, encouraged by the growing interest to co-located human-human communications and large scale location-based applications ([10, 15]). In the development of an IR system Appears in the Proceedings of The 2nd International Workshop on Contex- for mobile environments, evaluation plays an important role, tual Information Access, Seeking and Retrieval Evaluation (CIRSE 2010), as it allows to measure the effectiveness of the system and to March 28, 2010, Milton Keynes, UK. better understand problems from both the system and the http://www.irit.fr/CIRSE/ Copyright owned by the authors. user interaction point of view. However, evaluation remains query submitted by the user at the situation S i . It is up- challenging because of the main following reasons ([4, 11]): dated by combining it with the query profile Gs+1 q of a new 1) environmental data should be available and several usage query for the same situation, submitted at time s + 1. A scenarios should be evaluated across them, 2) evaluation, case-based reasoning approach [1] is adopted for selecting a if present, concerns a specific application (eg.tourist guide), profile Gopt to use for personalization according to a new generalization to a wide range of information access applica- situation by exploiting a similarity measure between situ- tions is difficult. Both user-centered and benchmark evalua- ations as explained in [2]. Personalization is achieved by tion approaches are adopted. However, as mobile IR systems re-ranking the search results of queries related to the same are strictly related to users and their environment, the user- search situation. The search results are re-ranked by com- centered evaluation live (user studies [3, 14, 8]) or in labo- bining for each retrieved document dk , the original score re- ratory (context-simulation framework [4, 9]) seem to be the turned by the system scoreo (q ∗ , dk ) and a personalized score most natural one. In [8] for example, a user-centered, iter- scorec (dk , Gopt ) obtaining a final scoref (dk ) as follows: ative, and progressive evaluation has been adopted combin-   ing IR evaluation methods with human-computer interac- scoref (dk ) = γ ∗ scoreo (q ∗ , dk ) + (1 − γ) ∗ scorec dk , Gopt tion development techniques. The authors consider mainly (2) the following guidelines: involve the right participants that Where γ ranges from 0 to 1. Both personalized and original are either current users or likely future; choose the right sit- scores could be bounded by varying the values of γ. The uations considering the different aspects of the environment; personalized score scorec (dk , Gopt ) is computed using the set relevant tasks that make participants seek information cosine similarity measure between the result dk and the top and are in accordance with situations that have been iden- ranked concepts of the user profile C opt as follows:    tified; use relevant evaluation approach and measures ac-   → → cording to the different sub-goals (effectiveness, usability) scorec dk , Gopt = sw (cj ) ∗ cos dk , cj (3) within the overall objective evaluation. The main limita- cj ∈C opt tions introduced by user studies is that experiments are not Where sw (cj ) is the similarity weight of the concept cj in repeatable and that they induce an extra costs. Within the the user profile Gopt . mobile IR field, a benchmark evaluation has been used in [13, 12], they demonstrated the efficacy of the benchmark approach to evaluate an early stage of their system. 3.2 Evaluation of contextual personalization In the absence of a standard evaluation framework, a for- 3. EVALUATION OF OUR CONTEXT-BASED mal evaluation of contextualization techniques may require a significant amount of extra feedback from users in order PERSONALIZED SEARCH to measure how much better a retrieval system can perform In this section, we first introduce our context-based per- with the proposed techniques than without them. In this sonalized approach for mobile search, we then present our case, the standard evaluation measures from the IR field re- evaluation protocol devoted for our proposed approach. quire the availability of manual content ratings with respect to query relevance and specific user preference (i.e., con- 3.1 Situation-aware user profile strained to the context of his search). For this aim we build Our context-aware approach to personalize search results a testbed consisting of a search space corpus, a set of queries, for mobile users [2] aims to adapt search results according and a set of hypothetic context situations. A user study was to user’s interests in a certain situation. A user U is repre- conducted, participants were asked to provide ratings, in a sented by a set of situations with their corresponding user blind test, for two retrieval scenarios: 1) top 20 documents profiles (interests), denoted : U = {(S i , Gi )}, where S i is a returned by Yahoo BOSS, 2) top 20 documents returned by situation and Gi its corresponding user profile. A situation our personalized approach. In the following, we describe our S i refers to the geographical and/or temporal context of the experimental data sets and our evaluation protocol. user when submitting a query to the search engine. User profiles are built over each identified situation by combining 3.2.1 Contexts and Queries graph-based query profiles. A query profile Gsq is built by Since the contextualization techniques are applied as the exploiting clicked documents Drs by the user and returned time goes, we have defined a set of six short use cases as with respect to the query q s submitted at time s. First a part of the evaluation setup. Each use case is composed of keyword query context K s is calculated as the centroid of a set of queries within a given geographical context, and a documents in Drs : narrative describing the relevance of a document regarding a 1  query and a geographical context. We have simulated a set K s (t) = wtd . (1) |Drs | s of six geographical contexts defined by a location type (zoo, d∈Dr music store, cinema, library, garden and museum). We have K is matched with each concept cj of the ODP3 ontology s created a set of totally 25 different queries, 5 queries be- → longing to each geographical context. Since mobile search represented by single term vector cj using the cosine sim- ilarity measure. The scores of the obtained concepts are queries are known to be short (and thus ambiguous), our propagated over the semantic links as explained in [6]. We queries are generally short (query length ≤ 3) and some select the most weighted graph of concepts to represent the of them are consequently ambiguous (eg. jaguar ) and are query profile Gsq at time s. The user profile G0i , within each tested within different geographical contexts (eg. the query ”water lilies” is tested within the two contexts ”garden” and identified situation S i , is initialized by the profile of the first ”museum”), totalizing a number of 30 queries within the six 3 The Open Directory Project (ODP): http://www.dmoz.org contexts. Our goal was to verify whether the consideration of geographical contexts and user profiles can enhance the performance of the search engine to respond to such ambigu- ous queries. Table 1 gives an example of the use case of the context museum. 3.2.2 Document collection The document collection consists of a set of about 3750 web pages retrieved from the web by yahoo BOSS as response to our set of queries. It is built by collecting the 150 first retrieved documents per query. 3.2.3 User profile The user profiles are integrated in the evaluation strategy according to a simulation algorithm that generates them us- ing hypothetic user interactions for each query. They are Figure 1: DCG@10 comparison between our person- constructed based on a manual judgments of the tuples for all the document in the col- lection. These, so built profiles, simulate user click-through data. Table 2: Average Top-n precision comparison be- tween our personalized search and Yahoo BOSS over 3.2.4 Evaluation protocol all queries Our experimental design consists of evaluating the effective- Average precision over all queries at: ness of our personalized approach when using the user profile P@5 P@10 P@15 P@20 in the IR model over a sequence of user contexts. In the ab- Yahoo BOSS 0,37 0,39 0,38 0,36 sence of an initial score of the document results list of yahoo Our model 0,70 0,64 0,59 0,55 BOSS, the re-ranking procedure is done based only in the Improvement 87,50% 63,56% 53,49% 50,92% personalized score (ie. γ = 0 in equation 2). The evaluation scenario is based on the k-fold cross validation like in [7] explained as follows: Figure 1 compares the effectiveness obtained by the initial yahoo search lists and the re-ranked ones obtained by our • for each use case, divide the query set into k equally- approach over all the queries. We observe that in general, sized subsets, and using k−1 training subsets for learn- our approach enhances the initial DCG@10 obtained by the ing the user interests and the remaining subset as a test standard search and improve the quality of the top search set, results lists. We have also computed the percentage of im- provement of personalized search comparatively to the stan- • for each query in the training set, an automatic pro- dard search computed at different cut-off points P@5, P@10, cess generates the associated profile based on its top n P@15 and P@20 averaged over all the queries. Results are relevant documents listed in the manually constructed presented in Table 2. Results prove that personalized search relevance judgments file. achieves higher retrieval precision of almost the queries in the six simulated contexts. Best performance are achieved • update the user profile concept weights across the queries by the personalized search in terms of average precision at in the training set and use it for re-ranking the search different cut-off points achieving an improvement of 87,50% results of the queries in the test set. at P@5, 63,56% at P@10, 53,49% at P@15 and 50,92% at P@20 comparatively to Yahoo BOSS. However, precision im- In order to evaluate the performance of our proposed ap- provement varies between queries, Figure 2 gives an exam- proach, a user study is conducted to compare the 20 top ple of this improvement variation between the queries of the ranking output of our approach and of Yahoo BOSS. Using context museum. This is probably due to the difference be- an assessment tool available on the web platform OSIRIM, tween the degree of ambiguity of the queries, which can not six users who participated to the experiment were asked to be explained only by the difference in query length. In fact, judge each tuple within the it depends also on the contents of the documents present in 20 top ranking output of both our approach and of Yahoo the collection. BOSS. Participants were unaware of the system they judge. Relevance judgments have been made using a three level 4. CONCLUSION relevance scale: relevant, partially relevant, or not relevant. In this paper we have presented our evaluation protocol of a context-aware personalization approach for mobile search. 3.3 Results and Discussion It is based on a combination of context simulation and user We evaluate the effectiveness of the personalized search over study. More precisely, we exploit context simulation to cre- the six use cases and we compare the obtained results to ate user contexts and profiles in one hand. On the other the initial ones from Yahoo BOSS. To better estimate the hand, we exploit Yahoo’s BOSS web search service and real quality of the search results at the top of the ranked list user judgments, through a user study, to evaluate the search (since mobile users are unlikely to scroll long lists of re- effectiveness of our approach comparatively to a standard trieved items), we estimate the DCG@10 for all the queries. search. We evaluated our approach according to the pro- Table 1: an example of the use case ”museum” Context QueryID Query terms Narrative A document is relevant if it speaks about da Vinci painter and or M17 da Vinci his paintings A document is relevant if it speaks about the painting sunflowers M23 sunflowers and or its painter Van Gogh and or his paintings A document is relevant if it speaks about the painting woman with museum M24 woman with a parasol a parasol and or its painter Claude Monet and or his paintings A document is relevant if it speaks about painter Edgar Degas and M25 Edgar Degas or his paintings A document is relevant if it speaks about the painting water lilies M21 water lilies and or its painter Claude Monet and or his paintings [6] M. Daoud, L. Tamine, M. Boughanem, and B. Chebaro. A session based personalized search using an ontological user profile. In ACM Symposium on Applied Computing (SAC), pages 1031–1035, 2009. [7] M. Daoud, L. Tamine-Lechani, and M. Boughanem. Using a concept-based user context for search personalization. In Proc. of the 2008 Internat. Conf. of Data Mining and Knowledge Engineering, 2008. [8] A. Göker and H. I. Myrhaug. Evaluation of a mobile information system in context. Information Processing and Management, 44(1):39–65, 2008. [9] F. Gui, M. Adjouadi, and N. Rishe. A contextualized and personalized approach for mobile search. In 2009 Internat. Conf. on Advanced Information Networking and Applications Workshops, pages 966–971. Figure 2: Improvement at P@5, P@10, P@15 and [10] R. Iqbal, J. Sturm, O. Kulyk, J. Wang, and J. Terken. P@20 for the queries of the context ”museum” User-centred design and evaluation of ubiquitous services. In Proc. of the 23rd annual internat. conf. on posed evaluation protocol and show that it is effective. In Design of communication, pages 138–145, 2005. future work, we plan to extend this protocol by using real [11] J. Kjeldskov and C. Graham. A review of mobile hci user data provided from a search engine log file. Extend- research method. In Human-Computer Interaction ing the protocol aims at testing the effectiveness of the per- with Mobile Devices and Services-5th Internat. sonalized search based on real mobile search contexts and Symposium, Mobile HCI 2003 proceedings, 2003. click-through data available in the log file. [12] D. Menegon, S. Mizzaro, E. Nazzi, and L. Vassena. Benchmark evaluation of context-aware web search. In 5. ACKNOWLEDGMENTS Proc. of ECIR 2009 Workshop on Contextual Information Access, Seeking and Retrieval Evaluation. The authors acknowledge the support of the project QUAERO, directed by OSEO agency, France, and thank PhD students [13] S. Mizzaro, E. Nazzi, and L. Vassena. Retrieval of at IRIT for their participation in the experiment. context-aware applications on mobile devices: how to evaluate? In Proc. of IIiX’08, pages 65–71, 2008. [14] C. Panayiotou, M. Andreou, G. Samaras, and 6. REFERENCES A. Pitsillides. Time based personalization for the [1] A. Aamodt and E. Plaza. Case-based reasoning: moving user. In Proc. of the International Conference Foundational issues, methodological variations, and on Mobile Business (ICMB’05), pages 128–136, 2005. system approaches. AI Communications, 7(1), 1994. [15] W. Schwinger, C. Grün, B. Pröll, W. Retschitzegger, [2] O. Bouidghaghen, L. Tamine-Lechani, and and A. Schauerhuber. Context-awarness in mobile M. Boughanem. Dynamically personalizing search tourism guides- a comprehensive survey. Technical results for mobile users. In Proc. of Flexible Query Report,Johannes Kepler University Linz, IFS/TK, Answering Systems, pages 293–298, 2009. 2005. [3] N. O. Bouvin, B. G. Christensen, K. Grønbæk, and [16] A. Sieg, B. Mobasher, and R. Burke. Web search F. A. Hansen. Hycon: a framework for context-aware personalization with ontological user profiles. In Proc. mobile hypermedia. Hypermedia, 9(1):59–88, 2003. of the 16th ACM conference on information and [4] M. Bylund and F. Espinoza. Testing and knowledge management, pages 525–534, 2007. demonstrating context-aware services with quake iii [17] L. Tamine-Lechani, M. Boughanem, and M. Daoud. arena. Communications of the ACM, 45(1), 2002. Evaluation of contextual information retrieval [5] V. Challam, S. Gauch, and A. Chandramouli. effectiveness: Overview of issues and research. Contextual search using ontology-based user profiles. Knowledge and Information Systems, Springer, 2009. In Proceedings of RIAO 2007, 2007.