=Paper=
{{Paper
|id=None
|storemode=property
|title=Contextual Evaluation of Mobile Search 
|pdfUrl=https://ceur-ws.org/Vol-569/paper4.pdf
|volume=Vol-569
}}
==Contextual Evaluation of Mobile Search ==
<pdf width="1500px">https://ceur-ws.org/Vol-569/paper4.pdf</pdf>
<pre>
                        Contextual evaluation of mobile search

              Ourdia Bouidghaghen                             Lynda Tamine                        Mariam Daoud
             IRIT, Paul Sabatier University           IRIT, Paul Sabatier University       IRIT, Paul Sabatier University
               118, Route de Narbonne                   118, Route de Narbonne               118, Route de Narbonne
                    Toulouse, France                         Toulouse, France                     Toulouse, France
                  bouidgha@irit.fr                            lechani@irit.fr                      daoud@irit.fr
                                                              Cécile Laffaire
                                                      IRIT, Paul Sabatier University
                                                        118, Route de Narbonne
                                                             Toulouse, France
                                                               laffaire@irit.fr

ABSTRACT                                                                  (location, time and interests), such systems are faced to a
We discuss the issue of evaluating our context-based person-              new challenge for IR, that is how those contextual data can
alized mobile search approach with a methodology based on                 enhance user satisfaction. Another important issue is how to
a combination of two evaluation approaches: context simu-                 evaluate the strategies and techniques involved in these new
lation and user study. Our personalized approach aims at                  systems. It is commonly accepted that the traditional evalu-
exploiting some context-aware user proﬁles through a per-                 ation methodologies used in TREC, CLEF and INEX cam-
sonalized score to re-rank initial search results obtained from           paigns are not always suitable for considering the contex-
a standard search system. We use Yahoo!’s open search web                 tual dimensions in the information access process. Indeed,
services platform BOSS 1 as a baseline. The context simu-                 laboratory-based or system oriented evaluation is challenged
lation allows us to simulate user locations and their related             by the presence of contextual dimensions such as user proﬁle
user interests. The user study involves real users who give               or environment which signiﬁcantly impact on the relevance
their relevance judgments to the top 20 documents returned                judgments or usefulness ratings made by the end user [17].
by yahoo and by our approach through an assessment tool                   To alleviate such limitations, contextual evaluation method-
available on the web platform OSIRIM2 . The experimental                  ologies have been proposed to support simulated user proﬁle
results show the eﬀectiveness of our personalized approach                through contextual simulations [16] or real evaluation sce-
according to the proposed evaluation protocol.                            narios through user studies [5].
                                                                          As an initial approach, yet allowing meaningful observations,
Categories and Subject Descriptors                                        we present here, the evaluation protocol aiming to evalu-
H.3.3 [Information Search and Retrieval]: Relevance                       ate empirically the performance of a novel context-based
feedback                                                                  personalized mobile search system. For this purpose, we
                                                                          compare the performance of retrieval: without personaliza-
                                                                          tion and with personalization. We compare our approach to
Keywords                                                                  the results obtained from yahoo BOSS web search service,
mobile search, context, user proﬁle, evaluation protocol                  which did not implement itself any personalization capa-
                                                                          bility. This paper discusses the methodology adopted and
1.      INTRODUCTION                                                      presents the results obtained. We ﬁrst brieﬂy survey IR eval-
The proliferation of mobile technologies such as (PDAs and                uation methodologies in mobile contexts (Sec. 2). We then
mobile phones, . . . ) and, with them, of mobile users, have              presents our approach for mobile search personalization, and
moved the static world of classical and Web IR towards an                 introduce our contextual IR evaluation protocol (Sect. 3).
always changing context-based world. The notion of con-                   Finally, we conclude and give perspectives for future works.
text, roughly described as the situation the user is in, is
exploited in the development of new IR systems. Starting
from considering only a low number of contextual features                 2.   EVALUATION OF IR IN MOBILE CON-
1
    http://developer.yahoo.com/search/boss/
                                                                               TEXTS
2
    https://osirim.irit.fr developed at IRIT lab                          Context-awareness in mobile IR focuses on context models
                                                                          including user proﬁles and environmental data (time, loca-
                                                                          tion, near persons, device and networks). The state-of-the-
                                                                          art highlights that signiﬁcative theoretical and technolog-
                                                                          ical progress has been achieved in this area over the last
                                                                          few years, encouraged by the growing interest to co-located
                                                                          human-human communications and large scale location-based
                                                                          applications ([10, 15]). In the development of an IR system
Appears in the Proceedings of The 2nd International Workshop on Contex-   for mobile environments, evaluation plays an important role,
tual Information Access, Seeking and Retrieval Evaluation (CIRSE 2010),   as it allows to measure the eﬀectiveness of the system and to
March 28, 2010, Milton Keynes, UK.                                        better understand problems from both the system and the
http://www.irit.fr/CIRSE/
Copyright owned by the authors.
user interaction point of view. However, evaluation remains          query submitted by the user at the situation S i . It is up-
challenging because of the main following reasons ([4, 11]):         dated by combining it with the query proﬁle Gs+1      q     of a new
1) environmental data should be available and several usage          query for the same situation, submitted at time s + 1. A
scenarios should be evaluated across them, 2) evaluation,            case-based reasoning approach [1] is adopted for selecting a
if present, concerns a speciﬁc application (eg.tourist guide),       proﬁle Gopt to use for personalization according to a new
generalization to a wide range of information access applica-        situation by exploiting a similarity measure between situ-
tions is diﬃcult. Both user-centered and benchmark evalua-           ations as explained in [2]. Personalization is achieved by
tion approaches are adopted. However, as mobile IR systems           re-ranking the search results of queries related to the same
are strictly related to users and their environment, the user-       search situation. The search results are re-ranked by com-
centered evaluation live (user studies [3, 14, 8]) or in labo-       bining for each retrieved document dk , the original score re-
ratory (context-simulation framework [4, 9]) seem to be the          turned by the system scoreo (q ∗ , dk ) and a personalized score
most natural one. In [8] for example, a user-centered, iter-         scorec (dk , Gopt ) obtaining a ﬁnal scoref (dk ) as follows:
ative, and progressive evaluation has been adopted combin-                                                                              
ing IR evaluation methods with human-computer interac-               scoref (dk ) = γ ∗ scoreo (q ∗ , dk ) + (1 − γ) ∗ scorec dk , Gopt
tion development techniques. The authors consider mainly                                                                               (2)
the following guidelines: involve the right participants that        Where γ ranges from 0 to 1. Both personalized and original
are either current users or likely future; choose the right sit-     scores could be bounded by varying the values of γ. The
uations considering the diﬀerent aspects of the environment;         personalized score scorec (dk , Gopt ) is computed using the
set relevant tasks that make participants seek information           cosine similarity measure between the result dk and the top
and are in accordance with situations that have been iden-           ranked concepts of the user proﬁle C opt as follows:
                                                                                                                               
tiﬁed; use relevant evaluation approach and measures ac-                                                                → →
cording to the diﬀerent sub-goals (eﬀectiveness, usability)              scorec dk , Gopt =               sw (cj ) ∗ cos dk , cj       (3)
within the overall objective evaluation. The main limita-                                       cj ∈C opt

tions introduced by user studies is that experiments are not
                                                                     Where sw (cj ) is the similarity weight of the concept cj in
repeatable and that they induce an extra costs. Within the
                                                                     the user proﬁle Gopt .
mobile IR ﬁeld, a benchmark evaluation has been used in
[13, 12], they demonstrated the eﬃcacy of the benchmark
approach to evaluate an early stage of their system.                 3.2     Evaluation of contextual personalization
                                                                     In the absence of a standard evaluation framework, a for-
3. EVALUATION OF OUR CONTEXT-BASED                                   mal evaluation of contextualization techniques may require
                                                                     a signiﬁcant amount of extra feedback from users in order
   PERSONALIZED SEARCH                                               to measure how much better a retrieval system can perform
In this section, we ﬁrst introduce our context-based per-            with the proposed techniques than without them. In this
sonalized approach for mobile search, we then present our            case, the standard evaluation measures from the IR ﬁeld re-
evaluation protocol devoted for our proposed approach.               quire the availability of manual content ratings with respect
                                                                     to query relevance and speciﬁc user preference (i.e., con-
3.1 Situation-aware user proﬁle                                      strained to the context of his search). For this aim we build
Our context-aware approach to personalize search results             a testbed consisting of a search space corpus, a set of queries,
for mobile users [2] aims to adapt search results according          and a set of hypothetic context situations. A user study was
to user’s interests in a certain situation. A user U is repre-       conducted, participants were asked to provide ratings, in a
sented by a set of situations with their corresponding user          blind test, for two retrieval scenarios: 1) top 20 documents
proﬁles (interests), denoted : U = {(S i , Gi )}, where S i is a     returned by Yahoo BOSS, 2) top 20 documents returned by
situation and Gi its corresponding user proﬁle. A situation          our personalized approach. In the following, we describe our
S i refers to the geographical and/or temporal context of the        experimental data sets and our evaluation protocol.
user when submitting a query to the search engine. User
proﬁles are built over each identiﬁed situation by combining         3.2.1     Contexts and Queries
graph-based query proﬁles. A query proﬁle Gsq is built by            Since the contextualization techniques are applied as the
exploiting clicked documents Drs by the user and returned            time goes, we have deﬁned a set of six short use cases as
with respect to the query q s submitted at time s. First a           part of the evaluation setup. Each use case is composed of
keyword query context K s is calculated as the centroid of           a set of queries within a given geographical context, and a
documents in Drs :                                                   narrative describing the relevance of a document regarding a
                               1                                    query and a geographical context. We have simulated a set
                    K s (t) =           wtd .                (1)
                              |Drs |  s
                                                                     of six geographical contexts deﬁned by a location type (zoo,
                                   d∈Dr
                                                                     music store, cinema, library, garden and museum). We have
K is matched with each concept cj of the ODP3 ontology
    s                                                                created a set of totally 25 diﬀerent queries, 5 queries be-
                                        →                            longing to each geographical context. Since mobile search
represented by single term vector cj using the cosine sim-
ilarity measure. The scores of the obtained concepts are             queries are known to be short (and thus ambiguous), our
propagated over the semantic links as explained in [6]. We           queries are generally short (query length ≤ 3) and some
select the most weighted graph of concepts to represent the          of them are consequently ambiguous (eg. jaguar ) and are
query proﬁle Gsq at time s. The user proﬁle G0i , within each        tested within diﬀerent geographical contexts (eg. the query
                                                                     ”water lilies” is tested within the two contexts ”garden” and
identiﬁed situation S i , is initialized by the proﬁle of the ﬁrst
                                                                     ”museum”), totalizing a number of 30 queries within the six
3
    The Open Directory Project (ODP): http://www.dmoz.org            contexts. Our goal was to verify whether the consideration
of geographical contexts and user proﬁles can enhance the
performance of the search engine to respond to such ambigu-
ous queries. Table 1 gives an example of the use case of the
context museum.

3.2.2    Document collection
The document collection consists of a set of about 3750 web
pages retrieved from the web by yahoo BOSS as response
to our set of queries. It is built by collecting the 150 ﬁrst
retrieved documents per query.

3.2.3    User proﬁle
The user proﬁles are integrated in the evaluation strategy
according to a simulation algorithm that generates them us-
ing hypothetic user interactions for each query. They are         Figure 1: DCG@10 comparison between our person-
constructed based on a manual judgments of the <query,            alized search and Yahoo BOSS over all queries
narrative, document> tuples for all the document in the col-
lection. These, so built proﬁles, simulate user click-through
data.                                                             Table 2: Average Top-n precision comparison be-
                                                                  tween our personalized search and Yahoo BOSS over
3.2.4    Evaluation protocol                                      all queries
Our experimental design consists of evaluating the eﬀective-                       Average precision over all queries at:
ness of our personalized approach when using the user proﬁle                     P@5       P@10       P@15        P@20
in the IR model over a sequence of user contexts. In the ab-       Yahoo BOSS 0,37         0,39       0,38        0,36
sence of an initial score of the document results list of yahoo    Our model     0,70      0,64       0,59        0,55
BOSS, the re-ranking procedure is done based only in the           Improvement 87,50% 63,56% 53,49% 50,92%
personalized score (ie. γ = 0 in equation 2). The evaluation
scenario is based on the k-fold cross validation like in [7]
explained as follows:                                             Figure 1 compares the eﬀectiveness obtained by the initial
                                                                  yahoo search lists and the re-ranked ones obtained by our
   • for each use case, divide the query set into k equally-      approach over all the queries. We observe that in general,
     sized subsets, and using k−1 training subsets for learn-     our approach enhances the initial DCG@10 obtained by the
     ing the user interests and the remaining subset as a test    standard search and improve the quality of the top search
     set,                                                         results lists. We have also computed the percentage of im-
                                                                  provement of personalized search comparatively to the stan-
   • for each query in the training set, an automatic pro-        dard search computed at diﬀerent cut-oﬀ points P@5, P@10,
     cess generates the associated proﬁle based on its top n      P@15 and P@20 averaged over all the queries. Results are
     relevant documents listed in the manually constructed        presented in Table 2. Results prove that personalized search
     relevance judgments ﬁle.                                     achieves higher retrieval precision of almost the queries in
                                                                  the six simulated contexts. Best performance are achieved
   • update the user proﬁle concept weights across the queries    by the personalized search in terms of average precision at
     in the training set and use it for re-ranking the search     diﬀerent cut-oﬀ points achieving an improvement of 87,50%
     results of the queries in the test set.                      at P@5, 63,56% at P@10, 53,49% at P@15 and 50,92% at
                                                                  P@20 comparatively to Yahoo BOSS. However, precision im-
In order to evaluate the performance of our proposed ap-          provement varies between queries, Figure 2 gives an exam-
proach, a user study is conducted to compare the 20 top           ple of this improvement variation between the queries of the
ranking output of our approach and of Yahoo BOSS. Using           context museum. This is probably due to the diﬀerence be-
an assessment tool available on the web platform OSIRIM,          tween the degree of ambiguity of the queries, which can not
six users who participated to the experiment were asked to        be explained only by the diﬀerence in query length. In fact,
judge each tuple <query, document, narrative> within the          it depends also on the contents of the documents present in
20 top ranking output of both our approach and of Yahoo           the collection.
BOSS. Participants were unaware of the system they judge.
Relevance judgments have been made using a three level            4.   CONCLUSION
relevance scale: relevant, partially relevant, or not relevant.   In this paper we have presented our evaluation protocol of
                                                                  a context-aware personalization approach for mobile search.
3.3 Results and Discussion                                        It is based on a combination of context simulation and user
We evaluate the eﬀectiveness of the personalized search over      study. More precisely, we exploit context simulation to cre-
the six use cases and we compare the obtained results to          ate user contexts and proﬁles in one hand. On the other
the initial ones from Yahoo BOSS. To better estimate the          hand, we exploit Yahoo’s BOSS web search service and real
quality of the search results at the top of the ranked list       user judgments, through a user study, to evaluate the search
(since mobile users are unlikely to scroll long lists of re-      eﬀectiveness of our approach comparatively to a standard
trieved items), we estimate the DCG@10 for all the queries.       search. We evaluated our approach according to the pro-
                                  Table 1: an example of the use case ”museum”
     Context    QueryID        Query terms       Narrative
                                                 A document is relevant if it speaks about da Vinci painter and or
                  M17            da Vinci
                                                 his paintings
                                                 A document is relevant if it speaks about the painting sunﬂowers
                  M23           sunﬂowers
                                                 and or its painter Van Gogh and or his paintings
                                                 A document is relevant if it speaks about the painting woman with
     museum       M24       woman with a parasol
                                                 a parasol and or its painter Claude Monet and or his paintings
                                                 A document is relevant if it speaks about painter Edgar Degas and
                  M25          Edgar Degas
                                                 or his paintings
                                                 A document is relevant if it speaks about the painting water lilies
                  M21           water lilies
                                                 and or its painter Claude Monet and or his paintings


                                                                 [6] M. Daoud, L. Tamine, M. Boughanem, and
                                                                     B. Chebaro. A session based personalized search using
                                                                     an ontological user proﬁle. In ACM Symposium on
                                                                     Applied Computing (SAC), pages 1031–1035, 2009.
                                                                 [7] M. Daoud, L. Tamine-Lechani, and M. Boughanem.
                                                                     Using a concept-based user context for search
                                                                     personalization. In Proc. of the 2008 Internat. Conf.
                                                                     of Data Mining and Knowledge Engineering, 2008.
                                                                 [8] A. Göker and H. I. Myrhaug. Evaluation of a mobile
                                                                     information system in context. Information Processing
                                                                     and Management, 44(1):39–65, 2008.
                                                                 [9] F. Gui, M. Adjouadi, and N. Rishe. A contextualized
                                                                     and personalized approach for mobile search. In 2009
                                                                     Internat. Conf. on Advanced Information Networking
                                                                     and Applications Workshops, pages 966–971.
Figure 2: Improvement at P@5, P@10, P@15 and
                                                                [10] R. Iqbal, J. Sturm, O. Kulyk, J. Wang, and J. Terken.
P@20 for the queries of the context ”museum”
                                                                     User-centred design and evaluation of ubiquitous
                                                                     services. In Proc. of the 23rd annual internat. conf. on
posed evaluation protocol and show that it is eﬀective. In           Design of communication, pages 138–145, 2005.
future work, we plan to extend this protocol by using real      [11] J. Kjeldskov and C. Graham. A review of mobile hci
user data provided from a search engine log ﬁle. Extend-             research method. In Human-Computer Interaction
ing the protocol aims at testing the eﬀectiveness of the per-        with Mobile Devices and Services-5th Internat.
sonalized search based on real mobile search contexts and            Symposium, Mobile HCI 2003 proceedings, 2003.
click-through data available in the log ﬁle.                    [12] D. Menegon, S. Mizzaro, E. Nazzi, and L. Vassena.
                                                                     Benchmark evaluation of context-aware web search. In
5. ACKNOWLEDGMENTS                                                   Proc. of ECIR 2009 Workshop on Contextual
                                                                     Information Access, Seeking and Retrieval Evaluation.
The authors acknowledge the support of the project QUAERO,
directed by OSEO agency, France, and thank PhD students         [13] S. Mizzaro, E. Nazzi, and L. Vassena. Retrieval of
at IRIT for their participation in the experiment.                   context-aware applications on mobile devices: how to
                                                                     evaluate? In Proc. of IIiX’08, pages 65–71, 2008.
                                                                [14] C. Panayiotou, M. Andreou, G. Samaras, and
6. REFERENCES                                                        A. Pitsillides. Time based personalization for the
 [1] A. Aamodt and E. Plaza. Case-based reasoning:                   moving user. In Proc. of the International Conference
     Foundational issues, methodological variations, and             on Mobile Business (ICMB’05), pages 128–136, 2005.
     system approaches. AI Communications, 7(1), 1994.
                                                                [15] W. Schwinger, C. Grün, B. Pröll, W. Retschitzegger,
 [2] O. Bouidghaghen, L. Tamine-Lechani, and                         and A. Schauerhuber. Context-awarness in mobile
     M. Boughanem. Dynamically personalizing search                  tourism guides- a comprehensive survey. Technical
     results for mobile users. In Proc. of Flexible Query            Report,Johannes Kepler University Linz, IFS/TK,
     Answering Systems, pages 293–298, 2009.                         2005.
 [3] N. O. Bouvin, B. G. Christensen, K. Grønbæk, and           [16] A. Sieg, B. Mobasher, and R. Burke. Web search
     F. A. Hansen. Hycon: a framework for context-aware              personalization with ontological user proﬁles. In Proc.
     mobile hypermedia. Hypermedia, 9(1):59–88, 2003.                of the 16th ACM conference on information and
 [4] M. Bylund and F. Espinoza. Testing and                          knowledge management, pages 525–534, 2007.
     demonstrating context-aware services with quake iii        [17] L. Tamine-Lechani, M. Boughanem, and M. Daoud.
     arena. Communications of the ACM, 45(1), 2002.                  Evaluation of contextual information retrieval
 [5] V. Challam, S. Gauch, and A. Chandramouli.                      eﬀectiveness: Overview of issues and research.
     Contextual search using ontology-based user proﬁles.            Knowledge and Information Systems, Springer, 2009.
     In Proceedings of RIAO 2007, 2007.

</pre>