=Paper= {{Paper |id=Vol-1338/paper9 |storemode=property |title=Online Evaluation of Point-Of-Interest Recommendation Systems |pdfUrl=https://ceur-ws.org/Vol-1338/paper_9.pdf |volume=Vol-1338 |dblpUrl=https://dblp.org/rec/conf/ecir/Dean-HallCKK15 }} ==Online Evaluation of Point-Of-Interest Recommendation Systems== https://ceur-ws.org/Vol-1338/paper_9.pdf
      Online evaluation of point-of-interest recommendation
                             systems

                                      Adriel Dean-Hall                        Charles L. A. Clarke
                                   University of Waterloo                    University of Waterloo
                                       Jaap Kamps                               Julia Kiseleva
                                  University of Amsterdam                   Eindhoven University of
                                                                                 Technology

ABSTRACT                                                                    to wait weeks for attraction suggestions. This wait makes
In this work we describe a system to evaluate multiple point-               the task of assessing more difficult as judgement is broken
of-interest recommendation systems. In this system each                     over long period and assessors have to remember previous
recommendation service will be exposed online and crowd-                    interactions with the system. Also, the longer the wait, the
sourced assessors will interact with merged results from mul-               more difficult it is to get crowdsourced assessors to return
tiple services, which are responding to suggestion requests                 to the task.
live, in order to determine which system performs best. This                   In this article we will describe a setup that allows for at-
work builds upon work done previously as part of the TREC                   traction recommendation services to be compared with users
Contextual Suggestion Track and describes plans for how the                 issuing requests where suggestions are made live. Here par-
track will be run in 2015.                                                  ticipating recommendation services will have an online sys-
                                                                            tem which is able to respond to a suggestion request imme-
                                                                            diately. When a user (or an assessor) is ready for suggestions
1.    INTRODUCTION                                                          they make a request. The system will then send that user’s
   Many point-of-interest recommendation systems have been                  profile and the name of the city to services. Suggestions
developed, each using different techniques for making recom-                will be made by multiple recommendation services and the
mendations. Often, when these systems are evaluated, they                   returned results will be merged and presented to the user.
are operating on different datasets or different factors are                The user will then interact with the results which provides
compared. Having a framework to fit such systems into and                   feedback on how good each service’s suggestions are. A score
be able to compare them using fair, standardized techniques                 for each service is continuously updated until the experiment
will help us determine which systems performed the best.                    ends.
   The TREC Contextual Suggestion track [6] has been run-                      We will describe the interface recommendation services
ning for three years since 2012. In this track systems, which               need to implement in order to participate, how they will be
provide personalized point-of-interest suggestions, are de-                 compared, and some challenges that moving from a batch-
signed. The past three iterations of the track have followed                style to live evaluation setup introduces.
closely with the traditional TREC evaluation methodology:
participants are given topics and develop a set of results for
each topic, the results are then evaluated by assessors and                 2.   RELATED WORK
scores are assigned to each participating system based on                      Point-of-interest recommendation is an area several re-
these judgements [11]. Specifically, a set of profiles (ratings             searchers are perusing. Braunhofer et al. [5] worked on an
for a set of attractions) and contexts (names of cities) were               application that made personalized recommendations within
released to participants. For each profile+context pair par-                cities; Adomavicius et al. [1] used collaborative filtering for
ticipants returned a set of ranked suggestions. These sug-                  similar goals incorporating temporal features; Baltrunas et
gestions were then judged by the assessors who originally                   al. [3] used information such as budget and familiarity with
created the profiles and a score was assigned to each partic-               the area. Additionally several systems have been developed
ipant’s set of results.                                                     as part of the Contextual Suggestion track including a sys-
   One disadvantage of this setup is that, for this track,                  tem that used textual similarity between attractions [8], and
topics are actually personal preferences provided by crowd-                 a systems that found reviews to be an informative feature
sourced assessors who, after providing their preferences, have              [12]. The goal is to bring the efforts of all these systems into
                                                                            one framework in order to compare them fairly.
                                                                               In addition to this framework being built upon the Con-
                                                                            textual Suggestion track our work is also inspired by work
                                                                            done during the plista dataset challenge [7]. In these ex-
                                                                            periments multiple competing systems registered to make
                                                                            recommendation about related articles users might find in-
Copyright c 2015 for the individual papers by the papers’ authors. Copy-    teresting based on the article they were currently reading
ing permitted for private and academic purposes. This volume is published   and previous interactions with the system. These recom-
and copyrighted by its editors.
ECIR Supporting Complex Search Task Workshop ’15 Vienna, Austria            mendations had to be made live as they were presented
Published on CEUR-WS: http://ceur-ws.org/Vol-1338/.                         to users while they were browsing news articles. Another
source of inspiration are challenges such as the Netflix prize
[4] and various Kaggle competitions1 , which, while not typ-
ically evaluated live, make use of various techniques, e.g.,
leaderboards, in order to provide feedback to participants.


3.     SERVICE INTERFACE
   In order to develop a framework for testing point-of-interest
recommendation services we need to determine an interface
that users will use to communicate with them. As parame-
ters for each recommendation request the systems take in the
user’s profile (see Section 4 for a description of the profile)
and, as context, the city which the user wants recommen-
dations for. We could also gather more contextual infor-
mation about our users during each request, for example, a
more precise location, who the user is travelling with (family,
friends, alone), etc. However, currently, we only use the city
and profile as input. The result returned to users will con-        Figure 1: 1. Request sent; 2. System passes request
sist of an ordered list of attractions that the service thinks      to services; 3. Services respond with suggestions; 4.
the user will like.                                                 Suggestions are merged and sent to user; 5. User
   This is similar to the information participants in previous      sends suggestion interactions to system.
iterations of the track had available to them but instead of
getting a batch of profiles and cities and returning a batch
                                                                    user. Services can then update their strategies and attempt
of results, services will recieve a single profile and city for
                                                                    to improve suggestion results for future requests.
each request. Additionally, instead of having a fixed profile
for each user, on each request the profile may be updated
with liked suggestions from previous requests.                      5.   DATASET COLLECTION
                                                                       In previous iterations of this track services were allowed to
                                                                    recommend any attraction they found on the open web. For
4.     USER PROFILES                                                simplicity, the points-of-interest that services are allowed to
   This leads us to the question of what a user’s profile con-      recommend in this experiment come from a fixed collection
sists of. Initially, for a new user, no information is in their     of attractions. Services will simply return a list of attrac-
profile. As the user asks for suggestions and interacts with        tion IDs. When they are displayed to users each attraction
them their profile will expand. For each attraction that has        will consist of a title, short description, and website URL
been recommended to the user the data about their inter-            with more information about the attraction. Users will use
action is added to their profile. Examples of interaction           this information to make a decision about whether they like
include the user viewing the attraction’s website, the user         a particular attraction. Again, here we are presenting this
“starring” the attraction, and the user rating the attraction.      basic information about each attraction but additional infor-
Currently, these three pieces of information are recorded for       mation, such as the attraction’s category or reviews about
each attraction the user has interacted with, however other         the attraction, could also be presented to users.
interactions, e.g., reviews written, could also be included            This pool of attractions is collected as part of an ongo-
in the profile. As the user interacts with the system their         ing effort by several research groups who have expertise in
profile will expand giving recommendation services a better         dealing with gathering this sort of information due to par-
opportunity to make more personalized suggestions.                  ticipation in previous Contextual Suggestion TREC tracks.
   Once interaction data from an attraction has been added          Having a fixed data collection will allow us to limit which at-
to the user’s profile it is essentially available publicly to all   tractions are suggested and will allow for greater reusability
services. One issue with this setup is that certain requests        of the judgements provided by users.
will have small profiles and certain requests will have larger
profiles. Limiting the size of the profile will allow us not
to worry as much about how the size of the profile affects          6.   SERVICE EVALUATION
service performance. One option to resolve this is, instead            So, in order to develop a recommendation service that
of adding every piece of interaction to a user’s profile, only      fits into this framework services must set up a server that
add certain attractions. One simple method of doing this is         responds to suggestions requests with a list of attraction
to only add attraction interaction data for a subset of cities.     IDs. The goal of forcing services into this framework is so
This will also allow us to ask for suggestions for the same         that we can compare the performance between multiple ser-
city multiple times (if that city is not part of the profile) and   vices. Instead of having users communicate directly with
necessarily not have to return to users for result interactions.    a recommendation service they will communicate with an
   Another potential feature is to push any updates to the          intermediary system. Services will be required to register
profiles to services. This will allow them to get feedback into     themselves with this system and the system will then pass
how well they are performing and whether the suggestions            recommendation requests to each service, logging the ser-
they are making are actually liked by users without having          vices’ responses.
to pool profiles or wait for another request from the same             Each service will be given an opportunity to make sug-
                                                                    gestions. One option to do this is, for each request, select
1
    http://www.kaggle.com/                                          one of the services at random and present the results from
                                                                  and a modified version of time-biased gain have been used
             1.0                                                  in previous iterations of this track and can be used here as
                                                                  well.
             0.8                                                     Since we are not involving each service in each suggestion
                                                                  request we need to choose which services to involve. The
                                                                  simplest way to choose is to pick services randomly, how-
             0.6
 P@5 Score




                                                                  ever we should keep our end goal in mind here. Our goal
                                                                  is to find the correct ordering of services in terms of perfor-
             0.4                                                  mance. So, for example, after a certain amount of requests
                                                                  have been made, if a particular service performs much more
             0.2                                                  poorly than any other service we are not likely to learn more
                                                                  information about the correct ordering of services if we pick
                                                                  it as often as other services. On the other hand, if two ser-
             0.0
                   A   B        C       D       E                 vices have very similar performance it may be worthwhile
                             Services                             to pick them more often in order to determine which of the
                                                                  two services perform better. In Figure 2 we already know
Figure 2: Potential scores given to five sytems at                that service A performs poorly and there is probably more
some point during the experiment.                                 to gain by comparing services B and C.
                                                                     We should also note that we are only interested in telling
                                                                  the difference between two systems if they have enough of
the service to the user. The user will then interact with the     a difference between them. If one service performs better
system and based on this interaction we can determine how         than another but an end user would not realistically be able
good the suggestions were. As more suggestion requests are        to tell the difference between them then it is not worthwhile
made each service will be given multiple chances to make          spending a bunch of resources determining their correct or-
recommendations and services can be compared.                     dering. In Figure 2 services D and E perform so similarly
   We take a slightly different approach where, for each sug-     that it is probably not worth comparing them.
gestion request a subset of the services are queried and the         We leave this issue of selecting services based on their
results from all these services are merged into a final list of   current ranking for future work and for now simply select
suggestions which is presented to users. If the user interacts    systems for each request randomly. It is worth nothing that
with a suggestion the score of the service that made that         we only expect a handful of services to register for this sys-
suggestion will be affected. It is possible for multiple ser-     tem initially and they can all probably be involved in every
vices to make the same suggestion, in this case if the user       or most suggestion requests.
interacts with the suggestion all the services that made it          In previous iterations of this track services waited until the
will have their score affected.                                   experiments were done to recieve feedback on how well they
   This setup will give services more opportunities to make       performed. An option being explored for this experiment is
suggestions (during each request rather than only some),          to provide scores or a leaderboard for services every so often
however we will need a method of merging requests from            so that services can see how well they are performing and
multiple services. Multiple result interleaving approaches        use that feedback to improve their service throughout the
have been discussed by Radlinski and Craswell which would         experiment.
be appropriate for our purposes [9], including the team draft
method proposed by Radlinski et al. [10].                         8.   ASSESSORS
   The reason a subset of the services rather than all services
                                                                     So far we have been discussing a system which allows users
are selected is that, realistically, most users will only view
                                                                  to interact with different recommendation services. Because
the attractions near the top of the list. If we try to com-
                                                                  we don’t have an existing userbase to run these experiments
pare too many services the user may not see any suggestions
                                                                  on we will use paid assessors to interact with the system in a
from some of the services (or only see few suggestions from
                                                                  similar way to real users. In past iterations of this track we
each service). In order to prevent this situation five ser-
                                                                  have found crowdsourced workers to be useful in these sorts
vices are chosen for each suggestion request (this number is
                                                                  of tasks. Additionally Ageev et al. were successful in sim-
chosen somewhat arbitrarily). We also limit the number of
                                                                  ulating search interaction data with crowdsourced workers
suggestions each service can make per request to 50.
                                                                  [2]. For this experiment we will solicit hundreds of crowd-
                                                                  sourced workers to make suggestion requests and interact
7.           SCORING                                              with the results. They will be asked to interact with the
   Again, we have our suggestion services which produce           results based on their own personal preferences. Payment
ranked lists of suggestions. Each suggestion request will         will be issued based on how many and for how long result
be sent out to multiple services and user’s will interact with    lists are interacted with.
a list of merged responses. The user’s interaction with the          Additionally, once the system has been set up and services
suggestions will allow a score to be calculated for each ser-     are registered and running, the setup can provide value to
vice. For each point-of-interest interacted with we can cal-      real users who can continue to use the system outside of the
culate a usefulness score based on how highly the user rated      track experiments. This will allow services to continue to re-
it and whether the user visited the website or “starred” it.      ceive feedback on their performance even outside of TREC.
We also collect timing data for each session so we can in-
corporate how long users spent on each attraction into our
scoring metric. Precision at rank k, mean reciprocal rank,
9.    SERVICE EFFICIENCY                                           12.     CONCLUSION
  When a request is made the user is expecting a response            We have briefly given an overview of a system that is used
within a short amount of time. Services will have to be al-        to evaluate multiple point-of-interest recommendation ser-
ways available and be able to respond quickly. Once requests       vices live using crowdsourced workers. This system will be
have been sent out to services, if a response takes too long       used to run the TREC 2015 Contextual Suggestion Track.
to be returned that service will not be given an opportunity       This experiment differs from previous years because services
to contribute to the final list of suggestions presented to the    will have to available online during the experiment, sug-
user. In order to help services maintain responsiveness they       gestions will have to be delivered live, and assessment and
will be allowed to register multiple servers. For each request     evaluation can be a lot more fluid. If you are interested in
one of the service’s servers will be chosen to respond to the      registering your service for this experiment you can find out
request. This will provide some robustness to the system           more about participating on the TREC website2 and the
should a particular server become unavailable. Additionally        Contextual Suggestion Track website3 .
we can optionally incorporate each service’s respond time
into their score and have efficiency influence the ordering of     13.     REFERENCES
services.
                                                                    [1] G. Adomavicius and A. Tuzhilin. Context-aware
  As an additional fallback mechanism, in case no service
                                                                        recommender systems. In Recommender systems
responds to a particular request, a baseline service will be
                                                                        handbook, pages 217–253. Springer, 2011.
developed that is always available and responds to every
request. If no service responds the results from the baseline       [2] M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find
service will be presented to users. This ensures that users             it if you can: A game for modeling different types of
always receive some response. This fallback mechanism will              web search success using interaction data. In
gather its results from a commercial web service.                       Proceedings of ACM SIGIR, 2011.
                                                                    [3] L. Baltrunas, B. Ludwig, S. Peer, and F. Ricci.
                                                                        Context relevance assessment and exploitation in
10.    PERSONALIZED DESCRIPTIONS                                        mobile recommender systems. Personal Ubiquitous
   The goal of each service is to select points-of-interest that        Comput., 16(5):507–526, June 2012.
the service predicts the user will like. These suggestions’ ti-     [4] J. Bennett and S. Lanning. The netflix prize. 2007.
tles, descriptions, and URLs are displayed to the user. The         [5] M. Braunhofer, M. Elahi, and F. Ricci. Usability
descriptions about each attraction shown to users are generic           assessment of a context-aware and personality-based
descriptions for that attraction. Services may want to mod-             mobile recommender system. In E-Commerce and Web
ify the descriptions slightly in order to include, for example,         Technologies, volume 188, pages 77–88. Springer, 2014.
why this particular user may find the attraction interesting.
                                                                    [6] A. Dean-Hall, C. L. A. Clarke, J. Kamps, P. Thomas,
Services will be given an opportunity to provide personal-
                                                                        and E. Voorhees. Overview of the TREC 2014
ized descriptions for each attraction in order to include this
                                                                        contextual suggestion track. In Proceedings of TREC,
kind of information. Evaluation for these descriptions will
                                                                        Gaithersburg, Maryland, 2014.
be done separately from the main evaluation into service
performance.                                                        [7] B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz.
   In previous iterations of this track every service had to            The plista dataset. In Proceedings of ACM NRS, 2013.
provide descriptions for all suggestions. The decision to           [8] D. Milne, P. Thomas, and C. Paris. Finding, weighting
make this an optional task was based on feedback that most              and describing venues: Csiro at the 2012 trec
services were simply providing generic descriptions, which              contextual suggestion track. In Proceedings of TREC,
in this experiment we are providing instead. Generating the             2012.
generic descriptions ourselves will provide us with another         [9] F. Radlinski and N. Craswell. Optimized interleaving
point of standardization between services to allow for more             for online retrieval evaluation. In Proceedings of ACM
fair comparisons.                                                       WSDM, 2013.
                                                                   [10] F. Radlinski, M. Kurup, and T. Joachims. How does
11.    USER INTERFACE                                                   clickthrough data reflect retrieval quality? In
                                                                        Proceedings of the 17th ACM Conference on
   Currently the interface that users use to make suggestions           Information and Knowledge Management, CIKM ’08,
requests and interact with service results is a web based               pages 43–52, New York, NY, USA, 2008. ACM.
interface. Users will select a city from a list and then be
                                                                   [11] E. M. Voorhees and D. K. Harman. TREC:
presented with merged results from multiple systems. This
                                                                        Experiment and Evaluation in Information Retrieval.
allows for crowdsourced assessors to easily provide system
                                                                        Digital Libraries and Electronic Publishing. The MIT
feedback. However, the system is designed so that other
                                                                        Press, 2005.
methods of presenting results to users could easily be used.
                                                                   [12] P. Yang and H. Fang. An opinion-aware approach to
In particular the API allows any developer to build a mobile
                                                                        contextual suggestion. In Proceedings of TREC, 2013.
application which enables users to interact with the system.
Point-of-interest recommendation lends itself to mobile users
and having multiple vectors for users to interact with the
system is one of the future goals for this project.
   The source for this project is currently available online:
https://github.com/akdh/entertain-me.
                                                                   2
                                                                       http://trec.nist.gov
                                                                   3
                                                                       https://sites.google.com/site/treccontext/