=Paper=
{{Paper
|id=Vol-1338/paper9
|storemode=property
|title=Online Evaluation of Point-Of-Interest Recommendation Systems
|pdfUrl=https://ceur-ws.org/Vol-1338/paper_9.pdf
|volume=Vol-1338
|dblpUrl=https://dblp.org/rec/conf/ecir/Dean-HallCKK15
}}
==Online Evaluation of Point-Of-Interest Recommendation Systems==
Online evaluation of point-of-interest recommendation systems Adriel Dean-Hall Charles L. A. Clarke University of Waterloo University of Waterloo Jaap Kamps Julia Kiseleva University of Amsterdam Eindhoven University of Technology ABSTRACT to wait weeks for attraction suggestions. This wait makes In this work we describe a system to evaluate multiple point- the task of assessing more difficult as judgement is broken of-interest recommendation systems. In this system each over long period and assessors have to remember previous recommendation service will be exposed online and crowd- interactions with the system. Also, the longer the wait, the sourced assessors will interact with merged results from mul- more difficult it is to get crowdsourced assessors to return tiple services, which are responding to suggestion requests to the task. live, in order to determine which system performs best. This In this article we will describe a setup that allows for at- work builds upon work done previously as part of the TREC traction recommendation services to be compared with users Contextual Suggestion Track and describes plans for how the issuing requests where suggestions are made live. Here par- track will be run in 2015. ticipating recommendation services will have an online sys- tem which is able to respond to a suggestion request imme- diately. When a user (or an assessor) is ready for suggestions 1. INTRODUCTION they make a request. The system will then send that user’s Many point-of-interest recommendation systems have been profile and the name of the city to services. Suggestions developed, each using different techniques for making recom- will be made by multiple recommendation services and the mendations. Often, when these systems are evaluated, they returned results will be merged and presented to the user. are operating on different datasets or different factors are The user will then interact with the results which provides compared. Having a framework to fit such systems into and feedback on how good each service’s suggestions are. A score be able to compare them using fair, standardized techniques for each service is continuously updated until the experiment will help us determine which systems performed the best. ends. The TREC Contextual Suggestion track [6] has been run- We will describe the interface recommendation services ning for three years since 2012. In this track systems, which need to implement in order to participate, how they will be provide personalized point-of-interest suggestions, are de- compared, and some challenges that moving from a batch- signed. The past three iterations of the track have followed style to live evaluation setup introduces. closely with the traditional TREC evaluation methodology: participants are given topics and develop a set of results for each topic, the results are then evaluated by assessors and 2. RELATED WORK scores are assigned to each participating system based on Point-of-interest recommendation is an area several re- these judgements [11]. Specifically, a set of profiles (ratings searchers are perusing. Braunhofer et al. [5] worked on an for a set of attractions) and contexts (names of cities) were application that made personalized recommendations within released to participants. For each profile+context pair par- cities; Adomavicius et al. [1] used collaborative filtering for ticipants returned a set of ranked suggestions. These sug- similar goals incorporating temporal features; Baltrunas et gestions were then judged by the assessors who originally al. [3] used information such as budget and familiarity with created the profiles and a score was assigned to each partic- the area. Additionally several systems have been developed ipant’s set of results. as part of the Contextual Suggestion track including a sys- One disadvantage of this setup is that, for this track, tem that used textual similarity between attractions [8], and topics are actually personal preferences provided by crowd- a systems that found reviews to be an informative feature sourced assessors who, after providing their preferences, have [12]. The goal is to bring the efforts of all these systems into one framework in order to compare them fairly. In addition to this framework being built upon the Con- textual Suggestion track our work is also inspired by work done during the plista dataset challenge [7]. In these ex- periments multiple competing systems registered to make recommendation about related articles users might find in- Copyright c 2015 for the individual papers by the papers’ authors. Copy- teresting based on the article they were currently reading ing permitted for private and academic purposes. This volume is published and previous interactions with the system. These recom- and copyrighted by its editors. ECIR Supporting Complex Search Task Workshop ’15 Vienna, Austria mendations had to be made live as they were presented Published on CEUR-WS: http://ceur-ws.org/Vol-1338/. to users while they were browsing news articles. Another source of inspiration are challenges such as the Netflix prize [4] and various Kaggle competitions1 , which, while not typ- ically evaluated live, make use of various techniques, e.g., leaderboards, in order to provide feedback to participants. 3. SERVICE INTERFACE In order to develop a framework for testing point-of-interest recommendation services we need to determine an interface that users will use to communicate with them. As parame- ters for each recommendation request the systems take in the user’s profile (see Section 4 for a description of the profile) and, as context, the city which the user wants recommen- dations for. We could also gather more contextual infor- mation about our users during each request, for example, a more precise location, who the user is travelling with (family, friends, alone), etc. However, currently, we only use the city and profile as input. The result returned to users will con- Figure 1: 1. Request sent; 2. System passes request sist of an ordered list of attractions that the service thinks to services; 3. Services respond with suggestions; 4. the user will like. Suggestions are merged and sent to user; 5. User This is similar to the information participants in previous sends suggestion interactions to system. iterations of the track had available to them but instead of getting a batch of profiles and cities and returning a batch user. Services can then update their strategies and attempt of results, services will recieve a single profile and city for to improve suggestion results for future requests. each request. Additionally, instead of having a fixed profile for each user, on each request the profile may be updated with liked suggestions from previous requests. 5. DATASET COLLECTION In previous iterations of this track services were allowed to recommend any attraction they found on the open web. For 4. USER PROFILES simplicity, the points-of-interest that services are allowed to This leads us to the question of what a user’s profile con- recommend in this experiment come from a fixed collection sists of. Initially, for a new user, no information is in their of attractions. Services will simply return a list of attrac- profile. As the user asks for suggestions and interacts with tion IDs. When they are displayed to users each attraction them their profile will expand. For each attraction that has will consist of a title, short description, and website URL been recommended to the user the data about their inter- with more information about the attraction. Users will use action is added to their profile. Examples of interaction this information to make a decision about whether they like include the user viewing the attraction’s website, the user a particular attraction. Again, here we are presenting this “starring” the attraction, and the user rating the attraction. basic information about each attraction but additional infor- Currently, these three pieces of information are recorded for mation, such as the attraction’s category or reviews about each attraction the user has interacted with, however other the attraction, could also be presented to users. interactions, e.g., reviews written, could also be included This pool of attractions is collected as part of an ongo- in the profile. As the user interacts with the system their ing effort by several research groups who have expertise in profile will expand giving recommendation services a better dealing with gathering this sort of information due to par- opportunity to make more personalized suggestions. ticipation in previous Contextual Suggestion TREC tracks. Once interaction data from an attraction has been added Having a fixed data collection will allow us to limit which at- to the user’s profile it is essentially available publicly to all tractions are suggested and will allow for greater reusability services. One issue with this setup is that certain requests of the judgements provided by users. will have small profiles and certain requests will have larger profiles. Limiting the size of the profile will allow us not to worry as much about how the size of the profile affects 6. SERVICE EVALUATION service performance. One option to resolve this is, instead So, in order to develop a recommendation service that of adding every piece of interaction to a user’s profile, only fits into this framework services must set up a server that add certain attractions. One simple method of doing this is responds to suggestions requests with a list of attraction to only add attraction interaction data for a subset of cities. IDs. The goal of forcing services into this framework is so This will also allow us to ask for suggestions for the same that we can compare the performance between multiple ser- city multiple times (if that city is not part of the profile) and vices. Instead of having users communicate directly with necessarily not have to return to users for result interactions. a recommendation service they will communicate with an Another potential feature is to push any updates to the intermediary system. Services will be required to register profiles to services. This will allow them to get feedback into themselves with this system and the system will then pass how well they are performing and whether the suggestions recommendation requests to each service, logging the ser- they are making are actually liked by users without having vices’ responses. to pool profiles or wait for another request from the same Each service will be given an opportunity to make sug- gestions. One option to do this is, for each request, select 1 http://www.kaggle.com/ one of the services at random and present the results from and a modified version of time-biased gain have been used 1.0 in previous iterations of this track and can be used here as well. 0.8 Since we are not involving each service in each suggestion request we need to choose which services to involve. The simplest way to choose is to pick services randomly, how- 0.6 P@5 Score ever we should keep our end goal in mind here. Our goal is to find the correct ordering of services in terms of perfor- 0.4 mance. So, for example, after a certain amount of requests have been made, if a particular service performs much more 0.2 poorly than any other service we are not likely to learn more information about the correct ordering of services if we pick it as often as other services. On the other hand, if two ser- 0.0 A B C D E vices have very similar performance it may be worthwhile Services to pick them more often in order to determine which of the two services perform better. In Figure 2 we already know Figure 2: Potential scores given to five sytems at that service A performs poorly and there is probably more some point during the experiment. to gain by comparing services B and C. We should also note that we are only interested in telling the difference between two systems if they have enough of the service to the user. The user will then interact with the a difference between them. If one service performs better system and based on this interaction we can determine how than another but an end user would not realistically be able good the suggestions were. As more suggestion requests are to tell the difference between them then it is not worthwhile made each service will be given multiple chances to make spending a bunch of resources determining their correct or- recommendations and services can be compared. dering. In Figure 2 services D and E perform so similarly We take a slightly different approach where, for each sug- that it is probably not worth comparing them. gestion request a subset of the services are queried and the We leave this issue of selecting services based on their results from all these services are merged into a final list of current ranking for future work and for now simply select suggestions which is presented to users. If the user interacts systems for each request randomly. It is worth nothing that with a suggestion the score of the service that made that we only expect a handful of services to register for this sys- suggestion will be affected. It is possible for multiple ser- tem initially and they can all probably be involved in every vices to make the same suggestion, in this case if the user or most suggestion requests. interacts with the suggestion all the services that made it In previous iterations of this track services waited until the will have their score affected. experiments were done to recieve feedback on how well they This setup will give services more opportunities to make performed. An option being explored for this experiment is suggestions (during each request rather than only some), to provide scores or a leaderboard for services every so often however we will need a method of merging requests from so that services can see how well they are performing and multiple services. Multiple result interleaving approaches use that feedback to improve their service throughout the have been discussed by Radlinski and Craswell which would experiment. be appropriate for our purposes [9], including the team draft method proposed by Radlinski et al. [10]. 8. ASSESSORS The reason a subset of the services rather than all services So far we have been discussing a system which allows users are selected is that, realistically, most users will only view to interact with different recommendation services. Because the attractions near the top of the list. If we try to com- we don’t have an existing userbase to run these experiments pare too many services the user may not see any suggestions on we will use paid assessors to interact with the system in a from some of the services (or only see few suggestions from similar way to real users. In past iterations of this track we each service). In order to prevent this situation five ser- have found crowdsourced workers to be useful in these sorts vices are chosen for each suggestion request (this number is of tasks. Additionally Ageev et al. were successful in sim- chosen somewhat arbitrarily). We also limit the number of ulating search interaction data with crowdsourced workers suggestions each service can make per request to 50. [2]. For this experiment we will solicit hundreds of crowd- sourced workers to make suggestion requests and interact 7. SCORING with the results. They will be asked to interact with the Again, we have our suggestion services which produce results based on their own personal preferences. Payment ranked lists of suggestions. Each suggestion request will will be issued based on how many and for how long result be sent out to multiple services and user’s will interact with lists are interacted with. a list of merged responses. The user’s interaction with the Additionally, once the system has been set up and services suggestions will allow a score to be calculated for each ser- are registered and running, the setup can provide value to vice. For each point-of-interest interacted with we can cal- real users who can continue to use the system outside of the culate a usefulness score based on how highly the user rated track experiments. This will allow services to continue to re- it and whether the user visited the website or “starred” it. ceive feedback on their performance even outside of TREC. We also collect timing data for each session so we can in- corporate how long users spent on each attraction into our scoring metric. Precision at rank k, mean reciprocal rank, 9. SERVICE EFFICIENCY 12. CONCLUSION When a request is made the user is expecting a response We have briefly given an overview of a system that is used within a short amount of time. Services will have to be al- to evaluate multiple point-of-interest recommendation ser- ways available and be able to respond quickly. Once requests vices live using crowdsourced workers. This system will be have been sent out to services, if a response takes too long used to run the TREC 2015 Contextual Suggestion Track. to be returned that service will not be given an opportunity This experiment differs from previous years because services to contribute to the final list of suggestions presented to the will have to available online during the experiment, sug- user. In order to help services maintain responsiveness they gestions will have to be delivered live, and assessment and will be allowed to register multiple servers. For each request evaluation can be a lot more fluid. If you are interested in one of the service’s servers will be chosen to respond to the registering your service for this experiment you can find out request. This will provide some robustness to the system more about participating on the TREC website2 and the should a particular server become unavailable. Additionally Contextual Suggestion Track website3 . we can optionally incorporate each service’s respond time into their score and have efficiency influence the ordering of 13. REFERENCES services. [1] G. Adomavicius and A. Tuzhilin. Context-aware As an additional fallback mechanism, in case no service recommender systems. In Recommender systems responds to a particular request, a baseline service will be handbook, pages 217–253. Springer, 2011. developed that is always available and responds to every request. If no service responds the results from the baseline [2] M. Ageev, Q. Guo, D. Lagun, and E. Agichtein. Find service will be presented to users. This ensures that users it if you can: A game for modeling different types of always receive some response. This fallback mechanism will web search success using interaction data. In gather its results from a commercial web service. Proceedings of ACM SIGIR, 2011. [3] L. Baltrunas, B. Ludwig, S. Peer, and F. Ricci. Context relevance assessment and exploitation in 10. PERSONALIZED DESCRIPTIONS mobile recommender systems. Personal Ubiquitous The goal of each service is to select points-of-interest that Comput., 16(5):507–526, June 2012. the service predicts the user will like. These suggestions’ ti- [4] J. Bennett and S. Lanning. The netflix prize. 2007. tles, descriptions, and URLs are displayed to the user. The [5] M. Braunhofer, M. Elahi, and F. Ricci. Usability descriptions about each attraction shown to users are generic assessment of a context-aware and personality-based descriptions for that attraction. Services may want to mod- mobile recommender system. In E-Commerce and Web ify the descriptions slightly in order to include, for example, Technologies, volume 188, pages 77–88. Springer, 2014. why this particular user may find the attraction interesting. [6] A. Dean-Hall, C. L. A. Clarke, J. Kamps, P. Thomas, Services will be given an opportunity to provide personal- and E. Voorhees. Overview of the TREC 2014 ized descriptions for each attraction in order to include this contextual suggestion track. In Proceedings of TREC, kind of information. Evaluation for these descriptions will Gaithersburg, Maryland, 2014. be done separately from the main evaluation into service performance. [7] B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz. In previous iterations of this track every service had to The plista dataset. In Proceedings of ACM NRS, 2013. provide descriptions for all suggestions. The decision to [8] D. Milne, P. Thomas, and C. Paris. Finding, weighting make this an optional task was based on feedback that most and describing venues: Csiro at the 2012 trec services were simply providing generic descriptions, which contextual suggestion track. In Proceedings of TREC, in this experiment we are providing instead. Generating the 2012. generic descriptions ourselves will provide us with another [9] F. Radlinski and N. Craswell. Optimized interleaving point of standardization between services to allow for more for online retrieval evaluation. In Proceedings of ACM fair comparisons. WSDM, 2013. [10] F. Radlinski, M. Kurup, and T. Joachims. How does 11. USER INTERFACE clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Currently the interface that users use to make suggestions Information and Knowledge Management, CIKM ’08, requests and interact with service results is a web based pages 43–52, New York, NY, USA, 2008. ACM. interface. Users will select a city from a list and then be [11] E. M. Voorhees and D. K. Harman. TREC: presented with merged results from multiple systems. This Experiment and Evaluation in Information Retrieval. allows for crowdsourced assessors to easily provide system Digital Libraries and Electronic Publishing. The MIT feedback. However, the system is designed so that other Press, 2005. methods of presenting results to users could easily be used. [12] P. Yang and H. Fang. An opinion-aware approach to In particular the API allows any developer to build a mobile contextual suggestion. In Proceedings of TREC, 2013. application which enables users to interact with the system. Point-of-interest recommendation lends itself to mobile users and having multiple vectors for users to interact with the system is one of the future goals for this project. The source for this project is currently available online: https://github.com/akdh/entertain-me. 2 http://trec.nist.gov 3 https://sites.google.com/site/treccontext/