Recommender Systems Evaluation: A 3D Benchmark Alan Said Domonkos Tikk Yue Shi TU Berlin Gravity R&D TU-Delft alan@dai-lab.de domonkos.tikk@gravityrd.com y.shi@tudeft.nl Martha Larson Klara Stumpf Paolo Cremonesi TU-Delft Gravity R&D Politecnico di Milano m.a.larson@tudelft.nl klara@gravityrd.com paolo.cremonesi@polimi.it ABSTRACT needs in vast content collections. However, with a variety Recommender systems add value to vast content resources of data and the recommendation task, the comparison of by matching users with items of interest. In recent years, algorithms, approaches and general concepts becomes infea- immense progress has been made in recommendation tech- sible due to the inherent differences in requirements, design niques. The evaluation of these has however not been matched choices, etc. This calls for a comprehensive benchmarking and is threatening to impede the further development of rec- framework that sets data- and task-specific requirements ommender systems. In this paper we propose an approach driven by particular real-world applications. that addresses this impasse by formulating a novel evalua- The benefits of benchmarking. Benchmarks formulate tion concept adopting aspects from recommender systems standardized tasks making it possible to compare the perfor- research and industry. Our model can express the quality mance of algorithms. They have been highly successful in the of a recommender algorithm from three perspectives, the areas of information retrieval, e.g. Text Retrieval Conference end consumer (user), the service provider and the vendor (TREC) [12] and the multimedia retrieval ImageCLEF [7], (business and technique for both). We review current bench- TRECVid [9] and MediaEval [6]. Benchmarks yield two types marking activities and point out their shortcomings, which of benefits; (1) they serve to support the development of new are addressed by our model. We also explain how our 3D technologies in the research community [9, 11] and (2) they benchmarking framework would apply to a specific use case. create economic impact by bringing research closer to the market [8]. Existing recommendation benchmarks. Today’s Categories and Subject Descriptors benchmarks are limited by their simplified views of users and H.3.3 [Information Storage and Retrieval]: Information of data. The problem setting of the Netflix Prize1 , ground- Search and Retrieval - Retrieval models breaking at its time, was focused on a single functional re- quirement: the qualitative assessment of recommendation was simplified to the root mean squared error of predicted 1. INTRODUCTION & MOTIVATION ratings. Its simplified view treated users as needing no fur- Recommender systems identify items suitable for specific ther output from the recommender system than a rating users in large content collections. Despite recent commer- on individual items. The data set was equally restricted to cial and research efforts, a systematic evaluation model that user ratings, additional information available in a real-world addresses and considers all aspects and participants of the recommender system environment were not considered. Fur- recommender system is still missing. In this paper we propose thermore, the Prize did not take non-functional requirements a 3D Recommender System Benchmarking Model that covers into account, which arise from business goals and technical all dimensions that impact the effectiveness of recommender parameters of the recommendation service, though aspects systems in real-world settings. The concept builds on a study as scalability, reactivity, robustness and adaptability are key of benchmarking settings from research and industry and for the productive operation of recommender systems. provides a common comparison of recommender systems, The series of context-aware movie recommendation independent of setting, data and purpose. Our benchmarking (CAMRa) challenges explored the usefulness of contextual concept captures three evaluation aspects which are shared data in recommendations. The 2010 challenge2 provided spe- in all recommender systems, independent of whether they cial features on the movie mood, movie location, and intended are research systems or industrial products. As three main audience (Moviepilot track), as well as social relationship be- evaluation dimensions we identify user requirements, business tween users and user activities on a movie-related social site requirements and technological constraints, each represented (Filmtipset track). The time of the recommendation was also by a set of qualities which ensure the general applicability of considered as context (Week track). Although the challenges these procedures. For each particular recommendation prob- expanded the data sources used, the evaluation translated lem, the instantiation and relevance of these requirements real-world user needs into the classification accuracy metrics should be specified. to evaluate the system in the contest, and non-functional The motivation behind this framework is the growing im- requirements of the solutions were not investigated. portance of recommender systems. Users cannot be assumed The limitations of the Netflix Prize and CAMRa series are to have the necessary overview to specify their information characteristics of currently existing benchmarks and data sets. The concept presented in this paper approaches this Copyright is held by the author/owner(s). Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE 1 2012), held in conjunction with ACM RecSys 2012 September 9, 2012, http://www.netflixprize.com 2 Dublin, Ireland http://www.dai-labor.de/camra2010/challenge/ 21 3. EVALUATION SETTING 3.1 Current evaluation methodologies Business Models Existing evaluation methods for recommender systems can be classified into system-oriented evaluation, user-oriented ts evaluation or a combination of both [3]. in ra In system-oriented evaluation (off-line) users are not st on involved in the evaluation, instead, a data set is partitioned C into training and test sets. Using the training set, data points l ca in the test set are predicted. In user-oriented evaluation ni ch (on-line) feedback from users interacting with the system is Te collected by explicit questions or implicit observing. User Requirements Competitions and challenges built around recommender systems are mostly organized to find the most accurate mod- Figure 1: The three proposed evaluation axes. els. As described in Table 1, recommender systems are mostly challenge by placing central focus on real-world user needs; evaluated off-line and often, the business value of the tech- large, heterogeneous, multi-source data sets and evaluating nologies is not examined. Even though the accuracy may both functional (quality-related) and non-functional (tech- influence user satisfaction and revenue increase indirectly, nical and business goals-related) requirements. there exists no way to evaluate the dimensions of user require- ments and business models. In most of the cases, the off-line evaluation scheme is chosen. Algorithms are often evalu- 2. 3D RECOMMENDATION EVALUATION ated by error, ranking or classification accuracy measures. Many challenges (e.g. Netflix Prize) use explicit ratings to In order to extend the state of the art of evaluation, we profile users, other recommender scenarios (e.g. item-2-item propose a concept for evaluation metrics that incorporates the recommendation) are not addressed. Technical constraints needs from all perspectives in the recommendation spectrum. are uncommon in contests, the exception being the RecLab The concept defines a set of benchmarking techniques that Prize3 . If a certain method performs well on a data set, the select the correct combination of (i) data sets, (ii) evaluation integrability in a real-world system is still not addressed. methods and (iii) metrics according to a three dimensional This deficiency is partially solved by online testing methods requirement space: business models, user requirements and (as seen in CAMRa): recommender systems were tested in a technical constraints, see Fig. 1. real environment, but an objective metric to show the real Business models allow a company to generate revenue. applicability of the tested system is missing. In the RecLab Different models lead to different requirements in terms of Prize, the evaluated metric is revenue increase generated the expected value from a recommender system. For instance, by the system. The organizers also specified non-functional in a pay-per-view video-on-demand business model, the goal requirements to be eligible for the semi-final (top 10 teams), of the recommender system is to increase sales to allow the but user requirements are not considered. These approaches company to maximize revenues. However, in subscriber-based all contain metrics and methods moving towards our 3D video-on-demand business models, the driving forces may be model, but none of them provide a comprehensive model. to get users to return to the service in the future (a typical showcase where recommender systems help [1]). Business 3.2 Currently existing metrics models may be influenced by the choice of the objective On-line evaluation is the only technique able to measure function in the recommender algorithm; prediction-based or the true user satisfaction; conducting such evaluations is ranking-based functions reflect different business metrics. however time consuming, and cannot be generally applied, User requirements reflect users’ perspectives. Recom- rather only to limited scenarios [2]. Contrary, off-line testing menders are assets for user satisfaction and persuasion, i.e., has the advantage to be immediate, and easy to perform they try to influence a user’s attitude or behavior [4], the on several data sets with multiple algorithms. The question usability of the systems affect the user’s perception of the sys- is whether differences between the off-line performance of tem. Recommendations may have different goals, e.g. reduce algorithms can be carried over to differentiate their online information overload, facilitate search, and find interesting performance in various recommendation situations. items increasing the quality and decreasing the time of the Classification metrics measure how well a system is able decision-making process. to classify items correctly, e.g. precision and recall. Predic- Technical constraints. Recommender systems in real- tive metrics measure to what extent a system can predict life must take into account a number of technical requirements ratings of users. As rated items have an order, predictive and constraints. These can be classified as data and system accuracy metrics can be used to measure the item ranking constraints, scalability and robustness requirements. Data ability. Coverage metrics measure the percentage of items constraints relate to the service architecture, e.g. satellite for which the system can make recommendations [13]. Con- TV lacks a return channel for feedback, hindering the use fidence metrics measure how certain the system is of the of collaborative filtering algorithms. System constraints accuracy of the recommendations. Additionally, many rec- derive from hardware and/or software limitations, e.g. in a ommender systems algorithms use learning rate metrics mobile TV scenario, the processing power in the hand-held in order to gradually increase quality. device is limited; excluding resource-heavy algorithms on A recommender system can recommend accurate items, the client side. Scalability requirements derive from the have good coverage and diversity and still not satisfy a user, need of instant recommendations to all users on all items. if they are trivial [10]. The state-of-the-art of the evalua- These requirements are particularly strict in linear TV, where tion metrics of recommendation reflects different recommen- viewers are used to quick responsiveness. Robustness re- dation tasks. Diversity, novelty, serendipity and user quirements are needed to create good services, able to work 3 in case of data or component failure in distributed systems. http://overstockreclabprize.com/ 22 Table 1: An overview of some recommender system-related contests from the perspective of our 3D evaluation Challenge Task(s) Metric Mode User Business Technical Netflix Prize minimize rating prediction error RMSE off-line indirect: error measure not addressed not addressed KDD-Cup’07 1: predict who rated what RMSE off-line not addressed detect trends & pop- not addressed 2: predict number of ratings ular items RecLab Prize Increase revenue revenue lift online & not addressed revenue lift response/learning off-line time, scalability KDD-Cup’11 minimize rating prediction error RMSE off-line indirect: error measure not addressed not addressed split popular/unpopular items ErrorRate find interesting or irrele- vant items KDD-Cup’12 prediction followed users MAP@3 off-line exploring interesting users not addressed not addressed click trough rate prediction MAE, AUC & sources ad targeting (CTR) CAMRa’10 context-aware; 1: temporal, 2: MAP, P@N, off-line contextual information in- not addressed not addressed emotional, 3: social AUC & online fluences preference CAMRa’11 group recommendation ErrorRate off-line group & target recommen- indirect: satisfaction not addressed rater identification dation CAMRa’12 find users for specific items impact on-line split interesting and irrel- increase audience not addressed evant content satisfaction are especially difficult to measure off-line. Di- the service which is crucial. Via the service interface, the user versity is important for the usefulness of a recommendation gets recommendations based on the context, which might be and therefore there is a need to define an intra-list similarity translated into different recommendation tasks. From a user metric [13]. Novelty and serendipity are two dimensions of perspective, easy content exploration and context dependent non-obviousness [3]. recommendation may be the most important aspects. 3.3 Possible Extensions of Methods & Metrics 4. CONCLUSION Real-world recommender systems should satisfy (1 ) func- We proposed a 3D Recommender System Benchmarking tional requirements that relate to qualitative assessment of model that extends the state-of-the-art and addresses both recommendations and (2 ) non-functional requirements speci- functional and non-functional real-word application-driven fied by the technological parameters and business goals of the aspects of recommender systems. Following the proposed service. Functional and non-functional requirements should concept, the benchmarking activities within the community be evaluated together: without the ability to provide accurate will encompass the full range of other recommender system recommendations, no recommender system can be valuable. use cases and algorithmic approaches. The comprehensive As poor quality has adverse effects on customers, it will not evaluation methodology will boost the development of more serve the business goal. Similarly, if the recommender does effective recommender systems, and make it possible to focus not scale with a service, not being able to provide recom- research resources productively and for industry technology mendation in real time, neither users nor service provider providers to increase the uptake of recommender technology. benefit from it. Thus, a trade-off between these requirements is needed for an impartial and comprehensive evaluation of 5. REFERENCES real-world recommenders. Scalable recommenders provide [1] M. B. Dias, D. Locher, M. Li, W. El-Deredy, and P. J. Lisboa. good quality recommendations independently of the data The value of personalised recommender systems to e-business: a case study. In RecSys ’08. ACM, 2008. size, growth and dynamic. They are able to (1 ) process huge [2] M. Gorgoglione, U. Panniello, and A. Tuzhilin. The effect of volumes of data during initialization using computation re- context-aware recommendations on customer purchasing sources linearly scalable with data size; and (2 ) serve large behavior and trust. In RecSys ’11, pages 85–92. ACM, 2011. amounts of parallel recommendation requests in real time [3] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM without significant degradation in service quality. In our Trans. Inf. Syst., 22(1), 2004. model, scalability is found on the technical requirement axis. [4] R. Hu. Design and user issues in personality-based Reactivity ensures good recommendations in real-time recommender systems. In RecSys ’10. ACM, 2010. where the time threshold depends on the use case, typically [5] T. Jambor and J. Wang. Optimizing multiple objectives in collaborative filtering. In RecSys ’10. ACM, 2010. in the range of 10–1000 ms. Adaptability is important to [6] M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, react for changes in user preferences, content availability C. Wartena, V. Murdock, G. Friedland, R. Ordelman, and and contextual parameters. In our 3D model, reactivity and G. J. F. Jones. Automatic tagging and geotagging in video adaptability belong to the user requirement axis. collections and communities. In ICMR ’11. ACM, 2011. [7] H. Müller. ImageCLEF experimental evaluation in visual Robustness is needed to handle partial, missing or cor- information retrieval. Springer, Heidelberg, 2010. rupted data both in the system initialization and operational [8] B. Rowe, D. Wood, A. Link, and D. Simoni. Economic impact phases. Robustness belongs to the business axis of our model. assessment of NIST’s text retrieval conference (TREC) Generally speaking, none of the requirements are mutu- program. Technical report, July 2010. [9] A. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns ally exclusive, instead, optimization should be based on a and TRECVid. In MIR ’06, 2006. combination of them – adapted for the setting in which the [10] L. Terveen and W. Hill. Beyond recommender systems: Helping recommender system will be deployed [5]. people help each other. In HCI in the New Millennium. This example of a Video-on-Demand (VoD) service from Addison-Wesley, 2001. the IPTV industry serves as a potential scenario for our [11] T. Tsikrika, J. Kludas, and A. Popescu. Building reliable and reusable test collections for image retrieval: The Wikipedia model. Business goals include increased VoD sales and cus- Task at ImageCLEF. IEEE Multimedia, 99(PrePrints), 2012. tomer retention, but may have additional aspects (promoting [12] E. M. Voorhees. Overview of TREC 2005. In TREC, 2005. content). The technical constraints are partly specified by [13] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. the middleware and the hardware/software configuration of Improving recommendation lists through topic diversification. In WWW ’05. ACM, 2005. the service provider, these all influence the response time of 23