Recommender Systems Evaluation: A 3D Benchmark

                     Alan Said                           Domonkos Tikk                                     Yue Shi
                       TU Berlin                             Gravity R&D                                   TU-Delft
                alan@dai-lab.de                   domonkos.tikk@gravityrd.com y.shi@tudeft.nl
                  Martha Larson                        Klara Stumpf           Paolo Cremonesi
                       TU-Delft                              Gravity R&D                            Politecnico di Milano
             m.a.larson@tudelft.nl                    klara@gravityrd.com                     paolo.cremonesi@polimi.it

ABSTRACT                                                                  needs in vast content collections. However, with a variety
Recommender systems add value to vast content resources                   of data and the recommendation task, the comparison of
by matching users with items of interest. In recent years,                algorithms, approaches and general concepts becomes infea-
immense progress has been made in recommendation tech-                    sible due to the inherent differences in requirements, design
niques. The evaluation of these has however not been matched              choices, etc. This calls for a comprehensive benchmarking
and is threatening to impede the further development of rec-              framework that sets data- and task-specific requirements
ommender systems. In this paper we propose an approach                    driven by particular real-world applications.
that addresses this impasse by formulating a novel evalua-                   The benefits of benchmarking. Benchmarks formulate
tion concept adopting aspects from recommender systems                    standardized tasks making it possible to compare the perfor-
research and industry. Our model can express the quality                  mance of algorithms. They have been highly successful in the
of a recommender algorithm from three perspectives, the                   areas of information retrieval, e.g. Text Retrieval Conference
end consumer (user), the service provider and the vendor                  (TREC) [12] and the multimedia retrieval ImageCLEF [7],
(business and technique for both). We review current bench-               TRECVid [9] and MediaEval [6]. Benchmarks yield two types
marking activities and point out their shortcomings, which                of benefits; (1) they serve to support the development of new
are addressed by our model. We also explain how our 3D                    technologies in the research community [9, 11] and (2) they
benchmarking framework would apply to a specific use case.                create economic impact by bringing research closer to the
                                                                          market [8].
                                                                             Existing recommendation benchmarks. Today’s
Categories and Subject Descriptors                                        benchmarks are limited by their simplified views of users and
H.3.3 [Information Storage and Retrieval]: Information                    of data. The problem setting of the Netflix Prize1 , ground-
Search and Retrieval - Retrieval models                                   breaking at its time, was focused on a single functional re-
                                                                          quirement: the qualitative assessment of recommendation
                                                                          was simplified to the root mean squared error of predicted
1.   INTRODUCTION & MOTIVATION                                            ratings. Its simplified view treated users as needing no fur-
   Recommender systems identify items suitable for specific               ther output from the recommender system than a rating
users in large content collections. Despite recent commer-                on individual items. The data set was equally restricted to
cial and research efforts, a systematic evaluation model that             user ratings, additional information available in a real-world
addresses and considers all aspects and participants of the               recommender system environment were not considered. Fur-
recommender system is still missing. In this paper we propose             thermore, the Prize did not take non-functional requirements
a 3D Recommender System Benchmarking Model that covers                    into account, which arise from business goals and technical
all dimensions that impact the effectiveness of recommender               parameters of the recommendation service, though aspects
systems in real-world settings. The concept builds on a study             as scalability, reactivity, robustness and adaptability are key
of benchmarking settings from research and industry and                   for the productive operation of recommender systems.
provides a common comparison of recommender systems,                         The series of context-aware movie recommendation
independent of setting, data and purpose. Our benchmarking                (CAMRa) challenges explored the usefulness of contextual
concept captures three evaluation aspects which are shared                data in recommendations. The 2010 challenge2 provided spe-
in all recommender systems, independent of whether they                   cial features on the movie mood, movie location, and intended
are research systems or industrial products. As three main                audience (Moviepilot track), as well as social relationship be-
evaluation dimensions we identify user requirements, business             tween users and user activities on a movie-related social site
requirements and technological constraints, each represented              (Filmtipset track). The time of the recommendation was also
by a set of qualities which ensure the general applicability of           considered as context (Week track). Although the challenges
these procedures. For each particular recommendation prob-                expanded the data sources used, the evaluation translated
lem, the instantiation and relevance of these requirements                real-world user needs into the classification accuracy metrics
should be specified.                                                      to evaluate the system in the contest, and non-functional
   The motivation behind this framework is the growing im-                requirements of the solutions were not investigated.
portance of recommender systems. Users cannot be assumed                     The limitations of the Netflix Prize and CAMRa series are
to have the necessary overview to specify their information               characteristics of currently existing benchmarks and data
                                                                          sets. The concept presented in this paper approaches this
Copyright is held by the author/owner(s).
Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE           1
2012), held in conjunction with ACM RecSys 2012 September 9, 2012,            http://www.netflixprize.com
                                                                          2
Dublin, Ireland                                                               http://www.dai-labor.de/camra2010/challenge/


                                                                     21
                                                                         3.      EVALUATION SETTING
                                                                         3.1       Current evaluation methodologies
           Business Models
                                                                            Existing evaluation methods for recommender systems can
                                                                         be classified into system-oriented evaluation, user-oriented


                                               ts
                                                                         evaluation or a combination of both [3].


                                            in
                                          ra
                                                                            In system-oriented evaluation (off-line) users are not


                                        st
                                      on
                                                                         involved in the evaluation, instead, a data set is partitioned

                                     C
                                                                         into training and test sets. Using the training set, data points
                                      l
                                   ca
                                                                         in the test set are predicted. In user-oriented evaluation
                                 ni
                               ch


                                                                         (on-line) feedback from users interacting with the system is
                             Te


                                                                         collected by explicit questions or implicit observing.
                                 User Requirements                          Competitions and challenges built around recommender
                                                                         systems are mostly organized to find the most accurate mod-
     Figure 1: The three proposed evaluation axes.
                                                                         els. As described in Table 1, recommender systems are mostly
challenge by placing central focus on real-world user needs;             evaluated off-line and often, the business value of the tech-
large, heterogeneous, multi-source data sets and evaluating              nologies is not examined. Even though the accuracy may
both functional (quality-related) and non-functional (tech-              influence user satisfaction and revenue increase indirectly,
nical and business goals-related) requirements.                          there exists no way to evaluate the dimensions of user require-
                                                                         ments and business models. In most of the cases, the off-line
                                                                         evaluation scheme is chosen. Algorithms are often evalu-
2.    3D RECOMMENDATION EVALUATION                                       ated by error, ranking or classification accuracy measures.
                                                                         Many challenges (e.g. Netflix Prize) use explicit ratings to
   In order to extend the state of the art of evaluation, we
                                                                         profile users, other recommender scenarios (e.g. item-2-item
propose a concept for evaluation metrics that incorporates the
                                                                         recommendation) are not addressed. Technical constraints
needs from all perspectives in the recommendation spectrum.
                                                                         are uncommon in contests, the exception being the RecLab
The concept defines a set of benchmarking techniques that
                                                                         Prize3 . If a certain method performs well on a data set, the
select the correct combination of (i) data sets, (ii) evaluation
                                                                         integrability in a real-world system is still not addressed.
methods and (iii) metrics according to a three dimensional
                                                                         This deficiency is partially solved by online testing methods
requirement space: business models, user requirements and
                                                                         (as seen in CAMRa): recommender systems were tested in a
technical constraints, see Fig. 1.
                                                                         real environment, but an objective metric to show the real
   Business models allow a company to generate revenue.
                                                                         applicability of the tested system is missing. In the RecLab
Different models lead to different requirements in terms of
                                                                         Prize, the evaluated metric is revenue increase generated
the expected value from a recommender system. For instance,
                                                                         by the system. The organizers also specified non-functional
in a pay-per-view video-on-demand business model, the goal
                                                                         requirements to be eligible for the semi-final (top 10 teams),
of the recommender system is to increase sales to allow the
                                                                         but user requirements are not considered. These approaches
company to maximize revenues. However, in subscriber-based
                                                                         all contain metrics and methods moving towards our 3D
video-on-demand business models, the driving forces may be
                                                                         model, but none of them provide a comprehensive model.
to get users to return to the service in the future (a typical
showcase where recommender systems help [1]). Business                   3.2       Currently existing metrics
models may be influenced by the choice of the objective                     On-line evaluation is the only technique able to measure
function in the recommender algorithm; prediction-based or               the true user satisfaction; conducting such evaluations is
ranking-based functions reflect different business metrics.              however time consuming, and cannot be generally applied,
   User requirements reflect users’ perspectives. Recom-                 rather only to limited scenarios [2]. Contrary, off-line testing
menders are assets for user satisfaction and persuasion, i.e.,           has the advantage to be immediate, and easy to perform
they try to influence a user’s attitude or behavior [4], the             on several data sets with multiple algorithms. The question
usability of the systems affect the user’s perception of the sys-        is whether differences between the off-line performance of
tem. Recommendations may have different goals, e.g. reduce               algorithms can be carried over to differentiate their online
information overload, facilitate search, and find interesting            performance in various recommendation situations.
items increasing the quality and decreasing the time of the                 Classification metrics measure how well a system is able
decision-making process.                                                 to classify items correctly, e.g. precision and recall. Predic-
   Technical constraints. Recommender systems in real-                   tive metrics measure to what extent a system can predict
life must take into account a number of technical requirements           ratings of users. As rated items have an order, predictive
and constraints. These can be classified as data and system              accuracy metrics can be used to measure the item ranking
constraints, scalability and robustness requirements. Data               ability. Coverage metrics measure the percentage of items
constraints relate to the service architecture, e.g. satellite           for which the system can make recommendations [13]. Con-
TV lacks a return channel for feedback, hindering the use                fidence metrics measure how certain the system is of the
of collaborative filtering algorithms. System constraints                accuracy of the recommendations. Additionally, many rec-
derive from hardware and/or software limitations, e.g. in a              ommender systems algorithms use learning rate metrics
mobile TV scenario, the processing power in the hand-held                in order to gradually increase quality.
device is limited; excluding resource-heavy algorithms on                   A recommender system can recommend accurate items,
the client side. Scalability requirements derive from the                have good coverage and diversity and still not satisfy a user,
need of instant recommendations to all users on all items.               if they are trivial [10]. The state-of-the-art of the evalua-
These requirements are particularly strict in linear TV, where           tion metrics of recommendation reflects different recommen-
viewers are used to quick responsiveness. Robustness re-                 dation tasks. Diversity, novelty, serendipity and user
quirements are needed to create good services, able to work
                                                                         3
in case of data or component failure in distributed systems.                 http://overstockreclabprize.com/


                                                                    22
Table 1: An overview of some recommender system-related contests from the perspective of our 3D evaluation
Challenge       Task(s)                            Metric         Mode            User                      Business             Technical
Netflix Prize   minimize rating prediction error   RMSE           off-line        indirect: error measure   not addressed        not addressed
KDD-Cup’07      1: predict who rated what          RMSE           off-line        not addressed             detect trends & pop- not addressed
                2: predict number of ratings                                                                ular items
RecLab Prize    Increase revenue                   revenue lift
                                                           online & not addressed                           revenue lift         response/learning
                                                           off-line                                                              time, scalability
KDD-Cup’11      minimize rating prediction error RMSE      off-line indirect: error measure                 not addressed        not addressed
                split popular/unpopular items    ErrorRate          find interesting or irrele-
                                                                    vant items
KDD-Cup’12      prediction followed users        MAP@3     off-line exploring interesting users not addressed                     not addressed
                click trough rate prediction     MAE, AUC           & sources                   ad targeting (CTR)
CAMRa’10        context-aware; 1: temporal, 2: MAP, P@N, off-line contextual information in- not addressed                        not addressed
                emotional, 3: social             AUC       & online fluences preference
CAMRa’11        group recommendation             ErrorRate off-line group & target recommen- indirect: satisfaction               not addressed
                rater identification                                dation
CAMRa’12        find users for specific items    impact    on-line split interesting and irrel- increase audience                 not addressed
                                                                    evant content

satisfaction are especially difficult to measure off-line. Di-               the service which is crucial. Via the service interface, the user
versity is important for the usefulness of a recommendation                  gets recommendations based on the context, which might be
and therefore there is a need to define an intra-list similarity             translated into different recommendation tasks. From a user
metric [13]. Novelty and serendipity are two dimensions of                   perspective, easy content exploration and context dependent
non-obviousness [3].                                                         recommendation may be the most important aspects.

3.3    Possible Extensions of Methods & Metrics                              4.     CONCLUSION
   Real-world recommender systems should satisfy (1 ) func-                     We proposed a 3D Recommender System Benchmarking
tional requirements that relate to qualitative assessment of                 model that extends the state-of-the-art and addresses both
recommendations and (2 ) non-functional requirements speci-                  functional and non-functional real-word application-driven
fied by the technological parameters and business goals of the               aspects of recommender systems. Following the proposed
service. Functional and non-functional requirements should                   concept, the benchmarking activities within the community
be evaluated together: without the ability to provide accurate               will encompass the full range of other recommender system
recommendations, no recommender system can be valuable.                      use cases and algorithmic approaches. The comprehensive
As poor quality has adverse effects on customers, it will not                evaluation methodology will boost the development of more
serve the business goal. Similarly, if the recommender does                  effective recommender systems, and make it possible to focus
not scale with a service, not being able to provide recom-                   research resources productively and for industry technology
mendation in real time, neither users nor service provider                   providers to increase the uptake of recommender technology.
benefit from it. Thus, a trade-off between these requirements
is needed for an impartial and comprehensive evaluation of                   5.     REFERENCES
real-world recommenders. Scalable recommenders provide                        [1] M. B. Dias, D. Locher, M. Li, W. El-Deredy, and P. J. Lisboa.
good quality recommendations independently of the data                            The value of personalised recommender systems to e-business: a
                                                                                  case study. In RecSys ’08. ACM, 2008.
size, growth and dynamic. They are able to (1 ) process huge
                                                                              [2] M. Gorgoglione, U. Panniello, and A. Tuzhilin. The effect of
volumes of data during initialization using computation re-                       context-aware recommendations on customer purchasing
sources linearly scalable with data size; and (2 ) serve large                    behavior and trust. In RecSys ’11, pages 85–92. ACM, 2011.
amounts of parallel recommendation requests in real time                      [3] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl.
                                                                                  Evaluating collaborative filtering recommender systems. ACM
without significant degradation in service quality. In our                        Trans. Inf. Syst., 22(1), 2004.
model, scalability is found on the technical requirement axis.                [4] R. Hu. Design and user issues in personality-based
   Reactivity ensures good recommendations in real-time                           recommender systems. In RecSys ’10. ACM, 2010.
where the time threshold depends on the use case, typically                   [5] T. Jambor and J. Wang. Optimizing multiple objectives in
                                                                                  collaborative filtering. In RecSys ’10. ACM, 2010.
in the range of 10–1000 ms. Adaptability is important to
                                                                              [6] M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac,
react for changes in user preferences, content availability                       C. Wartena, V. Murdock, G. Friedland, R. Ordelman, and
and contextual parameters. In our 3D model, reactivity and                        G. J. F. Jones. Automatic tagging and geotagging in video
adaptability belong to the user requirement axis.                                 collections and communities. In ICMR ’11. ACM, 2011.
                                                                              [7] H. Müller. ImageCLEF experimental evaluation in visual
   Robustness is needed to handle partial, missing or cor-                        information retrieval. Springer, Heidelberg, 2010.
rupted data both in the system initialization and operational                 [8] B. Rowe, D. Wood, A. Link, and D. Simoni. Economic impact
phases. Robustness belongs to the business axis of our model.                     assessment of NIST’s text retrieval conference (TREC)
   Generally speaking, none of the requirements are mutu-                         program. Technical report, July 2010.
                                                                              [9] A. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns
ally exclusive, instead, optimization should be based on a                        and TRECVid. In MIR ’06, 2006.
combination of them – adapted for the setting in which the                   [10] L. Terveen and W. Hill. Beyond recommender systems: Helping
recommender system will be deployed [5].                                          people help each other. In HCI in the New Millennium.
   This example of a Video-on-Demand (VoD) service from                           Addison-Wesley, 2001.
the IPTV industry serves as a potential scenario for our                     [11] T. Tsikrika, J. Kludas, and A. Popescu. Building reliable and
                                                                                  reusable test collections for image retrieval: The Wikipedia
model. Business goals include increased VoD sales and cus-                        Task at ImageCLEF. IEEE Multimedia, 99(PrePrints), 2012.
tomer retention, but may have additional aspects (promoting                  [12] E. M. Voorhees. Overview of TREC 2005. In TREC, 2005.
content). The technical constraints are partly specified by                  [13] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen.
the middleware and the hardware/software configuration of                         Improving recommendation lists through topic diversification.
                                                                                  In WWW ’05. ACM, 2005.
the service provider, these all influence the response time of


                                                                     23