=Paper=
{{Paper
|id=None
|storemode=property
|title=Toward a New Protocol to Evaluate Recommender Systems
|pdfUrl=https://ceur-ws.org/Vol-910/paper2.pdf
|volume=Vol-910
|dblpUrl=https://dblp.org/rec/conf/recsys/MeyerFCG12
}}
==Toward a New Protocol to Evaluate Recommender Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-910/paper2.pdf</pdf>
<pre>
          Toward a New Protocol to Evaluate Recommender
                            Systems
Frank Meyer, Françoise Fessant, Fabrice Clérot                                                   Eric Gaussier
                          Orange Labs                                                     University of Grenoble - LIG
                        av. Pierre Marzin                                                  UFR IM2AG - LIG/AMA
                      22307 Lannion cedex                                                     Grenoble Cedex 9
                               France                                                                 France
{franck.meyer,francoise.fessant,fabrice.clerot}@                                           eric.gaussier@imag.fr
                  orange.com

ABSTRACT                                                                   highlighted when it is provided (in industrial contexts, the
In this paper, we propose an approach to analyze the performance           generated recommendations themselves and their utility are more
and the added value of automatic recommender systems in an                 important than the rating predictions). There is increasing
industrial context. We show that recommender systems are                   consensus in the community that accuracy alone is not enough to
multifaceted and can be organized around 4 structuring functions:          assess the practical effectiveness and added-value of
help users to decide, help users to compare, help users to discover,       recommendations [8,13]. Recommender systems in industrial
help users to explore. A global off line protocol is then proposed         context are multifaceted and we propose to consider them around
to evaluate recommender systems. This protocol is based on the             the definition of 4 key recommendation functions which meet the
definition of appropriate evaluation measures for each                     needs of users facing a huge catalog of items: how to decide, how
aforementioned function. The evaluation protocol is discussed              to compare, how to explore and how to discover. Once the main
from the perspective of the usefulness and trust of the                    functions are defined, the next question is how to evaluate a
recommendation. A new measure called Average Measure of                    recommender system on its various facets? We will review for
Impact is introduced. This measure evaluates the impact of the             each function the key points for their evaluation and the available
personalized recommendation. We experiment with two classical              measures if they exist. In particular, we will introduce a dedicated
methods, K-Nearest Neighbors (KNN) and Matrix Factorization                measure for the function "help to discover". This function raises
(MF), using the well known dataset: Netflix. A segmentation of             the question of the evaluation from the point of view of the
both users and items is proposed to finely analyze where the               usefulness of the recommendation. We will also present a global
algorithms perform well or badly. We show that the performance             evaluation protocol able to deal with the multifaceted aspect of
is strongly dependent on the segments and that there is no clear           recommender systems, which requires at least a simple
correlation between the RMSE and the quality of the                        segmentation of users and items. The remainder of the paper is
recommendation.                                                            organized as follow: the next section introduces the four core
                                                                           functions of an industrial recommender system. Then the
Categories and Subject Descriptors                                         appropriate measures for each core function are presented as well
H.3.3 [Information Search and Retrieval]: Information filtering            as the global evaluation protocol. The last part of the paper is
– collaborative filtering, recommender system; H.3.4 [Systems              dedicated to experimental results and conclusion.
and Software]: Performance evaluation (efficiency and
effectiveness) – performance measures, usefulness of                       2. MAIN FEATURES OF
recommendation.                                                            RECOMMENDER SYSTEMS
                                                                           Automatic recommender systems are often used on e-commerce
General Terms                                                              websites. These systems work in conjunction with a search engine
Algorithms, Measurement, Performance, Experimentation.                     for assistance in catalog browsing to help users find relevant
                                                                           content. As many users of e-commerce websites are anonymous, a
Keywords                                                                   very important feature is the contextual recommendation of item,
Recommender systems, Industrial context, evaluation, Compare,              for anonymous users. The purpose of these systems being also to
Explore, Decide, Discover, RMSE, utility of recommendation                 increase usage (the audience of a site) or sales, the
                                                                           recommendation itself is more important than the rating predicted.
                                                                           Moreover, prioritizing a list of items on a display page is a more
1. INTRODUCTION                                                            important functionality than the prediction of a rating. These
The aim of recommender systems is to help users to find items              observations, completed with interviews with marketers and
that should interest them, from large catalogs. One frequently             project managers of Orange about their requirements relatively to
adopted measure of the quality of a recommender system is                  recommender systems and an overview of recommender systems
accuracy (for the prediction of ratings of users on items) [1,14].         both in the academic and in the industrial fields [10] has led us to
Yet in many implementations of recommender system services,                organize the recommender systems' functionalities into 4 main
the rating prediction function is either not provided, or not              features:
                                                                           Help to Decide. Given an item, a user wants to know if he will
Copyright is held by the author/owner(s). Workshop on Recommendation       appreciate the item. This feature consists of the prediction of a
Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with       rating for a user and an item and is today mainstream in academic
ACM RecSys 2012. September 9, 2012, Dublin, Ireland.                       literature [14].


                                                                       9
Help to Compare. Given several items, a user wants to know                   3.3 Measures of performance
what item to chose. This feature corresponds to a ranking                    For our protocol we use a classic train/test split of the data. The
function. It can be used to provide recommendation lists [5] or to           train set will be used to compute statistics and thresholds and to
provide personalized sorting results of requests on a catalog.               build a predictive model. The test set will be used to compute the
Help to Discover. Given a huge catalog of items, a user wants to             performance measures. The predictive model should at least be
find a short list of new interesting items. This feature is usually          able to provide a rating prediction function for any couple of user
called item-based top-N recommendation in the academic                       and item. We will see that to provide the "Help to Explore"
literature [6]. It corresponds to personalized recommendation.               functionality the predictive model also must be able, in some way,
Note that the prediction of the highest rated item is not necessarily        to produce an item-item similarity matrix allowing it to select, for
the most useful recommendation [5]. For instance the item with               each item i, its most similar items (the related items). We first
the highest predicted rating will most likely be already known by            detail the performance measures we use for our protocol,
the user.                                                                    according to the 4 core functions.
Help to Explore (or Navigate). Given one item, an (anonymous)                Help to Decide. The main use case is a user watching an item
user wants to know what the related items are. This feature                  description on a screen and wondering if he would enjoy it.
corresponds to the classical item-to-item recommendation to                  Giving a good personalized rating prediction will help the user to
anonymous users popularized by the e-commerce website                        choose. The "help to decide" function can be given by the rating
Amazon [9] during catalog browsing. This function is widely used             prediction function and must be measured by an accuracy measure
in the industry because it can make recommendations for                      which penalizes extreme errors. The Root Mean Squared Error
anonymous users, based on the items she consults. It requires a              (RMSE) is the natural candidate [14].
similarity function between items.                                           Help to Compare. The main use case here is a user getting an
                                                                             intermediate short list of items after having given her preferences.
3. EVALUATION OF INDUSTRIAL                                                  This user then wants to compare the items of this short list, in
RECOMMENDER SYSTEMS                                                          order to choose the one she will enjoy most. The function needs a
In this section we discuss the appropriate measures for each core            ranking mechanism with a homogeneous quality of ranking over
function and a global protocol for the evaluation of the                     the catalog. A simple measure is the percentage of compatible
recommender system. The evaluation is viewed from the                        rank indexes. After modeling, for each user u and for each couple
standpoint of the utility of the recommendation for each user and            of item (i, j) in the test set rated by u with ru,i≠ru,j, the preference
each item.                                                                   given by u is compared with the predicted preference given by the
                                                                             recommender method, using the predicted ratings                and     .
3.1 Utility of the recommendation                                            The percentage of compatible preferences is given by:
A good recommender system should avoid bad and trivial
recommendations. The fact that a user likes an item and the fact
that an item is already known by the user have to be distinguished                                                                (3-1)
[7]. A good recommendation corresponds to an item that would
probably be well rated by the user but also an item that the user            with                                                           , where
does not know. For instance it is worthless recommending to all                                                      is 1 if                has the
users the blockbuster of the year: it should be a good rated movie
                                                                             same sign as             and 0 otherwise, and
on the average, but it is not a useful recommendation as most of
people may have already seen it.                                                             is the number of elements of

3.2 Item segmentation and user segmentation
Another important issue for an industrial application is to fully            Help to Discover. The main use case here is a user getting
exploit the available catalog, including its long tail, consisting of        recommended items: these recommendations must be relevant and
items rarely purchased [2]. A system’s ability to make a                     useful. For relevancy our approach is the following: an item i
recommendation, in a relevant way, for all items in the catalog is           recommended for the user u
therefore important. However Tan and Netessine [16] have                     - is considered relevant if u has rated i in the test set with a rating
observed on the Netflix dataset for instance, that the long tail             greater than or equal to u's mean of ratings,
effect is not so obvious. There's more of a Pareto distribution
                                                                             - is considered irrelevant if u has rated i in the test set with a
(20% of the most rated items represents 80% of the global ratings)
                                                                             rating lower than u's mean of ratings
in the Netflix data than a long tail distribution as proposed by
Anderson [2] (where infrequent items globally represent more                 - is not evaluated if not present for u (not rated by u) in the test
ratings). They also noticed that the behavior of the users and the           set.
type of items they purchase are linked. In particular, customers             The classical measure to evaluate recommendation list is the
who watched items in the long tail are in fact heavy users, light            precision measure (recall being difficult to apply in the context of
users tend to focus only on popular items. These observations lead           recommendation, as in huge catalogs one does not know all the
us to the introduction of the notion of segments of items and                items relevant for each user). For each user u:
users. The definition of the segment thresholds must be relative
and catalog dependant. We will use the terms of light/heavy users                                                                   (3-2)
segment and of unpopular/popular item segment instead of using
long tail and short head concepts. In a first step we will use this          Hu stands for the subset of evaluable recommendations in the test
simple segmentation to analyze how an industrial recommender                 set for u, that is to say the set of couples (u,i), i being the
system can help all users both heavy and light and how it can                recommended item to the user u. |Hu| is the size of Hu, in number
recommend all items, both popular and unpopular.                             of couples (u, i).


                                                                        10
However the precision is not able to measure the usefulness of the                Our solution is to use the underlying item-item similarity matrix
recommendations: recommending well-known blockbusters,                            for this evaluation. We can assess the overall quality of the pairs
already known by the user will lead to a very high precision                      of similar items by an indirect method: 1. given a predictive
although this is of very low utility. To account for this, we                     model, find a way to compute similarities between any pair of
introduce here the concept of recommendation impact. The basic                    items, building an item-item similarity matrix. 2. use an item-item
idea is that, the more frequent a recommended item is, the less                   K-Nearest Neighbors (KNN) model [12] using this matrix. The
impact the recommendation has. This is summarized in Table 1:                     assumption is that a good similarity matrix must lead to good
         Table 1. The notion of recommendation Impact                             performances for other aspects of the recommendation when used
                                                                                  into an item-item KNN model. This is the approach we take, using
                           Impact of the recommendation                           RMSE, precision, and ranking performance measures. For a KNN
                 Impact if the user likes     Impact if the          user         type algorithm, this analysis is straightforward and simple: the
                 the item                     dislikes the item                   similarity matrix is already the kernel of the model. The
                                                                                  algorithms that are not directly based on a similarity measure need
                                                                                  a method for extracting the similarities between the items. For
Recommending     Low: the item is likely to   Low: even if the user
a popular item   be already known at least    dislikes this item he can
                                                                                  matrix-factorization-based algorithm, this can correspond to a
                 by name by the user.         understand that as a                method to compute similarities between the factors of the items.
                                              popular       item      this
                                              recommendation is likely
                                                                                  3.4 Evaluation Protocol
                                              to appear... at least at the        The evaluation protocol is then designed thanks to the mapping
                                              beginning                           between the 4 core functions and the associated performance
                                                                                  measures as summarized in Table 2.
Recommending     High:      the    service    High: not only the item
an unpopular     provided       by     the    was unknown and did                            Table 2. Adapted measures for each core function
(infrequent)     recommender system is        not inspire confidence,
                                                                                  Functions                 Quality criterion                     Measure
item             efficient. The rarest the    but it also was not good.
                 item was, the less likely                                        Decide           Accuracy of the rating prediction        RMSE
                 the user would have
                                                                                                   Penalization of extreme errors to
                 found it alone.
                                                                                                   minimize the risk of wrong
                                                                                                   decision
We then define the Average Measure of Impact (AMI) for the                        Compare          Good predicted ranking for every         COMP
performance evaluation of the function "Help user to Discover".                                    couple of items of the catalog
                                                                                                                                            % of compatible rank
The AMI of a recommendation list Z for a user u with an average                                                                             indexes
of rating is given by:                                                            Discover         Selection for a user the most            (Precision,        not
                                                                   (3-3)                           preferred items in a list of items       recommended!)
                                                                                                   Identification  of          good/bad     Average Measure of
Where Hu denotes the subset of the evaluable recommendations in                                    recommendations                          Impact (AMI)
the test set, Z denotes the set of couples (user, item), representing                              Precise,    useful,           trusted
a set of recommendations, count(i) the number of logs in the train                                 recommendation
set related to the item i, and |I| the size of the catalog of items.
                                                                                  Explore          Precise recommendations                  Similarity      matrix
The rarer an item i (rarity being estimated in the train set), the                                                                          leading    to     good
                                                                                                   Identification  of          good/bad
greater the AMI if i is both recommended and relevant for a user                                   recommendations                          performances,        in
u. The greater the AMI, the better the positive impact of the                                                                               accuracy, relevancy,
recommendations on u. The AMI will have to be calibrated as we                                                                              usefulness and trust
do not know yet what is a "good AMI". But we can already                          The following notations are adopted: a log (u, i, r) corresponds to
compare different algorithms, or different recommendation                         a user u who rated an item i with the rating r. U is the set of all the
strategies (such as post filtering methods to add serendipity) with               users, I is the set of all the items. Given a dataset D of logs and an
this measure.                                                                     algorithm A, the evaluation protocol we propose is as follow:
Help to Explore. The main case here is the item-to-item                           Initialization
recommendation for an anonymous user who is watching an item                      Randomly split the dataset into 2 datasets train and test
description on a screen: the recommender system should propose
items similar to that being watched. We can try to evaluate the                   Use the train dataset to generate a model with the algorithm A.
performance of this functionality by associating, with each                       Evaluation
context item i, the KNN of i, using an overall precision measure                  1. For each log (u, i, r) of the test set:
for the recommended items. But, we will have an issue: it can be
                                                                                              1.1 compute the predicted rating of the model
more effective to associate each context item i with N items
optimized only for precision, rather than N items similar to the                              1.2 compute the predicted rating error
context item i. It may be more efficient, to optimize precision, to               2. Use the RMSE which gives an indicator of the performance of the Help
associate blockbusters for each source item. In fact we want to                   to Decide function.
assess the quality of the Help to Explore (navigate) function: we                 3. For each user u of U:
want a good semantic, meaningful similarity for each associated                               3.1 sort all u's logs of the test set by ratings
item. But only an experiment with real users can assess this
semantic similarity.                                                                          3.2 sort all u's logs of the test set by rating prediction
                                                                                              3.3 compute COMP comparing the indexes of u's logs and the
                                                                                              indexes of the predicted ratings of he logs.


                                                                             11
4. Use the averaged COMP as an indicator of the Help to Compare                         - a random predictor, generating uniform ratings between [1..5]
function.                                                                               for each rating prediction.
5. For each item i of I, compute count(i) which is the number of logs in the            One industrial requirement of our system was that it could take
train set referencing i.
                                                                                        into account new items and new users every 2 hours. Considering
6. For each user u of U:                                                                other process and I/O constraints, for all the algorithms the
           6.1 compute the predicted rating of each item i of I.                        modeling time was then restricted to 1.5 hours. This has
           6.2 select the top-N highest predicted rating items noted iu,1 to
                                                                                        implications for the MF algorithm as on Netflix it always reaches
           iu,N which are the Top-N recommended items.                                  an optimum between 16 and 32 factors: this is a constant for all
                                                                                        our tests, for all the performances. Beyond 32 factors, MF does
           6.3 compute the rating average of u, noted .
                                                                                        not have enough time to converge. Note that this convergence
           6.4 for each recommended item iu,j of u:                                     may be slow, longer than 24 hours for more than 100 factors on
                       6.4.1 check if a corresponding log (u, iu,j,r) exists, If        the Netflix dataset.
                       so the recommendation of iu,j is evaluable else skip
                                                                                        Implementations details
                       the step 6.4.2.
                                                                                        Our implementation of MF is similar to those of the BRISMF
                       6.4.2. If r≥ then the recommendation is considered
                       relevant (and irrelevant in the other case).                     implementation [15] with a learning rate of 0.030 and a
                                                                                        regularization factor of 0.008, with early stopping. Learning
           6.5 compute the Precision and the AMI for the evaluable
                                                                                        process is stopped after 1.5 hours, or when the RMSE increases
           recommendations
                                                                                        three consecutive times (the increase or decrease of the RMSE is
7. Use the Precision and the AMI, averaged by users, as the indicators for              controlled on a validation set consisting of 1.5% of the train set).
the Help to Discover Function                                                           We used an implementation of item-item KNN model as
8. Specify a way to compute efficiently, using the model of the algorithm               described in [11]. The similarity function is the Weighted Pearson
A, the similarity between every couple of items (i,j).                                  similarity [4]. All details about implementations can be found in
9. Compute the similarity matrix of all the couple (i, j) for I×I.                      [10].
10. Use this similarity matrix as the kernel of an item-item K-Nearest
Neighbor model, then run the protocol for the steps 1 to 7 for RMSE,
                                                                                        5. NUMERICAL RESULTS
COMP, AMI and Precision to obtain a 4-dimensional indicator of the                      The following abbreviations are used for the segmentation of the
quality of the Help to Explore function.                                                performance: Huser: Heavy users, Luser: Light users, Pitem:
                                                                                        Popular items and Uitem: Unpopular items (the meaning of
4. EXPERIMENTS                                                                          unpopular is rather "rare", "infrequent"). For MF we analyzed the
4.1 Datasets and configuration                                                          number of factors used and for KNN the number of NN kept. The
Experiments are conducted on the widely used dataset Netflix [3].                       full results of our experiments are available in [10].
This dataset has the advantage of being public and allows
performance comparisons with many others techniques. Agnostic
                                                                                        5.1 “Help to Decide” performances
thresholds are used for segments of users and items, depending of                       The global default predictor has a RMSE of 0.964 and the global
datasets. We used simple thresholds based on the mean of the                            random predictor has a RMSE of 1.707.
number of ratings to split items into popular items and unpopular                       KNN's RMSE performances: Different sizes of neighborhoods
(infrequent) items, and similarly to split users into heavy users                       (K) have been tested, compliant with our tasks in an industrial
and light users. For instance, on Netflix, using a Train Set of 90%                     context. Increasing K generally increases the performances.
of the total of logs, the mean of the number of rating for the users                    However the associated similarity matrix weights must be kept in
is 190 (heavy users are users who gave more than 190 ratings                            RAM for efficiency purposes, which is difficult, if not possible,
otherwise they are light users) and the mean of number of ratings                       with high values of K. For very large catalog applications, the size
for the items is 5089 (popular items are items with more than                           of the KNN matrix must be reasonable (up to 200 neighbors in our
5089 ratings otherwise they are unpopular items). The number of                         tests). The KNN method performs well except when K is small
generated items for the Top-N recommendation is always N=10.                            and except for the light-user-unpopular item segment (Luser
All our tests are carried out on this configuration: Personal                           Uitem). There is a significant gap between the RMSE for the
Computer with 12 GB Ram, processor IntelTM XeonTM W3530 64-                             LuserUitem segment (RMSE=1.05) and the RMSE of the heavy-
bit-4-core processor running at 2.8 GHz, hard disk of 350 GB. All                       user-popular-item segment (RMSE=0.8). Clearly, the KNN model
algorithms and the benchmark process are written in JavaTM.                             is not adapted to the former, whereas it performs well on the later.
                                                                                        Optimal number of neighbors is around K= 100.
4.2 Algorithms
We chose to use 2 models: fast matrix factorization using the MF                        MF's RMSE performances: Different numbers of factors have
algorithm presented in [15] and an item-item KNN algorithm [12].                        been tested. MF has difficulties modeling the Luser-Uitem
These algorithms are mainstream techniques for recommender                              segment: on this segment the RMSE never decreases under 0.96.
systems. For MF we analyze the effect of the number of factors,                         On the contrary the RMSE for heavy-user-popular-item is close to
for the KNN algorithm we analyze the effect of K, the number of                         0.81, and the two symmetrical segments light-user-popular item
Nearest Neighbor kept in the model. In addition, to compare the                         and heavy-user-unpopular-item both have a good (low) RMSE
performances of these 2 algorithms, 2 baseline algorithms are also                      (0.84 and 0.85). The RMSE decreases when number of the factor
used:                                                                                   increases up to around 20 factors. After that number, the RMSE
                                                                                        increases. It is a consequence of our time-constrained early
- a simple default predictor using the mean of items and the mean                       stopping condition. This corresponds to about 140 passes on the
of the users (the sum of the two means if available, divided by 2).                     train set. The optimal number of factors seems to be between 16
This algorithm is also used by the KNN algorithm when no KNN                            and 32.
items are available for a given item to score.


                                                                                   12
5.2 “Help to Compare” performance                                              Compare, Discover and on 4 user-item segments: heavy-user-
The default global predictor has a percentage of compatible rank               popular-item, heavy-user-unpopular-item, light-user-popular-item
indexes (COMP) of 69% and the random global predictor has a                    and light-user-unpopular item. A summary of the results is given
performance of 49.99%.                                                         in Table 4. An analysis of the results by segments shows that
                                                                               globally, KNN is well adapted for the heavy-user segments and
MF's and KNN’s ranking performances: The results are given                     that MF, and the default predictor are well adapted to light-user
for the time limited version of run for MF. MF outperforms the                 segments. Globally, for the tasks "Help to Decide" and "Help to
KNN model for the light user segments (with a COMP of 73.5%                    Compare", MF is the best-suited algorithm in our tests. For the
for MF and 66% for KNN). For the rest, the performances are                    task "Help to Discover" KNN is more appropriate. Note that a
similar to those of KNN. The maximum of ranking compatibility                  switch-based hybrid recommender [14], based on item and user
is around 77% for heavy users' segments.                                       segmentation could exploit this information to improve the global
                                                                               performances of the system. Finally 3 main facts will have to be
5.3 “Help to Discover” performance                                             considered:
5.3.1 Analysis using the Precision
                                                                               1. Performances strongly vary according to the different segments
The global default predictor has a precision of 92.86 % which is
                                                                               of users and items.
questionable: one can see that a simple Top-10 based on high
rating average is sufficient to obtain good precision performance.             2. MF, KNN and default methods are complementary as they
The global random predictor has a precision of 53.04%.                         perform differently across the different segments.
KNNs' precision performances: The precision increases as the                   3. RMSE is not strictly linked to other performance measure, as
number of K increases. But the results are not significantly better            mentioned for instance in [5].
than that of the default predictor. The precision is better than the
default predictor for only the Huser-Pitem segment and only for at                                 Table 4. Global results, summary
least K=200. Under K=100, it seems better to use a default                                           Heavy     Heavy       Light       Light
predictor than a KNN predictor for ranking tasks. Nevertheless the                                   Users     Users       Users       Users
Huser-Pitem segment is well modeled: the precision for 10
generated items for the KNN model is greater than 97% for the                                        Popular   Unpopular   Popular     Unpopular
                                                                                                     items     items       Items       Items
model with 200 neighborhoods.
                                                                                   Decide
MF's precision performances: MF has a better behavior than the                     RMSE              KNN       MF          MF          MF
KNN model, especially for the light-user-unpopular-item segment                    Compare
(precision of 96% for F=32 factors, precision of 83% for the KNN                   %Compatible       KNN       KNN         MF          MF
with K>=100).                                                                      preferences
                                                                                   Discover
5.3.2 Analysis using the AMI                                                       Precision         KNN       MF          Default     MF
The Average Measure of Impact gives slight negative                                                                        Predictor
performances for the random predictor and a small performance to                   Discover
the default predictor: the default predictor "wins" its impact values              Average           KNN       Default     KNN         Default
on Unpopular items. Note that the supports for the different                       Measure    of               Predictor               Predictor
evaluated segments are very different and the weights of the two                   Impact
popular item segments are significantly higher The KNN model
behaves significantly better that the default predictor for the AMI.           When designing a recommender engine, we have to think about
For MF, the behavior is much worse than that a KNN model. In                   the impact of the recommender: recommending popular items to
general, the impact of MF is similar to, or lower than that of the             heavy users might be not so useful. On the other hand, it can be
default predictor. An analysis according to the segmentation gives             illusory to make personalized recommendations of unpopular (and
a more detailed view of where are the impacts. Numerical results               unknown) items to light (and unknown) users. A possible simple
are summarized in Table 3.                                                     strategy could be:

            Table 3. AMI according to the segmentation                         -     rely on robust default predictors, for instance based on robust
                                                                                     means of items to try to push unknown golden nuggets to
Best model      Huser    Luser   Huser           Luser           Global
                                                                                     unknown users,
                Pitem    Pitem   Uitem           Uitem
                                                                               -     use personalized algorithms to recommend popular items to
MF F=32           0.38    0.26           8.93            10.61      0.5              light users,
                                                                               -     finally, use personalized algorithms to recommend unpopular
KNN K=100         0.71    0.43           9.59             8.84      2.0
                                                                                     items of the long tail for heavy "connoisseurs".
Default Pred      0.29    0.25         21.22             12.31      0.5
                                                                               5.5 “Help to Explore” performance
Random            0.00    0.03           -5.13           -0.53     -0.6        To analyze the performance of the "Help to Explore" functionality
Pred
                                                                               we have to compare the quality of the similarities extracted from
Best            KNN      KNN     Default         Default         KNN           the models. We use the protocol defined before: a good similarity
algorithm                        Predictor       Predictor                     matrix for the task "Help to Explore" is a similarity matrix leading
                                                                               to global good performances, when used in a KNN model. We
5.4 Summary for Decide, Compare, Discover                                      choose a similarity matrix with 100 neighbors for each item: this
Four models have been analyzed: a KNN model, a MF model, a                     is largely enough for item-to-item tasks where generally a page
random model and a default predictor model, on 3 tasks adapted                 displays 10 to 20 similar items. Results are presented in Table 5
to a rating-predictor-based recommender system: Decide,                        for the KNN models with K=100, comparing KNN computed on


                                                                          13
MF's items factors, native KNN and a Random KNN model used                    - and the way to improve the recommender systems to achieve
as baseline. As item-item similarity matrix is the kernel of a item-          their tasks.
item KNN model, compute similarities in this case is
straightforward. To compute similarities between items for MF,                7. REFERENCES
we use the MF-based representation of items (the vectors of the               [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next
factor of the items), with a Pearson similarity. The KNN model                    Generation of Recommender Systems: A Survey of the State-
computed on the MF's factors of the items can be viewed as a MF-                  of-the-Art and Possible Extensions, IEEE Trans. Knowl.
emulated KNN model. Note that as the default predictor model                      Data Eng, 17 (6), 2005, pp. 734-749.
based on items’ means and users’ means cannot itself produce a                [2] Anderson, C. 2006. The Long Tail. Why the future of
similarity matrix, it is disqualified for this task. For the RMSE, the            business is selling less of more. Hyperion Verlag.
MF-Emulated KNN model looses 0.025 point going from 0.844 to
0.870. Compared with other models, it still performs correctly.               [3] Bennet, J and Lanning, S. 2007. The Netflix Prize, KDD Cup
                                                                                  and Workshop. 2007. www.netflixprize.com.
 Table 5. Quality of an item-item similarity matrix according
               to 4 measures: results on Netflix                              [4] Candillier, L., Meyer, F., Fessant, F. 2008. Designing
                                                                                  Specific Weighted Similarity Measures to Improve
                          Native KNN        KNN computed on MF's                  Collaborative Filtering Systems. ICDM 2008: 242-255.
                                                items factors
                             K=100            K=100, number of                [5] Cremonesi, P., Koren, Y., and Turrin, R. 2010. Performance
                                                 factors=16                       of recommender algorithms on Top-N recommender tasks.
       RMSE                 0.8440                 0.8691                         RecSys 2010.
     Ranking: %             77.03%                 75.67%
     compatible
                                                                              [6] Deshpande, M., and Karypis, G. 2004. Item-based top-N
      Precision             91.90%                  86.39%                        recommendation algorithms. In ACM Transactions on
                                                                                  Information Systems, 22(1), 143–177.
        AMI                  2.043                   2.025                    [7] Herlocker, J. L., Konstan, J. A., Terveen, L.G. and Riedl, J.
                                                                                  2004. Evaluating collaborative filtering recommender
     (Global time        (5290 seconds)          (3758 seconds)                   systems. In ACM Transactions on Information Systems 22
of the modeling task)
                                                                                  (1), 5–53.
For the global ranking, the difference between the MF-Emulated                [8] Knijnenburg, B. P., Willemsen, M.C., Kobsa, A. 2011: A
model and the native KNN model is still low, whereas a random                     pragmatic procedure to support the user-centric evaluation of
KNN model performs very badly. For the precision, for a Top-10                    recommender systems.RECSYS 2011, 321-324
ranking, the MF-Emulated KNN model performs significantly                     [9] Linden, G. Smith,.B, and York, J. 2003. Amazon.com
worse than a native KNN model. For the Average Measure of                         Recommendations: Item-to-Item Collaborative Filtering.
Impact, the MF-emulated KNN model and the native KNN model                        IEEE Internet Computing, 7 (1), 2003, pp. 76-80.
perform almost identically. These results show that MF could be
used to implement a similarity function between items to support              [10] Meyer, F. 2012. Recommender systems in industrial
the "Help to Explore" function, and that MF could be used as a                     contexts. ArXiv e-prints. http://arxiv.org/abs/1203.4487.
component for faster KNN search.                                              [11] Meyer; F., Fessant, F. 2011. Reperio: a generic and flexible
                                                                                   recommender system. IEEE/WIC/ACM Conference on Web
6. CONCLUSION                                                                      Intelligence, 2011.
We have proposed a new approach to analyze the performance
                                                                              [12] Sarwar, B., Karypis, G., Konstan, J., and Reidl, J. 2001.
and the added value of automatic Recommender Systems in an
                                                                                   Item-based collaborative filtering recommendation
industrial context. First, we have defined 4 core functions for
                                                                                   algorithms. In WWW’01: Proceedings of 10th International
these systems, which are: Help users to Decide, Help users to
                                                                                   Conference on World Wide Web, pages 285–295.
Compare, Help users to Discover, Help users to Explore. Then we
proposed a general off-line protocol crossing our 4 core functions            [13] Schroder, G., Thiele, M. and Lehner, W. 2011. Setting Goals
with a simple 4 users×items segments to evaluate a recommender                     and Choosing Metrics for Recommender System Evaluation.
system according to the industrial and marketing requirements.                     UCERSTI 2 -RECSYS 2011
We compared two major state of the art methods, item-item KNN                 [14] Su, X., and Khoshgoftaar, T.M. 2009. A survey of
and MF, with 2 baselines methods used as reference. We showed                      collaborative filtering techniques. In Advances in Artificial
that the two major methods are complementary as they perform                       Intelligence, 2009.
differently across the different segments. We proposed a new
measure, the Average Measure of Impact, to deal with the                      [15] Takács, G., Pilászy, I., Németh, B., Tikk, D. 2009. Scalable
usefulness and the trust of the recommendations. Using the                         Collaborative Filtering Approaches for Large Recommender
precision measure, and the AMI, we showed that there is no clear                   Systems. Journal of Machine Learning Research 10: 623-656
evidence of correlation between the RMSE and the quality of the                    2009.
recommendation. We have demonstrated the utility of our                       [16] Tan, T.F. and Netessine, S. 2011, Is Tom Cruise
protocol as it may change                                                          Threatened? An Empirical Study of the Impact of Product
                                                                                   Variety on Demand Concentration. ICIS 2011.
- the classical vision of the recommendation evaluation, often
focused on the RMSE/MAE measures as they are assumed
correlated with the system overall performances,


                                                                         14

</pre>