=Paper=
{{Paper
|id=None
|storemode=property
|title=Toward a New Protocol to Evaluate Recommender Systems
|pdfUrl=https://ceur-ws.org/Vol-910/paper2.pdf
|volume=Vol-910
|dblpUrl=https://dblp.org/rec/conf/recsys/MeyerFCG12
}}
==Toward a New Protocol to Evaluate Recommender Systems==
Toward a New Protocol to Evaluate Recommender
Systems
Frank Meyer, Françoise Fessant, Fabrice Clérot Eric Gaussier
Orange Labs University of Grenoble - LIG
av. Pierre Marzin UFR IM2AG - LIG/AMA
22307 Lannion cedex Grenoble Cedex 9
France France
{franck.meyer,francoise.fessant,fabrice.clerot}@ eric.gaussier@imag.fr
orange.com
ABSTRACT highlighted when it is provided (in industrial contexts, the
In this paper, we propose an approach to analyze the performance generated recommendations themselves and their utility are more
and the added value of automatic recommender systems in an important than the rating predictions). There is increasing
industrial context. We show that recommender systems are consensus in the community that accuracy alone is not enough to
multifaceted and can be organized around 4 structuring functions: assess the practical effectiveness and added-value of
help users to decide, help users to compare, help users to discover, recommendations [8,13]. Recommender systems in industrial
help users to explore. A global off line protocol is then proposed context are multifaceted and we propose to consider them around
to evaluate recommender systems. This protocol is based on the the definition of 4 key recommendation functions which meet the
definition of appropriate evaluation measures for each needs of users facing a huge catalog of items: how to decide, how
aforementioned function. The evaluation protocol is discussed to compare, how to explore and how to discover. Once the main
from the perspective of the usefulness and trust of the functions are defined, the next question is how to evaluate a
recommendation. A new measure called Average Measure of recommender system on its various facets? We will review for
Impact is introduced. This measure evaluates the impact of the each function the key points for their evaluation and the available
personalized recommendation. We experiment with two classical measures if they exist. In particular, we will introduce a dedicated
methods, K-Nearest Neighbors (KNN) and Matrix Factorization measure for the function "help to discover". This function raises
(MF), using the well known dataset: Netflix. A segmentation of the question of the evaluation from the point of view of the
both users and items is proposed to finely analyze where the usefulness of the recommendation. We will also present a global
algorithms perform well or badly. We show that the performance evaluation protocol able to deal with the multifaceted aspect of
is strongly dependent on the segments and that there is no clear recommender systems, which requires at least a simple
correlation between the RMSE and the quality of the segmentation of users and items. The remainder of the paper is
recommendation. organized as follow: the next section introduces the four core
functions of an industrial recommender system. Then the
Categories and Subject Descriptors appropriate measures for each core function are presented as well
H.3.3 [Information Search and Retrieval]: Information filtering as the global evaluation protocol. The last part of the paper is
– collaborative filtering, recommender system; H.3.4 [Systems dedicated to experimental results and conclusion.
and Software]: Performance evaluation (efficiency and
effectiveness) – performance measures, usefulness of 2. MAIN FEATURES OF
recommendation. RECOMMENDER SYSTEMS
Automatic recommender systems are often used on e-commerce
General Terms websites. These systems work in conjunction with a search engine
Algorithms, Measurement, Performance, Experimentation. for assistance in catalog browsing to help users find relevant
content. As many users of e-commerce websites are anonymous, a
Keywords very important feature is the contextual recommendation of item,
Recommender systems, Industrial context, evaluation, Compare, for anonymous users. The purpose of these systems being also to
Explore, Decide, Discover, RMSE, utility of recommendation increase usage (the audience of a site) or sales, the
recommendation itself is more important than the rating predicted.
Moreover, prioritizing a list of items on a display page is a more
1. INTRODUCTION important functionality than the prediction of a rating. These
The aim of recommender systems is to help users to find items observations, completed with interviews with marketers and
that should interest them, from large catalogs. One frequently project managers of Orange about their requirements relatively to
adopted measure of the quality of a recommender system is recommender systems and an overview of recommender systems
accuracy (for the prediction of ratings of users on items) [1,14]. both in the academic and in the industrial fields [10] has led us to
Yet in many implementations of recommender system services, organize the recommender systems' functionalities into 4 main
the rating prediction function is either not provided, or not features:
Help to Decide. Given an item, a user wants to know if he will
Copyright is held by the author/owner(s). Workshop on Recommendation appreciate the item. This feature consists of the prediction of a
Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with rating for a user and an item and is today mainstream in academic
ACM RecSys 2012. September 9, 2012, Dublin, Ireland. literature [14].
9
Help to Compare. Given several items, a user wants to know 3.3 Measures of performance
what item to chose. This feature corresponds to a ranking For our protocol we use a classic train/test split of the data. The
function. It can be used to provide recommendation lists [5] or to train set will be used to compute statistics and thresholds and to
provide personalized sorting results of requests on a catalog. build a predictive model. The test set will be used to compute the
Help to Discover. Given a huge catalog of items, a user wants to performance measures. The predictive model should at least be
find a short list of new interesting items. This feature is usually able to provide a rating prediction function for any couple of user
called item-based top-N recommendation in the academic and item. We will see that to provide the "Help to Explore"
literature [6]. It corresponds to personalized recommendation. functionality the predictive model also must be able, in some way,
Note that the prediction of the highest rated item is not necessarily to produce an item-item similarity matrix allowing it to select, for
the most useful recommendation [5]. For instance the item with each item i, its most similar items (the related items). We first
the highest predicted rating will most likely be already known by detail the performance measures we use for our protocol,
the user. according to the 4 core functions.
Help to Explore (or Navigate). Given one item, an (anonymous) Help to Decide. The main use case is a user watching an item
user wants to know what the related items are. This feature description on a screen and wondering if he would enjoy it.
corresponds to the classical item-to-item recommendation to Giving a good personalized rating prediction will help the user to
anonymous users popularized by the e-commerce website choose. The "help to decide" function can be given by the rating
Amazon [9] during catalog browsing. This function is widely used prediction function and must be measured by an accuracy measure
in the industry because it can make recommendations for which penalizes extreme errors. The Root Mean Squared Error
anonymous users, based on the items she consults. It requires a (RMSE) is the natural candidate [14].
similarity function between items. Help to Compare. The main use case here is a user getting an
intermediate short list of items after having given her preferences.
3. EVALUATION OF INDUSTRIAL This user then wants to compare the items of this short list, in
RECOMMENDER SYSTEMS order to choose the one she will enjoy most. The function needs a
In this section we discuss the appropriate measures for each core ranking mechanism with a homogeneous quality of ranking over
function and a global protocol for the evaluation of the the catalog. A simple measure is the percentage of compatible
recommender system. The evaluation is viewed from the rank indexes. After modeling, for each user u and for each couple
standpoint of the utility of the recommendation for each user and of item (i, j) in the test set rated by u with ru,i≠ru,j, the preference
each item. given by u is compared with the predicted preference given by the
recommender method, using the predicted ratings and .
3.1 Utility of the recommendation The percentage of compatible preferences is given by:
A good recommender system should avoid bad and trivial
recommendations. The fact that a user likes an item and the fact
that an item is already known by the user have to be distinguished (3-1)
[7]. A good recommendation corresponds to an item that would
probably be well rated by the user but also an item that the user with , where
does not know. For instance it is worthless recommending to all is 1 if has the
users the blockbuster of the year: it should be a good rated movie
same sign as and 0 otherwise, and
on the average, but it is not a useful recommendation as most of
people may have already seen it. is the number of elements of
3.2 Item segmentation and user segmentation
Another important issue for an industrial application is to fully Help to Discover. The main use case here is a user getting
exploit the available catalog, including its long tail, consisting of recommended items: these recommendations must be relevant and
items rarely purchased [2]. A system’s ability to make a useful. For relevancy our approach is the following: an item i
recommendation, in a relevant way, for all items in the catalog is recommended for the user u
therefore important. However Tan and Netessine [16] have - is considered relevant if u has rated i in the test set with a rating
observed on the Netflix dataset for instance, that the long tail greater than or equal to u's mean of ratings,
effect is not so obvious. There's more of a Pareto distribution
- is considered irrelevant if u has rated i in the test set with a
(20% of the most rated items represents 80% of the global ratings)
rating lower than u's mean of ratings
in the Netflix data than a long tail distribution as proposed by
Anderson [2] (where infrequent items globally represent more - is not evaluated if not present for u (not rated by u) in the test
ratings). They also noticed that the behavior of the users and the set.
type of items they purchase are linked. In particular, customers The classical measure to evaluate recommendation list is the
who watched items in the long tail are in fact heavy users, light precision measure (recall being difficult to apply in the context of
users tend to focus only on popular items. These observations lead recommendation, as in huge catalogs one does not know all the
us to the introduction of the notion of segments of items and items relevant for each user). For each user u:
users. The definition of the segment thresholds must be relative
and catalog dependant. We will use the terms of light/heavy users (3-2)
segment and of unpopular/popular item segment instead of using
long tail and short head concepts. In a first step we will use this Hu stands for the subset of evaluable recommendations in the test
simple segmentation to analyze how an industrial recommender set for u, that is to say the set of couples (u,i), i being the
system can help all users both heavy and light and how it can recommended item to the user u. |Hu| is the size of Hu, in number
recommend all items, both popular and unpopular. of couples (u, i).
10
However the precision is not able to measure the usefulness of the Our solution is to use the underlying item-item similarity matrix
recommendations: recommending well-known blockbusters, for this evaluation. We can assess the overall quality of the pairs
already known by the user will lead to a very high precision of similar items by an indirect method: 1. given a predictive
although this is of very low utility. To account for this, we model, find a way to compute similarities between any pair of
introduce here the concept of recommendation impact. The basic items, building an item-item similarity matrix. 2. use an item-item
idea is that, the more frequent a recommended item is, the less K-Nearest Neighbors (KNN) model [12] using this matrix. The
impact the recommendation has. This is summarized in Table 1: assumption is that a good similarity matrix must lead to good
Table 1. The notion of recommendation Impact performances for other aspects of the recommendation when used
into an item-item KNN model. This is the approach we take, using
Impact of the recommendation RMSE, precision, and ranking performance measures. For a KNN
Impact if the user likes Impact if the user type algorithm, this analysis is straightforward and simple: the
the item dislikes the item similarity matrix is already the kernel of the model. The
algorithms that are not directly based on a similarity measure need
a method for extracting the similarities between the items. For
Recommending Low: the item is likely to Low: even if the user
a popular item be already known at least dislikes this item he can
matrix-factorization-based algorithm, this can correspond to a
by name by the user. understand that as a method to compute similarities between the factors of the items.
popular item this
recommendation is likely
3.4 Evaluation Protocol
to appear... at least at the The evaluation protocol is then designed thanks to the mapping
beginning between the 4 core functions and the associated performance
measures as summarized in Table 2.
Recommending High: the service High: not only the item
an unpopular provided by the was unknown and did Table 2. Adapted measures for each core function
(infrequent) recommender system is not inspire confidence,
Functions Quality criterion Measure
item efficient. The rarest the but it also was not good.
item was, the less likely Decide Accuracy of the rating prediction RMSE
the user would have
Penalization of extreme errors to
found it alone.
minimize the risk of wrong
decision
We then define the Average Measure of Impact (AMI) for the Compare Good predicted ranking for every COMP
performance evaluation of the function "Help user to Discover". couple of items of the catalog
% of compatible rank
The AMI of a recommendation list Z for a user u with an average indexes
of rating is given by: Discover Selection for a user the most (Precision, not
(3-3) preferred items in a list of items recommended!)
Identification of good/bad Average Measure of
Where Hu denotes the subset of the evaluable recommendations in recommendations Impact (AMI)
the test set, Z denotes the set of couples (user, item), representing Precise, useful, trusted
a set of recommendations, count(i) the number of logs in the train recommendation
set related to the item i, and |I| the size of the catalog of items.
Explore Precise recommendations Similarity matrix
The rarer an item i (rarity being estimated in the train set), the leading to good
Identification of good/bad
greater the AMI if i is both recommended and relevant for a user recommendations performances, in
u. The greater the AMI, the better the positive impact of the accuracy, relevancy,
recommendations on u. The AMI will have to be calibrated as we usefulness and trust
do not know yet what is a "good AMI". But we can already The following notations are adopted: a log (u, i, r) corresponds to
compare different algorithms, or different recommendation a user u who rated an item i with the rating r. U is the set of all the
strategies (such as post filtering methods to add serendipity) with users, I is the set of all the items. Given a dataset D of logs and an
this measure. algorithm A, the evaluation protocol we propose is as follow:
Help to Explore. The main case here is the item-to-item Initialization
recommendation for an anonymous user who is watching an item Randomly split the dataset into 2 datasets train and test
description on a screen: the recommender system should propose
items similar to that being watched. We can try to evaluate the Use the train dataset to generate a model with the algorithm A.
performance of this functionality by associating, with each Evaluation
context item i, the KNN of i, using an overall precision measure 1. For each log (u, i, r) of the test set:
for the recommended items. But, we will have an issue: it can be
1.1 compute the predicted rating of the model
more effective to associate each context item i with N items
optimized only for precision, rather than N items similar to the 1.2 compute the predicted rating error
context item i. It may be more efficient, to optimize precision, to 2. Use the RMSE which gives an indicator of the performance of the Help
associate blockbusters for each source item. In fact we want to to Decide function.
assess the quality of the Help to Explore (navigate) function: we 3. For each user u of U:
want a good semantic, meaningful similarity for each associated 3.1 sort all u's logs of the test set by ratings
item. But only an experiment with real users can assess this
semantic similarity. 3.2 sort all u's logs of the test set by rating prediction
3.3 compute COMP comparing the indexes of u's logs and the
indexes of the predicted ratings of he logs.
11
4. Use the averaged COMP as an indicator of the Help to Compare - a random predictor, generating uniform ratings between [1..5]
function. for each rating prediction.
5. For each item i of I, compute count(i) which is the number of logs in the One industrial requirement of our system was that it could take
train set referencing i.
into account new items and new users every 2 hours. Considering
6. For each user u of U: other process and I/O constraints, for all the algorithms the
6.1 compute the predicted rating of each item i of I. modeling time was then restricted to 1.5 hours. This has
6.2 select the top-N highest predicted rating items noted iu,1 to
implications for the MF algorithm as on Netflix it always reaches
iu,N which are the Top-N recommended items. an optimum between 16 and 32 factors: this is a constant for all
our tests, for all the performances. Beyond 32 factors, MF does
6.3 compute the rating average of u, noted .
not have enough time to converge. Note that this convergence
6.4 for each recommended item iu,j of u: may be slow, longer than 24 hours for more than 100 factors on
6.4.1 check if a corresponding log (u, iu,j,r) exists, If the Netflix dataset.
so the recommendation of iu,j is evaluable else skip
Implementations details
the step 6.4.2.
Our implementation of MF is similar to those of the BRISMF
6.4.2. If r≥ then the recommendation is considered
relevant (and irrelevant in the other case). implementation [15] with a learning rate of 0.030 and a
regularization factor of 0.008, with early stopping. Learning
6.5 compute the Precision and the AMI for the evaluable
process is stopped after 1.5 hours, or when the RMSE increases
recommendations
three consecutive times (the increase or decrease of the RMSE is
7. Use the Precision and the AMI, averaged by users, as the indicators for controlled on a validation set consisting of 1.5% of the train set).
the Help to Discover Function We used an implementation of item-item KNN model as
8. Specify a way to compute efficiently, using the model of the algorithm described in [11]. The similarity function is the Weighted Pearson
A, the similarity between every couple of items (i,j). similarity [4]. All details about implementations can be found in
9. Compute the similarity matrix of all the couple (i, j) for I×I. [10].
10. Use this similarity matrix as the kernel of an item-item K-Nearest
Neighbor model, then run the protocol for the steps 1 to 7 for RMSE,
5. NUMERICAL RESULTS
COMP, AMI and Precision to obtain a 4-dimensional indicator of the The following abbreviations are used for the segmentation of the
quality of the Help to Explore function. performance: Huser: Heavy users, Luser: Light users, Pitem:
Popular items and Uitem: Unpopular items (the meaning of
4. EXPERIMENTS unpopular is rather "rare", "infrequent"). For MF we analyzed the
4.1 Datasets and configuration number of factors used and for KNN the number of NN kept. The
Experiments are conducted on the widely used dataset Netflix [3]. full results of our experiments are available in [10].
This dataset has the advantage of being public and allows
performance comparisons with many others techniques. Agnostic
5.1 “Help to Decide” performances
thresholds are used for segments of users and items, depending of The global default predictor has a RMSE of 0.964 and the global
datasets. We used simple thresholds based on the mean of the random predictor has a RMSE of 1.707.
number of ratings to split items into popular items and unpopular KNN's RMSE performances: Different sizes of neighborhoods
(infrequent) items, and similarly to split users into heavy users (K) have been tested, compliant with our tasks in an industrial
and light users. For instance, on Netflix, using a Train Set of 90% context. Increasing K generally increases the performances.
of the total of logs, the mean of the number of rating for the users However the associated similarity matrix weights must be kept in
is 190 (heavy users are users who gave more than 190 ratings RAM for efficiency purposes, which is difficult, if not possible,
otherwise they are light users) and the mean of number of ratings with high values of K. For very large catalog applications, the size
for the items is 5089 (popular items are items with more than of the KNN matrix must be reasonable (up to 200 neighbors in our
5089 ratings otherwise they are unpopular items). The number of tests). The KNN method performs well except when K is small
generated items for the Top-N recommendation is always N=10. and except for the light-user-unpopular item segment (Luser
All our tests are carried out on this configuration: Personal Uitem). There is a significant gap between the RMSE for the
Computer with 12 GB Ram, processor IntelTM XeonTM W3530 64- LuserUitem segment (RMSE=1.05) and the RMSE of the heavy-
bit-4-core processor running at 2.8 GHz, hard disk of 350 GB. All user-popular-item segment (RMSE=0.8). Clearly, the KNN model
algorithms and the benchmark process are written in JavaTM. is not adapted to the former, whereas it performs well on the later.
Optimal number of neighbors is around K= 100.
4.2 Algorithms
We chose to use 2 models: fast matrix factorization using the MF MF's RMSE performances: Different numbers of factors have
algorithm presented in [15] and an item-item KNN algorithm [12]. been tested. MF has difficulties modeling the Luser-Uitem
These algorithms are mainstream techniques for recommender segment: on this segment the RMSE never decreases under 0.96.
systems. For MF we analyze the effect of the number of factors, On the contrary the RMSE for heavy-user-popular-item is close to
for the KNN algorithm we analyze the effect of K, the number of 0.81, and the two symmetrical segments light-user-popular item
Nearest Neighbor kept in the model. In addition, to compare the and heavy-user-unpopular-item both have a good (low) RMSE
performances of these 2 algorithms, 2 baseline algorithms are also (0.84 and 0.85). The RMSE decreases when number of the factor
used: increases up to around 20 factors. After that number, the RMSE
increases. It is a consequence of our time-constrained early
- a simple default predictor using the mean of items and the mean stopping condition. This corresponds to about 140 passes on the
of the users (the sum of the two means if available, divided by 2). train set. The optimal number of factors seems to be between 16
This algorithm is also used by the KNN algorithm when no KNN and 32.
items are available for a given item to score.
12
5.2 “Help to Compare” performance Compare, Discover and on 4 user-item segments: heavy-user-
The default global predictor has a percentage of compatible rank popular-item, heavy-user-unpopular-item, light-user-popular-item
indexes (COMP) of 69% and the random global predictor has a and light-user-unpopular item. A summary of the results is given
performance of 49.99%. in Table 4. An analysis of the results by segments shows that
globally, KNN is well adapted for the heavy-user segments and
MF's and KNN’s ranking performances: The results are given that MF, and the default predictor are well adapted to light-user
for the time limited version of run for MF. MF outperforms the segments. Globally, for the tasks "Help to Decide" and "Help to
KNN model for the light user segments (with a COMP of 73.5% Compare", MF is the best-suited algorithm in our tests. For the
for MF and 66% for KNN). For the rest, the performances are task "Help to Discover" KNN is more appropriate. Note that a
similar to those of KNN. The maximum of ranking compatibility switch-based hybrid recommender [14], based on item and user
is around 77% for heavy users' segments. segmentation could exploit this information to improve the global
performances of the system. Finally 3 main facts will have to be
5.3 “Help to Discover” performance considered:
5.3.1 Analysis using the Precision
1. Performances strongly vary according to the different segments
The global default predictor has a precision of 92.86 % which is
of users and items.
questionable: one can see that a simple Top-10 based on high
rating average is sufficient to obtain good precision performance. 2. MF, KNN and default methods are complementary as they
The global random predictor has a precision of 53.04%. perform differently across the different segments.
KNNs' precision performances: The precision increases as the 3. RMSE is not strictly linked to other performance measure, as
number of K increases. But the results are not significantly better mentioned for instance in [5].
than that of the default predictor. The precision is better than the
default predictor for only the Huser-Pitem segment and only for at Table 4. Global results, summary
least K=200. Under K=100, it seems better to use a default Heavy Heavy Light Light
predictor than a KNN predictor for ranking tasks. Nevertheless the Users Users Users Users
Huser-Pitem segment is well modeled: the precision for 10
generated items for the KNN model is greater than 97% for the Popular Unpopular Popular Unpopular
items items Items Items
model with 200 neighborhoods.
Decide
MF's precision performances: MF has a better behavior than the RMSE KNN MF MF MF
KNN model, especially for the light-user-unpopular-item segment Compare
(precision of 96% for F=32 factors, precision of 83% for the KNN %Compatible KNN KNN MF MF
with K>=100). preferences
Discover
5.3.2 Analysis using the AMI Precision KNN MF Default MF
The Average Measure of Impact gives slight negative Predictor
performances for the random predictor and a small performance to Discover
the default predictor: the default predictor "wins" its impact values Average KNN Default KNN Default
on Unpopular items. Note that the supports for the different Measure of Predictor Predictor
evaluated segments are very different and the weights of the two Impact
popular item segments are significantly higher The KNN model
behaves significantly better that the default predictor for the AMI. When designing a recommender engine, we have to think about
For MF, the behavior is much worse than that a KNN model. In the impact of the recommender: recommending popular items to
general, the impact of MF is similar to, or lower than that of the heavy users might be not so useful. On the other hand, it can be
default predictor. An analysis according to the segmentation gives illusory to make personalized recommendations of unpopular (and
a more detailed view of where are the impacts. Numerical results unknown) items to light (and unknown) users. A possible simple
are summarized in Table 3. strategy could be:
Table 3. AMI according to the segmentation - rely on robust default predictors, for instance based on robust
means of items to try to push unknown golden nuggets to
Best model Huser Luser Huser Luser Global
unknown users,
Pitem Pitem Uitem Uitem
- use personalized algorithms to recommend popular items to
MF F=32 0.38 0.26 8.93 10.61 0.5 light users,
- finally, use personalized algorithms to recommend unpopular
KNN K=100 0.71 0.43 9.59 8.84 2.0
items of the long tail for heavy "connoisseurs".
Default Pred 0.29 0.25 21.22 12.31 0.5
5.5 “Help to Explore” performance
Random 0.00 0.03 -5.13 -0.53 -0.6 To analyze the performance of the "Help to Explore" functionality
Pred
we have to compare the quality of the similarities extracted from
Best KNN KNN Default Default KNN the models. We use the protocol defined before: a good similarity
algorithm Predictor Predictor matrix for the task "Help to Explore" is a similarity matrix leading
to global good performances, when used in a KNN model. We
5.4 Summary for Decide, Compare, Discover choose a similarity matrix with 100 neighbors for each item: this
Four models have been analyzed: a KNN model, a MF model, a is largely enough for item-to-item tasks where generally a page
random model and a default predictor model, on 3 tasks adapted displays 10 to 20 similar items. Results are presented in Table 5
to a rating-predictor-based recommender system: Decide, for the KNN models with K=100, comparing KNN computed on
13
MF's items factors, native KNN and a Random KNN model used - and the way to improve the recommender systems to achieve
as baseline. As item-item similarity matrix is the kernel of a item- their tasks.
item KNN model, compute similarities in this case is
straightforward. To compute similarities between items for MF, 7. REFERENCES
we use the MF-based representation of items (the vectors of the [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next
factor of the items), with a Pearson similarity. The KNN model Generation of Recommender Systems: A Survey of the State-
computed on the MF's factors of the items can be viewed as a MF- of-the-Art and Possible Extensions, IEEE Trans. Knowl.
emulated KNN model. Note that as the default predictor model Data Eng, 17 (6), 2005, pp. 734-749.
based on items’ means and users’ means cannot itself produce a [2] Anderson, C. 2006. The Long Tail. Why the future of
similarity matrix, it is disqualified for this task. For the RMSE, the business is selling less of more. Hyperion Verlag.
MF-Emulated KNN model looses 0.025 point going from 0.844 to
0.870. Compared with other models, it still performs correctly. [3] Bennet, J and Lanning, S. 2007. The Netflix Prize, KDD Cup
and Workshop. 2007. www.netflixprize.com.
Table 5. Quality of an item-item similarity matrix according
to 4 measures: results on Netflix [4] Candillier, L., Meyer, F., Fessant, F. 2008. Designing
Specific Weighted Similarity Measures to Improve
Native KNN KNN computed on MF's Collaborative Filtering Systems. ICDM 2008: 242-255.
items factors
K=100 K=100, number of [5] Cremonesi, P., Koren, Y., and Turrin, R. 2010. Performance
factors=16 of recommender algorithms on Top-N recommender tasks.
RMSE 0.8440 0.8691 RecSys 2010.
Ranking: % 77.03% 75.67%
compatible
[6] Deshpande, M., and Karypis, G. 2004. Item-based top-N
Precision 91.90% 86.39% recommendation algorithms. In ACM Transactions on
Information Systems, 22(1), 143–177.
AMI 2.043 2.025 [7] Herlocker, J. L., Konstan, J. A., Terveen, L.G. and Riedl, J.
2004. Evaluating collaborative filtering recommender
(Global time (5290 seconds) (3758 seconds) systems. In ACM Transactions on Information Systems 22
of the modeling task)
(1), 5–53.
For the global ranking, the difference between the MF-Emulated [8] Knijnenburg, B. P., Willemsen, M.C., Kobsa, A. 2011: A
model and the native KNN model is still low, whereas a random pragmatic procedure to support the user-centric evaluation of
KNN model performs very badly. For the precision, for a Top-10 recommender systems.RECSYS 2011, 321-324
ranking, the MF-Emulated KNN model performs significantly [9] Linden, G. Smith,.B, and York, J. 2003. Amazon.com
worse than a native KNN model. For the Average Measure of Recommendations: Item-to-Item Collaborative Filtering.
Impact, the MF-emulated KNN model and the native KNN model IEEE Internet Computing, 7 (1), 2003, pp. 76-80.
perform almost identically. These results show that MF could be
used to implement a similarity function between items to support [10] Meyer, F. 2012. Recommender systems in industrial
the "Help to Explore" function, and that MF could be used as a contexts. ArXiv e-prints. http://arxiv.org/abs/1203.4487.
component for faster KNN search. [11] Meyer; F., Fessant, F. 2011. Reperio: a generic and flexible
recommender system. IEEE/WIC/ACM Conference on Web
6. CONCLUSION Intelligence, 2011.
We have proposed a new approach to analyze the performance
[12] Sarwar, B., Karypis, G., Konstan, J., and Reidl, J. 2001.
and the added value of automatic Recommender Systems in an
Item-based collaborative filtering recommendation
industrial context. First, we have defined 4 core functions for
algorithms. In WWW’01: Proceedings of 10th International
these systems, which are: Help users to Decide, Help users to
Conference on World Wide Web, pages 285–295.
Compare, Help users to Discover, Help users to Explore. Then we
proposed a general off-line protocol crossing our 4 core functions [13] Schroder, G., Thiele, M. and Lehner, W. 2011. Setting Goals
with a simple 4 users×items segments to evaluate a recommender and Choosing Metrics for Recommender System Evaluation.
system according to the industrial and marketing requirements. UCERSTI 2 -RECSYS 2011
We compared two major state of the art methods, item-item KNN [14] Su, X., and Khoshgoftaar, T.M. 2009. A survey of
and MF, with 2 baselines methods used as reference. We showed collaborative filtering techniques. In Advances in Artificial
that the two major methods are complementary as they perform Intelligence, 2009.
differently across the different segments. We proposed a new
measure, the Average Measure of Impact, to deal with the [15] Takács, G., Pilászy, I., Németh, B., Tikk, D. 2009. Scalable
usefulness and the trust of the recommendations. Using the Collaborative Filtering Approaches for Large Recommender
precision measure, and the AMI, we showed that there is no clear Systems. Journal of Machine Learning Research 10: 623-656
evidence of correlation between the RMSE and the quality of the 2009.
recommendation. We have demonstrated the utility of our [16] Tan, T.F. and Netessine, S. 2011, Is Tom Cruise
protocol as it may change Threatened? An Empirical Study of the Impact of Product
Variety on Demand Concentration. ICIS 2011.
- the classical vision of the recommendation evaluation, often
focused on the RMSE/MAE measures as they are assumed
correlated with the system overall performances,
14