=Paper=
{{Paper
|id=None
|storemode=property
|title=Setting Goals and Choosing Metrics for Recommender System Evaluations
|pdfUrl=https://ceur-ws.org/Vol-811/paper12.pdf
|volume=Vol-811
}}
==Setting Goals and Choosing Metrics for Recommender System Evaluations==
<pdf width="1500px">https://ceur-ws.org/Vol-811/paper12.pdf</pdf>
<pre>
     Setting Goals and Choosing Metrics for Recommender
                      System Evaluations

                       Gunnar Schröder                                      Maik Thiele, Wolfgang Lehner
             T-Systems Multimedia Solutions GmbH                              Dresden University of Technology
                       Riesaer Straße 5                                    Faculty of Computer Science, Database
                   01129 Dresden, Germany                                             Technology Group
            gunnar.schroeder@t-systems.com                                        01062 Dresden, Germany
                                                                             {maik.thiele,wolfgang.lehner}
                                                                                   @tu-dresden.de

ABSTRACT                                                              the available results are sometimes contradictory for di↵er-
Recommender systems have become an important person-                  ent data sets or metrics. How efficient and successful a spe-
alization technique on the web and are widely used espe-              cific recommender system is also depends on the specific
cially in e-commerce applications. However, operators of              purpose of a recommender system and the characteristics of
web shops and other platforms are challenged by the large             the domain it is applied to. It is very unlikely that there is a
variety of available algorithms and the multitude of their            single best solution for any domain and context. If we want
possible parameterizations. Since the quality of the recom-           to increase the e↵ectiveness of recommendations we have to
mendations that are given can have a significant business             determine the best fit for a given scenario through thorough
impact, the selection of a recommender system should be               evaluation of available algorithms and parameterizations.
made based on well-founded evaluation data. The literature            This is particularly important for the usage of recommender
on recommender system evaluation o↵ers a large variety of             systems in e-commerce applications where the choice of al-
evaluation metrics but provides little guidance on how to             gorithms can have a significant business impact. In a web
choose among them. This paper focuses on the often ne-                shop a better recommender system can have a direct e↵ect
glected aspect of clearly defining the goal of an evaluation          on the company’s revenue since the recommendations can
and how this goal relates to the selection of an appropriate          significantly influence the users’ buying decisions [15].
metric. We discuss several well-known accuracy metrics and            The research that is presented in this paper is derived from
analyze how these reflect di↵erent evaluation goals. Further-         an evaluation of various recommendation algorithms that we
more we present some less well-known metrics as well as a             conducted for a large German e-commerce portal. In order
variation of the area under the curve measure that are par-           to achieve meaningful evaluation results we developed an
ticularly suitable for the evaluation of recommender systems          evaluation methodology and went through a wide range of
in e-commerce applications.                                           literature on the evaluation of recommender systems and in-
                                                                      formation retrieval systems and developed a framework for
                                                                      the evaluation of recommender systems.
Categories and Subject Descriptors                                    In this paper we will focus on the specific evaluation de-
H.4 [Information Systems Applications]: Miscellaneous;                mands of recommender systems in e-commerce applications.
D.2.8 [Software Engineering]: Metrics—complexity mea-                 We discuss the importance of defining a sufficiently detailed
sures, performance measures                                           goal and analyze which aspects of the recommender’s user
                                                                      interface and the used preference data influence a reasonable
                                                                      choice of metrics. We give an overview of applicable accu-
General Terms                                                         racy metrics, explain how to choose among the large variety
Theory                                                                and highlight some metrics that are particularly well-suited
                                                                      to the evaluation of recommender systems. In order to dis-
                                                                      cuss accuracy metrics in detail a discussion of non-accuracy
Keywords                                                              measures has to be omitted due to space constraints.
recommender systems, e-commerce, evaluation, metrics, mea-
sure, area under the curve, auc, informedness, markedness,
matthews correlation, precision, recall, roc
                                                                      2.    RELATED WORK
                                                                         Over time numerous quantitative metrics and qualitative
                                                                      techniques for o✏ine and online evaluations of recommender
1.   INTRODUCTION                                                     systems have been published in research. We will start by
  There is a large variety of algorithms for recommender              giving a short overview of some of the most important pub-
systems that were published in research and have been im-             lications that are relevant to this paper.
plemented by the industry. Almost every author claims that            Herlocker et al. provide a comprehensive overview of the ex-
a particular algorithm or implementation is superior to an-           isting methods for evaluating collaborative filtering systems
other in a certain respect, which makes it difficult to choose        [9]. Although they focus on collaborative filtering meth-
among them. Currently a comprehensive and objective com-              ods, many of the presented evaluation approaches, metrics
parison of existing recommender systems is hard to find and           and techniques are applicable to other types of recommender


                                                                 78
systems as well. They conducted an insightful study on the             3.    SETTING THE EVALUATION GOAL
correlation of di↵erent metrics and concluded that the ana-               The first step of an evaluation should be to define its goal
lyzed metrics can be subdivided in three major classes.                as precisely as possible. Although this seems rather obvious,
Olmo and Gaudioso build on this survey and derive a frame-             we would like to emphasize this step since it is often not ex-
work for recommender systems that divides recommender                  ecuted with sufficient care in evaluations in general. We can
systems into a filter and a guide component [4]. Their aim             only choose one or several adequate metrics and interpret
is to separate the calculation of recommendations from their           their results, if we have set the purpose of our evaluation
presentation. They propose to use metrics that focus on the            beforehand. The results of popular evaluation metrics such
fact whether the recommendations presented by the system               as RMSE, precision or F1 -measure do not necessarily lead us
are actually followed by the users of the recommender sys-             to the best algorithm for our purpose, as we try to illustrate
tem and suggest to pay more attention to the presentation              later on. Only if the set of metrics accurately reflects the
of recommendations as well as the respective objective of              specific objectives that are connected to the recommender
recommender systems.                                                   system in our evaluation and the concrete usage scenario,
Cremonesi and Lentini present an evaluation methodology                can we be sure that the obtained results are useful [11, 3].
for collaborative filtering recommender systems [2]. They              Moreover, the specific interface choice for the recommender
use mean squared error, root mean squared error (RMSE)                 system and the usage patterns that are to be expected should
and mean absolute error (MAE) as well as the classifica-               be taken into account. Such a precise evaluation goal could
tion accuracy metrics precision, recall, f-measure and ROC             be: Find the recommendation algorithm and parameteri-
graphs to compare two collaborative filtering algorithms us-           zation that leads to the highest overall turnover on a spe-
ing the MovieLens1 data set and a further movie data set               cific e-commerce web site, if four product recommendations
obtained from an IPTV service provider.                                are displayed as a vertical list below the currently displayed
Konstan et al. summarize findings about the usage of auto-             product.
mated recommender systems for information-seeking tasks
[12]. They list many problems and challenges in evaluat-               3.1    Objectives for Recommender Usage
ing recommender systems and emphasize the importance
                                                                          Having a clear conception of the objectives we want to
of standardized evaluation methodologies, metrics and test
                                                                       achieve by designing, implementing or applying a recom-
data sets for progress in research.
                                                                       mender system is a worthwhile e↵ort, even if we recognize
Kohavi et al. provide a survey and practical guide for con-
                                                                       that it may be too difficult or expensive to measure them
ducting controlled experiments on the web [11]. In a follow-
                                                                       accurately. In many cases we may discover that only the
up paper Crook et al. describe common pitfalls they experi-
                                                                       results of a large scale field study would accurately reflect
enced and emphasize the importance of choosing an overall
                                                                       our evaluation goal. Nevertheless a defined goal will help
evaluation criterion that truly reflects business goals [3].
                                                                       us in selecting the most appropriate metric or set of metrics
Jannach et al. provide a chapter on evaluating recommender
                                                                       that allows us to obtain the most precise measurement of
systems in their book [10]. They review the current state
                                                                       the recommender’s suitability for the usage scenario which
of research and survey the approaches taken in published
                                                                       can be achieved with a reasonable e↵ort.
papers on the evaluation of recommender systems.
                                                                       Common reasons for implementing a recommender system
Shani and Gunawardana contributed a chapter on evaluat-
                                                                       are the desire to improve user satisfaction and to increase
ing recommender systems to the handbook by Ricci et al.
                                                                       the economic success of a platform. Although both goals
[14] and describe the most important aspects in conducting
                                                                       are interrelated they may be competing in some scenarios.
o✏ine and online experiments as well as user studies. They
                                                                       As an example, in e-commerce a recommender may either
outline basic approaches for all three types of evaluations,
                                                                       determine the top recommendations based on the best price-
some important accuracy metrics and discuss various other
                                                                       performance ratio for the customer but it may also show the
properties of recommenders that should be considered when
                                                                       products that are likely to lead to the highest revenue for
evaluating recommenders.
                                                                       the business. For this purpose commercial recommenders
A further valuable source for metrics and measures is the
                                                                       for web shops often consider a reward attribute for items
literature on information retrieval. It is, by its very nature,
                                                                       that models how much the company profits from a sale of a
concerned with the quality of search results and provides
                                                                       certain item. This information can be used e.g. in combina-
insight into evaluation methodologies and metrics (e.g.[5]).
                                                                       tion with classification accuracy metrics (see Section 4.2).
   The evaluation of recommender systems has emerged as
                                                                       A recommender system in a live environment usually has
an important topic as more and more algorithms and tech-
                                                                       to be optimized with regard to various other objectives that
niques for recommenders systems are presented. Many dif-
                                                                       are related to the technical performance and the system’s life
ferent evaluation metrics can be applied in evaluations, al-
                                                                       cycle such as responsiveness, scalability, peak load, reliabil-
though some (e.g. RMSE, MAE and precision) are more
                                                                       ity, ramp-up e↵orts, maintainability, extensibility and cost
frequently used than others. However, authors rarely justify
                                                                       of ownership [10]. These aspects have a strong influence on
their particular choice and little guidance is o↵ered on how
                                                                       the decision among a set of algorithms and implementations
to choose a metric for a specific purpose. Some notable ex-
                                                                       and put further constraints on a choice. Further aspects
ceptions are the comparison of metrics given by Herlocker et
                                                                       that extend beyond the accuracy of a recommender are cov-
al. [9] and the overview by Jannach et al. [10]. In this pa-
                                                                       erage, novelty, serendipity, confidence, persuasiveness, trust,
per we try to provide the reader with a better understanding
                                                                       robustness, security and many more (cf. [9, 14]). Measuring
of the existing accuracy metrics and how to apply them in
                                                                       these diverse qualities of a recommender system is a large
order to evaluate recommenders for specific goals.
                                                                       field that is beyond the scope of the presented work. We
                                                                       will confine our analysis to accuracy metrics and attempt to
1
    http://www.grouplens.org/node/73                                   analyze how di↵erent measures reflect varying objectives.


                                                                  79
3.2    Analyzing the Recommender System and                              product or clicking on a link to display an article usually has
       its Data                                                          a di↵erent primary goal than expressing a positive rating, so
   In order to further specify our evaluation goal we have to            these actions are considered an implicit rating.
take a closer look at the recommender’s task, its interface                 The user ratings are usually collected using either a nu-
and the utilized data.                                                   merical, a binary or a unary rating scale. Although other
What is the recommender’s task and how does the                          preference models such as textual reviews are conceivable
system interact with the user? In our opinion the                        they are rarely used in today’s recommender systems.
most common usage scenarios for recommender systems are                  A numerical rating is represented by a number from either
prediction, ranking and classification tasks which manifest              a discrete or a continuous rating scale, in most cases with
themselves in di↵erent types of user interfaces.                         a limited range. Typical examples of a discrete rating scale
In a prediction task the focus lies on predicting ratings for            are ratings on a scale from zero to five stars or Likert re-
unrated items (e.g. movies, music or news articles) of a user            sponse scales that are commonly used in questionnaires. To
and showing these predictions to him. For example, a movie               be precise, these rating scales are actually ordinal scales, a
rental company might show predicted ratings to users in or-              fact which is ignored by predictive accuracy metrics (cf. 4.1)
der to aid them in their decision making.                                that make intensive use of ratios. An example of a contin-
In a ranking task the recommender tries to determine an or-              uous rating scale could be a slider that is set by a user and
der of items, often with the purpose of creating a top-k list            translated to a real value.
of items. This task is particularly common in e-commerce                 A binary rating scale allows users to assign items to two dif-
applications, where a top-k or unlimited ordered list of rec-            ferent classes (like/dislike). YouTube, for example, allows
ommended products is shown in a sidebar or on a dedicated                users to rate movies with either thumb up or thumb down.
page. But also web portals for news, content and multi-                  A unary rating, by contrast, allows users to assign items only
media make heavy use of top-k lists. A further interesting               to a single class, which is in most cases positive (e.g. like).
ranking task for a recommender is to order a limited set of              A prominent example of an explicit unary rating is Face-
search results according the user’s predicted preference.                book’s “Like”-button. Implicit unary ratings can be pur-
In a classification task the recommender determines a set                chased products in a web shop or clicked links on a news
with a limited (often fixed) number of recommended items                 page.
with no particular order among them implied. E.g. articles               The important di↵erence between binary and unary rat-
or products that are predicted to be of interest to a user are           ings is that unary ratings o↵er no distinction between dis-
highlighted or annotated on a web page. A top-k list of rec-             liked and unrated items. With unary ratings we cannot tell
ommended items can be seen as a classification task as well,             whether a user actually dislikes an item or simply does not
especially if the number of items is very small and the order            know the item or does not bother to rate it. In a large
is less prominent e.g. recommended items are arranged in a               web shop, such as Amazon, a customer will only be aware
grid or scrollable in two directions in a circular fashion (cf.          of a small portion of the product catalog. Furthermore,
Amazon).                                                                 other factors such as limited budget, limited time, external
Recommender systems are further used to determine simi-                  constraints, and products the user already owns determine
lar items or similar users. E.g. in web shops on a product               whether a customer will actually buy a product he or she
detail page usually a list of similar products is shown or a             likes. Being aware of this di↵erence and the implied biases is
news web site often lists articles similar to the one currently          important when conducting an evaluation and interpreting
shown. Recommending similar users is of particular interest              the results of various metrics.
for social web applications such as Facebook. These tasks
which focus on the items and users themselves instead of the             4.    OVERVIEW OF EVALUATION METRICS
user’s ratings, can be seen as ranking or classification tasks              Evaluation metrics for recommender systems can be di-
depending on the chosen interface.                                       vided into four major classes [9, 4]: 1) Predictive accuracy
Realizing for which of these tasks the recommender system                metrics, 2) Classification accuracy metrics, 3) Rank accu-
is used and what the user interface looks like is important for          racy metrics and 4) Non-accuracy metrics. Since we confine
choosing the most appropriate metric. The number of dis-                 our analysis to accuracy metrics we will omit the discus-
played recommendations and their visual arrangement (To                  sion of this class and suggest to consult the literature (e.g.
which extent does it convey a ranking?) should also be con-              [9, 14]).
sidered before choosing a metric. In many usage scenarios
more than one of these tasks has to be fulfilled by a rec-               4.1    Predictive Accuracy Metrics
ommender system. Therefore it may be sensible to apply                      Predictive accuracy or rating prediction metrics embark on
one accuracy metric for each of the usage scenarios to find              the question of how close the ratings estimated by a recom-
a reasonable compromise or to even consider using di↵erent               mender are to the true user ratings. This type of measures
recommendation algorithms for each task.                                 is very popular for the evaluation of non-binary ratings. It
What kind of user preference data is used by the                         is most appropriate for usage scenarios in which an accurate
recommender? The preferences of a user can be gathered                   prediction of the ratings for all items is of high importance.
either through explicit or implicit ratings. While an explicit           The most important representatives of this class are mean
rating is made as a deliberate statement by a user who has               absolute error (MAE), mean squared error (MSE), root mean
the intention to express his or her opinion, an implicit rat-            squared error (RMSE) and normalized mean absolute error
ing is deducted from actions of the user that had a di↵erent             (NMAE) (cf. [9, 8]). MSE and RMSE use the squared de-
primary goal. If, for example, a user rates a movie on a web             viations and thus emphasize larger errors in comparison to
site, this is a direct expression of his or her opinion and pref-        the MAE metric. MAE and RMSE describe the error in
erences, so we consider it an explicit rating. Purchasing a              the same units as the computed values, while MSE yields


                                                                    80
squared units. NMAE normalizes the MAE metric to the                                              Relevant     Irrelevant      Total
range of the respective rating scale in order to make results             Recommended                tp             fp       tp + f p
comparable among recommenders with varying rating scales.                 Not Recommended            fn            tn        f n + tn
The RMSE metric has been used in the Netflix competition                  Total                   tp + f n      f p + tn        N
in order to determine the improvement in comparison to the
Cinematch algorithm as well as the prize winner. This was              Table 1: A contingency table or confusion matrix
a significant source of discussion over the course of the com-         that accumulates the numbers of true/false posi-
petition.                                                              tive/negative recommendations.
These predictive accuracy error metrics are frequently used
for the evaluation of recommender systems since they are
easy to compute and understand. They are well-studied and              lowing) IR metrics can be observed in Figure 1 which shows
are also applied in many contexts other than recommender               a potential ranking of four relevant items in a set of ten
systems. However, they do not necessarily correspond di-               items by a recommender. In examples (a) to (j) the length
rectly to the most popular usage scenarios for recommender             k of the top-k list varies, while the item order remains fixed.
systems. There are few cases where users are in fact in-               Recall or true positive rate (also called sensitivity in psychol-
terested in the overall prediction accuracy. Recommender               ogy) is calculated as the ratio of recommended items that
systems are more commonly used to display a limited list               are relevant to the total number of relevant items:
of top ranked items or the set of all items that have been                                                     tp
                                                                                           recall = tpr =
rated above a certain threshold. Many recommendation al-                                                   tp + f n
gorithms are able to provide more accurate statements about
                                                                       This is the probability that a relevant item is recommended.
a limited set of items that the user either likes or dislikes.
                                                                       The two measures, precision and recall, are inversely related
The estimations for many other items are rather inaccurate
                                                                       which we notice if we vary the size of the set of recommen-
but often also significantly less important to users.
                                                                       dations (cf. Fig. 1). In most cases, increasing the size of the
This applies in particular to e-commerce applications where
                                                                       recommendation set will increase recall but decrease preci-
we are usually more concerned with suggesting some prod-
                                                                       sion. Because of this mutual dependence it makes sense to
ucts to customers that they will like in comparison to es-
                                                                       consider precision and recall in conjunction with two other
timating the most accurate ratings for the large amount of
                                                                       metrics that are called fallout and miss rate.
items that customers would never purchase.
                                                                       Fallout or false positive rate is calculated as the ratio of rec-
4.2    Classification Accuracy Metrics                                 ommended items that are irrelevant to the total number of
   Classification accuracy metrics try to assess the successful        irrelevant items:
decision making capacity (SDMC) of recommendation algo-                                                         fp
                                                                                          fallout = fpr =
rithms. They measure the amount of correct and incorrect                                                     f p + tn
classifications as relevant or irrelevant items that are made
                                                                       This is the probability that an irrelevant item is recom-
by the recommender system and are therefore useful for user
                                                                       mended.
tasks such as finding good items. SDMC metrics ignore the
                                                                       Miss rate or false negative rate is calculated as the ratio of
exact rating or ranking of items as only the correct or incor-
                                                                       items not recommended but actually relevant to the total
rect classification is measured [9, 4]. This type of metric is
                                                                       number of relevant items:
particularly suitable for applications in e-commerce that try
to convince users of making certain decisions such as pur-                                                       fn
                                                                                         missRate = fnr =
chasing products or services.                                                                                 tp + f n
A comprehensive overview of classic, basic information re-
                                                                       This is the probability that a relevant item is not recom-
trieval metrics, such as recall and precision and further im-
                                                                       mended.
proved metrics, is given in the survey by Powers [13]. The
                                                                       Just as precision and recall describe the recommender’s per-
following short summary is based on his terminology and
                                                                       formance regarding the true positives, two similar metrics
descriptions for these metrics. In section 5.1 we will discuss
                                                                       can be defined that measure how the algorithm behaves re-
three less well-known metrics that were described in this
                                                                       garding true negatives. Since this corresponds to interchang-
paper as well and that we deem very useful. To the best of
                                                                       ing the true and false values of the underlying data for the
our knowledge they have not been used for the evaluation of
                                                                       precision and recall metric, they are called inverse precision
recommender systems so far.
                                                                       and inverse recall.
The common basic information retrieval (IR) metrics are
                                                                       Inverse precision or true negative accuracy is calculated as
calculated from the number of items that are either relevant
                                                                       the ratio of items not recommended that are indeed irrele-
or irrelevant and either contained in the recommendation
                                                                       vant to the total number of not recommended items:
set of a user or not. These numbers can be clearly arranged
in a contingency table that is sometimes also called the con-                                                     tn
                                                                                     inversePrecision = tna =
fusion matrix (see Table 1).                                                                                   f n + tn
Precision or true positive accuracy (also confidence in data           This is the probability that an item which is not recom-
mining) is calculated as the ratio of recommended items that           mended is indeed irrelevant.
are relevant to the total number of recommended items:                 Inverse recall or true negative rate (also called specificity) is
                                         tp                            calculated as the ratio of items not recommended that are
                   precision = tpa =
                                      tp + f p                         really irrelevant to the total number of irrelevant items:
This is the probability that a recommended item corresponds                                                  tn
to the user’s preferences. The behavior of this (and the fol-                    inverseRecall = tnr =             =1       fpr
                                                                                                          f p + tn


                                                                  81
       Figure 1: Varying the length k of a top-k list for a possible recommender’s ranking of ten items.


This is the probability that an irrelevant item is indeed not           know whether they measure total or partial orderings. Most
recommended.                                                            rank accuracy metrics such as Kendall’s tau and Spearman’s
The F1 -measure and F -measure try to combine precision                 rho compare two total orderings. The problem with these
and recall into a single score by calculating di↵erent types            measures is that in most cases we are not provided with a
of means of both metrics. The F1 -measure or F1 -score is               full ranking of all items. Many recommendation algorithms
calculated as the standard harmonic mean of precision and               can generate only a partial list of items that are most likely
recall:                                                                 to be preferred by a user. All other items would have to be
                        2           2 · precision · recall              concatenated to the list in an arbitrary order. Furthermore,
         F1 =      1         1    =                                     the recommendation list can contain groups of items with a
               precision
                         + recall
                                     precision + recall
                                                                        similar rating that can appear in varying orders. The same
   Moreover there is a large variety of metrics that are de-            applies to the true preferences of a user. In order to create
rived from these basic information retrieval metrics [9, 5].            a full ranking of the items all preference values for the user
Mean average precision (MAP) is a popular metric for search             have to be known. Since the user might express the same
engines and is applied, for example, to report results at the           rating for several items the list will again contain groups
Text Retrieval Conference (TREC) [5]. It takes each rele-               of items that can appear in an arbitrary order. The largest
vant item and calculates the precision of the recommenda-               problem is posed by items for which no user rating is known.
tion set with the size that corresponds to the rank of the              These items could in fact hold an arbitrary place within the
relevant item. Then the arithmetic mean of all these preci-             ranking. Sometimes it is assumed that ratings for items
sions is formed.                                                        that are preferred are known, so that the unknown items
                         PN                                             are concatenated to the end of the list. In general, however,
                            r=1 (P (r) ⇥ rel(r))                        the unknown items could as well contain items that would
            AP =
                    number of relevant documents                        appear within the top ranks if rated by the user [9].
Afterwards we calculate the arithmetic mean of the average              The bottom line is that in most cases a rank metric for
precisions of all users to get the final mean average precision:        partial orderings would be more appropriate for comparing
                                PM                                      recommendation lists that are produced by recommenders
                                   u=1 APu                              to item rankings from known user preferences. One pos-
                      MAP =
                                     M                                  sibility is to use an arbitrary total ordering that complies
Various other means including the geometric mean (GMAP),                with the partial ordering, though the results will become
harmonic mean (HMAP) and quadratic mean (QMAP) can                      less meaningful when the number and size of item groups
be applied instead. All of these measures emphasize true                with identical rating increases. An evaluation might state
positives at the top of a ranking list. This behaviour can be           that a certain ranking is inferior even though only items
observed in Figure 2 (a) to (j) where the position of a single          within groups of identical ratings switched their places. A
relevant item within a ranking of ten items is varied.                  better solution is to compare all or a subset of the total or-
Further derived measures that we will discuss later are ROC             derings that comply with the partial orderings and average
curves and the area under the curve (AUC) measure.                      the results. However, this can easily become combinatorially
                                                                        challenging when both rankings are in fact partial orderings.
4.3    Rank Accuracy Metrics                                            Fagin et al. discuss several derived metrics for comparing
   A rank accuracy or ranking prediction metric measures                partial orderings that could be applied for recommender sys-
the ability of a recommender to estimate the correct order              tems [6]. Bansal and Fernandez-Baca provide runtime effi-
of items concerning the user’s preference, which is called the          cient implementations of these metrics [1].
measurement of rank correlation in statistics. Therefore this           We think that rank accuracy metrics are very well suited
type of measure is most adequate if the user is presented with          for e-commerce applications if they allow to compare par-
a long ordered list of items recommended to him. A rank                 tial rankings in a meaningful way. In our implementation
prediction metric uses only the relative ordering of prefer-            we used a variation of the Kendall’s tau metric for partial
ence values so that is independent of the exact values that             rankings on boolean classifications. That is, we compare
are estimated by a recommender. For example, a recom-                   the partial ranking provided by recommended and not rec-
mender that constantly estimates items’ ratings to be lower             ommended items with the partial ranking of relevant and
than the true user preferences, would still achieve a perfect           irrelevant items. This boolean Kendall’s tau is in fact highly
score as long as the ranking is correct.                                correlated to the AUC metric (e.g. Fig. 2).
For the usage of rank accuracy metrics, it is important to


                                                                   82
5.    IMPROVED METRICS FOR
      RECOMMENDER SYSTEMS
5.1   Informedness, Markedness and Matthews
      Correlation
   A major problem with the frequently used metrics preci-
sion, recall and F1 -measure is that they su↵er from severe
biases. The outcome of these measures not only depends on
the accuracy of the recommender (or classifier) but also very
much on the ratio of relevant items. In order to illustrate
this we imagine an ideal recommender and its inverse and
apply them to generate top-k lists of four items for varying
ratios of relevant items (see Fig. 3). The examples show
that the informatory value of all three measures varies. In             Figure 5: An example for the ROC curve used by
extreme cases with a highly skewed ratio its value can be               our variant of the AUC measure.
even misleading (cf. Fig. 3 e – h).
                                                                        in-depth discussion of ROC curves can be found in the pa-
To solve this problem Powers [13] introduced three new met-
                                                                        per by Fawcett [7].
rics that try avoid these biases by integrating the inverse
                                                                        A perfect recommender would yield a ROC curve that goes
precision and inverse recall respectively.
                                                                        straight up towards 1.0 recall and 0.0 fallout until all relevant
Markedness combines precision and inverse precision into a
                                                                        items are retrieved. Afterwards it would go straight right to-
single measure and expresses how marked the classifications
                                                                        wards 1.0 fallout while the remaining irrelevant items follow.
of a recommender are in comparison to chance:
                                                                        The obvious aim is consequently to maximize the area under
       markedness = precision + inversePrecision 1                      the ROC curve. The area under the curve (AUC) can there-
                       tp         tn                                    fore be used as a single measure for the overall quality of a
                  =          +            1                             recommender system. However, a frequently uttered point
                    tp + f p   f n + tn
                                                                        of criticism is that users are often more interested in the
Informedness combines recall and inverse recall into a single           items at the top of recommendation lists but that the AUC
measure and expresses how informed the classifications of a             measure is equally a↵ected by swaps at the top or the bottom
recommender are in comparison to chance:                                of a recommendation list (cf. Figures 2 and 4 e – h). This
          informedness = recall + inverseRecall 1                       may be a disadvantage if we are mainly interested in find-
                                                                        ing the top ranked items and thus care mostly for the first
                            tp          tn
                       =           +           1                        part of the ROC graph. Therefore, in addition to the stan-
                         tp + f n    f p + tn                           dard AUC measure, we implemented a slight variation of the
Both markedness and informedness return values in the range             AUC measure. This limited area under the curve (LAUC)
[ 1, 1].                                                                measure uses the same approach but instead of measuring
The Matthews Correlation combines the informedness and                  the complete area, it takes only the area under the first part
markedness measures into a single metric by calculating                 of the curve into account. This is the part of the curve which
their geometric mean:                                                   is formed by the first k recommendations.
                                                                        However, measuring this partial area is not easy due to sev-
                           (tp · tn) (f p · f n)
correlation = p                                                         eral problems:
               (tp + f n) · (f p + tn) · (tp + f p) · (f n + tn)
               p                                                           • The measure has to be normalized in a certain way to
            = ± informedness · markedness                                    be comparable.
The range of the Matthews Correlation is [ 1, 1], so the sign              • The normalized measure should return one if all rele-
in the second representation of the formula actually depends                 vant items are retrieved first by the recommendation
on the respective signs for markedness and informedness.                     algorithm and fit within the recommendation list.
We propose to replace the measures precision and recall by
markedness and informedness for most evaluation purposes                   • If the recommendation list contains only relevant items,
in order to avoid being misled by underlying biases. As                      then the area under the curve is in fact zero. Never-
we can see in Figures 1 and 3, markedness, informedness                      theless the normalized measure should still return one.
and Matthews Correlation are significantly more helpful in                 • Relevant items that are retrieved at the end of the list
choosing an adequate size for a top-k recommendation list.                   with no irrelevant items following do not add to the
Furthermore derived metrics such as mean average precision                   area under the limited curve.
(MAP) could as well be replaced by their respective equiv-
alents (e.g. mean average markedness).                                  A good solution for these problems is to generate just a lim-
                                                                        ited recommendation list that only contains a fixed number
5.2   Limited Area Under the Curve                                      of items and to assume that all other relevant items will be
   ROC curves provide a graphical representation for the                distributed uniformly over the rest of the ranking list until
performance of a recommender system, an information re-                 all items are retrieved. This means that we calculate the
trieval system or any other type of binary classifier. A ROC            AUC measure for the first part of the ROC curve in the
curve plots recall (true positive rate) against fallout (false          standard way until the first k recommendations have been
positive rate) for increasing recommendation set size. An               retrieved. Then we take the end point of the ROC curve


                                                                   83
        Figure 2: Varying the position of a single relevant item on a four out of ten recommendation list.


Figure 3: Varying the ratio of relevant items in a four out of ten recommendation list and applying two
recommenders with perfect and inverse accuracy.


formed by the first k recommendations and draw a straight                 following questions:
line to the upper right corner (see Fig. 5). The area under               Is there a distinction between rated and unrated
the curve that is situated to the right of the curve formed               items? If all items are implicitly rated predictive accuracy
by the top-k list thus is a simple trapezoid.                             metrics are not applicable, because there are no unrated
The resulting LAUC measure is very useful for the evalua-                 items for which we can predict a rating and measure the
tion of recommender systems that are applied to generate                  accuracy.
top-k lists of items (cf. Figures 2 and 4):                               Are items rated on a numerical or a binary scale? A
                                                                          binary rating scale usually suggests a classification or rank-
     • The measure returns one if all relevant items are re-
                                                                          ing task.
       trieved within the top-k list.
                                                                          Are users interested in rating predictions or only in
     • The measure becomes minimal if no relevant items are               top-ranked items? If users only care about top-ranked (or
       retrieved within the top-k list. If all irrelevant items           lowest-ranked) items and not about individual rating scores
       are retrieved first and fit within the top-k list the mea-         for items this suggests a classification or ranking task. Al-
       sure returns zero.                                                 though an algorithm may use predicted ratings internally,
                                                                          it has to succeed in estimating the top-ranked (or lowest-
     • A top-k list that contains more relevant items will yield          ranked) items and a higher error on rating predictions for
       a higher score than a list with less relevant items, ex-           items is acceptable as long as top-ranked (or lowest-ranked)
       cept if the length of the list is close to the total number        items are identified correctly. Therefore an evaluation should
       of items. In this case the order of relevant and irrele-           focus on classification or ranking metrics.
       vant items within the recommendation list would have               Is a limited list of top-ranked items shown? If yes,
       a higher influence on the overall score.                           a metric that measures the overall predictive accuracy or
     • If a relevant item moves towards the top of the list the           overall ranking accuracy is not appropriate. The exact rat-
       measure increases.                                                 ing predictions and ranking of other items are irrelevant to
                                                                          users and should not be considered by the metric.
     • A swap at the top or bottom of the top-k list has the              Do the recommended items have or imply an order?
       same e↵ect on the measure’s value.                                 Users will usually consider recommendations in a certain or-
                                                                          der, in particular if many recommendations are shown. If
     • All changes beyond the end of the top-k list are ignored           this is the case, basic information retrieval metrics such as
       by the measure.                                                    precision, recall, markedness and informedness are not suf-
A similar variation of the boolean Kendall’s tau, which con-              ficient since they ignore the order among the recommended
siders only the top-k recommendations, is another useful and              items. A metric that considers the order of recommended
highly correlated metric (e.g. Fig. 2).                                   items as well is more appropriate for this purpose.
                                                                          How fast does the user’s interest in lower ranked
6.     SOME GUIDELINES FOR CHOOSING A                                     items decay? The metric should reflect the user’s decay in
                                                                          interest. MAP and GMAP, for example, emphasize the first
       METRIC                                                             recommendations in contrast to the AUC or Kendall’s tau
  In order to choose an appropriate evaluation metric for a               measure that weighs swaps in lower and higher ranks in the
given recommendation scenario, it is helpful to answer the


                                                                     84
       Figure 4: Varying the position of three relevant items on a four out of ten recommendation list.


same way (cf. Fig. 2). If users hardly look at the ranking                experiments on the web. In Proceedings of the 15th
of lower ranked items, a classification or ranking error for              ACM SIGKDD, pages 1105–1114, 2009.
lower ranked items becomes irrelevant.                                [4] F. H. del Olmo and E. Gaudioso. Evaluation of
                                                                          recommender systems: A new approach. Expert
7.   CONCLUSION                                                           Systems with Applications, 35(3):790–804, October
                                                                          2008.
   In this paper we analyzed the relation between objectives          [5] Eighteenth Text REtrieval Conference. Appendix a:
for applying recommender systems and metrics used for their               Common evaluation measures. In The Eighteenth Text
evaluation. We emphasized the importance of defining a pre-               REtrieval Conference (TREC 2009) Proceedings, 2009.
cise goal for the evaluation and discussed how the data used
                                                                      [6] R. Fagin, R. Kumar, M. Mahdian, D. Sivakumar, and
by the recommender as well as its user interface a↵ect the
                                                                          E. Vee. Comparing partial rankings. SIAM Journal on
specific task for the recommender. We gave an overview of
                                                                          Discrete Mathematics, 20(3):628–648, 2006.
the three main classes of accuracy related evaluation met-
rics and discussed their applicability for di↵erent types of          [7] T. Fawcett. An introduction to roc analysis. Pattern
recommender systems. In order to illustrate the specific                  Recognition Letters, 27(8):861–874, 2006.
advantages and disadvantages of various metrics discussed,            [8] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins.
we compared them using several informative examples. We                   Eigentaste: A constant time collaborative filtering
proposed to utilize markedness, informedness and Matthews                 algorithm. Information Retrieval, 4(2):133–151, 2001.
correlation as classification metrics since they are superior         [9] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
to precision, recall and F1 -measure for most purposes. We                J. T. Riedl. Evaluating collaborative filtering
presented a new variation of the area under the curve mea-                recommender systems. ACM Transaction on
sure that is particularly suited for top-k recommendations                Information Systems, 22(1):5–53, January 2004.
which are used in many e-commerce applications. This lim-            [10] D. Jannach, M. Zanker, A. Felfernig, and G. Friedich.
ited area under the curve measure combines classification                 Recommender Systems: An Introduction. Cambridge
and ranking accuracy to create a better measure for this                  University Press, 1 edition, 9 2010.
purpose. Furthermore, we provided some crisp guidelines              [11] R. Kohavi, R. Longbotham, D. Sommerfield, and
that help to choose an appropriate evaluation metric for a                R. M. Henne. Controlled experiments on the web:
specific usage scenario.                                                  survey and practical guide. Data Min. Knowl. Discov.,
                                                                          18:140–181, February 2009.
                                                                     [12] J. A. Konstan, S. M. McNee, C.-N. Ziegler, R. Torres,
8.   ACKNOWLEDGMENTS                                                      N. Kapoor, and J. Riedl. Lessons on applying
  The work presented in this paper is funded by the Euro-                 automated recommender systems to
pean Social Fund and the Free State of Saxony under the                   information-seeking tasks. In AAAI, 2006.
grant agreement number 080954843.                                    [13] D. M. Powers. Evaluation: From precision, recall and
                                                                          f-factor to roc, informedness, markedness and
9.   REFERENCES                                                           correlation. Technical report, School of Informatics
                                                                          and Engineering, Flinders University Adelaide, South
 [1] M. S. Bansal and D. Fernández-Baca. Computing                       Australia, 2007.
     distances between partial rankings. Information
                                                                     [14] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor,
     Processing Letters, 109(4):238 – 241, 2009.
                                                                          editors. Recommender Systems Handbook. Springer,
 [2] P. Cremonesi, R. Turrin, E. Lentini, and                             Berlin, 1st edition, 10 2010.
     M. Matteucci. An evaluation methodology for
                                                                     [15] M. Zanker, M. Bricman, S. Gordea, D. Jannach, and
     collaborative recommender systems. In AXMEDIS,
                                                                          M. Jessenitschnig. Persuasive online-selling in quality
     pages 224–231, Washington, DC, USA, 2008.
                                                                          and taste domains. In Electronic Commerce and Web
 [3] T. Crook, B. Frasca, R. Kohavi, and R. Longbotham.                   Technologies, pages 51–60, 2006.
     Seven pitfalls to avoid when running controlled


                                                                85

</pre>