=Paper= {{Paper |id=None |storemode=property |title=On the Use of Weighted Mean Absolute Error in Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-910/paper5.pdf |volume=Vol-910 |dblpUrl=https://dblp.org/rec/conf/recsys/Cleger-TamayoFH12 }} ==On the Use of Weighted Mean Absolute Error in Recommender Systems== https://ceur-ws.org/Vol-910/paper5.pdf
              On the use of Weighted Mean Absolute Error in
                         Recommender Systems

                          S. Cleger-Tamayo                                          J.M. Fernández-Luna & J.F. Huete
                         Dpto. de Informática.                                  Dpto. de Ciencias de la Computación e I.A.
                     Universidad de Holguín, Cuba                              CITIC – UGR Universidad de Granada, Spain
                     sergio@facinf.uho.edu.cu                                         {jmfluna,jhg}@decsai.ugr.es


ABSTRACT                                                                       system, RMSE being more sensitive to the occasional large
The classical strategy to evaluate the performance of a Rec-                   error: the squaring process gives higher weight to very large
ommender System is to measure the error in rating predic-                      errors. A valuable property of both metrics is that they take
tions. But when focusing on a particular dimension in a                        their values in the same range as the error being estimated,
recommending process it is reasonable to assume that ev-                       so they can be easily understood by the users.
ery prediction should not be treated equally, its importance                      But, these metrics consider that the standard deviation of
depends on the degree to which the predicted item matches                      the error term is constant over all the predictions, i.e. each
the deemed dimension or feature. In this paper we shall ex-                    prediction provides equally precise information about the er-
plore the use of weighted Mean Average Error (wMAE) as                         ror variation. This assumption, however, does not hold, even
an alternative to capture and measure their effects on the                     approximately, in every recommending application. In this
recommendations. In order to illustrate our approach two                       paper we will focus on the weighted Mean Absolute Error,
different dimensions are considered, one item-dependent and                    wMAE, as an alternative to measure the impact of a given
the other that depends on the user preferences.                                feature in the recommendations1 . Two are the main pur-
                                                                               poses for using this metric: On the one hand, as an enhanced
                                                                               evaluation tool for better assessing the RS performance with
1.    INTRODUCTION                                                             respect to the goals of the application. For example, in the
   Several algorithms based on different ideas and concepts                    case of recommending books or movies it could be possible
have been developed to compute recommendations and, as                         that the accuracy of the predictions varies when focusing on
a consequence, several metrics can be used to measure the                      past or recent products. In this situation, it is not reason-
performance of the system. In the last years, increasing ef-                   able that every error were treated equally, so more stress
forts have been devoted to the research of Recommender                         should be put in recent items. On the other hand, it can
System (RS) evaluation. According to [2], “the decision on                     be also useful as a diagnosis tool that, using a “magnifying
the proper evaluation metric is often critical, as each met-                   lens”, can help to identify those cases where an algorithm is
ric may favor a different algorithm”. The selected metric                      having trouble with. For both purposes, different features
depends on the particular recommendation tasks to be an-                       shall be considered which might depend on the items, as
alyzed. Two main tasks might be considered: the first one,                     for example, in the case of a movie-based RS the genre, the
with the objective of measuring the capability of a RS to                      release date, price, etc. But also, the dimension might be
predict the rating that a user should give to an unobserved                    user-dependent considering, for example, the location of the
item, and the second one is related to the ability of an RS                    user, the users’ rating distribution, etc.
to rank a set of unobserved items, in such a way that those                       This metric has been widely used for evaluation of model
items more relevant to the user have to be placed in top                       performance in several fields as meteorology or economic
positions of the ranking. Our interest in this paper is the                    forecasting [8]. But, few have been discussed about its use
measurement of the capability of a system to predict user                      in the recommending field; isolately several papers use small
interest in an unobserved item, so we focus on rating predic-                  tweaks on error metrics in order to explore different aspects
tion.                                                                          of RS [5, 7]. Next section presents the weighted mean abso-
   For this purpose, two standard metrics [3, 2] have been                     lute error, illustrating its performance considering two differ-
traditionally considered: the Mean Absolute Error (MAE)                        ent features, user and item-dependent, respectively. Lastly
and the Root Mean Squared Error (RMSE). Both metrics                           we present the concluding remarks.
try to measure which might be the expected error of the
                                                                               2.    WEIGHTED MEAN ABSOLUTE ERROR
                                                                                  The objective of this paper is to study the use of a weight-
Acknowledgements:                This work was jointly supported by the        ing factor in the average error. In order to illustrate its func-
Spanish Ministerio de Educación y Ciencia and Junta de Andalucía, under
projects TIN2011-28538-C02-02 and Excellence Project TIC-04526,                tionality we will consider simple examples obtained using
respectively, as well as the Spanish AECID fellowship program.                 four different collaborative-based RS (using Mahout imple-
                                                                               mentations): i) Means, that predicts using the average rat-
                                                                               1
Copyright is held by the author/owner(s). Workshop on Recommendation            A similar reasoning can be used when considering squared
Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with           error, which yields to the weighted Mean Root Squared Er-
ACM RecSys 2012. September 9, 2012, Dublin, Ireland. .                         ror, wRMSE.




                                                                          24
ings for each user; ii) LM [1], following a nearest neighbors                   a way that those common ratings for a particular user
approach; iii) SlopeOne [4], predicting based on the average                    will have greater weights, i.e. wi = prU (ri ).
difference between preferences and iv) SVD [6], based on a
matrix factorization technique. The metric performance is               rU– The last one assigns more weight to the less frequent
showed using an empirical evaluation based on the classic                   rating, i.e. wi = 1 − prU (ri ).
MovieLens 100K data set.                                                   Figures 1-A and 1-B present the absolute values of the
   A weighting factor would indicate the subjective impor-              MAE and wMAE error for the four RSs considered in this
tance we wish to place on each prediction, relating the error           paper. Figure 1-A shows the results where the weights
to any feature that might be relevant from both, the user or            are positively correlated to the feature distribution, whereas
the seller point of view. For instance, considering the release         Figure 1-B presents the results when they are negatively cor-
date, we can assign weights in such a way that the higher the           related. In this case, we can observe that by using wMAE we
weight, the higher importance we are placing on more recent             can determine that error is highly dependent on the users’
data. In this case we could observe that even when the MAE              pattern of ratings, and weaker when considering item pop-
is under reasonable threshold, the performance of a system              ularity. Moreover, if we compare the two figures we can
might be inadequate when analyzing this particular feature.             observe that all the models perform better when predicting
   The weighted Mean Absolute Error can be computed as                  the most common ratings. In this sense, they are able to
                    PU PNi                                              learn the most frequent preferences and greater errors (bad
                       i=1  j=1 wi,j × abs(pi,j − ri,j )                performance) are obtained when focusing on less frequent
        wM AE =              PU PNi                      ,  (1)
                                i=1  j=1 wi,j                           rating values. Related to item popularity these differences
                                                                        are less conclusive. In some sense, the way in which the user
where U represents the number of users; Ni , the number                 rates an item does not depend of how popular the item is.
of items predicted for the ith -user; ri,j , the rating given by
the ith -user to the item Ij ; pi,j , the rating predicted by           2.1      Relative Weights vs. Relative Error
the model and wi,j represents the weight associated to this               Another different alternative to explore the benefits of us-
prediction. Note that when all the individual differences are           ing the wMAE metric is to consider the ratio between wMAE
weighted equally wM AE coincides with M AE.                             and MAE. In this sense, denoting as ei,j = abs(pi,j − ri,j ),
  In order to illustrate our approach, we shall consider two            we have that wMAE/MAE is equal to
factors, assuming that wi,j ∈ [0, 1].
• Item popularity: we would like to investigate whether                                               P                    P
the error in the predictions depends on the number of users                                              i,j wi,j ei,j /       i,j ei,j
                                                                                    wM AE/M AE =            P                             .
who rated the items. Two alternatives will be considered:                                                       i,j wi,j /N

                                                                           Taking into account that we restrict the weights to take its
  i+ The weights will put more of a penalty on bad pre-                 value in the [0, 1] interval, the denominator might represent
     dictions when an item has been rated quite frequently              the average percentage of mass of the items that is related
     (the items has a high number of ratings). We still pe-             to the dimension under consideration whereas the numerator
     nalize bad predictions when it has a small number of               represents the average percentage of the error coming from
     ratings, but we do not penalize as much as when we                 this feature. So, when wMAE > MAE we have that the
     have more samples, since it may just be that the lim-              percentage of error coming from the feature is greater than
     ited number of ratings do not provide much informa-                its associated mass, so the the system is not able to predict
     tion about the latent factors which influence the users            properly such dimension. When both metrics are equal this
     ratings. Particularly, for each item Ii we shall consider          implies that the expected error is independent of the feature.
     its weight as the probability that this item were rated               In Figure 1-C we present the values of the wMAE/MAE
     in the training set, i.e. wi = pr(Ii ).                            where, again, we can see that there exists a dependence be-
  i– This is the inverse of the previous criterion, where we            tween the rating and the error. The error associated to the
     put more emphasis on the predictions over those items              common ratings are less than the relative importance of this
     with fewer ratings. So the weights are wi = 1 − pr(Ii ).           feature in the system whereas for less common ratings the
                                                                        system is not able to perform good predictions, being greater
• Rating distribution: It is well known that the users                  the relative error than its associated weights. This situation
does not rate the items uniformly, they tend to use high-               does not hold when considering item popularity.
valued ratings. By means of this feature we can measure                    Figures 1-D and 1-E present an scatter plot that relates
whether the error depends on the ratings distribution or not.           the relative weights (horizontal axis) to the relative error
Particularly, we shall consider four different alternatives:            (vertical axis) for each user in the system and for each RS
                                                                        used2 . Particularly, in Figure 1-D we are considering rU+
rS+ Considering the overall rating distribution in the sys-             as weighting factor. Note that, since there are 5 possible
    tem, putting more emphasis on the error in the pre-                 ratings, the relative weight is equal to 0.2 when all the rat-
    dictions on those common ratings. So the weights are                ings are equally probable and its value increases with the
    wi = prS (ri ), ri being the rating given by the user to            importance of the most used ratings. In this figure we can
    the item Ii .                                                       see that both percentage of mass and the percentage of er-
                                                                        ror are positively correlated, being wMAE/MAE < 1 for
 rS– Inversely, we assess more weight to the less common                most of the users. Moreover, there is a trend to improve the
     ratings, i.e. wi = 1 − prS (ri ).                                  predictions for those users with higher relative mass (for ex-
rU+ Different users can use a different pattern of rating, so           ample, we can see how the regression line for the LM model
                                                                        2
    we consider the rating distribution of the user, in such                We have included all the users with at least 10 predictions.




                                                                   25
                                                             Figure 1: Using wMAE in recommendations: absolute and relative values.
                                                                                                                                 MAE        rU-        rS-        iS-
                                      MAE     ru+      rS+    iS+
                                                                                       A                    0,95                                                                 B
0,95                                                                                                                                                                                                              Means     LM    Slope One            SVD        C
                                                                                                            0,90
0,90                                                                                                                                                                                              1,100
                                                                                                            0,85                                                                                  1,050
0,85
                                                                                                                                                                                                  1,000
0,80                                                                                                                                                                                              0,950
                                                                                                            0,80
                                                                                                                                                                                                  0,900
0,75
                                                                                                            0,75                                                                                  0,850
0,70                                                                                                                                                                                              0,800
                                                                                                            0,70                                                                                  0,750
0,65
                                                                                                                                                                                                  0,700
0,60                                                                                                                                                                                              0,650
                                                                                                            0,65
0,55                                                                                                                                                                                              0,600
                                                                                                                                                                                                          rU+       rU-     rS+       rS-         i+         i-
                                                                                                            0,60
0,50                                                                                                                 Means             LM                  Slope One       SVD
                        Means           LM                Slope One             SVD




                    0,7
                                                                                                        0,0040                                                                                                     LM     Slope One         SVD
                                LM      Lineal (LM )         SO         SVD                                                                                                          E                                                                                 F
                                                                                                                                                                                          0,98
                    0,6                                                                D                0,0035
                                                                                                                                                                                          0,96
                                                                                                        0,0030
                                                                                                                                                                                          0,94
                    0,5
                                                                                           relative error
       relative error




                                                                                                        0,0025                                                                            0,92

                    0,4                                                                                                                                                                   0,90
                                                                                                        0,0020
                                                                                                                                                                                          0,88
                    0,3                                                                                 0,0015                                                                            0,86

                                                                                                                                                                                          0,84
                                                                                                        0,0010
                    0,2
                                                                                                                                                                                          0,82
                                                                                                        0,0005
                                                                                                            0,0005      0,0010    0,0015          0,0020       0,0025   0,0030   0,0035   0,80
                    0,1
                      0,1       0,2     0,3         0,4           0,5     0,6    0,7                                                   relative weight                                           rU+        rU-           rS+         rS-              i+         i-
                                         relative weight




       gets further away3 from the line y=x). In some way we can                                                                                                  uncover specific cases where a recommendation algorithm
       conclude that recommendation usefulness of the rating dis-                                                                                                 may be having suboptimal performance. This is a very use-
       tribution is consistent for all the users and RS models. On                                                                                                ful way to know the origin of the errors found in the rec-
       the other hand, Figure 1-E considers i+ as weights. In this                                                                                                ommendations and therefore useful for improving the RSs,
       case, although weights and error are positively correlated,                                                                                                although its main problem is that it is not absolute as MAE.
       there exists significant differences between different users.
       This result is hidden in the global measures.                                                                                                              4.       REFERENCES
       2.2                      Relative Comparison Among Models                                                                                                  [1] S. Cleger-Tamayo, J.M. Fernández-Luna and J.F.
                                                                                                                                                                      Huete. A New Criteria for Selecting Neighborhood in
          Although wMAE might give some information about how
                                                                                                                                                                      Memory-Based Recommender Systems. Proc. of 14th
       the error has been obtained, there is no criterion about what
                                                                                                                                                                      CAEPIA’11, pp. 423-432. 2011.
       a good prediction is. In order to tackle this situation we pro-
                                                                                                                                                                  [2] A. Gunawardana and G. Shani. A Survey of Accuracy
       pose the use of the relative rather than the absolute error,
                                                                                                                                                                      Evaluation Metrics of Recommendation Tasks. Journal
       i.e. the weighted Mean Absolute Percentage Error, wMAPE.
                                                                                                                                                                      of Machine Learning Research 10, pp. 2935-2962. 2009.
       Then, given two models, M 1 and M 2, the relative metric is
       defined as wM AEM 1 /wM AEM 2 . In this metric, the less                                                                                                   [3] J.L. Herlocker, J.A. Konstan, L.G. Terveen and J.T.
       the value, the greater the improvements. Thus, if we fix the                                                                                                   Riedl. Evaluating collaborative filtering recommender
       model M2 to be a simple model (as the average rating) we                                                                                                       systems. ACM Trans. Inf. Syst. 22, 1. 2004, pp. 5-53.
       obtain the wMAE values in Figure 1-F. From these values,                                                                                                   [4] D. Lemire and A. Maclachlan. Slope One Predictors for
       we can obtain some conclusions as for instance that LM fits                                                                                                    Online Rating-Based Collaborative Filtering. Proc. of
       better the common user’s preferences (rU+), whereas Slope                                                                                                      SIAM Data Mining (SDM’05), 2005
       One and SVD are more biased toward the overall rating dis-                                                                                                 [5] P. Massa and P. Avesani. Trust metrics in
       tribution in the system (rS+). Similarly, we found that bet-                                                                                                   recommender systems. Computing with Social Trust,
       ter improvements, with respect to the average ratings, are                                                                                                     pp. 259-285 Springer 2009.
       obtained when focusing on less frequent ratings. Finally,                                                                                                  [6] B.M. Sarwar, G. Karypis, J. Konstan and J. Riedl.
       with respect to item popularity all the models obtain bet-                                                                                                     Incremental SVD-Based Algorithms for Highly
       ter improvements when considering the most popular items,                                                                                                      Scaleable Recommender Systems. 5th International
       although these differences are less significant.                                                                                                               Conf. on Computer and Information Technology. 2002.
                                                                                                                                                                  [7] T. Jambor and J. Wang. Goal-driven collaborative
       3.                       CONCLUSIONS                                                                                                                           filtering: A directional error based approach. In Proc.
                                                                                                                                                                      ECIR’2010. pp. 407-419. 2010.
         In this paper we have explored the use of weighted Mean
                                                                                                                                                                  [8] C.J. Willmot. Statistics for the Evaluation and
       Average Error as a means to measure the RS’s performance
                                                                                                                                                                      Comparison of Models. Journal of Geophysical
       by focusing on a given dimension or feature, being able to
                                                                                                                                                                      Research, 90. pp. 8995-9005, 1985.
       3
         The other models perform similarly, but we have decided
       to not include these regression lines due to clarity reasons.




                                                                                                                                                   26