=Paper= {{Paper |id=None |storemode=property |title=On the Use of Weighted Mean Absolute Error in Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-910/paper5.pdf |volume=Vol-910 |dblpUrl=https://dblp.org/rec/conf/recsys/Cleger-TamayoFH12 }} ==On the Use of Weighted Mean Absolute Error in Recommender Systems== https://ceur-ws.org/Vol-910/paper5.pdf

On the use of Weighted Mean Absolute Error in
Recommender Systems

S. Cleger-Tamayo J.M. Fernández-Luna & J.F. Huete
Dpto. de Informática. Dpto. de Ciencias de la Computación e I.A.
Universidad de Holguín, Cuba CITIC – UGR Universidad de Granada, Spain
sergio@facinf.uho.edu.cu {jmfluna,jhg}@decsai.ugr.es

ABSTRACT system, RMSE being more sensitive to the occasional large
The classical strategy to evaluate the performance of a Rec- error: the squaring process gives higher weight to very large
ommender System is to measure the error in rating predic- errors. A valuable property of both metrics is that they take
tions. But when focusing on a particular dimension in a their values in the same range as the error being estimated,
recommending process it is reasonable to assume that ev- so they can be easily understood by the users.
ery prediction should not be treated equally, its importance But, these metrics consider that the standard deviation of
depends on the degree to which the predicted item matches the error term is constant over all the predictions, i.e. each
the deemed dimension or feature. In this paper we shall ex- prediction provides equally precise information about the er-
plore the use of weighted Mean Average Error (wMAE) as ror variation. This assumption, however, does not hold, even
an alternative to capture and measure their effects on the approximately, in every recommending application. In this
recommendations. In order to illustrate our approach two paper we will focus on the weighted Mean Absolute Error,
different dimensions are considered, one item-dependent and wMAE, as an alternative to measure the impact of a given
the other that depends on the user preferences. feature in the recommendations1 . Two are the main pur-
poses for using this metric: On the one hand, as an enhanced
evaluation tool for better assessing the RS performance with
1. INTRODUCTION respect to the goals of the application. For example, in the
Several algorithms based on different ideas and concepts case of recommending books or movies it could be possible
have been developed to compute recommendations and, as that the accuracy of the predictions varies when focusing on
a consequence, several metrics can be used to measure the past or recent products. In this situation, it is not reason-
performance of the system. In the last years, increasing ef- able that every error were treated equally, so more stress
forts have been devoted to the research of Recommender should be put in recent items. On the other hand, it can
System (RS) evaluation. According to [2], “the decision on be also useful as a diagnosis tool that, using a “magnifying
the proper evaluation metric is often critical, as each met- lens”, can help to identify those cases where an algorithm is
ric may favor a different algorithm”. The selected metric having trouble with. For both purposes, different features
depends on the particular recommendation tasks to be an- shall be considered which might depend on the items, as
alyzed. Two main tasks might be considered: the first one, for example, in the case of a movie-based RS the genre, the
with the objective of measuring the capability of a RS to release date, price, etc. But also, the dimension might be
predict the rating that a user should give to an unobserved user-dependent considering, for example, the location of the
item, and the second one is related to the ability of an RS user, the users’ rating distribution, etc.
to rank a set of unobserved items, in such a way that those This metric has been widely used for evaluation of model
items more relevant to the user have to be placed in top performance in several fields as meteorology or economic
positions of the ranking. Our interest in this paper is the forecasting [8]. But, few have been discussed about its use
measurement of the capability of a system to predict user in the recommending field; isolately several papers use small
interest in an unobserved item, so we focus on rating predic- tweaks on error metrics in order to explore different aspects
tion. of RS [5, 7]. Next section presents the weighted mean abso-
For this purpose, two standard metrics [3, 2] have been lute error, illustrating its performance considering two differ-
traditionally considered: the Mean Absolute Error (MAE) ent features, user and item-dependent, respectively. Lastly
and the Root Mean Squared Error (RMSE). Both metrics we present the concluding remarks.
try to measure which might be the expected error of the
2. WEIGHTED MEAN ABSOLUTE ERROR
The objective of this paper is to study the use of a weight-
Acknowledgements: This work was jointly supported by the ing factor in the average error. In order to illustrate its func-
Spanish Ministerio de Educación y Ciencia and Junta de Andalucía, under
projects TIN2011-28538-C02-02 and Excellence Project TIC-04526, tionality we will consider simple examples obtained using
respectively, as well as the Spanish AECID fellowship program. four different collaborative-based RS (using Mahout imple-
mentations): i) Means, that predicts using the average rat-
1
Copyright is held by the author/owner(s). Workshop on Recommendation A similar reasoning can be used when considering squared
Utility Evaluation: Beyond RMSE (RUE 2012), held in conjunction with error, which yields to the weighted Mean Root Squared Er-
ACM RecSys 2012. September 9, 2012, Dublin, Ireland. . ror, wRMSE.

24
ings for each user; ii) LM [1], following a nearest neighbors a way that those common ratings for a particular user
approach; iii) SlopeOne [4], predicting based on the average will have greater weights, i.e. wi = prU (ri ).
difference between preferences and iv) SVD [6], based on a
matrix factorization technique. The metric performance is rU– The last one assigns more weight to the less frequent
showed using an empirical evaluation based on the classic rating, i.e. wi = 1 − prU (ri ).
MovieLens 100K data set. Figures 1-A and 1-B present the absolute values of the
A weighting factor would indicate the subjective impor- MAE and wMAE error for the four RSs considered in this
tance we wish to place on each prediction, relating the error paper. Figure 1-A shows the results where the weights
to any feature that might be relevant from both, the user or are positively correlated to the feature distribution, whereas
the seller point of view. For instance, considering the release Figure 1-B presents the results when they are negatively cor-
date, we can assign weights in such a way that the higher the related. In this case, we can observe that by using wMAE we
weight, the higher importance we are placing on more recent can determine that error is highly dependent on the users’
data. In this case we could observe that even when the MAE pattern of ratings, and weaker when considering item pop-
is under reasonable threshold, the performance of a system ularity. Moreover, if we compare the two figures we can
might be inadequate when analyzing this particular feature. observe that all the models perform better when predicting
The weighted Mean Absolute Error can be computed as the most common ratings. In this sense, they are able to
PU PNi learn the most frequent preferences and greater errors (bad
i=1 j=1 wi,j × abs(pi,j − ri,j ) performance) are obtained when focusing on less frequent
wM AE = PU PNi , (1)
i=1 j=1 wi,j rating values. Related to item popularity these differences
are less conclusive. In some sense, the way in which the user
where U represents the number of users; Ni , the number rates an item does not depend of how popular the item is.
of items predicted for the ith -user; ri,j , the rating given by
the ith -user to the item Ij ; pi,j , the rating predicted by 2.1 Relative Weights vs. Relative Error
the model and wi,j represents the weight associated to this Another different alternative to explore the benefits of us-
prediction. Note that when all the individual differences are ing the wMAE metric is to consider the ratio between wMAE
weighted equally wM AE coincides with M AE. and MAE. In this sense, denoting as ei,j = abs(pi,j − ri,j ),
In order to illustrate our approach, we shall consider two we have that wMAE/MAE is equal to
factors, assuming that wi,j ∈ [0, 1].
• Item popularity: we would like to investigate whether P P
the error in the predictions depends on the number of users i,j wi,j ei,j / i,j ei,j
wM AE/M AE = P .
who rated the items. Two alternatives will be considered: i,j wi,j /N

Taking into account that we restrict the weights to take its
i+ The weights will put more of a penalty on bad pre- value in the [0, 1] interval, the denominator might represent
dictions when an item has been rated quite frequently the average percentage of mass of the items that is related
(the items has a high number of ratings). We still pe- to the dimension under consideration whereas the numerator
nalize bad predictions when it has a small number of represents the average percentage of the error coming from
ratings, but we do not penalize as much as when we this feature. So, when wMAE > MAE we have that the
have more samples, since it may just be that the lim- percentage of error coming from the feature is greater than
ited number of ratings do not provide much informa- its associated mass, so the the system is not able to predict
tion about the latent factors which influence the users properly such dimension. When both metrics are equal this
ratings. Particularly, for each item Ii we shall consider implies that the expected error is independent of the feature.
its weight as the probability that this item were rated In Figure 1-C we present the values of the wMAE/MAE
in the training set, i.e. wi = pr(Ii ). where, again, we can see that there exists a dependence be-
i– This is the inverse of the previous criterion, where we tween the rating and the error. The error associated to the
put more emphasis on the predictions over those items common ratings are less than the relative importance of this
with fewer ratings. So the weights are wi = 1 − pr(Ii ). feature in the system whereas for less common ratings the
system is not able to perform good predictions, being greater
• Rating distribution: It is well known that the users the relative error than its associated weights. This situation
does not rate the items uniformly, they tend to use high- does not hold when considering item popularity.
valued ratings. By means of this feature we can measure Figures 1-D and 1-E present an scatter plot that relates
whether the error depends on the ratings distribution or not. the relative weights (horizontal axis) to the relative error
Particularly, we shall consider four different alternatives: (vertical axis) for each user in the system and for each RS
used2 . Particularly, in Figure 1-D we are considering rU+
rS+ Considering the overall rating distribution in the sys- as weighting factor. Note that, since there are 5 possible
tem, putting more emphasis on the error in the pre- ratings, the relative weight is equal to 0.2 when all the rat-
dictions on those common ratings. So the weights are ings are equally probable and its value increases with the
wi = prS (ri ), ri being the rating given by the user to importance of the most used ratings. In this figure we can
the item Ii . see that both percentage of mass and the percentage of er-
ror are positively correlated, being wMAE/MAE < 1 for
rS– Inversely, we assess more weight to the less common most of the users. Moreover, there is a trend to improve the
ratings, i.e. wi = 1 − prS (ri ). predictions for those users with higher relative mass (for ex-
rU+ Different users can use a different pattern of rating, so ample, we can see how the regression line for the LM model
2
we consider the rating distribution of the user, in such We have included all the users with at least 10 predictions.

25
Figure 1: Using wMAE in recommendations: absolute and relative values.
MAE rU- rS- iS-
MAE ru+ rS+ iS+
A 0,95 B
0,95 Means LM Slope One SVD C
0,90
0,90 1,100
0,85 1,050
0,85
1,000
0,80 0,950
0,80
0,900
0,75
0,75 0,850
0,70 0,800
0,70 0,750
0,65
0,700
0,60 0,650
0,65
0,55 0,600
rU+ rU- rS+ rS- i+ i-
0,60
0,50 Means LM Slope One SVD
Means LM Slope One SVD

0,7
0,0040 LM Slope One SVD
LM Lineal (LM ) SO SVD E F
0,98
0,6 D 0,0035
0,96
0,0030
0,94
0,5
relative error
relative error

0,0025 0,92

0,4 0,90
0,0020
0,88
0,3 0,0015 0,86

0,84
0,0010
0,2
0,82
0,0005
0,0005 0,0010 0,0015 0,0020 0,0025 0,0030 0,0035 0,80
0,1
0,1 0,2 0,3 0,4 0,5 0,6 0,7 relative weight rU+ rU- rS+ rS- i+ i-
relative weight

gets further away3 from the line y=x). In some way we can uncover specific cases where a recommendation algorithm
conclude that recommendation usefulness of the rating dis- may be having suboptimal performance. This is a very use-
tribution is consistent for all the users and RS models. On ful way to know the origin of the errors found in the rec-
the other hand, Figure 1-E considers i+ as weights. In this ommendations and therefore useful for improving the RSs,
case, although weights and error are positively correlated, although its main problem is that it is not absolute as MAE.
there exists significant differences between different users.
This result is hidden in the global measures. 4. REFERENCES
2.2 Relative Comparison Among Models [1] S. Cleger-Tamayo, J.M. Fernández-Luna and J.F.
Huete. A New Criteria for Selecting Neighborhood in
Although wMAE might give some information about how
Memory-Based Recommender Systems. Proc. of 14th
the error has been obtained, there is no criterion about what
CAEPIA’11, pp. 423-432. 2011.
a good prediction is. In order to tackle this situation we pro-
[2] A. Gunawardana and G. Shani. A Survey of Accuracy
pose the use of the relative rather than the absolute error,
Evaluation Metrics of Recommendation Tasks. Journal
i.e. the weighted Mean Absolute Percentage Error, wMAPE.
of Machine Learning Research 10, pp. 2935-2962. 2009.
Then, given two models, M 1 and M 2, the relative metric is
defined as wM AEM 1 /wM AEM 2 . In this metric, the less [3] J.L. Herlocker, J.A. Konstan, L.G. Terveen and J.T.
the value, the greater the improvements. Thus, if we fix the Riedl. Evaluating collaborative filtering recommender
model M2 to be a simple model (as the average rating) we systems. ACM Trans. Inf. Syst. 22, 1. 2004, pp. 5-53.
obtain the wMAE values in Figure 1-F. From these values, [4] D. Lemire and A. Maclachlan. Slope One Predictors for
we can obtain some conclusions as for instance that LM fits Online Rating-Based Collaborative Filtering. Proc. of
better the common user’s preferences (rU+), whereas Slope SIAM Data Mining (SDM’05), 2005
One and SVD are more biased toward the overall rating dis- [5] P. Massa and P. Avesani. Trust metrics in
tribution in the system (rS+). Similarly, we found that bet- recommender systems. Computing with Social Trust,
ter improvements, with respect to the average ratings, are pp. 259-285 Springer 2009.
obtained when focusing on less frequent ratings. Finally, [6] B.M. Sarwar, G. Karypis, J. Konstan and J. Riedl.
with respect to item popularity all the models obtain bet- Incremental SVD-Based Algorithms for Highly
ter improvements when considering the most popular items, Scaleable Recommender Systems. 5th International
although these differences are less significant. Conf. on Computer and Information Technology. 2002.
[7] T. Jambor and J. Wang. Goal-driven collaborative
3. CONCLUSIONS filtering: A directional error based approach. In Proc.
ECIR’2010. pp. 407-419. 2010.
In this paper we have explored the use of weighted Mean
[8] C.J. Willmot. Statistics for the Evaluation and
Average Error as a means to measure the RS’s performance
Comparison of Models. Journal of Geophysical
by focusing on a given dimension or feature, being able to
Research, 90. pp. 8995-9005, 1985.
3
The other models perform similarly, but we have decided
to not include these regression lines due to clarity reasons.