=Paper= {{Paper |id=Vol-1441/recsys2015_poster11 |storemode=property |title=Measuring the Concentration Reinforcement Bias of Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster11.pdf |volume=Vol-1441 |dblpUrl=https://dblp.org/rec/conf/recsys/AdamopoulosTM15 }} ==Measuring the Concentration Reinforcement Bias of Recommender Systems== https://ceur-ws.org/Vol-1441/recsys2015_poster11.pdf
          Measuring the Concentration Reinforcement Bias of
                       Recommender Systems
            Panagiotis Adamopoulos   Alexander Tuzhilin                                            Peter Mountanos
            padamopo@stern.nyu.edu atuzhili@stern.nyu.edu                                      peter.mountanos@nyu.edu
                                  Department of Information, Operations, and Management Sciences
                                     Leonard N. Stern School of Business, New York University


ABSTRACT                                                                  ment measure M to assess whether a RS follows or changes
In this paper, we propose new metrics to accurately mea-                  the prior popularity of items when recommendations are
sure the concentration reinforcement of recommender sys-                  generated. To evaluate the concentration reinforcement bias
tems and the enhancement of the “long tail”. We also con-                 of recommendations, [3, 4] measure the proportion of items
duct a comparative analysis of various RS algorithms illus-               that changed from “long-tail” in terms of prior sales (or num-
trating the usefulness of the proposed metrics.                           ber of positive ratings) to popular in terms of recommenda-
                                                                          tion frequency as: M = 1 − K
                                                                                                        P
                                                                                                          i=1 πi ρii , where the vector π
Keywords                                                                  denotes the initial distribution of each of the K popularity
                                                                          categories and ρii the probability of staying in category i,
Dispersion; Diversity; Long Tail; Popularity Reinforcement
                                                                          given that i was the initial category. In [3, 4], the popular-
1.     INTRODUCTION                                                       ity categories, labeled as “head” and “tail”, are based on the
                                                                          Pareto principle and hence the “head” category contains the
   Even though many researchers have focused on developing                top 20% of items (in terms of positive ratings or recommen-
efficient algorithms for generating more accurate recommen-               dation frequency, respectively) and the “tail” category the
dations, there is increasing interest in metrics that go beyond           remaining 80%. However, this metric of concentration re-
this paradigm [1, 2] and evaluate various other properties                inforcement (popularity) bias entails an arbitrary selection
and dimensions of recommender system (RS) algorithms,                     of popularity categories. Besides, all items included in the
including the popularity bias and dispersion of recommen-                 same popularity category are contributing equally to this
dations. However, following the currently established evalu-              metric, despite any differences in popularity.
ation protocols and simply evaluating the generated recom-
mendation lists in terms of dispersion and inequality of rec-   3. CONCENTRATION REINFORCEMENT
ommendations does not provide any information about the            To precisely measure the concentration reinforcement (pop-
concentration reinforcement and popularity bias of the rec-     ularity) bias of RSes and alleviate the problems of the afore-
ommendations (i.e., whether popular or long-tail items are      mentioned metrics, we propose a new metric as follows:
more likely to be recommended) since these metrics do not                                     s(i)+1                         N           
consider the prior popularity of the candidate items. Focus-           X1                                          N
                                                                                                                                 r (i)+1
                                                                                 s(i)                         1   r  (i)
                                                                                               P
                                                                                                 j∈I s(j)+1                     N ∗|U |+|I|
ing on improving the current evaluation protocols of RSes CI@N =             P            ln  rN (i)+1 +                 ln  s(i)+1  ,
through alleviating this problem, we propose new metrics               i∈I
                                                                           2 j∈I s(j)                         2 N ∗ |U |       P
                                                                                                N ∗|U |+|I|                      j∈I s(j)+1
to accurately measure the concentration reinforcement and
“long-tail enhancement” of recommender system algorithms.       where s(i) is the prior popularity of item i (i.e., the number
                                                                of positive ratings for item i in the training set or corre-
2. RELATED WORK                                                 spondingly the number of prior sales of item i), rN (i) is the
   Several measures have been employed in prior research in     number of times item i is included in the generated top-
order to measure the concentration reinforcement and pop-       N recommendation lists, and U and I are the sets of users
ularity bias of RSes as well as other similar concepts. These   and items, respectively.1 In essence, following the notion of
metrics include catalog coverage, aggregate diversity, and      Jensen-Shannon divergence in probability theory and statis-
the Gini coefficient. In particular, catalog coverage mea-      tics, the proposed metric captures the distributional diver-
sures the percentage of items for which the RS is able to       gence between the popularity of each item in terms of prior
make predictions [9] while aggregate diversity uses the total   sales (or number of positive ratings) and the number of times
number of distinct items among the top-N recommendation         each item is recommended across all users. Based on this
lists across all users to measure the absolute long-tail diver- metric, a score of zero denotes no change (i.e. the number
sity of recommendations [5]. The Gini coefficient [7] is used   of times an item is recommended is proportional to its prior
to measure the distributional dispersion of the number of       popularity) whereas a (more) positive score denotes that the
times each item is recommended across all users; similar are    generated recommendations deviate (more) from the prior
the Hoover (Robin Hood) index and the Lorenz curve.             popularity (i.e., sales or positive ratings) of items.
   However, these metrics do not take into consideration the       In order to measure whether the deviation of recommen-
prior popularity of candidate items and, hence, do not pro-     dations from the distribution of prior sales (or positive rat-
vide sufficient evidence on whether the prior concentration     ings) promotes long-tail rather than popular items, we also
of popularity is reinforced or alleviated by the RS. Moving     1
                                                                  Another smoothed   version of 
                                                                                                the proposed metric is: CI@N =
towards this direction, [3, 4] employ a popularity reinforce-                  
                                                                                      s (i)
                                                                                                              
                                                                                                                      rN (i)
                                                                                1                                      N
                                                                                                 %
                                                                                                                 + 12 r%                  %
                                                                P
                                                                            i∈I 2 s% (i) ln1 s (i)+ 1 rN (i)               (i) ln   1 s (i)+ 1 rN (i)   ,
Copyright is held by the author(s).                                                        2 %      2 %                             2 %      2 %
                                                                                           s(i)                     N
                                                                                                                   r (i)
                                                                                                          N
RecSys 2015 Poster Proceedings, September 16–20, 2015, Austria, Vienna.   where s% (i) = P         and r%    (i) = N ∗|U | .
                                                                                          j∈I s(j)
.
propose a measure of “long-tail enforcement” as follows:
                                         !  rN (i)+1 
               1  X              s(i)            N ∗|U |+|I|
  LT Iλ @N =          λ 1− P              ln  s(i)+1 
              |I| i∈I           j∈I s(j)       P
                                                  j∈I s(j)+1
                                          s(i)+1 
                              s(i)
                                           P
                                             j∈I s(j)+1
                 + (1 − λ) P          ln  rN (i)+1  ,
                             j∈I s(j)               N ∗|U |+|I|

where λ ∈ (0, 1) controls which items are considered long-tail
(i.e., the percentile of popularity below which a RS should
increase the frequency of recommendation of an item). In
essence, the proposed metric rewards a RS for increasing the
frequency of recommendations of long-tail items while penal-
izing for frequently recommending already popular items.

4.    EXPERIMENTAL RESULTS
   To empirically illustrate the usefulness of the proposed
metrics, we conduct a large number of experiments com-
paring various algorithms across different performance mea-
sures. The data sets we used are the MovieLens 100k (ML-
100k), 1M (ML-1m), and “latest-small” (ML-ls), and the
FilmTrust (FT). The recommendations were produced using
the algorithms of association rules (AR), item-based collab-
orative filtering (CF) nearest neighbors (ItemKNN), user-
based CF nearest neighbors (UserKNN), CF ensemble for
ranking (RankSGD) [10], list-wise learning to rank with ma-
trix factorization (LRMF) [12], Bayesian personalized rank-
ing (BPR) [11], and BPR for non-uniformly sampled items                        Figure 1: Performance (ranking) of various RS algorithms.
(WBPR) [6] implemented in [8].
   Figure 1 illustrates the results of the comparative anal-              5.     CONCLUSIONS
ysis of the different algorithms across various metrics. In
                                                                             We propose new metrics to accurately measure the con-
particular, Fig. 1 shows the relative ranking in performance
                                                                          centration reinforcement and “long-tail enforcement” of rec-
for each algorithm based on popular metrics of predictive
                                                                          ommender systems. The proposed metrics capture different
accuracy and dispersion as well as the newly proposed met-
                                                                          performance dimensions of an algorithm compared to exist-
rics; green (red) squares indicate that the specific algorithm
                                                                          ing metrics of RSes as they take into consideration the prior
achieved the best (worst) relative performance among all the
                                                                          distribution of positive ratings and sales of the candidates
algorithms for the corresponding dataset and metric.2
                                                                          items in order to accurately measure the effect of a RS. We
   Based on the results, we can see that the proposed metrics
                                                                          also conduct a comparative analysis of various RS algorithms
capture different performance dimensions of an algorithm
                                                                          illustrating the usefulness of the proposed metrics.
compared to the relevant metrics of Gini coefficient and ag-
gregate diversity. Comparing the performance based on the
proposed concentration bias metric (CI@N ) with the met-                  6.     REFERENCES
ric of Gini coefficient, we see that even though on aggregate              [1] P. Adamopoulos. Beyond rating prediction accuracy: On new
an algorithm might distribute more equally than another                        perspectives in recommender systems. In RecSys. ACM, 2013.
algorithm the number of times each item is recommended,                    [2] P. Adamopoulos. On discovering non-obvious recommendations:
                                                                               Using unexpectedness and neighborhood selection methods in
it might still achieve this by deviating less from the prior                   collaborative filtering systems. In RecSys. ACM, 2014.
popularity (i.e., number of sales or positive ratings) of each             [3] P. Adamopoulos and A. Tuzhilin. Probabilistic neighborhood
item separately (e.g., green color for Gini coefficient and red                selection in collaborative filtering systems. Working Paper:
                                                                               CBA-13-04, NYU, 2013. http://hdl.handle.net/2451/31988.
color for concentration reinforcement). Nevertheless, the dif-
                                                                           [4] P. Adamopoulos and A. Tuzhilin. On over-specialization and
ferences among the LT Iλ performance and the other metrics                     concentration bias of recommendations: Probabilistic
(e.g., aggregate diversity) indicate that even though some al-                 neighborhood selection in CF systems. In RecSys. ACM, 2014.
gorithms might recommend fewer (more) items than others                    [5] G. Adomavicius and Y. Kwon. Improving aggregate
or distribute how many times each item is recommended less                     recommendation diversity using ranking-based techniques.
                                                                               IEEE Trans. on Knowl. and Data Eng., 24(5):896–911, 2012.
(more) equally among the recommended items, they might                     [6] Z. Gantner, L. Drumond, et al. Personalized ranking for
achieve this by frequently recommending more (fewer) long-                     non-uniformly sampled items. In Proceed. of KDD Cup, 2012.
tail items rather than more (fewer) popular items (e.g., red               [7] C. Gini. Measurement of inequality of incomes. The Economic
color for Gini coefficient and green color for “long-tail en-                  Journal, pages 124–126, 1921.
                                                                           [8] G. Guo, J. Zhang, Z. Sun, and N. Yorke-Smith. Librec: A java
forcement”). Hence, the two proposed metrics should be                         library for recommender systems. In UMAP’15, 2015.
used in combination in order to evaluate i) how much the                   [9] J. Herlocker, J. Konstan, et al. Evaluating collaborative
recommendations of a RS algorithm deviate from the prior                       filtering recom. systems. ACM Trans. Inf. Syst., 22(1), 2004.
popularity of items and ii) whether this deviation occurs by              [10] M. Jahrer and A. Töscher. Collaborative filtering ensemble for
                                                                               ranking. In Proceedings of KDD Cup competition, 2012.
promoting long-tail rather than already popular items.                    [11] S. Rendle, C. Freudenthaler, et al. BPR: Bayesian Personalized
2                                                                              Ranking from Implicit Feedback. In UAI ’09, 2009.
 We have reversed the scale of the Gini coefficient for easier inter-
pretation of the results (i.e., the green color corresponds to the most   [12] Y. Shi, M. Larson, et al. List-wise learning to rank with matrix
uniformly distributed recommendations).                                   factorization for collaborative filtering. In RecSys ’10, 2010.