Measuring the Concentration Reinforcement Bias of Recommender Systems Panagiotis Adamopoulos Alexander Tuzhilin Peter Mountanos padamopo@stern.nyu.edu atuzhili@stern.nyu.edu peter.mountanos@nyu.edu Department of Information, Operations, and Management Sciences Leonard N. Stern School of Business, New York University ABSTRACT ment measure M to assess whether a RS follows or changes In this paper, we propose new metrics to accurately mea- the prior popularity of items when recommendations are sure the concentration reinforcement of recommender sys- generated. To evaluate the concentration reinforcement bias tems and the enhancement of the “long tail”. We also con- of recommendations, [3, 4] measure the proportion of items duct a comparative analysis of various RS algorithms illus- that changed from “long-tail” in terms of prior sales (or num- trating the usefulness of the proposed metrics. ber of positive ratings) to popular in terms of recommenda- tion frequency as: M = 1 − K P i=1 πi ρii , where the vector π Keywords denotes the initial distribution of each of the K popularity categories and ρii the probability of staying in category i, Dispersion; Diversity; Long Tail; Popularity Reinforcement given that i was the initial category. In [3, 4], the popular- 1. INTRODUCTION ity categories, labeled as “head” and “tail”, are based on the Pareto principle and hence the “head” category contains the Even though many researchers have focused on developing top 20% of items (in terms of positive ratings or recommen- efficient algorithms for generating more accurate recommen- dation frequency, respectively) and the “tail” category the dations, there is increasing interest in metrics that go beyond remaining 80%. However, this metric of concentration re- this paradigm [1, 2] and evaluate various other properties inforcement (popularity) bias entails an arbitrary selection and dimensions of recommender system (RS) algorithms, of popularity categories. Besides, all items included in the including the popularity bias and dispersion of recommen- same popularity category are contributing equally to this dations. However, following the currently established evalu- metric, despite any differences in popularity. ation protocols and simply evaluating the generated recom- mendation lists in terms of dispersion and inequality of rec- 3. CONCENTRATION REINFORCEMENT ommendations does not provide any information about the To precisely measure the concentration reinforcement (pop- concentration reinforcement and popularity bias of the rec- ularity) bias of RSes and alleviate the problems of the afore- ommendations (i.e., whether popular or long-tail items are mentioned metrics, we propose a new metric as follows: more likely to be recommended) since these metrics do not  s(i)+1   N  consider the prior popularity of the candidate items. Focus- X1 N r (i)+1 s(i) 1 r (i) P j∈I s(j)+1 N ∗|U |+|I| ing on improving the current evaluation protocols of RSes CI@N = P ln  rN (i)+1 + ln  s(i)+1  , through alleviating this problem, we propose new metrics i∈I 2 j∈I s(j) 2 N ∗ |U | P N ∗|U |+|I| j∈I s(j)+1 to accurately measure the concentration reinforcement and “long-tail enhancement” of recommender system algorithms. where s(i) is the prior popularity of item i (i.e., the number of positive ratings for item i in the training set or corre- 2. RELATED WORK spondingly the number of prior sales of item i), rN (i) is the Several measures have been employed in prior research in number of times item i is included in the generated top- order to measure the concentration reinforcement and pop- N recommendation lists, and U and I are the sets of users ularity bias of RSes as well as other similar concepts. These and items, respectively.1 In essence, following the notion of metrics include catalog coverage, aggregate diversity, and Jensen-Shannon divergence in probability theory and statis- the Gini coefficient. In particular, catalog coverage mea- tics, the proposed metric captures the distributional diver- sures the percentage of items for which the RS is able to gence between the popularity of each item in terms of prior make predictions [9] while aggregate diversity uses the total sales (or number of positive ratings) and the number of times number of distinct items among the top-N recommendation each item is recommended across all users. Based on this lists across all users to measure the absolute long-tail diver- metric, a score of zero denotes no change (i.e. the number sity of recommendations [5]. The Gini coefficient [7] is used of times an item is recommended is proportional to its prior to measure the distributional dispersion of the number of popularity) whereas a (more) positive score denotes that the times each item is recommended across all users; similar are generated recommendations deviate (more) from the prior the Hoover (Robin Hood) index and the Lorenz curve. popularity (i.e., sales or positive ratings) of items. However, these metrics do not take into consideration the In order to measure whether the deviation of recommen- prior popularity of candidate items and, hence, do not pro- dations from the distribution of prior sales (or positive rat- vide sufficient evidence on whether the prior concentration ings) promotes long-tail rather than popular items, we also of popularity is reinforced or alleviated by the RS. Moving 1 Another smoothed version of  the proposed metric is: CI@N = towards this direction, [3, 4] employ a popularity reinforce-  s (i)  rN (i) 1 N % + 12 r% % P i∈I 2 s% (i) ln1 s (i)+ 1 rN (i) (i) ln 1 s (i)+ 1 rN (i) , Copyright is held by the author(s). 2 % 2 % 2 % 2 % s(i) N r (i) N RecSys 2015 Poster Proceedings, September 16–20, 2015, Austria, Vienna. where s% (i) = P and r% (i) = N ∗|U | . j∈I s(j) . propose a measure of “long-tail enforcement” as follows: !  rN (i)+1  1 X s(i) N ∗|U |+|I| LT Iλ @N = λ 1− P ln  s(i)+1  |I| i∈I j∈I s(j) P j∈I s(j)+1  s(i)+1  s(i) P j∈I s(j)+1 + (1 − λ) P ln  rN (i)+1  , j∈I s(j) N ∗|U |+|I| where λ ∈ (0, 1) controls which items are considered long-tail (i.e., the percentile of popularity below which a RS should increase the frequency of recommendation of an item). In essence, the proposed metric rewards a RS for increasing the frequency of recommendations of long-tail items while penal- izing for frequently recommending already popular items. 4. EXPERIMENTAL RESULTS To empirically illustrate the usefulness of the proposed metrics, we conduct a large number of experiments com- paring various algorithms across different performance mea- sures. The data sets we used are the MovieLens 100k (ML- 100k), 1M (ML-1m), and “latest-small” (ML-ls), and the FilmTrust (FT). The recommendations were produced using the algorithms of association rules (AR), item-based collab- orative filtering (CF) nearest neighbors (ItemKNN), user- based CF nearest neighbors (UserKNN), CF ensemble for ranking (RankSGD) [10], list-wise learning to rank with ma- trix factorization (LRMF) [12], Bayesian personalized rank- ing (BPR) [11], and BPR for non-uniformly sampled items Figure 1: Performance (ranking) of various RS algorithms. (WBPR) [6] implemented in [8]. Figure 1 illustrates the results of the comparative anal- 5. CONCLUSIONS ysis of the different algorithms across various metrics. In We propose new metrics to accurately measure the con- particular, Fig. 1 shows the relative ranking in performance centration reinforcement and “long-tail enforcement” of rec- for each algorithm based on popular metrics of predictive ommender systems. The proposed metrics capture different accuracy and dispersion as well as the newly proposed met- performance dimensions of an algorithm compared to exist- rics; green (red) squares indicate that the specific algorithm ing metrics of RSes as they take into consideration the prior achieved the best (worst) relative performance among all the distribution of positive ratings and sales of the candidates algorithms for the corresponding dataset and metric.2 items in order to accurately measure the effect of a RS. We Based on the results, we can see that the proposed metrics also conduct a comparative analysis of various RS algorithms capture different performance dimensions of an algorithm illustrating the usefulness of the proposed metrics. compared to the relevant metrics of Gini coefficient and ag- gregate diversity. Comparing the performance based on the proposed concentration bias metric (CI@N ) with the met- 6. REFERENCES ric of Gini coefficient, we see that even though on aggregate [1] P. Adamopoulos. Beyond rating prediction accuracy: On new an algorithm might distribute more equally than another perspectives in recommender systems. In RecSys. ACM, 2013. algorithm the number of times each item is recommended, [2] P. Adamopoulos. On discovering non-obvious recommendations: Using unexpectedness and neighborhood selection methods in it might still achieve this by deviating less from the prior collaborative filtering systems. In RecSys. ACM, 2014. popularity (i.e., number of sales or positive ratings) of each [3] P. Adamopoulos and A. Tuzhilin. Probabilistic neighborhood item separately (e.g., green color for Gini coefficient and red selection in collaborative filtering systems. Working Paper: CBA-13-04, NYU, 2013. http://hdl.handle.net/2451/31988. color for concentration reinforcement). Nevertheless, the dif- [4] P. Adamopoulos and A. Tuzhilin. On over-specialization and ferences among the LT Iλ performance and the other metrics concentration bias of recommendations: Probabilistic (e.g., aggregate diversity) indicate that even though some al- neighborhood selection in CF systems. In RecSys. ACM, 2014. gorithms might recommend fewer (more) items than others [5] G. Adomavicius and Y. Kwon. Improving aggregate or distribute how many times each item is recommended less recommendation diversity using ranking-based techniques. IEEE Trans. on Knowl. and Data Eng., 24(5):896–911, 2012. (more) equally among the recommended items, they might [6] Z. Gantner, L. Drumond, et al. Personalized ranking for achieve this by frequently recommending more (fewer) long- non-uniformly sampled items. In Proceed. of KDD Cup, 2012. tail items rather than more (fewer) popular items (e.g., red [7] C. Gini. Measurement of inequality of incomes. The Economic color for Gini coefficient and green color for “long-tail en- Journal, pages 124–126, 1921. [8] G. Guo, J. Zhang, Z. Sun, and N. Yorke-Smith. Librec: A java forcement”). Hence, the two proposed metrics should be library for recommender systems. In UMAP’15, 2015. used in combination in order to evaluate i) how much the [9] J. Herlocker, J. Konstan, et al. Evaluating collaborative recommendations of a RS algorithm deviate from the prior filtering recom. systems. ACM Trans. Inf. Syst., 22(1), 2004. popularity of items and ii) whether this deviation occurs by [10] M. Jahrer and A. Töscher. Collaborative filtering ensemble for ranking. In Proceedings of KDD Cup competition, 2012. promoting long-tail rather than already popular items. [11] S. Rendle, C. Freudenthaler, et al. BPR: Bayesian Personalized 2 Ranking from Implicit Feedback. In UAI ’09, 2009. We have reversed the scale of the Gini coefficient for easier inter- pretation of the results (i.e., the green color corresponds to the most [12] Y. Shi, M. Larson, et al. List-wise learning to rank with matrix uniformly distributed recommendations). factorization for collaborative filtering. In RecSys ’10, 2010.