The Demographics of Cool Popularity and Recommender Performance for Different Groups of Users Michael D. Ekstrand and Maria Soledad Pera People and Information Research Team Dept. of Computer Science, Boise State University, Boise, Idaho, USA {michaelekstrand,solepera}@boisestate.edu ABSTRACT encourage the selection of algorithms that perform well on the Typical recommender evaluations treat users as an homogeneous largest subgroup’s tastes. unit. However, user subgroups often differ in their tastes, which Our central research question is this: what changes about our can result more broadly in diverse recommender needs. Thus, these assessment of relative or absolute recommender effectiveness when groups may have different degrees of satisfaction with the pro- we consider performance for different subgroups of users– basically vided recommendations. We explore the offline top-N performance when we consider all subgroups’ satisfaction to be equally impor- of collaborative filtering algorithms across two domains. We find tant? Does popularity bias exacerbate demographic bias effects? that several strategies achieve higher accuracy for dominant de- How do popularity bias mitigations affect the demographic bias? mographic groups, thus increasing the overall performance for the strategy, without providing increased benefits for other users. 2 INITIAL ANALYSIS We answer these questions with an offline analysis using LensKit [4] CCS CONCEPTS 1 and two datasets that provide user demographics of some form. • Information systems → Recommender systems; MovieLens-1M 2 [6] contains 1M 5-star ratings of 3,900 movies by 6,040 users who joined MovieLens through 2000. Each user has self- KEYWORDS reported age, gender, occupation, and zip code. LastFM contains data of 359,347 users who played 294,015 unique artists. The main record collaborative filtering, evaluation popularity bias set consists of 17,559,530 tuples of the form ⟨user , artist, playCount⟩. For most users, gender, age, country, and sign-up date are pro- 1 INTRODUCTION vided. We employed several classical and widely-used collaborative Recommender system evaluation—offline and online —typically fo- filtering algorithms: (1) Popular (Pop), recommending the most cuses on the system’s effectiveness, in aggregate over the entire user frequently rated or played items; (2) Item-Item (II), an item-based population. While individual user characteristics are sometimes collaborative filter using 20 neighbors and cosine similarity; (3) taken into account, as in demographic-informed recommendation, User-User (UU), a user-based collaborative filter configured to use evaluations typically still aggregate over all users [8]. In this work, 30 neighbors and cosine similarity; and (4) FunkSVD (MF), which is we connect recent work leveraging user demographics to deepen based on gradient descent matrix factorization technique with 40 understanding of different users’ satisfaction with search engines latent features and 150 training iterations per feature. Each algo- [7], with the work of Bellogin et al. [1] measuring recommenders’ rithm is tagged with its variant: ‘-E’ are explicit-feedback recom- performance for different items to examine recommender system menders (applicable only to MovieLens); ‘-B’ are implicit-feedback accuracy for users in different demographic groups in an offline recommenders that only consider whether an item was rated or setting. This attention is necessary because, by default, the largest played, disregarding its rating value or play count; ‘-C’ are implicit- subgroup of users will dominate overall statistics; if other subgroups feedback recommenders that consider the number of times an artist have different needs, their satisfaction will carry less weight in the was played as repeated implicit feedback (LastFM only). We applied final analysis. This can result in an incomplete picture of the per- 5-fold cross-validation, using two methods: (1) LensKit’s default formance of the system and and obscure the need to identify how strategy and (2) Bellogin’s UAR method [1] for neutralizing popular- to better serve specific demographic groups. To the well-known ity bias; this works like the default, except it picks test sets of items problems of popularity bias [2] and misclassified decoys [3, 5] (a instead of users. An initial experiment revealed that regardless of good item recommendation counted as a error given that the user the metric, i.e., Recall, Mean Reciprocal Rank (MRR), and Mean has yet to interact with the item in available data), we add a third Average Precision, the algorithms exhibit similar behavior, thus we consideration: demographic bias, where the satisfaction (approxi- report our results using MRR. mated in offline settings by top-N accuracy) of some demographic Demographic distribution and its impact on evaluation. groups is weighted more heavily than others. Demographic bias Figure 1 shows user gender distribution; with the majority of users also has a complex expected interaction with popularity bias: the reporting as male. The age distribution reveals some differences: most active and numerous users will have a greater impact on popu- the largest block of MovieLens users belong to the [25-35] group, larity than other users, so popularity bias in evaluation will further whereas a plurality of LastFM users belong to the [18-24] group.3 1 Code and scripts are available at https://doi.org/10.18122/B2ND8P RecSys 2017 Poster Proceedings, August 27–31, Como, Italy. 2 Later MovieLens dataset do not include demographic information. © 2017 Copyright held by the owner/author(s). 3 For consistency, we binned LastFM users into the same groups used in MovieLens-1M. RecSys 2017 Poster Proceedings, August 27–31, Como, Italy. Ekstrand and Pera Age Gender Age Gender Algorithm 0.15 0.6 II−B LastFM.UI Proportion DataSet 0.10 II−C 0.4 LastFM II−E ML−1M 0.05 Mean−E 0.2 MF−B MRR 0.00 0.0 MF−C 0.04 MF−E 1 18 25 35 45 50 56 A F M A N N ML−1M.UI Demographic Characteristic 0.03 Pop−B 0.02 Pop−C Figure 1: User distribution based on age and gender 0.01 UU−B UU−E 0.00 l ed 7] ] ] ] ] +] A l ed F M A Al Al 24 34 44 55 N N −1 6 et et 8− 5− 5− 5− [5 ck ck [1 [1 [2 [3 [4 Bu Bu Demographic Characteristic Standard Results. Figure 2 shows the MRR achieved by each Figure 3: Results of UAR experiment algorithm, grouped by demographic group. For each demographic characteristic, All is the accuracy achieved by averaging across all users, and Bucketed is the result of first averaging within each 3 DISCUSSION AND FUTURE WORK demographic group, and then averaging the groups’ results (thus Our analysis showed that, unsurprisingly, a number of recommen- giving each group equal weight, instead of each user). The results dation strategies achieve moderately higher accuracy metric values across subgroups are broadly similar for both data sets, though the for dominant demographic groups. This can cause an algorithm’s All analysis tracks most closely with the dominant group. How- performance to increase without delivering benefit to smaller sub- ever, if a decision is to be made based on “performs best", then the groups of the user population. In other words, the perceived sat- small differences become non-trivial, as they will affect the final isfaction with a recommender may not be the same for the “cool” decisions. One example case emerges from our analysis: on LastFM, users—in the dominant group—as it is for those in smaller groups. II performs better using play counts (“-C”) for some age groups, Demographic bias in accuracy metric results also has a complex while the “-B” variant is more effective for other age groups. interaction with mitigation strategies for other offline evaluation While we cannot conclude, based on this ongoing study, which is ailments such as popularity bias. A uniform item strategy results in the right decision, our preliminary analysis demonstrates the need disproportionately higher accuracy values for users in some smaller for further exploration from a demographic perspective. subgroups. Further work is needed to understand which paradigm maps most closely to actual user experience or response. Age Gender Our findings highlight the need for careful and multi-faceted 0.3 Algorithm consideration of recommender system behavior across a range of both users and items. As prior work has found that recommenders II−B LastFM 0.2 II−C 0.1 II−E are not equally good at recommending for all items, we find that recommenders are not equally good for all users in predictable Mean−E MF−B MRR 0.0 MF−C and socially-relevant ways. While the full social and business ram- MF−E ifications of our findings have yet to be explored, we encourage 0.3 Pop−B ML−1M 0.2 0.1 Pop−C researchers and practitioners to pay attention to which users receive UU−B 0.0 UU−E how much benefit from a particular recommender. l ed ] ] ] ] ] +] A l ed F M A Al Al 17 24 34 44 55 N N 6 et et − 8− 5− 5− 5− [5 ACKNOWLEDGMENTS ck ck [1 [1 [2 [3 [4 Bu Bu Demographic Characteristic We thank Ion Madrazo for helping with analysis, and the People Figure 2: Results of basic run of results and Information Research Team (PIReT) for their support. REFERENCES Popularity Bias Mitigating Results. We also seek to under- [1] A. Bellogin. Performance prediction and evaluation in Recommender Systems: an stand how demographic bias interacts with mitigation techniques Information Retrieval perspective. PhD thesis, UAM, 2012. for other issues, such as popularity bias. To that end, we performed [2] A. Bellogin, P. Castells, and I. Cantador. Precision-oriented evaluation of recom- mender systems: an algorithmic comparison. In Proc. ACM RecSys ’11, 2011. a version of our analysis using Bellogin’s UAR technique [1]. We [3] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms see (in Figure 3) that several of the smaller user groups have sub- on top-n recommendation tasks. In ACM RecSys, pages 39–46, 2010. stantially higher accuracy measures than larger groups, particularly [4] M. D. Ekstrand, M. Ludwig, J. A. Konstan, and J. T. Riedl. Rethinking the recom- mender research ecosystem: reproducibility, openness, and lenskit. In Proc. ACM on age. An analysis using this method would find that the recom- RecSys ’11, 2011. mender is delivering better recommendations to these groups. [5] M. D. Ekstrand and V. Mahant. Sturgeon and the cool kids: Problems with Top-N recommender evaluation. In Proc. FLAIRS 30. AAAI Press, 22 May 2017. The differences obtained using UAR or traditional evaluations [6] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. show that mitigating popularity bias comes with the cost of signifi- Trans. Interact. Intel. Sys., 5(4):19, 2016. cantly changing the distribution of measured accuracy across user [7] R. Mehrotra, A. Anderson, F. Diaz, A. Sharma, H. Wallach, and E. Yilmaz. Auditing search engines for differential satisfaction across demographics. In Proc. WWW subgroups. (Analysis using 1R [1] did not produce results signifi- ’17 Companion, 2017. cantly different from Figure 2.) Which evaluation strategy better [8] G. Shani and A. Gunawardana. Evaluating recommendation systems. In Recom- reflects actual user experience is still up for debate. mender systems handbook, pages 257–297. Springer, 2011.