Exploring Cross-group Discrepancies in Calibrated Popularity for Accuracy/Fairness Trade-off Optimization∗ OLEG LESOTA, Johannes Kepler University Linz and Linz Institute of Technology, Austria STEFAN BRANDL, Johannes Kepler University Linz, Austria MATTHIAS WENZEL, Johannes Kepler University Linz, Austria ALESSANDRO B. MELCHIORRE, Johannes Kepler University Linz and Linz Institute of Technology, Austria ELISABETH LEX, Graz University of Technology, Austria NAVID REKABSAZ, Johannes Kepler University Linz and Linz Institute of Technology, Austria MARKUS SCHEDL† , Johannes Kepler University Linz and Linz Institute of Technology, Austria Popularity bias is an important issue in recommender systems, as it affects end-users, content creators, and content provider platforms alike. It can cause users to miss out on less popular items that would fit their preference, prevent new content creators from finding their audience, and force providers to pay higher royalties for serving expensive popular content. Over the past years, various approaches to mitigate popularity bias in recommender systems have been proposed. Among them, post-processing methods are widely accepted due to their versatility and ease of implementation. While previous studies have investigated the effects of different post-processing techniques on accuracy and fairness of recommendations, the influence of different algorithms on different user groups have not received much attention in this context. Addressing this research gap, we study the effect of a recent mitigation strategy, Calibrated Popularity, in conjunction with a selection of state-of-the-art recommender algorithms including BPR, ItemKNN, LightGCN, MultiVAE, and NeuMF. We show that these algorithms demonstrate different characteristics in terms of the trade-off between accuracy and fairness, both within and between various user groups defined by gender and inclination towards consumption of mainstream items. Finally, we demonstrate how these discrepancies can be exploited to achieve a more effective trade-off between utility and fairness of recommender systems. 1 INTRODUCTION Recommender systems (RSs) are ubiquitous decision support tools, assisting all kinds of users in their personal and business tasks. They help connect content creators and consumers on streaming platforms, suggest products in online stores, and even influence whether a person finds a fitting job. Considering the important role of RSs, it is crucial to monitor societal and statistical biases they often suffer from. While not all biases are harmful—recommendation results need to be biased in the sense of personalization to match the end user’s preferences—data, algorithmic, and presentation biases may lead to unfair behavior of RSs, i. e., the recommender “systematically and unfairly discriminates against certain individuals or groups of individuals in favor of others” [8]. Popularity bias corresponds to the tendency of some RSs to favor popular items over lesser popular items and is considered to be a harmful phenomenon [1, 7, 13, 16]. Popularity bias has been a long-studied problem in the RSs community (e. g., [3, 4, 17, 20, 28]). A RS with popularity bias creates recommendation lists with highly popular items ranked on top, suppressing the exposure of long-tail items. This often leads to low satisfaction of end users (especially those interested in niche items), unfairly limited exposure of new and niche item producers, and higher expenses for content providers, as serving popular items on online platforms in most cases spells higher royalties. To measure and ∗ Copyright 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Presented at the MORS workshop held in conjunction with the 16th ACM Conference on Recommender Systems (RecSys), 2022, in Seattle, USA. † This is the corresponding author. 1 MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA capture different aspects of popularity bias, various metrics have been introduced; for an overview, see Abdollahpouri et al. [4]. In this paper we concentrate on the user side of popularity bias. Over the past years, researchers proposed a multitude of bias mitigation strategies, working on different stages of the recommendation pipeline. One can distinguish pre-, in- and post-processing methods. The first act before the main RS, often applying transformations to the data the RS is trained on, in an attempt to make its output less biased. In-processing methods usually include additional debiasing training objectives, e. g., through adversarial training [9, 21] or regularization [26]. Post-processing methods act on the output of the RS, usually by re-ranking recommended items to satisfy a certain fairness goal. Post-processing bias mitigation techniques have the advantage of versatility, being independent of the main RS and thus able to work in conjunction with almost any algorithm. In addition, a number of calibration-based post-processing techniques have been shown effective in popularity bias mitigation for matrix factorization algorithms. Previous studies have shown that not only different RSs vary in the degree they are susceptible to popularity bias, but also different user groups suffer from it to various extents. These findings lead us to the following research questions we tackle in this work: RQ1: Is the mitigation technique of post-processing equally effective for all algorithms? Do all algorithms show the same character of trade-off between utility and calibration? RQ2: Are all user groups equally affected by the mitigation procedure? Are the optimal mitigation parameters the same for all user groups? RQ3: To which extent can the trade-off between utility and calibration be softened through using specific mitigation parameters for each user group? To answer these questions, we pair a recent mitigation strategy, Calibrated Popularity, with an array of recommender algorithms, analyzing the mitigation effectiveness and utility-fairness trade-off for each of them. We also consider the effect of mitigation on different user groups, and through this characterize the scoring strategy of each recommendation algorithm investigated. Finally, we conduct an experiment to evaluate potential gains of mitigation approaches tailored specifically to each user group. 2 RELATED WORK Many studies formalize fairness of a RS on the user level through calibration of a certain item attribute (such as genre) [24]. Meaning that recommendation is considered fair only when the distribution of the attribute over the recommended list matches its distribution over some reference list (e. g., each user’s consumption history or the whole list of items in the collection). A number of studies follow this approach to investigate and enforce fairness of the recommendations [5, 14]. Lesota et al. [17] study differences between popularity distributions of consumed and recommended items for each user, tackling the problem of measuring popularity bias as miscalibration between the two. They express it in terms of the median as well as several statistical moments and similarity measures. In addition, they combine research strands on popularity bias and gender bias by analyzing how female and male listeners are affected by popularity bias. Abdollahpouri et al. [3] show that state-of-the-art movie recommendation algorithms suffer from popularity bias, and introduce the delta-GAP metric to quantify the level of underrepresentation. Kowald et al. [16] reproduce these results for music domain. Works on bias mitigation often adopt post-processing strategies. Post-processing is a widely used family of bias mitigation techniques. They operate on the output of a RS, re-ranking the items, striving to create a list that satisfies both utility and fairness objectives. A big advantage of post-processing is its flexibility to be used with almost any RS algorithm. A multitude of post-processing techniques for popularity bias mitigation have been proposed over the years. Abdollahpouri et al. [2] propose an algorithm calibrating proportions of head and tail items from the overall popularity distribution. Zehlike et al. [25] propose FA*IR, a method for boosting exposure of items of some protected 2 Exploring Cross-group Discrepancies in Calibrated Popularity for Accuracy/Fairness Trade-off Optimization MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA category (of low popularity). Abdollahpouri et al. [4] take a more user-oriented approach, called Calibrated Popularity (CP), calibrating three-bin item popularity distributions between user consumption history and their recommendations. Klimashevskaia et al. [15] take a wide perspective on post-processing popularity bias mitigation techniques and analyze them on both platform-wide and user-preference levels. They show that CP is preferable for providing fairness on per-user level. Usually bias mitigation algorithms allow to adjust the weight distribution between the utility and fairness objectives. da Silva et al. [5] take this idea further, experimenting with learning personal weighting for every user to ensure proper genre diversity in the recommendation lists. To the best of our knowledge, this approach has not been adopted for popularity bias mitigation. In addition, most studies presenting mitigation techniques limit their demonstration to a narrow scope of algorithms and mainly consider the population of users as a whole. We address these limitations in our research, by (1) conducting a set of bias mitigation experiments on state-of-the-art RSs of different architectures, (2) considering two ways of user grouping as well as the whole populations, and (3) carrying out an experiment with learning bias mitigation weights for every group separately. We investigate two datasets: MovieLens-1M (ML-1M) [10] from the movie domain and LFM-2b [23] from the music domain. 3 METHODOLOGY We base our study on the common assumption that consumers prefer calibrated recommendations [24], i. e., the distribu- tion of item popularity in a user’s recommendation list should match that of their interaction history. We investigate the trade-off between popularity calibration and utility of recommendations considering different recommender algorithms, user groups, and settings of the mitigation technique. Item Popularity. Following common practice [4, 17], we define popularity of each item through the number of interactions with it. We distinguish Popular, Niche, and Mid categories of items. Popular items are represented by items most interacted with and jointly receiving 20% of all user-item interactions. Similarly, Niche items are the least interacted with, receiving 20% of aggregated user-item interactions. The rest of items falls into the category Mid. User Groups. This work concerns both overall user population as well as specific user groups. We investigate two ways of user grouping: by users’ gender1 and by their inclination towards consumption of popular items. With the latter, similar to [4], we define three user groups: HighPop, MidPop, and LowPop, based on the proportion of popular items in their consumption histories. The groups are defined by sorting users in descending order with respect to the proportion of popular items they consume, and then selecting the top 20%, mid 60%, and bottom 20% of users, respectively. Metrics. In this work, we consider the trade-off between recommenders’ utility expressed through NDCG @10 metric and their proneness to popularity bias. Following previous work, we define the latter on per-user level as Jensen-Shannon Divergence between the popularity distribution of a user’s already consumed items and the top 10 recommended items. If 𝐻𝑢 and 𝑅𝑢 are item popularity (probability) distributions of consumption history and recommendation for user 𝑢, respectively, we calculate Jensen-Shannon Divergence as: ! 1 ∑︁ 2𝐻𝑢 (𝑐) ∑︁ 2𝑅𝑢 (𝑐) 𝐽𝑆𝐷 (𝐻𝑢 , 𝑅𝑢 ) = 𝐻𝑢 (𝑐)𝑙𝑜𝑔2 + 𝑅𝑢 (𝑐)𝑙𝑜𝑔2 (1) 2 𝑐 𝐻𝑢 (𝑐) + 𝑅𝑢 (𝑐) 𝑐 𝐻𝑢 (𝑐) + 𝑅𝑢 (𝑐) where 𝐻𝑢 (𝑐) is the proportion of items of popularity category 𝑐 in the consumption history of user 𝑢. 𝐽𝑆𝐷 can be seen as symmetrical version of Kullback–Leibler divergence. Note that using 𝑙𝑜𝑔2 we ensure that the value of 𝐽𝑆𝐷 is bound 1 Due to limitations of considered datasets we have to treat gender as a binary concept (male versus female). 3 MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA between 0 and 1 [19]. We express the degree of exposure to popularity bias of a user group 𝑔 as 𝐽𝑆𝐷𝑔 , defined as the average 𝐽𝑆𝐷 over all users in 𝑔. Bias Mitigation Technique. Calibrated Popularity is a recent post-processing technique for popularity bias mitiga- tion [4]. It re-ranks a recommendation list 𝐿𝑢′ of 𝑚 items initially recommended to each user, to create a personalized popularity-aware recommendation list 𝐿𝑢∗ of 𝑛 items (𝑛 << 𝑚): 𝐿𝑢∗ = arg max (1 − 𝜆) · 𝑅𝑒𝑙 (𝐿𝑢 ) − 𝜆 · 𝐽𝑆𝐷 (𝐻𝑢 , 𝑃 (𝐿𝑢 )) (2) 𝐿𝑢 , |𝐿𝑢 |=𝑛 where 𝑅𝑒𝑙 (𝐿𝑢 ) is the sum of relevance scores and 𝑃 (𝐿𝑢 ) the item popularity distribution of the 𝑛 candidate item list. To ensure consistency of the mitigation procedure across different recommenders, we re-scale the relevance scores constituting 𝑅𝑒𝑙 (𝐿𝑢 ) to the interval [0, 1] where needed. The parameter 𝜆 allows to prioritize between the utility (first term) and bias mitigation (second term) objectives. Similar to [4] the final recommendation list 𝐿𝑢∗ is created through the process of greedy optimization. Choosing Optimal Mitigation Parameters. To illustrate potential gains of choosing group-specific mitigation parameters for different user groups we introduce a way of selecting an optimal value of parameter 𝜆 taking into account both utility and fairness of the recommendation. For ease of notation, we denote 𝐽𝑆𝐹 (𝐻𝑢 , 𝑅𝑢 ) = 1 − 𝐽𝑆𝐷 (𝐻𝑢 , 𝑅𝑢 ) a fairness measure; similar to NDCG, higher values are better. We define the optimal value of 𝜆 for a user group 𝑔 as follows: NDCG𝑔 · 𝐽𝑆𝐹𝑔 𝜆𝑔 = arg max (3) 𝜆 ∈ [0,1] NDCG𝑔 + 𝐽𝑆𝐹𝑔 In other words, for a given group 𝑔 we select 𝜆 to maximize the harmonic mean between utility (NDCG) and fairness (JSF) for the group. The selection is done by conducting a grid search on the interval [0, 1]. 4 EXPERIMENT SETUP Datasets. We investigate two datasets, MovieLens-1M (ML-1M) 2 in the movie domain and LFM-2b 3 in the music domain. The former provides ratings for 6K users and almost 4K movies. The latter is a Last.fm music listening dataset, which we modify to fit our experimental setup. Firstly, we consider only listening events in the year 2019 of users with meta-information regarding age, gender and country. Secondly, all user–item interactions with a playcount (𝑃𝐶) of < 2 are removed to reduce the number of spurious interactions and noise. Thirdly, we only consider tracks that were listened to by at least 5 different users (constraint 1) and we only consider users who have listened to at least 5 different tracks (constraint 2). Lastly, we treat each user–item interaction in a binary way – 1 if the user has listened to a track, 0 otherwise. Furthermore, we sample 100K tracks uniformly-at-random to ensure items of different characteristics are equally likely to be included in the final subset. We then reinforce constraints 1 and 2. This ultimately results in almost 10K users retained with a total of 10.7M listening events. See Table 1 for details. Algorithms and Baselines. We study popular collaborative filtering algorithms (i. e., neighborhood-based, neural matrix factorization, autoencoders, and graph convolution networks), briefly described in the following. For consistency, we use the algorithm implementations from the Recbole framework [27].4 These are: Bayesian Personalized Ranking (BPR) [22] adopts an optimization function that ranks the items consumed by the users according to their preferences, by defining an implicit order between pairs of items. Item k-Nearest Neighbors (KNN) [6] recommends items based on item-to-item similarity. Specifically, an item is recommended to a user if it is similar (in terms of ratings or interactions) 2 https://grouplens.org/datasets/movielens/1m 3 http://www.cp.jku.at/datasets/LFM-2b 4 https://recbole.io/ 4 Exploring Cross-group Discrepancies in Calibrated Popularity for Accuracy/Fairness Trade-off Optimization MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA Table 1. Statistics of the LFM-2b and MovieLens-1M datasets, broken down into investigated user gender groups. Items (Tracks / Listening Events / Dataset Demographic Users Sparsity Movies) Interactions All 9, 759 99, 922 10, 746, 088 99.8063% LFM-2b Female 1, 820 70, 780 1, 856, 757 99.8359% Male 7, 939 99, 890 8, 889, 331 99.7995% All 6, 040 3, 706 1, 000, 209 95.5316% MovieLens-1M Female 1, 709 3, 481 246, 440 96.1090% Male 4, 331 3, 671 753, 769 95.3038% BPR ItemKNN LightGCN MultiVAE NeuMF .25 .20 .35 .15 .20 .20 .15 .30 .10 .15 .25 JSD .15 .10 .05 .10 .20 .10 .05 .15 .05 .10 .24 .25 .13 .14 .15 .09 .10 .11 .04 nDCG@10 BPR ItemKNN LightGCN MultiVAE NeuMF .2 .5 LowPop .2 .2 .4 MidPop HighPop .2 .3 JSD .1 .1 .1 .2 .1 0 .1 .08 .10 .12 .22 .24 .26 .12 .14 .16 .06 .08 .10 .12 .14 .16 .04 .06 nDCG@10 BPR ItemKNN LightGCN MultiVAE NeuMF .2 Male .2 .3 Female .2 .1 JSD .1 .2 .1 .1 .10 .24 .14 .10 .04 nDCG@10 Fig. 1. Trade-off between utility and fairness in LFM-2b dataset. Paler points denote smaller weight for the fairness objective (𝜆). Points with the highest 𝐽 𝑆𝐷 for each curve correspond to 𝜆 = 0. to the items previously interacted with by the user. Light Graph Convolution Network (LightGCN) [11] learns user and item embeddings by linearly propagating them through the user-item interaction graph. It uses the weighted sum of the embeddings learned at all layers as the final embedding. Neural Matrix Factorization (NeuMF) [12] builds on the basic matrix factorization approach but replaces the inner product with a neural architecture that can learn an arbitrary function from the interaction data. Variational Autoencoders (MultiVAE) [18] estimates a probability distribution over all items, given the user’s interaction vector. Training and Evaluation. To evaluate the aforementioned algorithms, we partition the interactions of each user in train/validation/test groups with a 60-20-20 ratio split. Therefore, 60% of all users’ interactions are used to train the algorithms. We maximize the NDCG @10 metric over the validation set. All results are reported for the test set. Popularity Bias Mitigation. We conduct a series of tests comparing utility and fairness of recommendation lists produced by the above mentioned algorithms and re-ranked by the CP post-processing mitigation technique with 5 MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA BPR ItemKNN LightGCN MultiVAE NeuMF .04 .04 .04 .04 .04 JSD .02 .02 .02 .02 .02 .316 .344 .348 .352 .352 .356 .348 .352 .356 .280 nDCG@10 BPR ItemKNN LightGCN MultiVAE NeuMF .04 .04 .04 .04 .04 JSD .02 .02 .02 .02 .02 LowPop MidPop HighPop .24 .28 .32 .36 .24 .28 .32 .36 .40 .28 .32 .36 .40 .28 .32 .36 .40 .20 .24 .28 .32 nDCG@10 Fig. 2. Trade-off between utility and fairness in ML-1M dataset. Paler points denote smaller weight for the fairness objective (𝜆). Points with the highest 𝐽 𝑆𝐷 for each curve correspond to 𝜆 = 0. different settings. For every algorithm5 , we re-rank the list of top 100 recommendations to create the final list of 10 items for each user. We test it for ten values of the weighting parameter 𝜆 from 0 (no mitigation) to 0.9 (weight 0.9 to the fairness objective and 0.1 to utility) with step size 0.1. We aim to exploit potential differences in the way various user groups respond to the popularity bias mitigation to achieve a better trade-off between utility and fairness. To this end, we split users uniformly-at-random into train and test sets of the same size striving to ensure all user groups are represented in both. We search for optimal values of 𝜆 for each user group and the whole population using the criterion in Equation 3 on the train set. We then compute the new recommendation lists for the test users, applying mitigation with the weights learned for their corresponding user groups. 5 RESULTS We approach RQ1 by analyzing popularity calibration over various factors, shown in Figures 1 and 2. The figures report NDCG (utility measure) against 𝐽𝑆𝐷 (popularity bias measure) for every recommendation algorithm and the ten values of 𝜆, respectively, for LFM-2b and ML-1M dataset. On the plots, the opacity of the points corresponds to the value of 𝜆, such that the palest show the results of 𝜆 = 0. Let us first look at the the top row of the plots in each figure which shows the results achieved on the whole population of the dataset. On LFM-2b we notice differences in behavior of recommended lists produced by different recommendation algorithms. In particular, KNN shows an increase in NDCG combined with an improvement in fairness at 𝜆 = 0.1. BPR and LightGCN show steady progress towards debiased results of lower utility with growing 𝜆. At the same time, MultiVAE and NeuMF demonstrate a notably larger drop in utility over the first step (from 𝜆 = 0 to 0.1). Considering that every point on each plot corresponds to a new set of items recommended to most of the users, we can examine the quality of the top 100 recommendation lists produced by different algorithms. In this regard, a smooth decay of NDCG and 𝐽𝑆𝐷 signify a better overall quality of the top 100 list as it allows to debias the recommendation gradually without a sudden drop in utility. A sudden drop in both metrics however would mean that the achieved utility to a large degree comes from concentration of the popular items at the 5 We re-scale relevance scores provided by the algorithms to the interval [0, 1] in order to ensure comparability of mitigation results, see Equation 2. 6 Exploring Cross-group Discrepancies in Calibrated Popularity for Accuracy/Fairness Trade-off Optimization MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA Table 2. Results of popularity bias mitigation with different settings. LFM-2b ML-1M Algorithm Metric 𝜆=0 𝜆𝑎𝑙𝑙 𝜆𝑝𝑜𝑝 𝜆𝑔𝑒𝑛𝑑𝑒𝑟 𝜆=0 𝜆𝑎𝑙𝑙 𝜆𝑝𝑜𝑝 𝜆𝑔𝑒𝑛𝑑𝑒𝑟 BPR NDCG ↑ 0.102 0.102 0.102 0.102 0.315 0.314 0.314 0.314 𝐽𝑆𝐷 ↓ 0.253 0.253 0.252 0.253 0.044 0.044 0.041 0.044 KNN NDCG ↑ 0.252 0.254 0.254 0.254 0.354 0.354 0.354 0.354 𝐽𝑆𝐷 ↓ 0.163 0.138 0.129 0.139 0.044 0.044 0.044 0.044 LightGCN NDCG ↑ 0.152 0.152 0.152 0.152 0.352 0.352 0.352 0.352 𝐽𝑆𝐷 ↓ 0.230 0.229 0.229 0.229 0.045 0.045 0.044 0.044 MultiVAE NDCG ↑ 0.110 0.110 0.110 0.110 0.357 0.357 0.357 0.357 𝐽𝑆𝐷 ↓ 0.200 0.200 0.196 0.200 0.046 0.046 0.046 0.046 NeuMF NDCG ↑ 0.050 0.050 0.049 0.050 0.277 0.273 0.275 0.275 𝐽𝑆𝐷 ↓ 0.358 0.358 0.292 0.358 0.043 0.016 0.020 0.024 top, and the rest of the top 100 contains less relevant items. Considering this, we observe that on the ML-1M dataset all five recommenders show a very similar behavior. Approaching RQ2, we look at the rest of the plots in Figures 1 and 2 which as before show the trade-off between utility and fairness, but here according to specific user groups. The second rows of the plots correspond to the user grouping according to their inclination towards the consumption of popular music (aka. mainstreamness). The plots on the third row of Figure 1 show the results of the users grouped by genders6 . Analyzing the results according to the mainstreamness groups for LFM-2b, we observe initially all algorithms provide the best utility to the HighPop group, the group most exposed to popularity bias varies from model to model. BPR, LightGCN and KNN show that LowPop user group benefit from the mitigation method in terms of utility. KNN also shows the same for the MidPop group. In most cases, the HighPop group experiences the largest drop in utility through popularity bias mitigation. These findings support our hypothesis that selecting mitigation parameters separately for each user group potentially improves the trade-off between utility and fairness. On the ML-1M dataset we see that all three groups maintain a certain level of utility, while steadily decreasing the bias metric, showing that all five algorithms on this dataset can successfully achieve good results in terms of bias mitigation while maintaining utility. Considering the results on genders, we do not observe a notable difference in the overall patterns between the genders. Finally addressing RQ3, Table 2 shows the results of the experiment with group-specific values of 𝜆. For every algorithm and dataset, we report utility and bias metric under four conditions: 𝜆 = 0 no mitigation, 𝜆𝑎𝑙𝑙 where one optimal parameter value is selected for the whole population, 𝜆𝑝𝑜𝑝 where specific optimal parameter value selected for each popularity inclination user group, and finally 𝜆𝑔𝑒𝑛𝑑𝑒𝑟 indicating the selection of specific parameter value for each gender. We observe that, except for NeuMF on ML-1M, group-specific 𝜆s provide the best result on bias metric and trade-off between utility and fairness, namely a lower value of 𝐽𝑆𝐷 together with utility staying on the same level or slightly decreasing. Among all algorithms LightGCN has shown the lowest sensitivity to the mitigation, as its values do not notably drop in either utility or bias metrics. All algorithms show low bias metrics without any mitigation on ML-1M, nevertheless leveraging group specific 𝜆 allows to improve trade-off between utility and fairness for BPR and NeuMF. 6 We do not report this grouping for ML-1M as all five algorithms display the same behavior for both genders: steady decrease of bias metric with only slight decrease in utility. 7 MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA 6 CONCLUSION AND FUTURE WORK We explore the effectiveness of a post-processing popularity bias mitigation technique, Calibrated Popularity, applied to an array of state-of-the-art recommendation algorithms. We conduct experiments on two datasets from the music and movie domain, of different size and sparsity, considering how various user groups (defined by gender and mainstreami- ness) are affected by bias mitigation. The larger music dataset LFM-2b exposes discrepancies in behavior of different algorithms. NeuMF and MultiVAE show a notable drop in utility and bias metrics even with light bias mitigation settings. KNN shows an increase in utility with moderate bias mitigation applied. We also show that for BPR, KNN, and LightGCN users least interested in popular items can benefit in terms of utility from popularity bias mitigation as opposed to other users. Our experiments show that different user groups respond to the mitigation differently depending on their inclination towards consumption of popular content. We also found responses of different gender groups relatively similar. Finally, we conduct an experiment showing that selecting mitigation parameters individually for every user group (by interest towards popular items) leads to a better trade-off between utility and fairness overall. In future work, other mitigation strategies as well as criteria for selecting optimal mitigation parameters could be tested. Also, additional user groups could be addressed, including choosing individual settings for each user. ACKNOWLEDGMENTS This work received financial support by the Austrian Science Fund (FWF): P33526 and DFH-23; and by the State of Upper Austria and the Federal Ministry of Education, Science, and Research, through grants LIT-2020-9-SEE-113 and LIT-2021-YOU-215. REFERENCES [1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling popularity bias in learning-to-rank recommendation. In Proceedings of the eleventh ACM conference on recommender systems. 42–46. [2] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2019. Managing popularity bias in recommender systems with personalized re-ranking. In The thirty-second international flairs conference. [3] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher. 2019. The Unfairness of Popularity Bias in Recommendation. In Proceedings of the Workshop on Recommendation in Multi-stakeholder Environments co-located with the 13th ACM Conference on Recommender Systems (RecSys 2019), Copenhagen, Denmark, September 20, 2019 (CEUR Workshop Proceedings, Vol. 2440). CEUR-WS.org. http://ceur-ws.org/Vol- 2440/paper4.pdf [4] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, Bamshad Mobasher, and Edward C. Malthouse. 2021. User-centered Evaluation of Popularity Bias in Recommender Systems. In Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, UMAP 2021, Utrecht, The Netherlands, June, 21-25, 2021, Judith Masthoff, Eelco Herder, Nava Tintarev, and Marko Tkalcic (Eds.). ACM, 119–129. https: //doi.org/10.1145/3450613.3456821 [5] Diego Corrêa da Silva, Marcelo Garcia Manzato, and Frederico Araújo Durão. 2021. Exploiting personalized calibration and metrics for fairness recommendation. Expert Systems with Applications 181 (2021), 115112. https://doi.org/10.1016/j.eswa.2021.115112 [6] Mukund Deshpande and George Karypis. 2004. Item-Based Top-N Recommendation Algorithms. ACM Trans. Inf. Syst. 22, 1 (jan 2004), 143–177. https://doi.org/10.1145/963770.963776 [7] Michael D Ekstrand, Mucun Tian, Ion Madrazo Azpiazu, Jennifer D Ekstrand, Oghenemaro Anuyah, David McNeill, and Maria Soledad Pera. 2018. All the cool kids, how do they fit in?: Popularity and demographic biases in recommender evaluation and effectiveness. In Conference on Fairness, Accountability and Transparency. 172–186. [8] Batya Friedman and Helen Nissenbaum. 1996. Bias in Computer Systems. ACM Trans. Inf. Syst. 14, 3 (1996), 330–347. https://doi.org/10.1145/230538. 230561 [9] Christian Ganhör, David Penz, Navid Rekabsaz, Oleg Lesota, and Markus Schedl. 2022. Unlearning Protected User Attributes in Recommendations with Adversarial Training. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2142–2147. https://doi.org/10.1145/3477495.3531820 [10] F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19. 8 Exploring Cross-group Discrepancies in Calibrated Popularity for Accuracy/Fairness Trade-off Optimization MORS@RecSys2022, September 18–23, 2022, Seattle, WA, USA [11] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. Association for Computing Machinery, New York, NY, USA, 639–648. https://doi.org/10.1145/3397271.3401063 [12] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web (Perth, Australia) (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 173–182. https://doi.org/10.1145/3038912.3052569 [13] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015. What recommenders recommend: an analysis of recommendation biases and possible countermeasures. User Modeling and User-Adapted Interaction 25, 5 (2015), 427–491. [14] Michael Jugovac, Dietmar Jannach, and Lukas Lerche. 2017. Efficient optimization of multiple recommendation quality factors according to individual user tendencies. Expert Systems with Applications 81 (2017), 321–331. https://doi.org/10.1016/j.eswa.2017.03.055 [15] Anastasiia Klimashevskaia, Mehdi Elahi, Dietmar Jannach, Christoph Trattner, and Lars Skjærven. 2022. Mitigating Popularity Bias in Recommendation: Potential and Limits of Calibration Approaches. 82–90. https://doi.org/10.1007/978-3-031-09316-6_8 [16] Dominik Kowald, Markus Schedl, and Elisabeth Lex. 2020. The Unfairness of Popularity Bias in Music Recommendation: A Reproducibility Study. In Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 12036). Springer, 35–42. https://doi.org/10.1007/978-3-030-45442-5_5 [17] Oleg Lesota, Alessandro B. Melchiorre, Navid Rekabsaz, Stefan Brandl, Dominik Kowald, Elisabeth Lex, and Markus Schedl. 2021. Analyzing Item Popularity Bias of Music Recommender Systems: Are Different Genders Equally Affected?. In RecSys ’21: Fifteenth ACM Conference on Recommender Systems, Amsterdam, The Netherlands, 27 September 2021 - 1 October 2021, Humberto Jesús Corona Pampín, Martha A. Larson, Martijn C. Willemsen, Joseph A. Konstan, Julian J. McAuley, Jean Garcia-Gathright, Bouke Huurnink, and Even Oldridge (Eds.). ACM, 601–606. https://doi.org/10.1145/3460231.3478843 [18] Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018. Variational Autoencoders for Collaborative Filtering. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 689–698. https://doi.org/10.1145/3178876.3186150 [19] J. Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory 37, 1 (1991), 145–151. https: //doi.org/10.1109/18.61115 [20] Masoud Mansoury, Himan Abdollahpouri, Mykola Pechenizkiy, Bamshad Mobasher, and Robin Burke. 2020. Feedback Loop and Bias Amplification in Recommender Systems. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 2145–2148. https: //doi.org/10.1145/3340531.3412152 [21] Navid Rekabsaz, Simone Kopeinik, and Markus Schedl. 2021. Societal Biases in Retrieved Contents: Measurement Framework and Adversarial Mitigation of BERT Rankers. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada). 306–316. [22] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (Montreal, Quebec, Canada) (UAI ’09). AUAI Press, Arlington, Virginia, USA, 452–461. [23] Markus Schedl, Stefan Brandl, Oleg Lesota, Emilia Parada-Cabaleiro, David Penz, and Navid Rekabsaz. 2022. LFM-2b: A Dataset of Enriched Music Listening Events for Recommender Systems Research and Fairness Analysis. In ACM SIGIR Conference on Human Information Interaction and Retrieval (Regensburg, Germany) (CHIIR ’22). Association for Computing Machinery, New York, NY, USA, 337–341. https://doi.org/10.1145/3498366.3505791 [24] Harald Steck. 2018. Calibrated recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018, Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan (Eds.). ACM, 154–162. https://doi.org/10.1145/ 3240323.3240372 [25] Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, and Ricardo Baeza-Yates. 2017. FA*IR: A Fair Top-k Ranking Algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM ’17). Association for Computing Machinery, New York, NY, USA, 1569–1578. https://doi.org/10.1145/3132847.3132938 [26] George Zerveas, Navid Rekabsaz, Daniel Cohen, and Carsten Eickhoff. 2022. Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2532–2538. https://doi.org/10.1145/3477495.3531891 [27] Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Kaiyuan Li, Yushuo Chen, Yujie Lu, Hui Wang, Changxin Tian, Xingyu Pan, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2020. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. CoRR abs/2011.01731 (2020). arXiv:2011.01731 https: //arxiv.org/abs/2011.01731 [28] Ziwei Zhu, Yun He, Xing Zhao, and James Caverlee. 2021. Popularity Bias in Dynamic Recommendation. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, Feida Zhu, Beng Chin Ooi, and Chunyan Miao (Eds.). ACM, 2439–2449. https://doi.org/10.1145/3447548.3467376 9