=Paper=
{{Paper
|id=Vol-2440/paper2
|storemode=property
|title= Crank up the Volume: Preference Bias Amplification in Collaborative Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2440/paper2.pdf
|volume=Vol-2440
|authors=Kun Lin,Nasim Sonboli,Bamshad Mobasher,Robin Burke
|dblpUrl=https://dblp.org/rec/conf/recsys/LinSMB19
}}
== Crank up the Volume: Preference Bias Amplification in Collaborative Recommendation==
Crank up the volume: preference bias amplification in collaborative recommendation∗ Kun Lin† Nasim Sonboli∗ DePaul University University of Colorado Boulder Chicago, USA Boulder, USA linkun.nicole@gmail.com nasim.sonboli@colorado.edu Bamshad Mobasher Robin Burke DePaul University University of Colorado Boulder Chicago, USA Boulder, USA mobasher@cs.depaul.edu robin.burke@colorado.edu ABSTRACT (Civil Rights Act of 1964), with similar provisions in effect Recommender systems are personalized: we expect the re- in other countries. sults given to a particular user to reflect that user’s prefer- The biases in the outputs of recommendation algorithms ences. Some researchers have studied the notion of calibra- can be due to a variety of factors in the input data that is fed tion, how well recommendations match users’ stated prefer- to the algorithms. As the saying goes: “garbage in, garbage ences, and bias disparity the extent to which mis-calibration out”. These underlying factors include sample size disparity, affects different user groups. In this paper, we examine bias having limited features for protected groups, features that are disparity over a range of different algorithms and for differ- proxies of demographic attributes, human factors or skewed ent item categories and demonstrate significant differences findings [2]. These causes are not mutually exclusive and can between model-based and memory-based algorithms. be present at the same time and they can result in disparate negative outcomes. KEYWORDS In this paper, we model bias as the preferences of users algorithmic bias, bias amplification, collaborative filtering, and their tendency to choose one type of item over another. bias disparity, calibration, fairness, recommendation algo- In and of itself, this type of bias is not necessarily a nega- rithms tive phenomenon. In fact, patterns in preference bias are a key ingredient that recommendation algorithms use to con- struct predictive models and provide users with personalized 1 INTRODUCTION outputs. However, in certain contexts the propagation of Recommender systems have become ubiquitous and are in- preference biases can be problematic. For example, in the creasingly influencing our daily decisions in a variety of news recommendation domain, preference biases can cause online domains. Recently, there has been a shift of focus filter bubbles [19] and limit the exposure of users to diversi- from achieving the best accuracy [10] in recommendation to fied items. And, in job recommendation and lending domains, other important measures such as diversity, novelty, as well existing biases in the input data may reflect historical societal as socially-sensitive concerns such as fairness [12, 13]. One biases against protected groups, which must be accounted of the key issues with which to contend is that biases in the for by learning systems [18]. input data (used for training predictive models) are reflected, Our main goal in this paper is to study how different col- and in some cases amplified, in the results of recommender laborative filtering algorithms might propagate or amplify system algorithms. This is specially important in contexts existing preference biases in the input data and the differ- where fairness and equity matter or are required by laws and ent kinds of impact such disparity between input and the regulations such as in lending (Equal Credit Opportunity output might have on users. For the purpose of this analy- Act), education (Civil Rights Act of 1964; Education Amend- sis, we use bias disparity, a recently introduced group-based ments of 1972), housing (Fair Housing Act), employment metric[24, 26]. This metric considers biases with respect to the preferences of specific user groups such as men or women ∗ Copyright 2019 for this paper by its authors. Use permitted under Creative towards specific item categories such as different movie gen- Commons License Attribution 4.0 International (CC BY 4.0). res. This metric evaluates and compares the preference ratio Presented at the RMSE workshop held in conjunction with the 13th ACM Conference on Recommender Systems (RecSys), 2019, in Copenhagen, in both the input and the output data and measures the de- Denmark. gree to which recommendation algorithms may propagate † Both authors contributed equally to this research. these biases, in some cases dampening them and in others RMSE’19, September 2019, Copenhagen, Denmark Kun Lin, et al. amplifying them. Throughout this paper we use the notions fractions of each group (equality in true positive rate). This of preference bias and preference ratio interchangeably. metric can be used to detect unfairness for both consumers Our preliminary experiments on a movie rating dataset and providers. show that different types of algorithms behave quite differ- Steck [22] has proposed an approach for calibrating rec- ently in the way in which they propagate preference biases ommender systems to reflect the various interests of users in the input data. These findings maybe especially important relative to their initial preference proportions. The degree for system designers in determining the choice of algorithms of calibration is quantified using the Kullback-Leibler (KL) and parameter settings in critical domains where the output divergence. This metric compares the distribution over all of the system must conform to legal and ethical standards the genres of the set of movies played by the user and the or to prevent discriminatory behavior by the system. As far same distribution in a user’s recommendation list. A post- as we know, this paper is among the first works to have processing re-ranking algorithm is then used to adjust the observed this phenomenon in recommendation algorithms. calibration degree in the recommendation list. We are specifically interested in answering the following The authors in [5] have discussed another type of bias research questions: called popularity bias. Many e-commerce domains exhibit this kind of bias where a small set of popular items, such • RQ1 How do different recommendation algorithms as those from established sellers, may dominate recommen- propagate existing preference biases in the input data dation lists, while newly-arrived or niche items receive less to the generated recommendation lists? attention. In this situation, the likelihood of being recom- • RQ2 How does the bias disparity between the input mended for popular items will be considerably higher than and the output differ for different user groups (e.g., the rest of the (long-tail) items, potentially resulting in an men versus women)? unfair treatment of some sellers. The methods presented in • RQ3 How do bias disparity impact individual users [1, 14] have tried to break the feedback loop and mitigate with extreme preferences (positive or negative) with this issue. These methods generally try to increase fairness respect to particular categories of items? for item providers (P-fairness) in the system by diversifying the recommendation list of users. 2 RELATED WORK The authors in [6] have looked into the influence of algo- As authors in [3] mention, fairness can be a multi-sided rithms on the output data; they tracked the extent to which notion. Recommender systems often involve multiple stake- the diversity in user profiles change in the output recommen- holders, including consumers and providers [4] and fairness dations. [7] has also looked into the author gender distribu- can be sought for for these different stakeholders. In gen- tion in user profiles in the BookCrossing dataset (BX) and eral, fairness is a system goal, as neither side have a good has compared it with that of the output recommendations. view of the ecosystem and distribution of the resources. Fair- According to their results, the nearest neighbor methods ness for users/consumers could mean providing similar rec- propagate the biases and strengthen them, and matrix fac- ommendations to similar users without considering their torization methods strengthen the biases more. Interestingly, protected attributes, such as certain demographic features. our results for matrix factorization methods show the op- Methods that seek fairness for consumers of a system fall posite trends possibly indicating the different behavior of under the category of consumer-side fairness (C-fairness). algorithms in different domains and datasets. Fairness to item-providers (for example sellers on Amazon), The work by Tsintzou et al. [24] sought to demonstrate may means providing their items a reasonable chance of unfairness for consumers/users by modeling the bias as the being exposed/recommended to consumers. This kind of preferences of users. Their proposed metric is called the bias fairness is called the provider-side fairness (P-fairness). disparity, and is similar in logic to the metric proposed in Various metrics have been introduced for detecting model Steck’s work. They both have a user-centric point of view biases. The metrics presented in [25], such as absolute unfair- and want to achieve group-fairness. They both calculate the ness, value unfairness, underestimation and overestimation difference between the preference of the user in the input unfairness focus on the discrepancies between the predicted data and the predicted preference of the user by the recom- scores and the true scores across protected and unprotected mendation algorithm. Bias disparity metric looks at these groups and consider the results to be unfair if the model differences in a more fined-grained way, evaluating the pref- consistently deviates (overestimates or underestimates) from erences of specific user groups for specific item categories. the true ratings for specific groups. These metrics show un- KL divergence used in Steck’s approach measures more gen- fairness towards consumers. erally the difference in preference distributions across genres. Equality of opportunity discussed in [9] detects whether The sign value of the bias disparity, on the other hand, gives there are equal proportions of individuals from the qualified Crank up the volume: preference bias amplification in collaborative recommendation RMSE’19, September 2019, Copenhagen, Denmark us information about how input and output biases differ rel- Here we assume that PR S (G, C) > 0, and PR R >= 0. A ative to specific categories: negative values indicating the bias disparity of zero or near zero means that the input and bias has been reversed and positive values indicating it has output of the algorithm are almost the same with respect to been amplified. KL divergence, on the other hand, produces the prevalence of the chosen category: the algorithm reflects non-negative values and cannot differentiate between these the users’ preferences quite closely. A negative bias disparity two cases. means that the output preference bias is less than that of One of the limitations of the work of Tsintzou et al. [24] is the input. In other words, the preference bias towards a that they perform their analysis only for K-nearest-neighbor given category is dampened. The extreme value, BD = −1, models. In this paper, we build on their work by considering would indicate that a category important in a user’s profile a variety of recommendation algorithms. We are also inter- is completely missing from the system’s recommendations ested in understanding how bias affects female and male user (PR R = 0). If the bias disparity value is positive, the output groups separately and how it might affect individual users. preference bias towards an item category is higher than that of the input, indicating that the importance of the given 3 METHODOLOGY category has been amplified by the algorithm. Bias Disparity Algorithms Let U be the set of n users and I be the set of m items and S be the n × m input matrix, where S(u, i) = 1 if user u has The experiments were performed using the librec-auto selected item i, and zero otherwise. experimentation platform, [17], which is a python wrapper Let AU , be an attribute that is associated with users and built around the Java-based LibRec [8] recommendation li- partitions them into groups that have that attribute in com- brary. All experiments were performed using a 5-fold cross mon, such as gender. Similarly, let AI be the attribute that validation setting where 80% of each user’s rating data is is associated with items and that partitions the items into used for the training dataset and the rest as the test dataset categories, e.g. movie genres. (LibRec’s userfixed configuration). Given matrix S, the input preference ratio for user group We tested our experiments on four groups of algorithms: G on item category C is the fraction of liked items by group memory-based, model-based (ranking), model-based (rating) G in category C: and baseline. We selected both user-based and item-based k- Í Í nearest-neighbor methods from the memory-based category. S(u, i) PR S (G, C) = Íu ∈G Íi ∈C (1) BPR [21], RankALS [23] were selected from the learning-to- u ∈G i ∈I S(u, i) rank category. From the rating-oriented latent factor models Eq. (1) is essentially the conditional probability of selecting [16], we chose Biased Matrix Factorization (BiasedMF) [20], an item from category C given that this selection is done by SVD++ [15], and Weighted Regularized Matrix Factorization a user in group G. (WRMF) [11]. We used a most-popular recommender as a The bias disparity is the relative difference of the prefer- baseline as this algorithm would be expected to maximally ence bias value between the input S and output of a recom- amplify the popularity bias in the recommendation outputs. mendation algorithm R, and is defined as follows: For each algorithm, we tuned the parameters and picked the one that gives the best performance in terms of nor- PR R (G, C) − PR S (G, C) malized Discounted Cumulative Gain (nDCG) of the top 10 BD(G, C) = (2) PR S (G, C) listed items. The nDCG values of the algorithms over two We assume that a recommendation algorithm provides experiments in the paper are shown in Table 1. each user u with a list of r ranked items Ru . Let R be the collection of all the recommendations to all the users rep- Algorithm Experiment 1 Experiment 2 resented as a binary matrix, where R(u, i) = 1 if item i is MostPopular 0.480 0.460 recommended to user u, and zero otherwise. The overall ItemKNN 0.524 0.515 bias disparity for a category C is obtained by averaging bias UserKNN 0.572 0.559 disparities across all users regardless of the group. For more BPR 0.616 0.588 details on this metric, interested readers can refer to [24]. RankALS 0.446 0.374 In this paper, we use bias disparity metric on two levels: BiasedMF 0.200 0.200 1. Group-based bias disparity which is calculated based on SVD++ 0.167 0.239 Eq. (2) and calculated the bias disparity for two user groups WRMF 0.507 0.498 of women and men. 2. General bias disparity which is also Table 1: nDCG values with selected parameters for the two calculated based on Eq. (2) for all the users in the dataset experiments regardless of their group membership. RMSE’19, September 2019, Copenhagen, Denmark Kun Lin, et al. Dataset 4 EXPERIMENTAL RESULTS We ran our experiments on MovieLens 1M1 dataset (ML), a Experiment 1: Action and Romance Categories publicly available dataset for movie recommendation which In this experiment, we keep the number of items in item is widely used in recommender systems experimentation. groups approximately the same while we create unbalanced ML contains 6,040 users and 3,702 movies and 1M ratings. user group sizes. The Action and Romance genres are taken The sparsity of ratings in this dataset is about 96%. as item categories, with 468 and 436 movies in each group respectively. We have 278 women and 981 men for our user Experiment Design groups while each user has at least 90 ratings. After filtering In this section, we look to address these questions: the dataset, we ended up with 207,002 ratings from 1,259 users on 904 items with a sparsity of 18% for experiment one. • What values of the bias disparity are produced by dif- As we see in Table 2, the preference ratio of male users is ferent recommendation algorithms? (RQ1) higher for Action genre (≈ 0.70) compared to the Romance • Do bias disparity values differ across male and female genre (≈ 0.30) whereas female users have a more balanced users in the dataset? (RQ2) preference ratio (≈ 0.50) over these two movie genres. From • How are users with extreme initial preference ratio comparing the preference ratios of the whole population effected by bias disparities? (RQ3) and sub-groups (table 2), we observe an overall tendency to We addressed these questions in three steps: Initially, we prefer the Action genre over Romance genre. This overall selected a subset of the ML dataset consisting of male and bias mainly comes from the preference ratio of the majority female user groups and two movie genres as our item groups. male user group. Then, in the first step, we separately calculated preference ra- tio (Eq. 1) of males and females (user groups) on these genres Genre Whole Population Male Female and computed the corresponding the bias disparity values Action 0.675 0.721 0.502 (Eq. 2). In the second step, we calculated the preference ratios Romance 0.325 0.279 0.498 and bias disparities for our movie genres on the whole user Table 2: Input preference ratio for Action and Romance data (without partitioning into separate user groups). In the third step, we looked into users with zero initial preference Step 1: Group-based Bias Disparity. According to the results ratio on one of the genres to see the effects of different algo- shown in Figure 1, we see that both the neighborhood based rithms on bias disparity. Our goal was to determine if input methods, UserKNN and ItemKNN, show increased output preference ratios were significantly different from the output preference ratio (PR) of both male and female user groups preference ratios in the recommendations (i.e., if bias dispar- on Action genre by 50% and around 20% respectively. While ity was significantly different from 0, due to the dampening both of these algorithms show increased preference ratio or amplification of preference biases). on the Action genre, they have dramatically decreased it for In the first step of the experiments, we calculated the Romance genre, although the preference ratio of women on group-based bias disparity. As bias disparity represents a both genres in the input data were balanced. Accordingly, form of inaccuracy (users getting results different from their we see in Figure 2, both of these algorithms show negative interests), bias disparity differences between groups repre- bias disparities (BD) on Romance for both men and women. sent a form of unfairness as the system is working better for These results show different outcomes for the two groups some than for others. because of the different input preference ratio. For the female In the second step of the experiment, we calculated the group, the neighborhood-based algorithms induced a bias general bias disparity for the whole population. The compar- towards Action not present in the input; for the male group, ison of the bias disparity for the whole population (step 2) the algorithms tend to perpetuate and amplify the existing compared to specific user sub-groups (step 1), can help us biases in the input data. understand how algorithms differ in terms of bias disparity The matrix factorization algorithms show different ten- across the whole user population. dencies. In BiasedMF, the output preference ratio is much We ran two sets of experiments, first with Action and Ro- lower than the input preference ratio for male users in the mance genre movies as our item groups, and then with Crime Action genre (the opposite of what we observed for the and Sci-Fi genre movies. More details will be mentioned in neighborhood-based methods). The PR for the female group each experiment. is approximately the same. With BiasedMF, the preference ratios of both female and male groups are pushed close to 0.5. We have a negative bias disparity as we see in Figure 2, which 1 https://grouplens.org/datasets/movielens means that the original preference ratio is underestimated. Crank up the volume: preference bias amplification in collaborative recommendation RMSE’19, September 2019, Copenhagen, Denmark Interestingly, this algorithm strengthens the bias disparity of Algorithm Women Men P-value both men and women on Romance genre which is an overes- MostPopular 2.519 0.659 4.29e-05 timation of their actual preference. We see a similar pattern ItemKNN 2.100 0.994 8.89e-25 in SVD++ as well. UserKNN 0.749 1.091 2.44e-19 WRMF, the other latent factor model, gives inconsistent BPR 0.285 0.678 7.79e-03 results from BiasedMF and SVD++. It slightly decreased the RankALS 0.368 0.306 9.99e-01 preference ratio of women on Action and increased it for BiasedMF 1.230 2.660 1.35e-02 men on Action. We see the opposite trend on Romance genre, SVD++ 0.803 2.364 6.66e-04 in other words, the output preference ratio for women on WRMF 0.585 0.523 1.76e-05 Romance is slightly higher while for men is lower. Table 3: Bias disparity absolute value sum over Action and Generally, the absolute value of the BD for the two user Romance groups are not similar. Men have higher absolute values of BD on Romance while women have higher absolute values of BD on Action. As we see in Figure 2, different algorithms In general, the Action genre (with preference ratio of 0.675) affect women or men differently. ItemKNN affects the women is preferred to the Romance genre (preference ratio of 0.325). more than men in both genres, while UserKNN amplifies the As seen in the previous experiment, the two neighborhood- bias more for men than women in both genres. BiasedMF based methods, UserKNN and ItemKNN, both increase the and SVD++ increase bias more for men; WRMF, increases general preference ratio significantly. Our latent factor mod- the bias slightly more for women. els (BiasedMF, SVD++, WRMF) show different effects on In this experiment, women had an almost balanced pref- the preference ratio. None of the matrix factorization algo- erence over Action and Romance movies, while men prefer rithms significantly increase the original input preference Action movies to Romance movies. A well-calibrated algo- ratio in the Action genre. BiasedMF and SVD++ significantly rithm would preserve these tendencies. However, with the decreases the output preference ratio in Action genre, while influence of the male group, most of the recommender algo- WRMF keep the the output preference ratio close to the rithms provide an unbalanced recommendation list specially initial preference. for women (the minority group). However, BiasedMF and The Romance category has lower input preference ratio SVD++ run counter to this trend, reversing the bias dispar- than Action genre, which means that in the input dataset, ity for both genres. The influence of men’s preferences for the population on average prefers Action to Romance. The Action in the overall data is reduced, resulting in fewer un- output preference ratios for this genre show a reverse pat- wanted Action movie recommendations for women which tern compared to the Action genre. The neighborhood-based is fairer for this group. These two algorithms balance out algorithms decrease the preference ratio and most of the the exposure of Action and Romance genres for both user matrix factorization algorithms don’t change the preference groups. ratio by much except for BiasedMF and SVD++, which sig- K-nearest-neighbor methods amplify the bias significantly nificantly increases the preference ratio. We can see the bias and this behavior could be due to their sensitivity to the pop- disparity change in Figure 2 as well. ularity bias. Both of the neighborhood-based models show a similar trend to the most-popular recommender (the light Step 3: Users with Extreme Preferences. To examine extreme blue bar). Romance genre is less favored by the majority preference cases, we concentrated on users with very low group (981 men vs 278 women) in the dataset compared to preference ratios across the genres we studied (We excluded the Action genre. So, we end up having more neighbors from the users that had a zero preference ratio on both genres). the majority group as the nearest neighbors (user-knn) or There were 10 men who had zero preference ratios on the having more ratings from the majority group on a specific Romance genre, which means that they only watched Ac- genre (item-knn). So, their preference will dominate the pref- tion movies. In Figure 3, it shows the preference ratio in the erence of the other group on both genres. These methods recommendation. Some algorithms, like UserKNN, BPR, and not only prioritize the preference of the majority group to WRMF, recommend all Action movies, which is totally con- the minority group, but they also amplify this bias. sistent with these users’ initial preference. Other algorithms, including BiasedMF and SVD++, de-amplify the effects of the preference and show a more diverse recommendation set. Step 2: General Bias Disparity. In Figure 1, the bar shows When analyzing the preference ratio of the extreme group, the preference ratio in the recommendation output and the the effects of some algorithms become more clearer because dashed line shows the input preference ratio for related cat- of the consistency of the general population and extreme egories. group. RMSE’19, September 2019, Copenhagen, Denmark Kun Lin, et al. Figure 2: Bias disparity for Action and Romance Figure 1: Output Preference Ratio for Action and Romance Experiment 2: Crime and Sci-Fi Figure 3: Output PR for users with extreme preferences for Action and Romance In this experiment, our item groups were Crime and Sci-Fi, with 211 and 276 movies in each group respectively. The number of users in both user groups were still unbalanced, As it is shown in table 4, the preference ratios of both 259 female users and 1,335 male users. All of the users had at male and female users in Crime and Sci-Fi movies are similar. least 50 ratings from both genres which leaves us with 37,897 Both men and women have a preference ratio of around 0.7 ratings from the 1,594 users on 487 items. The sparsity of on Sci-Fi and around 0.3 for the Crime genre. According to the dataset was around 95%. Table 4, the whole population prefers Sci-Fi movies to Crime Crank up the volume: preference bias amplification in collaborative recommendation RMSE’19, September 2019, Copenhagen, Denmark movies, and we see a similar trend in both user groups, male and female. Genre Whole Population Male Female Crime 0.317 0.302 0.334 Sci-Fi 0.683 0.698 0.666 Table 4: Input Preference Ratio for Crime and Sci-Fi Step 1: Group-Based Bias Disparity. Overall, the group-based bias disparity is very similar to the pattern seen in the whole population. Based on patterns shown in Figure 4, the dif- ference between the patterns that we see in Crime genre for both men and women is minimal, and the same trend is true for Romance genre. The difference in the absolute values of bias disparity between groups is not as enormous as the difference that we saw in Action and Romance (Fig- ure 2), which is partly because the two groups have similar preference over the two categories. Neighborhood-based algorithms amplify the existing pref- erence bias for both groups. The matrix factorization al- gorithms either dampen the input bias, like BiasedMF and SVD++, or they don’t change the input preference ratio sig- nificantly, like WRMF. Algorithm Women Men P-value MostPopular 0.740 0.951 0.73 ItemKNN 1.137 0.898 0.08 UserKNN 0.818 0.792 0.64 BPR 0.357 0.400 0.54 RankALS 1.126 1.114 0.93 WRMF 0.247 0.311 0.15 BiasedMF 3.089 3.636 0.39 SVD++ 2.394 2.778 0.41 Table 5: Bias disparity absolute value sum over Crime and Sci-Fi Step 2: General Bias Disparity. As shown in the Figure 4, the pattern of Crime and Sci-Fi over the whole population is con- sistent with Action and Romance. The neighborhood based algorithms, UserKNN and ItemKNN, show an increased out- put preference ratio for the more preferred genre (Sci-Fi), Figure 4: Output preference ratio for Crime and Sci-Fi and a decreased PR for the less preferred genre (Crime). The matrix factorization algorithms show different patterns from neighborhood based algorithms but very similar pattern to experiment 1. BiasedMF and SVD++ have the most signifi- Step 3: Users with Extreme Preferences. We had 37 users with cant effects on the preference ratio, increasing the preference preference ratio value of zero on Crime movies, meaning ratios of the less favored category and decreasing those of that they only watched Sci-Fi movies. The trends that we see the more favored category. WRMF shows good calibration in Figure 6 for this group is pretty similar to Figure 3 here. Similarly to experiment 1, algorithms such as UserKNN, The bias disparity showing in Figure 5 is also consistent BPR, and WRMF, provide the recommendations well-calibrated with the bias disparity shown in the Figure 2 of experiment to the users’ initial preferences whereas BiasedMF and SVD++, one. significantly ampen the initial preference biases. RMSE’19, September 2019, Copenhagen, Denmark Kun Lin, et al. of the neighborhood-based models show a similar trend to- wards popularity, consistent with the findings of [13]. With these models, we might expect that a dominant group would contribute more neighbors in recommendation generation and would influence predictions by virtue of its presence in these groupings. These methods not only prioritize the preference of the dominant group, but they also amplify the biases for the dominant group across all users. Different from the previous research on the bias ampli- fication of matrix factorization methods [7], we observed that different matrix factorization models influence prefer- ence biases differently. SVD++ and BiasedMF both dampen the preference bias for different movie genres for both men and women. WRMF algorithm is well-calibrated for the Sci- Fi/Crime genres for both men and women but the behavior is inconsistent for Action/Romance genre. Each of these model-based algorithms produces a low- rank approximation of the input rating data, but do so in slightly different ways. Jannach et al. [13] found that model- based algorithms generally have less popularity bias, so it may be expected that such algorithm would not show as much bias disparity as the memory-based ones. However, further study will be required to understand the interactions between input biases and each algorithm’s learning objective. Interestingly, parameter tuning of these algorithms, which produced better accuracy, did not change the bias disparity pattern. As we have discovered in our experiments, recommenda- Figure 5: Bias disparity for Crime and Sci-Fi tion algorithms generally distort preference biases present in the input data and do so in sometimes unpredictable ways. Different groups of users may be treated in quite different ways as a result. Bias disparity analysis is a useful tool in understanding how aspects of the input data are reflected in an algorithm’s output. REFERENCES [1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling popularity bias in learning-to-rank recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM, 42–46. [2] Solon Barocas and Andrew D Selbst. 2016. Big data’s disparate impact. Calif. L. Rev. 104 (2016), 671. [3] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Bal- anced neighborhoods for multi-sided fairness in recommendation. In Conference on Fairness, Accountability and Transparency. 202–214. [4] Robin D Burke, Himan Abdollahpouri, Bamshad Mobasher, and Tri- Figure 6: Output Preference Ratio for Crime and Sci-Fi of nadh Gupta. 2016. Towards Multi-Stakeholder Utility Evaluation of Extreme Group Recommender Systems.. In UMAP (Extended Proceedings). [5] Òscar Celma and Pedro Cano. 2008. From hits to niches?: or how popular artists can bias music recommendation and discovery. In Pro- ceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems 5 CONCLUSION AND FUTURE WORK and the Netflix Prize Competition. ACM, 5. [6] Sushma Channamsetty and Michael D Ekstrand. 2017. Recommender Although we focused here on a handful of the more common response to diversity and popularity bias in user profiles. In The Thir- movie genres, some important patterns can be seen. Both tieth International Flairs Conference. Crank up the volume: preference bias amplification in collaborative recommendation RMSE’19, September 2019, Copenhagen, Denmark [7] Michael D Ekstrand, Mucun Tian, Mohammed R Imran Kazi, Hoda [17] Masoud Mansoury, Robin Burke, Aldo Ordonez-Gauger, and Xavier Mehrpouyan, and Daniel Kluver. 2018. Exploring author gender in Sepulveda. 2018. Automating recommender systems experimenta- book rating and recommendation. In Proceedings of the 12th ACM tion with librec-auto. In Proceedings of the 12th ACM Conference on Conference on Recommender Systems. ACM, 242–250. Recommender Systems. ACM, 500–501. [8] Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith. 2015. LibRec: [18] Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines A Java Library for Recommender Systems.. In UMAP Workshops, Vol. 4. reinforce racism. nyu Press. [9] Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of op- [19] Eli Pariser. 2011. The filter bubble: How the new personalized web is portunity in supervised learning. In Advances in neural information changing what we read and how we think. Penguin. processing systems. 3315–3323. [20] Arkadiusz Paterek. 2007. Improving regularized singular value de- [10] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T composition for collaborative filtering. In Proceedings of KDD cup and Riedl. 2004. Evaluating collaborative filtering recommender systems. workshop, Vol. 2007. 5–8. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 5–53. [21] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars [11] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from Filtering for Implicit Feedback Datasets.. In ICDM, Vol. 8. Citeseer, implicit feedback. In Proceedings of the twenty-fifth conference on un- 263–272. certainty in artificial intelligence. AUAI Press, 452–461. [12] Neil Hurley and Mi Zhang. 2011. Novelty and diversity in top-n [22] Harald Steck. 2018. Calibrated recommendations. In Proceedings of the recommendation–analysis and evaluation. ACM Transactions on Inter- 12th ACM conference on recommender systems. ACM, 154–162. net Technology (TOIT) 10, 4 (2011), 14. [23] Gábor Takács and Domonkos Tikk. 2012. Alternating least squares for [13] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Ju- personalized ranking. In Proceedings of the sixth ACM conference on govac. 2015. What recommenders recommend: an analysis of recom- Recommender systems. ACM, 83–90. mendation biases and possible countermeasures. User Modeling and [24] Virginia Tsintzou, Evaggelia Pitoura, and Panayiotis Tsaparas. User-Adapted Interaction 25, 5 (2015), 427–491. 2018. Bias Disparity in Recommendation Systems. arXiv preprint [14] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. arXiv:1811.01461 (2018). 2014. Correcting Popularity Bias by Enhancing Recommendation [25] Sirui Yao and Bert Huang. 2017. Beyond parity: Fairness objectives for Neutrality.. In RecSys Posters. collaborative filtering. In Advances in Neural Information Processing [15] Yehuda Koren. 2008. Factorization meets the neighborhood: a multi- Systems. 2921–2930. faceted collaborative filtering model. In Proceedings of the 14th ACM [26] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei SIGKDD international conference on Knowledge discovery and data min- Chang. 2017. Men also like shopping: Reducing gender bias amplifi- ing. ACM, 426–434. cation using corpus-level constraints. arXiv preprint arXiv:1707.09457 [16] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factoriza- (2017). tion techniques for recommender systems. Computer 8 (2009), 30–37.