=Paper=
{{Paper
|id=Vol-2440/short4
|storemode=property
|title= Bias Disparity in Recommendation Systems
|pdfUrl=https://ceur-ws.org/Vol-2440/short4.pdf
|volume=Vol-2440
|authors=Virginia Tsintzou,Evaggelia Pitoura,Panayiotis Tsaparas
|dblpUrl=https://dblp.org/rec/conf/recsys/TsintzouPT19
}}
== Bias Disparity in Recommendation Systems==
Bias Disparity in Recommendation Systems∗ Virginia Tsintzou Evaggelia Pitoura Panayiotis Tsaparas Department of Computer Department of Computer Department of Computer Science and Engineering Science and Engineering Science and Engineering University of Ioannina University of Ioannina University of Ioannina vtsintzou@cs.uoi.gr pitoura@cs.uoi.gr tsap@cs.uoi.gr ABSTRACT systems; (b) Using synthetic data we study different conditions Recommender systems have been applied successfully in a number under which bias disparity may appear. We consider the effect of of different domains, such as, entertainment, commerce, and em- the iterative application of recommendation algorithms on the bias ployment. Their success lies in their ability to exploit the collective of the data; (c) We present some observations on bias disparity on behavior of users in order to deliver highly targeted, personalized real data, using the MovieLens1 dataset; (d) We consider a simple recommendations. Given that recommenders learn from user pre- re-ranking algorithm for correcting bias disparity, and we study it ferences, they incorporate different biases that users exhibit in the experimentally. input data. More importantly, there are cases where recommenders may amplify such biases, leading to the phenomenon of bias dis- 2 RELATED WORK parity. In this short paper, we present a preliminary experimental The problem of algorithmic bias, and its flip side, fairness in algo- study on synthetic data, where we investigate different conditions rithms, has attracted considerable attention in the recent years [4, 5]. under which a recommender exhibits bias disparity, and the long- Most existing work focuses on classification systems, while there term effect of recommendations on data bias. We also consider a is limited work on recommendation systems. simple re-ranking algorithm for reducing bias disparity, and present One type of recommendation bias that has been considered in the some observations for data disparity on real data. literature is popularity bias [3, 6]. It has been observed that under some conditions popular items are more likely to be recommended leading to a rich get richer effect, and there are some attempts to 1 INTRODUCTION correct this bias [6, 7]. Related to this is also the quest for diver- Recommender systems have found applications in a wide range of sity [8], where the goal is to include different types of items in domains, including e-commerce, entertainment and social media, the recommendations, or provide additional exposure for specific news portals, and employment sites [12]. They have been proven classes of items [1]. to be extremely effective in predicting the preferences of the users, These notions of fairness do not take into account the presence of filtering the available content to provide a highly personalized and different (protected) groups of users, and different item categories targeted experience. that we consider in this work. This setting is considered in [2], One of the most popular classes of recommendation systems where they define two types of bias, and they propose a modification is collaborative filtering. Collaborative Filtering (CF) uses the col- of the recommendation algorithm in [9] to ensure a fair output. lective behavior of all users over all items to infer the preferences of Their work focuses on fairness, rather than bias disparity, and individual users for specific items [12]. However, given the reliance works with a specific algorithm. The notion of bias disparity is of CF algorithms on the user preferences, they are susceptible to examined in [14] but in a classification setting. Closely related to biases that may appear in the input data. In this work we consider our work is the work in [11], where they consider a similar notion biases with respect to the preferences of specific groups of users of bias disparity, and they propose calibrated recommendations for (e.g., men and women) towards specific categories of items (e.g., mitigating its effect. Their work assumes a single class of users, and different movie genres). they treat users individually, rather than as a group. Bias in recommendations is not necessarily always problematic. Fairness in terms of correcting rating errors for specific groups For example, it is natural to expect gender bias when recommending of users was studied in [13] for a matrix factorization CF recommen- clothes. However, gender bias is undesirable when recommending der. A similar setting is considered in [10], where they provide a job postings, or information content. Furthermore, we want to avoid general framework for defining fairness (either individual or group the case where the recommender system introduces bias in the data, fairness), and a methodology for enforcing fairness by inserting by amplifying existing biases and reinforcing stereotypes. We refer “antidote data” in the dataset. A notion of fairness for tensor-based to this phenomenon, where there is a difference between input and recommendations that relies on statistical parity is explored in [15]. recommendation bias, as bias disparity. In this paper, we consider the problem of bias disparity in re- 3 MODEL commendation systems, and we make the following contributions: (a) We define notions of bias and bias disparity for recommender 3.1 Definitions We consider a set of n users U and a set of m items I. We are given ∗ Copyright 2019 for this paper by its authors. Use permitted under Creative Commons a binary n × m matrix S, where S(u, i) = 1 if user u has selected License Attribution 4.0 International (CC BY 4.0). Presented at the RMSE workshop held in conjunction with the 13th ACM Conference 1 MovieLens 1M: https://grouplens.org/datasets/movielens/1m/ on Recommender Systems (RecSys), 2019, in Copenhagen, Denmark. item i, and zero otherwise. Selection may mean that user u liked recommend r items to a user, the r items with the highest utility post i, or that u purchased product i, or that u watched video i. values are selected. We assume that users are associated with an attribute AU , that partitions them into groups depending on their value for the at- 4 BIAS DISPARITY ON SYNTHETIC DATA tribute. For example, the attribute AU may be the gender of the In this section, we present experiments with synthetic data. Our user, partitioning the users into men and women. We will typically goal is to study the conditions under which the UserKNN exhibits assume that we have two groups, and one of the groups is the pro- bias disparity. tected group. Similarly, we assume that items are associated with an attribute AI , e.g., the genre of a movie, which partitions them 4.1 Synthetic data generation into categories, e.g., action and romance movies. Users are split into two groups G 1 and G 2 of size n 1 and n 2 respecti- Given the association matrix S, we define the input preference ra- vely, and items are partitioned into two categories C 1 and C 2 of size tio PR S (G, C) of group G for category C as the fraction of selections m 1 and m 2 respectively. We assume that users in G 1 tend to favor from group G that are in category C. Formally: items in category C 1 , while users in group G 2 tend to favor items S(u, i) in category C 2 . To quantify this preference, we give as input to the Í Í PR S (G, C) = Íu ∈G Íi ∈C (1) u ∈G i ∈I S(u, i) data generator two parameters ρ 1 , ρ 2 , where parameter ρ i deter- This is essentially the conditional probability that a selection is in mines the preference ratio PR S (G i , Ci ) of group G i for category Ci . category C given that it comes from a user in group G. For example, ρ 1 = 0.7 means that 70% of the ratings of group G 1 To assess the importance of this probability we compare it against are in category C 1 . The datasets we create consist of 1,000 users the probability P(C) = |C |/m of selecting from category C when and 1,000 items. We assume that each user selects 5% of the items selecting uniformly at random. We define the bias B S (G, C) of group in expectation, and we recommend r = 10 items per user. G for category C as: We perform two different sets of experiments. In the first set, we examine the effect of the preference ratios, and in the second PR S (G, C) B S (G, C) = (2) set, the effect of group and category sizes. All reported values are P(C) averages over 10 experiments. Bias values less than 1 denote negative bias, that is, the group G on average tends to select less often from category C, while bias 4.2 Varying the preference ratios values greater than 1 denote positive bias, that is, that group G In these experiments, we create datasets with equal-size groups G 1 favors category C disproportionately to its size. and G 2 , and equal-size item categories C 1 and C 2 , and we vary the We assume that the recommendation algorithm outputs for each preference ratios of the groups. user u a ranked list of r items Ru . The collection of all recommenda- tions can be represented as a binary matrix R, where R(u, i) = 1 if 4.2.1 Symmetric Preferences: In the first experiment, we assume item i is recommended for user u and zero otherwise. Given matrix that the two groups G 1 and G 2 have the same preference ratios R, we can compute the output preference ratio of the recommenda- by setting ρ 1 = ρ 2 = ρ, where ρ takes values from 0.5 to 1, in tion algorithm, PR R (G, C), of group G for category C using Eq. (1), increments of 0.05. In Figure 1(a), we plot the output preference and the output bias B R (G, C) of group G for category C. ratio PR R (G 1 , C 1 ) (eq. PR R (G 2 , C 2 )) as a function of ρ. Note that To compare the bias of a group G for a category C in the input in this experiment, bias is the preference ratio scaled by a factor data S and the recommendations R, we define the bias disparity, of two. We report preference ratios to be more interpretable. The that is, the relative change of the bias value. dashed line shows when the output ratio is equal to the input ratio and thus there is no bias disparity. We consider different values B R (G, C) − B S (G, C) BD(G, C) = (3) for K, the number of neighbors. A first observation is that when B S (G, C) the input bias is small (PR S ≤ 0.6), the output bias decreases or Our definitions of preference ratios and bias are motivated by stays the same. In this case, users have neighbors from both groups. concepts of group proportionality, and group fairness considered For higher input bias (PR S > 0.6), we have a sharp increase of the in the literature [4, 5]. output bias, which reaches its peak for PR S = 0.8. In these cases, the recommender polarizes the two groups, recommending items 3.2 The Recommendation Algorithm only from their favored category. For the recommendations, in our experiments, we use a user-based In Figure 1(b), we report the preference ratio for all candidate K-Nearest-Neighbors (UserKNN) algorithm. The UserKNN algo- items for recommendation for each user (i.e., if the system recom- rithm first computes for each user, u, the set N K (u) of the K most mended all items with non zero utility). Surprisingly, the candidate similar users to u. For similarity, it uses the Jaccard similarity, JSim, items are less biased even for high values of the input bias. This computed using the matrix S. For user u and item i not selected by shows that: (a) Utility proportional to user-similarity increases bias, u, the algorithm computes a utility value since the top-r recommendations with the highest utility are signi- n ∈N K (u) JSim(u, n)S(n, i) Í ficantly more biased, (b) It is possible to reduce bias by re-ranking V (u, i) = (4) the candidate items. n ∈N K (u) JSim(u, n) Í Increasing the value of K increases the output bias. Adding neig- The utility value V (u, i) is the fraction of the similarity scores of hbors increases the strength of the signal, and the algorithm discri- the top-K most similar users to u that have selected item i. To minates better between the items in the different categories, causing 2 0.5 1 1 1 output preference ratio output preference ratio output preference ratio candidate items ratio 0.9 0.9 0.9 0.4 0.8 K=10 0.8 0.8 K=10 0.3 K=10 K=20 K=20 K=20 K=50 0.7 K=50 K=50 0.7 0.7 0.2 0.6 K=10 0.6 0.6 K=20 0.1 K=50 0.5 0.5 0.5 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 input preference ratio input preference ratio input preference ratio input preference ratio (a) P R R , symmetric case (b) Ratio of candidate items (c) P R R (G 1, C 1 ), asymmetric case (d) P R R (G 2, C 2 ), asymmetric case Figure 1: Experiment with different preference ratios. it to favor the preferred category. Understanding the role of K is a 1 1 subject for future study. output preference ratio output preference ratio 0.8 0.8 4.2.2 Asymmetric Preferences: In this experiment, group G 1 has 0.6 preference ratio ρ 1 ranging from 0.5 to 1, while G 2 is unbiased 0.6 0.4 with fixed preference ratio ρ 2 = 0.5. In Figure 1, we show the 0.4 K=10 0.2 recommendation preference ratio for groups G 1 (Figure 1(c)) and K=20 K=10 0.2 K=20 K=50 0 K=50 G 2 (Figure 1(d)) as a function of ρ 1 . 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 We observe that the output bias of group G 1 is amplified at a rate group ratio ? category ratio 3 much higher than in Figure 1(a), while group G 2 becomes biased (a) Group Size (b) Category Size towards category C 1 . Surprisingly, the presence of the unbiased Figure 2: (a) Unbalanced group sizes, (b) Unbalanced cate- group G 2 has an amplifying effect on the bias of G 1 , rather than a gory sizes; input preference ratio PR S (G i , Ci ) = 0.7. moderating one, more so than an oppositely-biased group. This is due to the fact that the users in the unbiased group G 2 provide a Note that as long as θ ≤ 0.7, group G 1 has positive bias (greater stronger signal in favor of category C 1 , compared to the symmetric than 1) for category C 1 since bias is equal to ρ 1 /θ . However, it de- case where group G 2 is biased over C 2 . This reinforces the bias in creases as the size of the category increases. When the category size favor of category C 1 . As expected, the unbiased group adopts the is not very large (θ ≤ 0.5), the output bias is amplified regardless biases of the biased group of the category size. For θ > 0.7, G 1 is actually biased in favor of C 2 , and this is reflected in the output. There is an interesting range 4.3 Varying group and category sizes [0.6, 0.7] where G 1 is positively biased towards C 1 but its bias is In this experiment we examine bias disparity with unbalanced weak, and thus the recommendation output is drawn to category groups and categories. C 2 by the more biased group. 4.3.1 Varying Group Sizes: We first consider groups of uneven 4.4 Iterative Application of Recommendations size. We set the size n 1 of G 1 to be a fraction ϕ of the number of all We observed bias disparity in the output of the recommendation users n, ranging from 5% to 95%. Both groups have fixed preference algorithm. However, how does this affect the bias in the data? To ratio ρ 1 = ρ 2 = 0.7. Figure 2(a) shows the output recommenda- study this we consider a scenario where the users accept (some tion preference ratio PR R (G 1 , C 1 ) as a function of ϕ. The plot of of) the recommendations of the algorithm, and we study the long- PR R (G 2 , C 2 ) is the mirror image of this one, so we do not report it. term effect of the iterative application of the algorithm on the We observe that for ϕ ≤ 0.3 group G 1 has negative bias disparity bias of the data. More precisely, at each iteration, we consider (PR R (G 1 , C 1 ) < 0.7). That is, the small group is drawn by the larger the top-r recommendations of the algorithm (r = 10) to a user group. For medium values of ϕ in [0.35, 0.5], the bias of both groups u, and we normalize their utility values, by the utility value of is amplified, despite the fact that G 1 is smaller than G 2 . The increase the top recommendation. We then assume that the user accepts a is larger for the larger group, but there is increase for the smaller recommendation with probability equal to the normalized score. group as well. The accepted recommendations are added to the data, and they are We also experimented with the case where G 2 is unbiased. In fed as input to the next iteration of the recommendation algorithm. this case G 2 becomes biased towards C 1 even for ϕ = 0.05, while We apply this iterative algorithm on a dataset with two equally the point at which the bias disparity for G 1 becomes positive is but oppositely biased groups, as described in Section 4.2.1. The much earlier (ϕ ≈ 0.2). This indicates that a small biased group can results of this iterative experiment are shown in Figure 3(a), where have a stronger impact than a large unbiased one. we plot the average preference ratio for each iteration. Iteration 0 4.3.2 Varying Category Sizes: We now consider categories of corresponds to the input data. In our experiment a user accepts on uneven size. We set the size m 1 of C 1 to be a fraction θ of the number average 7 recommendations. For this experiment we set the number items m, ranging from 10% to 90%. We assume that both groups of neighbors K to 50. have fixed preference ratio ρ 1 = ρ 2 = 0.7. Figure 2(b) shows the We observe that even with the probabilistic acceptance of recom- recommendation preference ratio PR R (G 1 , C 1 ) as a function of θ . mendations, there is a clear long-term effect of the recommendation The plot of PR R (G 2 , C 2 ) is again the mirror image of this one. bias. For small values of input bias, we observe a decrease, in line 3 PRS:0.5 PRS:0.55 PRS:0.6 PRS:0.65 PRS:0.7 PRS:0.75 PRS:0.8 PRS:0.85 PRS:0.9 PRS:0.95 denotes that u was recommended item i. We will refer to the pair (u, i) as a recommendation. The set R ∗ contains r recommendati- 1 1 ons for each user, thus, rn recommendations in total. Let V (R ∗ ) = (u,i)∈R ∗ V (u, i) denote the total utility of the recommendations Í S S 0.9 0.9 preference ratio PR preference ratio PR 0.8 0.8 in set R ∗ . Since R ∗ contains for each user u the top-r items with 0.7 0.7 the highest utility, R ∗ has maximum possible utility among all sets 0.6 0.6 with r recommendations per user. 0.5 0.5 However, as we have seen in our experiments, the set R ∗ may 0 1 2 3 4 5 0 1 2 3 4 5 iteration iteration have high bias disparity. We will adjust the recommendations in the set R ∗ to produce a new set of recommendations R, with r (b) UserKNN (c) GULM Figure 3: The evolution of the preference ratio in the data recommendations per user, with zero bias disparity. Clearly this for different input preference ratios (PR S ), after 5 iterations will come at the expense of utility. Our goal is to find the set R with of (a) UserKNN and (b) GULM. Iteration 0 shows the original the minimum utility loss. preference ratio of each experiment. Since we have two categories, to achieve zero bias disparity, it suffices to have B R (G i , Ci ) = B S (G i , Ci ). Without loss of generality with the observations in Figure 1(a). For these values of bias, the assume that B R ∗ (G i , Ci ) > B S (G i , Ci ). Let Ci denote the category recommender will result in reducing bias and smoothing out dif- other than Ci . We decrease the output bias B R by swapping recom- ferences. The value of preference ratio 0.6 remains more or less mendations (u, i) of category Ci with recommendations (u, j) of constant, while for larger values the bias in the data increases. The- category Ci . Given a target bias value, we can compute the number refore, for large values of bias the recommender has a reinforcing of swaps for achieving zero bias disparity. The utility loss incurred effect, which in the long term will lead to polarized groups of users. by swapping (u, i) with (u, j) is V (u, i) − V (u, j). The goal is to find the swaps with the minimum utility loss. 5 BIAS DISPARITY ON REAL DATA We present a simple and efficient greedy algorithm for this task. In this experiment, we use the Movielens 1M dataset2 . We consider Let N S denote the desired number of swaps. The algorithm starts as categories the genres Action and Romance, with 468 and 463 with the set R = R ∗ , and performs N S steps. At each step it computes movies. We extract a subset of users U that have at least 90 ratings a set of candidate swaps by pairing for each user u the lowest-ranked in these categories, resulting in 1,259 users. Users in U consist of recommendation (u, i) in R from category Ci , with the highest ran- 981 males and 278 females. ked recommendation (u, j) not in R from category Ci , and performs In Table 1, we show the input/output bias and in parentheses the swap with the minimum utility loss. It is easy to show that the bias disparity for each group-category combination. The right the algorithm is optimal, that is, it achieves the minimum utility part of the table reports these numbers when the user groups are loss. We refer to this algorithm as the GULM (Group Utility Loss balanced, by selecting a random sample of 278 males. We observe Minimization) algorithm. that males are biased in favor of Action movies while females prefer By design, when we apply the GULM algorithm on the output Romance movies. The application of UserKNN increases the strong of the recommendation algorithm, we eliminate bias disparity (mo- input bias for males in the output. Females are moderately biased dulo rounding errors) in the recommendations. We consider the in favor of Romance movies. Hence, their output bias is drawn iterative application of the recommendation algorithm, in the set- to Action items. We observe a very similar picture for balanced ting described in Section 4.4, assuming again that the probability data, indicating that the changes in bias are not due to the group of a recommendation being accepted depends on its utility. The imbalance. results are shown in Figure 3(b). For values of preference ratio up to 0.65, we observe that bias remains more or less constant after Table 1: Gender bias on action and romance re-ranking. For larger values, there is some noticeable increase in Unbalanced Groups Balanced Groups the bias, albeit significantly smaller than before re-ranking. The Action Romance Action Romance increase is due to the fact that the recommendations introduced by M 1.39/1.67 (0.2) 0.58/0.28 (-0.51) 1.40/1.66 (0.18) 0.57/0.29 (-0.49) GULM have low probability to be accepted. F 0.97/1.14 (0.17) 1.03/0.85 (-0.17) 0.97/1.08 (0.11) 1.03/0.92 (-0.10) 7 CONCLUSIONS In this short paper, we performed a preliminary study of bias dis- 6 CORRECTING BIAS DISPARITY parity in recommender systems, and the conditions under which it To address the problem of bias disparity, we consider an algorithm may appear. Using synthetic data, we observed that recommenda- that performs post-processing of the recommendations. Our goal is tion algorithms can introduce bias disparity even for moderately to adjust the set of recommended items, so as to ensure that there biased groups. We view this analysis as a first step towards a syste- is no bias disparity. In addition, we would like the new recommen- matic analysis of the factors that cause bias disparity. We intend to dation set to have the maximum possible utility. investigate more recommendation algorithms, and more datasets, Abusing the notation, let R ∗ denote the set of user-item pairs including the case of numerical, rather than binary, ratings. It is also produced by our recommendation algorithm, where (u, i) ∈ R ∗ interesting to examine this work in the context of other definitions 2 MovieLens 1M: https://grouplens.org/datasets/movielens/1m/ for bias and fairness. 4 REFERENCES [1] Alex Beutel, Ed H. Chi, Zhiyuan Cheng, Hubert Pham, and John Anderson. 2017. Beyond Globally Optimal: Focused Learning for Improved Recommendations. In Proceedings of the 26th International Conference on World Wide Web. [2] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Balanced Neig- hborhoods for Multi-sided Fairness in Recommendation. In Conference on Fairness, Accountability and Transparency, FAT. [3] Òscar Celma and Pedro Cano. 2008. From Hits to Niches?: Or How Popular Artists Can Bias Music Recommendation and Discovery. In Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition (NETFLIX). ACM. [4] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. 2012. Fairness through awareness. In Innovations in Theoretical Computer Science. [5] Sara Hajian, Francesco Bonchi, and Carlos Castillo. 2016. Algorithmic Bias: From Discrimination Discovery to Fairness-aware Data Mining. In ACM SIGKDD. [6] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015. What Recommenders Recommend: An Analysis of Recommendation Biases and Possible Countermeasures. User Modeling and User-Adapted Interaction 25, 5 (Dec. 2015), 427–491. [7] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2014. Cor- recting Popularity Bias by Enhancing Recommendation Neutrality. In Poster Proceedings of the 8th ACM Conference on Recommender Systems, RecSys 2014. [8] Matev Kunaver and Toma Porl. 2017. Diversity in Recommender Systems A Survey. Know.-Based Syst. (2017). [9] X. Ning and G. Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recom- mender Systems. In 2011 IEEE 11th International Conference on Data Mining. [10] Bashir Rastegarpanah, Krishna P. Gummadi, and Mark Crovella. 2019. Fighting Fire with Fire: Using Antidote Data to Improve Polarization and Fairness of Re- commender Systems. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019. 231–239. [11] Harald Steck. 2018. Calibrated Recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys ’18). ACM, New York, NY, USA, 154–162. [12] Xiaoyuan Su and Taghi M. Khoshgoftaar. 2009. A Survey of Collaborative Filtering Techniques. Adv. in Artif. Intell. (2009). [13] Sirui Yao and Bert Huang. 2017. Beyond Parity: Fairness Objectives for Collabo- rative Filtering. In NIPS. [14] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In EMNLP. [15] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-Aware Tensor-Based Recommendation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 1153–1162. 5