=Paper=
{{Paper
|id=Vol-2440/paper2
|storemode=property
|title= Crank up the Volume: Preference Bias Amplification in Collaborative Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2440/paper2.pdf
|volume=Vol-2440
|authors=Kun Lin,Nasim Sonboli,Bamshad Mobasher,Robin Burke
|dblpUrl=https://dblp.org/rec/conf/recsys/LinSMB19
}}
== Crank up the Volume: Preference Bias Amplification in Collaborative Recommendation==
<pdf width="1500px">https://ceur-ws.org/Vol-2440/paper2.pdf</pdf>
<pre>
    Crank up the volume: preference bias amplification
            in collaborative recommendation∗
                               Kun Lin†                                                          Nasim Sonboli∗
                            DePaul University                                              University of Colorado Boulder
                              Chicago, USA                                                          Boulder, USA
                        linkun.nicole@gmail.com                                             nasim.sonboli@colorado.edu

                         Bamshad Mobasher                                                          Robin Burke
                          DePaul University                                                University of Colorado Boulder
                            Chicago, USA                                                            Boulder, USA
                        mobasher@cs.depaul.edu                                               robin.burke@colorado.edu

ABSTRACT                                                                       (Civil Rights Act of 1964), with similar provisions in effect
Recommender systems are personalized: we expect the re-                        in other countries.
sults given to a particular user to reflect that user’s prefer-                    The biases in the outputs of recommendation algorithms
ences. Some researchers have studied the notion of calibra-                    can be due to a variety of factors in the input data that is fed
tion, how well recommendations match users’ stated prefer-                     to the algorithms. As the saying goes: “garbage in, garbage
ences, and bias disparity the extent to which mis-calibration                  out”. These underlying factors include sample size disparity,
affects different user groups. In this paper, we examine bias                  having limited features for protected groups, features that are
disparity over a range of different algorithms and for differ-                 proxies of demographic attributes, human factors or skewed
ent item categories and demonstrate significant differences                    findings [2]. These causes are not mutually exclusive and can
between model-based and memory-based algorithms.                               be present at the same time and they can result in disparate
                                                                               negative outcomes.
KEYWORDS                                                                           In this paper, we model bias as the preferences of users
algorithmic bias, bias amplification, collaborative filtering,                 and their tendency to choose one type of item over another.
bias disparity, calibration, fairness, recommendation algo-                    In and of itself, this type of bias is not necessarily a nega-
rithms                                                                         tive phenomenon. In fact, patterns in preference bias are a
                                                                               key ingredient that recommendation algorithms use to con-
                                                                               struct predictive models and provide users with personalized
1   INTRODUCTION                                                               outputs. However, in certain contexts the propagation of
Recommender systems have become ubiquitous and are in-                         preference biases can be problematic. For example, in the
creasingly influencing our daily decisions in a variety of                     news recommendation domain, preference biases can cause
online domains. Recently, there has been a shift of focus                      filter bubbles [19] and limit the exposure of users to diversi-
from achieving the best accuracy [10] in recommendation to                     fied items. And, in job recommendation and lending domains,
other important measures such as diversity, novelty, as well                   existing biases in the input data may reflect historical societal
as socially-sensitive concerns such as fairness [12, 13]. One                  biases against protected groups, which must be accounted
of the key issues with which to contend is that biases in the                  for by learning systems [18].
input data (used for training predictive models) are reflected,                    Our main goal in this paper is to study how different col-
and in some cases amplified, in the results of recommender                     laborative filtering algorithms might propagate or amplify
system algorithms. This is specially important in contexts                     existing preference biases in the input data and the differ-
where fairness and equity matter or are required by laws and                   ent kinds of impact such disparity between input and the
regulations such as in lending (Equal Credit Opportunity                       output might have on users. For the purpose of this analy-
Act), education (Civil Rights Act of 1964; Education Amend-                    sis, we use bias disparity, a recently introduced group-based
ments of 1972), housing (Fair Housing Act), employment                         metric[24, 26]. This metric considers biases with respect to
                                                                               the preferences of specific user groups such as men or women
∗ Copyright 2019 for this paper by its authors. Use permitted under Creative
                                                                               towards specific item categories such as different movie gen-
Commons License Attribution 4.0 International (CC BY 4.0).                     res. This metric evaluates and compares the preference ratio
Presented at the RMSE workshop held in conjunction with the 13th ACM
Conference on Recommender Systems (RecSys), 2019, in Copenhagen,
                                                                               in both the input and the output data and measures the de-
Denmark.                                                                       gree to which recommendation algorithms may propagate
† Both authors contributed equally to this research.                           these biases, in some cases dampening them and in others
RMSE’19, September 2019, Copenhagen, Denmark                                                                         Kun Lin, et al.


amplifying them. Throughout this paper we use the notions         fractions of each group (equality in true positive rate). This
of preference bias and preference ratio interchangeably.          metric can be used to detect unfairness for both consumers
   Our preliminary experiments on a movie rating dataset          and providers.
show that different types of algorithms behave quite differ-         Steck [22] has proposed an approach for calibrating rec-
ently in the way in which they propagate preference biases        ommender systems to reflect the various interests of users
in the input data. These findings maybe especially important      relative to their initial preference proportions. The degree
for system designers in determining the choice of algorithms      of calibration is quantified using the Kullback-Leibler (KL)
and parameter settings in critical domains where the output       divergence. This metric compares the distribution over all
of the system must conform to legal and ethical standards         the genres of the set of movies played by the user and the
or to prevent discriminatory behavior by the system. As far       same distribution in a user’s recommendation list. A post-
as we know, this paper is among the first works to have           processing re-ranking algorithm is then used to adjust the
observed this phenomenon in recommendation algorithms.            calibration degree in the recommendation list.
   We are specifically interested in answering the following         The authors in [5] have discussed another type of bias
research questions:                                               called popularity bias. Many e-commerce domains exhibit
                                                                  this kind of bias where a small set of popular items, such
    • RQ1 How do different recommendation algorithms              as those from established sellers, may dominate recommen-
      propagate existing preference biases in the input data      dation lists, while newly-arrived or niche items receive less
      to the generated recommendation lists?                      attention. In this situation, the likelihood of being recom-
    • RQ2 How does the bias disparity between the input           mended for popular items will be considerably higher than
      and the output differ for different user groups (e.g.,      the rest of the (long-tail) items, potentially resulting in an
      men versus women)?                                          unfair treatment of some sellers. The methods presented in
    • RQ3 How do bias disparity impact individual users           [1, 14] have tried to break the feedback loop and mitigate
      with extreme preferences (positive or negative) with        this issue. These methods generally try to increase fairness
      respect to particular categories of items?                  for item providers (P-fairness) in the system by diversifying
                                                                  the recommendation list of users.
2   RELATED WORK                                                     The authors in [6] have looked into the influence of algo-
As authors in [3] mention, fairness can be a multi-sided          rithms on the output data; they tracked the extent to which
notion. Recommender systems often involve multiple stake-         the diversity in user profiles change in the output recommen-
holders, including consumers and providers [4] and fairness       dations. [7] has also looked into the author gender distribu-
can be sought for for these different stakeholders. In gen-       tion in user profiles in the BookCrossing dataset (BX) and
eral, fairness is a system goal, as neither side have a good      has compared it with that of the output recommendations.
view of the ecosystem and distribution of the resources. Fair-    According to their results, the nearest neighbor methods
ness for users/consumers could mean providing similar rec-        propagate the biases and strengthen them, and matrix fac-
ommendations to similar users without considering their           torization methods strengthen the biases more. Interestingly,
protected attributes, such as certain demographic features.       our results for matrix factorization methods show the op-
Methods that seek fairness for consumers of a system fall         posite trends possibly indicating the different behavior of
under the category of consumer-side fairness (C-fairness).        algorithms in different domains and datasets.
Fairness to item-providers (for example sellers on Amazon),          The work by Tsintzou et al. [24] sought to demonstrate
may means providing their items a reasonable chance of            unfairness for consumers/users by modeling the bias as the
being exposed/recommended to consumers. This kind of              preferences of users. Their proposed metric is called the bias
fairness is called the provider-side fairness (P-fairness).       disparity, and is similar in logic to the metric proposed in
   Various metrics have been introduced for detecting model       Steck’s work. They both have a user-centric point of view
biases. The metrics presented in [25], such as absolute unfair-   and want to achieve group-fairness. They both calculate the
ness, value unfairness, underestimation and overestimation        difference between the preference of the user in the input
unfairness focus on the discrepancies between the predicted       data and the predicted preference of the user by the recom-
scores and the true scores across protected and unprotected       mendation algorithm. Bias disparity metric looks at these
groups and consider the results to be unfair if the model         differences in a more fined-grained way, evaluating the pref-
consistently deviates (overestimates or underestimates) from      erences of specific user groups for specific item categories.
the true ratings for specific groups. These metrics show un-      KL divergence used in Steck’s approach measures more gen-
fairness towards consumers.                                       erally the difference in preference distributions across genres.
   Equality of opportunity discussed in [9] detects whether       The sign value of the bias disparity, on the other hand, gives
there are equal proportions of individuals from the qualified
Crank up the volume: preference bias amplification in collaborative recommendation                 RMSE’19, September 2019, Copenhagen, Denmark


us information about how input and output biases differ rel-                    Here we assume that PR S (G, C) > 0, and PR R >= 0. A
ative to specific categories: negative values indicating the                 bias disparity of zero or near zero means that the input and
bias has been reversed and positive values indicating it has                 output of the algorithm are almost the same with respect to
been amplified. KL divergence, on the other hand, produces                   the prevalence of the chosen category: the algorithm reflects
non-negative values and cannot differentiate between these                   the users’ preferences quite closely. A negative bias disparity
two cases.                                                                   means that the output preference bias is less than that of
   One of the limitations of the work of Tsintzou et al. [24] is             the input. In other words, the preference bias towards a
that they perform their analysis only for K-nearest-neighbor                 given category is dampened. The extreme value, BD = −1,
models. In this paper, we build on their work by considering                 would indicate that a category important in a user’s profile
a variety of recommendation algorithms. We are also inter-                   is completely missing from the system’s recommendations
ested in understanding how bias affects female and male user                 (PR R = 0). If the bias disparity value is positive, the output
groups separately and how it might affect individual users.                  preference bias towards an item category is higher than that
                                                                             of the input, indicating that the importance of the given
3   METHODOLOGY                                                              category has been amplified by the algorithm.
Bias Disparity
                                                                             Algorithms
Let U be the set of n users and I be the set of m items and
S be the n × m input matrix, where S(u, i) = 1 if user u has                 The experiments were performed using the librec-auto
selected item i, and zero otherwise.                                         experimentation platform, [17], which is a python wrapper
   Let AU , be an attribute that is associated with users and                built around the Java-based LibRec [8] recommendation li-
partitions them into groups that have that attribute in com-                 brary. All experiments were performed using a 5-fold cross
mon, such as gender. Similarly, let AI be the attribute that                 validation setting where 80% of each user’s rating data is
is associated with items and that partitions the items into                  used for the training dataset and the rest as the test dataset
categories, e.g. movie genres.                                               (LibRec’s userfixed configuration).
   Given matrix S, the input preference ratio for user group                    We tested our experiments on four groups of algorithms:
G on item category C is the fraction of liked items by group                 memory-based, model-based (ranking), model-based (rating)
G in category C:                                                             and baseline. We selected both user-based and item-based k-
                              Í     Í                                        nearest-neighbor methods from the memory-based category.
                                          S(u, i)
                PR S (G, C) = Íu ∈G Íi ∈C                  (1)               BPR [21], RankALS [23] were selected from the learning-to-
                               u ∈G i ∈I S(u, i)                             rank category. From the rating-oriented latent factor models
Eq. (1) is essentially the conditional probability of selecting              [16], we chose Biased Matrix Factorization (BiasedMF) [20],
an item from category C given that this selection is done by                 SVD++ [15], and Weighted Regularized Matrix Factorization
a user in group G.                                                           (WRMF) [11]. We used a most-popular recommender as a
   The bias disparity is the relative difference of the prefer-              baseline as this algorithm would be expected to maximally
ence bias value between the input S and output of a recom-                   amplify the popularity bias in the recommendation outputs.
mendation algorithm R, and is defined as follows:                               For each algorithm, we tuned the parameters and picked
                                                                             the one that gives the best performance in terms of nor-
                          PR R (G, C) − PR S (G, C)                          malized Discounted Cumulative Gain (nDCG) of the top 10
              BD(G, C) =                                    (2)
                                 PR S (G, C)                                 listed items. The nDCG values of the algorithms over two
   We assume that a recommendation algorithm provides                        experiments in the paper are shown in Table 1.
each user u with a list of r ranked items Ru . Let R be the
collection of all the recommendations to all the users rep-                          Algorithm       Experiment 1       Experiment 2
resented as a binary matrix, where R(u, i) = 1 if item i is                          MostPopular         0.480              0.460
recommended to user u, and zero otherwise. The overall                                ItemKNN            0.524              0.515
bias disparity for a category C is obtained by averaging bias                         UserKNN            0.572              0.559
disparities across all users regardless of the group. For more                           BPR            0.616              0.588
details on this metric, interested readers can refer to [24].                         RankALS            0.446              0.374
   In this paper, we use bias disparity metric on two levels:                         BiasedMF           0.200              0.200
1. Group-based bias disparity which is calculated based on                              SVD++            0.167              0.239
Eq. (2) and calculated the bias disparity for two user groups                           WRMF             0.507              0.498
of women and men. 2. General bias disparity which is also                    Table 1: nDCG values with selected parameters for the two
calculated based on Eq. (2) for all the users in the dataset                 experiments
regardless of their group membership.
RMSE’19, September 2019, Copenhagen, Denmark                                                                         Kun Lin, et al.


Dataset                                                            4    EXPERIMENTAL RESULTS
We ran our experiments on MovieLens 1M1 dataset (ML), a            Experiment 1: Action and Romance Categories
publicly available dataset for movie recommendation which          In this experiment, we keep the number of items in item
is widely used in recommender systems experimentation.             groups approximately the same while we create unbalanced
ML contains 6,040 users and 3,702 movies and 1M ratings.           user group sizes. The Action and Romance genres are taken
The sparsity of ratings in this dataset is about 96%.              as item categories, with 468 and 436 movies in each group
                                                                   respectively. We have 278 women and 981 men for our user
Experiment Design                                                  groups while each user has at least 90 ratings. After filtering
In this section, we look to address these questions:               the dataset, we ended up with 207,002 ratings from 1,259
                                                                   users on 904 items with a sparsity of 18% for experiment one.
    • What values of the bias disparity are produced by dif-          As we see in Table 2, the preference ratio of male users is
      ferent recommendation algorithms? (RQ1)                      higher for Action genre (≈ 0.70) compared to the Romance
    • Do bias disparity values differ across male and female       genre (≈ 0.30) whereas female users have a more balanced
      users in the dataset? (RQ2)                                  preference ratio (≈ 0.50) over these two movie genres. From
    • How are users with extreme initial preference ratio          comparing the preference ratios of the whole population
      effected by bias disparities? (RQ3)                          and sub-groups (table 2), we observe an overall tendency to
   We addressed these questions in three steps: Initially, we      prefer the Action genre over Romance genre. This overall
selected a subset of the ML dataset consisting of male and         bias mainly comes from the preference ratio of the majority
female user groups and two movie genres as our item groups.        male user group.
Then, in the first step, we separately calculated preference ra-
tio (Eq. 1) of males and females (user groups) on these genres            Genre     Whole Population        Male    Female
and computed the corresponding the bias disparity values                  Action         0.675              0.721    0.502
(Eq. 2). In the second step, we calculated the preference ratios         Romance         0.325              0.279    0.498
and bias disparities for our movie genres on the whole user            Table 2: Input preference ratio for Action and Romance
data (without partitioning into separate user groups). In the
third step, we looked into users with zero initial preference      Step 1: Group-based Bias Disparity. According to the results
ratio on one of the genres to see the effects of different algo-   shown in Figure 1, we see that both the neighborhood based
rithms on bias disparity. Our goal was to determine if input       methods, UserKNN and ItemKNN, show increased output
preference ratios were significantly different from the output     preference ratio (PR) of both male and female user groups
preference ratios in the recommendations (i.e., if bias dispar-    on Action genre by 50% and around 20% respectively. While
ity was significantly different from 0, due to the dampening       both of these algorithms show increased preference ratio
or amplification of preference biases).                            on the Action genre, they have dramatically decreased it for
   In the first step of the experiments, we calculated the         Romance genre, although the preference ratio of women on
group-based bias disparity. As bias disparity represents a         both genres in the input data were balanced. Accordingly,
form of inaccuracy (users getting results different from their     we see in Figure 2, both of these algorithms show negative
interests), bias disparity differences between groups repre-       bias disparities (BD) on Romance for both men and women.
sent a form of unfairness as the system is working better for         These results show different outcomes for the two groups
some than for others.                                              because of the different input preference ratio. For the female
   In the second step of the experiment, we calculated the         group, the neighborhood-based algorithms induced a bias
general bias disparity for the whole population. The compar-       towards Action not present in the input; for the male group,
ison of the bias disparity for the whole population (step 2)       the algorithms tend to perpetuate and amplify the existing
compared to specific user sub-groups (step 1), can help us         biases in the input data.
understand how algorithms differ in terms of bias disparity           The matrix factorization algorithms show different ten-
across the whole user population.                                  dencies. In BiasedMF, the output preference ratio is much
   We ran two sets of experiments, first with Action and Ro-       lower than the input preference ratio for male users in the
mance genre movies as our item groups, and then with Crime         Action genre (the opposite of what we observed for the
and Sci-Fi genre movies. More details will be mentioned in         neighborhood-based methods). The PR for the female group
each experiment.                                                   is approximately the same. With BiasedMF, the preference
                                                                   ratios of both female and male groups are pushed close to 0.5.
                                                                   We have a negative bias disparity as we see in Figure 2, which
1 https://grouplens.org/datasets/movielens                         means that the original preference ratio is underestimated.
Crank up the volume: preference bias amplification in collaborative recommendation              RMSE’19, September 2019, Copenhagen, Denmark


Interestingly, this algorithm strengthens the bias disparity of                      Algorithm        Women       Men     P-value
both men and women on Romance genre which is an overes-                              MostPopular       2.519      0.659   4.29e-05
timation of their actual preference. We see a similar pattern                         ItemKNN          2.100      0.994   8.89e-25
in SVD++ as well.                                                                     UserKNN          0.749      1.091   2.44e-19
   WRMF, the other latent factor model, gives inconsistent                               BPR           0.285      0.678   7.79e-03
results from BiasedMF and SVD++. It slightly decreased the                            RankALS          0.368      0.306   9.99e-01
preference ratio of women on Action and increased it for                              BiasedMF         1.230      2.660   1.35e-02
men on Action. We see the opposite trend on Romance genre,                              SVD++          0.803      2.364   6.66e-04
in other words, the output preference ratio for women on                                WRMF           0.585      0.523   1.76e-05
Romance is slightly higher while for men is lower.                           Table 3: Bias disparity absolute value sum over Action and
   Generally, the absolute value of the BD for the two user                  Romance
groups are not similar. Men have higher absolute values of
BD on Romance while women have higher absolute values
of BD on Action. As we see in Figure 2, different algorithms                    In general, the Action genre (with preference ratio of 0.675)
affect women or men differently. ItemKNN affects the women                   is preferred to the Romance genre (preference ratio of 0.325).
more than men in both genres, while UserKNN amplifies the                    As seen in the previous experiment, the two neighborhood-
bias more for men than women in both genres. BiasedMF                        based methods, UserKNN and ItemKNN, both increase the
and SVD++ increase bias more for men; WRMF, increases                        general preference ratio significantly. Our latent factor mod-
the bias slightly more for women.                                            els (BiasedMF, SVD++, WRMF) show different effects on
   In this experiment, women had an almost balanced pref-                    the preference ratio. None of the matrix factorization algo-
erence over Action and Romance movies, while men prefer                      rithms significantly increase the original input preference
Action movies to Romance movies. A well-calibrated algo-                     ratio in the Action genre. BiasedMF and SVD++ significantly
rithm would preserve these tendencies. However, with the                     decreases the output preference ratio in Action genre, while
influence of the male group, most of the recommender algo-                   WRMF keep the the output preference ratio close to the
rithms provide an unbalanced recommendation list specially                   initial preference.
for women (the minority group). However, BiasedMF and                           The Romance category has lower input preference ratio
SVD++ run counter to this trend, reversing the bias dispar-                  than Action genre, which means that in the input dataset,
ity for both genres. The influence of men’s preferences for                  the population on average prefers Action to Romance. The
Action in the overall data is reduced, resulting in fewer un-                output preference ratios for this genre show a reverse pat-
wanted Action movie recommendations for women which                          tern compared to the Action genre. The neighborhood-based
is fairer for this group. These two algorithms balance out                   algorithms decrease the preference ratio and most of the
the exposure of Action and Romance genres for both user                      matrix factorization algorithms don’t change the preference
groups.                                                                      ratio by much except for BiasedMF and SVD++, which sig-
   K-nearest-neighbor methods amplify the bias significantly                 nificantly increases the preference ratio. We can see the bias
and this behavior could be due to their sensitivity to the pop-              disparity change in Figure 2 as well.
ularity bias. Both of the neighborhood-based models show
a similar trend to the most-popular recommender (the light                   Step 3: Users with Extreme Preferences. To examine extreme
blue bar). Romance genre is less favored by the majority                     preference cases, we concentrated on users with very low
group (981 men vs 278 women) in the dataset compared to                      preference ratios across the genres we studied (We excluded
the Action genre. So, we end up having more neighbors from                   the users that had a zero preference ratio on both genres).
the majority group as the nearest neighbors (user-knn) or                    There were 10 men who had zero preference ratios on the
having more ratings from the majority group on a specific                    Romance genre, which means that they only watched Ac-
genre (item-knn). So, their preference will dominate the pref-               tion movies. In Figure 3, it shows the preference ratio in the
erence of the other group on both genres. These methods                      recommendation. Some algorithms, like UserKNN, BPR, and
not only prioritize the preference of the majority group to                  WRMF, recommend all Action movies, which is totally con-
the minority group, but they also amplify this bias.                         sistent with these users’ initial preference. Other algorithms,
                                                                             including BiasedMF and SVD++, de-amplify the effects of
                                                                             the preference and show a more diverse recommendation set.
Step 2: General Bias Disparity. In Figure 1, the bar shows                   When analyzing the preference ratio of the extreme group,
the preference ratio in the recommendation output and the                    the effects of some algorithms become more clearer because
dashed line shows the input preference ratio for related cat-                of the consistency of the general population and extreme
egories.                                                                     group.
RMSE’19, September 2019, Copenhagen, Denmark                                                                      Kun Lin, et al.


                                                                      Figure 2: Bias disparity for Action and Romance


Figure 1: Output Preference Ratio for Action and Romance


Experiment 2: Crime and Sci-Fi                                   Figure 3: Output PR for users with extreme preferences for
                                                                 Action and Romance
In this experiment, our item groups were Crime and Sci-Fi,
with 211 and 276 movies in each group respectively. The
number of users in both user groups were still unbalanced,         As it is shown in table 4, the preference ratios of both
259 female users and 1,335 male users. All of the users had at   male and female users in Crime and Sci-Fi movies are similar.
least 50 ratings from both genres which leaves us with 37,897    Both men and women have a preference ratio of around 0.7
ratings from the 1,594 users on 487 items. The sparsity of       on Sci-Fi and around 0.3 for the Crime genre. According to
the dataset was around 95%.                                      Table 4, the whole population prefers Sci-Fi movies to Crime
Crank up the volume: preference bias amplification in collaborative recommendation              RMSE’19, September 2019, Copenhagen, Denmark


movies, and we see a similar trend in both user groups, male
and female.

      Genre      Whole Population           Male      Female
      Crime           0.317                 0.302      0.334
      Sci-Fi          0.683                 0.698      0.666
    Table 4: Input Preference Ratio for Crime and Sci-Fi

Step 1: Group-Based Bias Disparity. Overall, the group-based
bias disparity is very similar to the pattern seen in the whole
population. Based on patterns shown in Figure 4, the dif-
ference between the patterns that we see in Crime genre
for both men and women is minimal, and the same trend
is true for Romance genre. The difference in the absolute
values of bias disparity between groups is not as enormous
as the difference that we saw in Action and Romance (Fig-
ure 2), which is partly because the two groups have similar
preference over the two categories.
   Neighborhood-based algorithms amplify the existing pref-
erence bias for both groups. The matrix factorization al-
gorithms either dampen the input bias, like BiasedMF and
SVD++, or they don’t change the input preference ratio sig-
nificantly, like WRMF.

         Algorithm         Women         Men       P-value
         MostPopular        0.740        0.951       0.73
          ItemKNN           1.137        0.898       0.08
          UserKNN           0.818        0.792       0.64
             BPR            0.357        0.400       0.54
          RankALS           1.126        1.114       0.93
            WRMF            0.247        0.311       0.15
          BiasedMF          3.089        3.636       0.39
            SVD++           2.394        2.778       0.41
Table 5: Bias disparity absolute value sum over Crime and
Sci-Fi

Step 2: General Bias Disparity. As shown in the Figure 4, the
pattern of Crime and Sci-Fi over the whole population is con-
sistent with Action and Romance. The neighborhood based
algorithms, UserKNN and ItemKNN, show an increased out-
put preference ratio for the more preferred genre (Sci-Fi),                      Figure 4: Output preference ratio for Crime and Sci-Fi
and a decreased PR for the less preferred genre (Crime). The
matrix factorization algorithms show different patterns from
neighborhood based algorithms but very similar pattern to
experiment 1. BiasedMF and SVD++ have the most signifi-                      Step 3: Users with Extreme Preferences. We had 37 users with
cant effects on the preference ratio, increasing the preference              preference ratio value of zero on Crime movies, meaning
ratios of the less favored category and decreasing those of                  that they only watched Sci-Fi movies. The trends that we see
the more favored category. WRMF shows good calibration                       in Figure 6 for this group is pretty similar to Figure 3
here.                                                                           Similarly to experiment 1, algorithms such as UserKNN,
   The bias disparity showing in Figure 5 is also consistent                 BPR, and WRMF, provide the recommendations well-calibrated
with the bias disparity shown in the Figure 2 of experiment                  to the users’ initial preferences whereas BiasedMF and SVD++,
one.                                                                         significantly ampen the initial preference biases.
RMSE’19, September 2019, Copenhagen, Denmark                                                                             Kun Lin, et al.


                                                            of the neighborhood-based models show a similar trend to-
                                                            wards popularity, consistent with the findings of [13]. With
                                                            these models, we might expect that a dominant group would
                                                            contribute more neighbors in recommendation generation
                                                            and would influence predictions by virtue of its presence
                                                            in these groupings. These methods not only prioritize the
                                                            preference of the dominant group, but they also amplify the
                                                            biases for the dominant group across all users.
                                                               Different from the previous research on the bias ampli-
                                                            fication of matrix factorization methods [7], we observed
                                                            that different matrix factorization models influence prefer-
                                                            ence biases differently. SVD++ and BiasedMF both dampen
                                                            the preference bias for different movie genres for both men
                                                            and women. WRMF algorithm is well-calibrated for the Sci-
                                                            Fi/Crime genres for both men and women but the behavior
                                                            is inconsistent for Action/Romance genre.
                                                               Each of these model-based algorithms produces a low-
                                                            rank approximation of the input rating data, but do so in
                                                            slightly different ways. Jannach et al. [13] found that model-
                                                            based algorithms generally have less popularity bias, so it
                                                            may be expected that such algorithm would not show as
                                                            much bias disparity as the memory-based ones. However,
                                                            further study will be required to understand the interactions
                                                            between input biases and each algorithm’s learning objective.
                                                            Interestingly, parameter tuning of these algorithms, which
                                                            produced better accuracy, did not change the bias disparity
                                                            pattern.
                                                               As we have discovered in our experiments, recommenda-
        Figure 5: Bias disparity for Crime and Sci-Fi       tion algorithms generally distort preference biases present
                                                            in the input data and do so in sometimes unpredictable ways.
                                                            Different groups of users may be treated in quite different
                                                            ways as a result. Bias disparity analysis is a useful tool in
                                                            understanding how aspects of the input data are reflected in
                                                            an algorithm’s output.

                                                            REFERENCES
                                                             [1] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017.
                                                                 Controlling popularity bias in learning-to-rank recommendation. In
                                                                 Proceedings of the Eleventh ACM Conference on Recommender Systems.
                                                                 ACM, 42–46.
                                                             [2] Solon Barocas and Andrew D Selbst. 2016. Big data’s disparate impact.
                                                                 Calif. L. Rev. 104 (2016), 671.
                                                             [3] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. 2018. Bal-
                                                                 anced neighborhoods for multi-sided fairness in recommendation. In
                                                                 Conference on Fairness, Accountability and Transparency. 202–214.
                                                             [4] Robin D Burke, Himan Abdollahpouri, Bamshad Mobasher, and Tri-
Figure 6: Output Preference Ratio for Crime and Sci-Fi of        nadh Gupta. 2016. Towards Multi-Stakeholder Utility Evaluation of
Extreme Group                                                    Recommender Systems.. In UMAP (Extended Proceedings).
                                                             [5] Òscar Celma and Pedro Cano. 2008. From hits to niches?: or how
                                                                 popular artists can bias music recommendation and discovery. In Pro-
                                                                 ceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems
5   CONCLUSION AND FUTURE WORK                                   and the Netflix Prize Competition. ACM, 5.
                                                             [6] Sushma Channamsetty and Michael D Ekstrand. 2017. Recommender
Although we focused here on a handful of the more common         response to diversity and popularity bias in user profiles. In The Thir-
movie genres, some important patterns can be seen. Both          tieth International Flairs Conference.
Crank up the volume: preference bias amplification in collaborative recommendation                   RMSE’19, September 2019, Copenhagen, Denmark


 [7] Michael D Ekstrand, Mucun Tian, Mohammed R Imran Kazi, Hoda              [17] Masoud Mansoury, Robin Burke, Aldo Ordonez-Gauger, and Xavier
     Mehrpouyan, and Daniel Kluver. 2018. Exploring author gender in               Sepulveda. 2018. Automating recommender systems experimenta-
     book rating and recommendation. In Proceedings of the 12th ACM                tion with librec-auto. In Proceedings of the 12th ACM Conference on
     Conference on Recommender Systems. ACM, 242–250.                              Recommender Systems. ACM, 500–501.
 [8] Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith. 2015. LibRec:     [18] Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines
     A Java Library for Recommender Systems.. In UMAP Workshops, Vol. 4.           reinforce racism. nyu Press.
 [9] Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of op-      [19] Eli Pariser. 2011. The filter bubble: How the new personalized web is
     portunity in supervised learning. In Advances in neural information           changing what we read and how we think. Penguin.
     processing systems. 3315–3323.                                           [20] Arkadiusz Paterek. 2007. Improving regularized singular value de-
[10] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T           composition for collaborative filtering. In Proceedings of KDD cup and
     Riedl. 2004. Evaluating collaborative filtering recommender systems.          workshop, Vol. 2007. 5–8.
     ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 5–53.       [21] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars
[11] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative               Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from
     Filtering for Implicit Feedback Datasets.. In ICDM, Vol. 8. Citeseer,         implicit feedback. In Proceedings of the twenty-fifth conference on un-
     263–272.                                                                      certainty in artificial intelligence. AUAI Press, 452–461.
[12] Neil Hurley and Mi Zhang. 2011. Novelty and diversity in top-n           [22] Harald Steck. 2018. Calibrated recommendations. In Proceedings of the
     recommendation–analysis and evaluation. ACM Transactions on Inter-            12th ACM conference on recommender systems. ACM, 154–162.
     net Technology (TOIT) 10, 4 (2011), 14.                                  [23] Gábor Takács and Domonkos Tikk. 2012. Alternating least squares for
[13] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Ju-               personalized ranking. In Proceedings of the sixth ACM conference on
     govac. 2015. What recommenders recommend: an analysis of recom-               Recommender systems. ACM, 83–90.
     mendation biases and possible countermeasures. User Modeling and         [24] Virginia Tsintzou, Evaggelia Pitoura, and Panayiotis Tsaparas.
     User-Adapted Interaction 25, 5 (2015), 427–491.                               2018. Bias Disparity in Recommendation Systems. arXiv preprint
[14] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma.              arXiv:1811.01461 (2018).
     2014. Correcting Popularity Bias by Enhancing Recommendation             [25] Sirui Yao and Bert Huang. 2017. Beyond parity: Fairness objectives for
     Neutrality.. In RecSys Posters.                                               collaborative filtering. In Advances in Neural Information Processing
[15] Yehuda Koren. 2008. Factorization meets the neighborhood: a multi-            Systems. 2921–2930.
     faceted collaborative filtering model. In Proceedings of the 14th ACM    [26] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei
     SIGKDD international conference on Knowledge discovery and data min-          Chang. 2017. Men also like shopping: Reducing gender bias amplifi-
     ing. ACM, 426–434.                                                            cation using corpus-level constraints. arXiv preprint arXiv:1707.09457
[16] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factoriza-        (2017).
     tion techniques for recommender systems. Computer 8 (2009), 30–37.

</pre>