<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crank up the volume: preference bias amplification in collaborative recommendation∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kun Lin†</string-name>
          <email>linkun.nicole@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bamshad Mobasher</string-name>
          <email>mobasher@cs.depaul.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nasim Sonboli∗</string-name>
          <email>nasim.sonboli@colorado.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robin Burke</string-name>
          <email>robin.burke@colorado.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DePaul University</institution>
          ,
          <addr-line>Chicago</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Colorado Boulder</institution>
          ,
          <addr-line>Boulder</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Recommender systems are personalized: we expect the results given to a particular user to reflect that user's preferences. Some researchers have studied the notion of calibration, how well recommendations match users' stated preferences, and bias disparity the extent to which mis-calibration afects diferent user groups. In this paper, we examine bias disparity over a range of diferent algorithms and for diferent item categories and demonstrate significant diferences between model-based and memory-based algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>
        Recommender systems have become ubiquitous and are
increasingly influencing our daily decisions in a variety of
online domains. Recently, there has been a shift of focus
from achieving the best accuracy [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] in recommendation to
other important measures such as diversity, novelty, as well
as socially-sensitive concerns such as fairness [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. One
of the key issues with which to contend is that biases in the
input data (used for training predictive models) are reflected,
and in some cases amplified, in the results of recommender
system algorithms. This is specially important in contexts
where fairness and equity matter or are required by laws and
regulations such as in lending (Equal Credit Opportunity
Act), education (Civil Rights Act of 1964; Education
Amendments of 1972), housing (Fair Housing Act), employment
∗Copyright 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Presented at the RMSE workshop held in conjunction with the 13th ACM
Conference on Recommender Systems (RecSys), 2019, in Copenhagen,
Denmark.
†Both authors contributed equally to this research.
(Civil Rights Act of 1964), with similar provisions in efect
in other countries.
      </p>
      <p>
        The biases in the outputs of recommendation algorithms
can be due to a variety of factors in the input data that is fed
to the algorithms. As the saying goes: “garbage in, garbage
out”. These underlying factors include sample size disparity,
having limited features for protected groups, features that are
proxies of demographic attributes, human factors or skewed
ifndings [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These causes are not mutually exclusive and can
be present at the same time and they can result in disparate
negative outcomes.
      </p>
      <p>
        In this paper, we model bias as the preferences of users
and their tendency to choose one type of item over another.
In and of itself, this type of bias is not necessarily a
negative phenomenon. In fact, patterns in preference bias are a
key ingredient that recommendation algorithms use to
construct predictive models and provide users with personalized
outputs. However, in certain contexts the propagation of
preference biases can be problematic. For example, in the
news recommendation domain, preference biases can cause
iflter bubbles [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and limit the exposure of users to
diversiifed items. And, in job recommendation and lending domains,
existing biases in the input data may reflect historical societal
biases against protected groups, which must be accounted
for by learning systems [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Our main goal in this paper is to study how diferent
collaborative filtering algorithms might propagate or amplify
existing preference biases in the input data and the
diferent kinds of impact such disparity between input and the
output might have on users. For the purpose of this
analysis, we use bias disparity, a recently introduced group-based
metric[
        <xref ref-type="bibr" rid="ref24 ref26">24, 26</xref>
        ]. This metric considers biases with respect to
the preferences of specific user groups such as men or women
towards specific item categories such as diferent movie
genres. This metric evaluates and compares the preference ratio
in both the input and the output data and measures the
degree to which recommendation algorithms may propagate
these biases, in some cases dampening them and in others
amplifying them. Throughout this paper we use the notions
of preference bias and preference ratio interchangeably.
      </p>
      <p>Our preliminary experiments on a movie rating dataset
show that diferent types of algorithms behave quite
diferently in the way in which they propagate preference biases
in the input data. These findings maybe especially important
for system designers in determining the choice of algorithms
and parameter settings in critical domains where the output
of the system must conform to legal and ethical standards
or to prevent discriminatory behavior by the system. As far
as we know, this paper is among the first works to have
observed this phenomenon in recommendation algorithms.</p>
      <p>We are specifically interested in answering the following
research questions:
• RQ1 How do diferent recommendation algorithms
propagate existing preference biases in the input data
to the generated recommendation lists?
• RQ2 How does the bias disparity between the input
and the output difer for diferent user groups (e.g.,
men versus women)?
• RQ3 How do bias disparity impact individual users
with extreme preferences (positive or negative) with
respect to particular categories of items?
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        As authors in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] mention, fairness can be a multi-sided
notion. Recommender systems often involve multiple
stakeholders, including consumers and providers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and fairness
can be sought for for these diferent stakeholders. In
general, fairness is a system goal, as neither side have a good
view of the ecosystem and distribution of the resources.
Fairness for users/consumers could mean providing similar
recommendations to similar users without considering their
protected attributes, such as certain demographic features.
Methods that seek fairness for consumers of a system fall
under the category of consumer-side fairness (C-fairness).
Fairness to item-providers (for example sellers on Amazon),
may means providing their items a reasonable chance of
being exposed/recommended to consumers. This kind of
fairness is called the provider-side fairness (P-fairness).
      </p>
      <p>
        Various metrics have been introduced for detecting model
biases. The metrics presented in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], such as absolute
unfairness, value unfairness, underestimation and overestimation
unfairness focus on the discrepancies between the predicted
scores and the true scores across protected and unprotected
groups and consider the results to be unfair if the model
consistently deviates (overestimates or underestimates) from
the true ratings for specific groups. These metrics show
unfairness towards consumers.
      </p>
      <p>
        Equality of opportunity discussed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] detects whether
there are equal proportions of individuals from the qualified
fractions of each group (equality in true positive rate). This
metric can be used to detect unfairness for both consumers
and providers.
      </p>
      <p>
        Steck [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] has proposed an approach for calibrating
recommender systems to reflect the various interests of users
relative to their initial preference proportions. The degree
of calibration is quantified using the Kullback-Leibler (KL)
divergence. This metric compares the distribution over all
the genres of the set of movies played by the user and the
same distribution in a user’s recommendation list. A
postprocessing re-ranking algorithm is then used to adjust the
calibration degree in the recommendation list.
      </p>
      <p>
        The authors in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have discussed another type of bias
called popularity bias. Many e-commerce domains exhibit
this kind of bias where a small set of popular items, such
as those from established sellers, may dominate
recommendation lists, while newly-arrived or niche items receive less
attention. In this situation, the likelihood of being
recommended for popular items will be considerably higher than
the rest of the (long-tail) items, potentially resulting in an
unfair treatment of some sellers. The methods presented in
[
        <xref ref-type="bibr" rid="ref1 ref14">1, 14</xref>
        ] have tried to break the feedback loop and mitigate
this issue. These methods generally try to increase fairness
for item providers (P-fairness) in the system by diversifying
the recommendation list of users.
      </p>
      <p>
        The authors in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have looked into the influence of
algorithms on the output data; they tracked the extent to which
the diversity in user profiles change in the output
recommendations. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] has also looked into the author gender
distribution in user profiles in the BookCrossing dataset (BX) and
has compared it with that of the output recommendations.
According to their results, the nearest neighbor methods
propagate the biases and strengthen them, and matrix
factorization methods strengthen the biases more. Interestingly,
our results for matrix factorization methods show the
opposite trends possibly indicating the diferent behavior of
algorithms in diferent domains and datasets.
      </p>
      <p>
        The work by Tsintzou et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] sought to demonstrate
unfairness for consumers/users by modeling the bias as the
preferences of users. Their proposed metric is called the bias
disparity, and is similar in logic to the metric proposed in
Steck’s work. They both have a user-centric point of view
and want to achieve group-fairness. They both calculate the
diference between the preference of the user in the input
data and the predicted preference of the user by the
recommendation algorithm. Bias disparity metric looks at these
diferences in a more fined-grained way, evaluating the
preferences of specific user groups for specific item categories.
KL divergence used in Steck’s approach measures more
generally the diference in preference distributions across genres.
The sign value of the bias disparity, on the other hand, gives
us information about how input and output biases difer
relative to specific categories: negative values indicating the
bias has been reversed and positive values indicating it has
been amplified. KL divergence, on the other hand, produces
non-negative values and cannot diferentiate between these
two cases.
      </p>
      <p>
        One of the limitations of the work of Tsintzou et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] is
that they perform their analysis only for K-nearest-neighbor
models. In this paper, we build on their work by considering
a variety of recommendation algorithms. We are also
interested in understanding how bias afects female and male user
groups separately and how it might afect individual users.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>METHODOLOGY</title>
    </sec>
    <sec id="sec-4">
      <title>Bias Disparity</title>
      <p>Let U be the set of n users and I be the set of m items and
S be the n × m input matrix, where S(u, i) = 1 if user u has
selected item i, and zero otherwise.</p>
      <p>Let AU , be an attribute that is associated with users and
partitions them into groups that have that attribute in
common, such as gender. Similarly, let AI be the attribute that
is associated with items and that partitions the items into
categories, e.g. movie genres.</p>
      <p>Given matrix S, the input preference ratio for user group
G on item category C is the fraction of liked items by group
G in category C:</p>
      <p>PRS (G, C) = ÍÍuu ∈∈GG ÍÍii ∈∈CI SS((uu,, ii)) (1)
Eq. (1) is essentially the conditional probability of selecting
an item from category C given that this selection is done by
a user in group G.</p>
      <p>The bias disparity is the relative diference of the
preference bias value between the input S and output of a
recommendation algorithm R, and is defined as follows:
BD(G, C) = PRR (G, C) − PRS (G, C) (2)</p>
      <p>PRS (G, C)</p>
      <p>
        We assume that a recommendation algorithm provides
each user u with a list of r ranked items Ru . Let R be the
collection of all the recommendations to all the users
represented as a binary matrix, where R(u, i) = 1 if item i is
recommended to user u, and zero otherwise. The overall
bias disparity for a category C is obtained by averaging bias
disparities across all users regardless of the group. For more
details on this metric, interested readers can refer to [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>In this paper, we use bias disparity metric on two levels:
1. Group-based bias disparity which is calculated based on
Eq. (2) and calculated the bias disparity for two user groups
of women and men. 2. General bias disparity which is also
calculated based on Eq. (2) for all the users in the dataset
regardless of their group membership.</p>
      <p>Here we assume that PRS (G, C) &gt; 0, and PRR &gt;= 0. A
bias disparity of zero or near zero means that the input and
output of the algorithm are almost the same with respect to
the prevalence of the chosen category: the algorithm reflects
the users’ preferences quite closely. A negative bias disparity
means that the output preference bias is less than that of
the input. In other words, the preference bias towards a
given category is dampened. The extreme value, BD = −1,
would indicate that a category important in a user’s profile
is completely missing from the system’s recommendations
(PRR = 0). If the bias disparity value is positive, the output
preference bias towards an item category is higher than that
of the input, indicating that the importance of the given
category has been amplified by the algorithm.</p>
    </sec>
    <sec id="sec-5">
      <title>Algorithms</title>
      <p>
        The experiments were performed using the librec-auto
experimentation platform, [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which is a python wrapper
built around the Java-based LibRec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] recommendation
library. All experiments were performed using a 5-fold cross
validation setting where 80% of each user’s rating data is
used for the training dataset and the rest as the test dataset
(LibRec’s userfixed configuration).
      </p>
      <p>
        We tested our experiments on four groups of algorithms:
memory-based, model-based (ranking), model-based (rating)
and baseline. We selected both user-based and item-based
knearest-neighbor methods from the memory-based category.
BPR [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], RankALS [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] were selected from the
learning-torank category. From the rating-oriented latent factor models
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we chose Biased Matrix Factorization (BiasedMF) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
SVD++ [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and Weighted Regularized Matrix Factorization
(WRMF) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We used a most-popular recommender as a
baseline as this algorithm would be expected to maximally
amplify the popularity bias in the recommendation outputs.
      </p>
      <p>For each algorithm, we tuned the parameters and picked
the one that gives the best performance in terms of
normalized Discounted Cumulative Gain (nDCG) of the top 10
listed items. The nDCG values of the algorithms over two
experiments in the paper are shown in Table 1.</p>
      <sec id="sec-5-1">
        <title>Algorithm</title>
        <p>MostPopular
ItemKNN
UserKNN</p>
        <p>BPR
RankALS
BiasedMF</p>
        <p>SVD++
WRMF</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Dataset</title>
      <p>We ran our experiments on MovieLens 1M1 dataset (ML), a
publicly available dataset for movie recommendation which
is widely used in recommender systems experimentation.
ML contains 6,040 users and 3,702 movies and 1M ratings.
The sparsity of ratings in this dataset is about 96%.</p>
    </sec>
    <sec id="sec-7">
      <title>Experiment Design</title>
      <p>In this section, we look to address these questions:
• What values of the bias disparity are produced by
different recommendation algorithms? (RQ1)
• Do bias disparity values difer across male and female
users in the dataset? (RQ2)
• How are users with extreme initial preference ratio
efected by bias disparities? (RQ3)</p>
      <p>We addressed these questions in three steps: Initially, we
selected a subset of the ML dataset consisting of male and
female user groups and two movie genres as our item groups.
Then, in the first step , we separately calculated preference
ratio (Eq. 1) of males and females (user groups) on these genres
and computed the corresponding the bias disparity values
(Eq. 2). In the second step, we calculated the preference ratios
and bias disparities for our movie genres on the whole user
data (without partitioning into separate user groups). In the
third step, we looked into users with zero initial preference
ratio on one of the genres to see the efects of diferent
algorithms on bias disparity. Our goal was to determine if input
preference ratios were significantly diferent from the output
preference ratios in the recommendations (i.e., if bias
disparity was significantly diferent from 0, due to the dampening
or amplification of preference biases).</p>
      <p>In the first step of the experiments, we calculated the
group-based bias disparity. As bias disparity represents a
form of inaccuracy (users getting results diferent from their
interests), bias disparity diferences between groups
represent a form of unfairness as the system is working better for
some than for others.</p>
      <p>In the second step of the experiment, we calculated the
general bias disparity for the whole population. The
comparison of the bias disparity for the whole population (step 2)
compared to specific user sub-groups (step 1), can help us
understand how algorithms difer in terms of bias disparity
across the whole user population.</p>
      <p>We ran two sets of experiments, first with Action and
Romance genre movies as our item groups, and then with Crime
and Sci-Fi genre movies. More details will be mentioned in
each experiment.
1https://grouplens.org/datasets/movielens
4</p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTAL RESULTS</title>
    </sec>
    <sec id="sec-9">
      <title>Experiment 1: Action and Romance Categories</title>
      <p>In this experiment, we keep the number of items in item
groups approximately the same while we create unbalanced
user group sizes. The Action and Romance genres are taken
as item categories, with 468 and 436 movies in each group
respectively. We have 278 women and 981 men for our user
groups while each user has at least 90 ratings. After filtering
the dataset, we ended up with 207,002 ratings from 1,259
users on 904 items with a sparsity of 18% for experiment one.</p>
      <p>As we see in Table 2, the preference ratio of male users is
higher for Action genre (≈ 0.70) compared to the Romance
genre (≈ 0.30) whereas female users have a more balanced
preference ratio (≈ 0.50) over these two movie genres. From
comparing the preference ratios of the whole population
and sub-groups (table 2), we observe an overall tendency to
prefer the Action genre over Romance genre. This overall
bias mainly comes from the preference ratio of the majority
male user group.</p>
      <sec id="sec-9-1">
        <title>Genre</title>
        <p>Action
Romance</p>
      </sec>
      <sec id="sec-9-2">
        <title>Whole Population</title>
        <p>0.675
0.325</p>
      </sec>
      <sec id="sec-9-3">
        <title>Male</title>
        <p>0.721
0.279</p>
      </sec>
      <sec id="sec-9-4">
        <title>Female</title>
        <p>0.502
0.498
Step 1: Group-based Bias Disparity. According to the results
shown in Figure 1, we see that both the neighborhood based
methods, UserKNN and ItemKNN, show increased output
preference ratio (PR) of both male and female user groups
on Action genre by 50% and around 20% respectively. While
both of these algorithms show increased preference ratio
on the Action genre, they have dramatically decreased it for
Romance genre, although the preference ratio of women on
both genres in the input data were balanced. Accordingly,
we see in Figure 2, both of these algorithms show negative
bias disparities (BD) on Romance for both men and women.</p>
        <p>These results show diferent outcomes for the two groups
because of the diferent input preference ratio. For the female
group, the neighborhood-based algorithms induced a bias
towards Action not present in the input; for the male group,
the algorithms tend to perpetuate and amplify the existing
biases in the input data.</p>
        <p>The matrix factorization algorithms show diferent
tendencies. In BiasedMF, the output preference ratio is much
lower than the input preference ratio for male users in the
Action genre (the opposite of what we observed for the
neighborhood-based methods). The PR for the female group
is approximately the same. With BiasedMF, the preference
ratios of both female and male groups are pushed close to 0.5.
We have a negative bias disparity as we see in Figure 2, which
means that the original preference ratio is underestimated.
Interestingly, this algorithm strengthens the bias disparity of
both men and women on Romance genre which is an
overestimation of their actual preference. We see a similar pattern
in SVD++ as well.</p>
        <p>WRMF, the other latent factor model, gives inconsistent
results from BiasedMF and SVD++. It slightly decreased the
preference ratio of women on Action and increased it for
men on Action. We see the opposite trend on Romance genre,
in other words, the output preference ratio for women on
Romance is slightly higher while for men is lower.</p>
        <p>Generally, the absolute value of the BD for the two user
groups are not similar. Men have higher absolute values of
BD on Romance while women have higher absolute values
of BD on Action. As we see in Figure 2, diferent algorithms
afect women or men diferently. ItemKNN afects the women
more than men in both genres, while UserKNN amplifies the
bias more for men than women in both genres. BiasedMF
and SVD++ increase bias more for men; WRMF, increases
the bias slightly more for women.</p>
        <p>In this experiment, women had an almost balanced
preference over Action and Romance movies, while men prefer
Action movies to Romance movies. A well-calibrated
algorithm would preserve these tendencies. However, with the
influence of the male group, most of the recommender
algorithms provide an unbalanced recommendation list specially
for women (the minority group). However, BiasedMF and
SVD++ run counter to this trend, reversing the bias
disparity for both genres. The influence of men’s preferences for
Action in the overall data is reduced, resulting in fewer
unwanted Action movie recommendations for women which
is fairer for this group. These two algorithms balance out
the exposure of Action and Romance genres for both user
groups.</p>
        <p>K-nearest-neighbor methods amplify the bias significantly
and this behavior could be due to their sensitivity to the
popularity bias. Both of the neighborhood-based models show
a similar trend to the most-popular recommender (the light
blue bar). Romance genre is less favored by the majority
group (981 men vs 278 women) in the dataset compared to
the Action genre. So, we end up having more neighbors from
the majority group as the nearest neighbors (user-knn) or
having more ratings from the majority group on a specific
genre (item-knn). So, their preference will dominate the
preference of the other group on both genres. These methods
not only prioritize the preference of the majority group to
the minority group, but they also amplify this bias.
Step 2: General Bias Disparity. In Figure 1, the bar shows
the preference ratio in the recommendation output and the
dashed line shows the input preference ratio for related
categories.</p>
        <p>In general, the Action genre (with preference ratio of 0.675)
is preferred to the Romance genre (preference ratio of 0.325).
As seen in the previous experiment, the two
neighborhoodbased methods, UserKNN and ItemKNN, both increase the
general preference ratio significantly. Our latent factor
models (BiasedMF, SVD++, WRMF) show diferent efects on
the preference ratio. None of the matrix factorization
algorithms significantly increase the original input preference
ratio in the Action genre. BiasedMF and SVD++ significantly
decreases the output preference ratio in Action genre, while
WRMF keep the the output preference ratio close to the
initial preference.</p>
        <p>The Romance category has lower input preference ratio
than Action genre, which means that in the input dataset,
the population on average prefers Action to Romance. The
output preference ratios for this genre show a reverse
pattern compared to the Action genre. The neighborhood-based
algorithms decrease the preference ratio and most of the
matrix factorization algorithms don’t change the preference
ratio by much except for BiasedMF and SVD++, which
significantly increases the preference ratio. We can see the bias
disparity change in Figure 2 as well.</p>
        <p>Step 3: Users with Extreme Preferences. To examine extreme
preference cases, we concentrated on users with very low
preference ratios across the genres we studied (We excluded
the users that had a zero preference ratio on both genres).
There were 10 men who had zero preference ratios on the
Romance genre, which means that they only watched
Action movies. In Figure 3, it shows the preference ratio in the
recommendation. Some algorithms, like UserKNN, BPR, and
WRMF, recommend all Action movies, which is totally
consistent with these users’ initial preference. Other algorithms,
including BiasedMF and SVD++, de-amplify the efects of
the preference and show a more diverse recommendation set.
When analyzing the preference ratio of the extreme group,
the efects of some algorithms become more clearer because
of the consistency of the general population and extreme
group.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Experiment 2: Crime and Sci-Fi</title>
      <p>In this experiment, our item groups were Crime and Sci-Fi,
with 211 and 276 movies in each group respectively. The
number of users in both user groups were still unbalanced,
259 female users and 1,335 male users. All of the users had at
least 50 ratings from both genres which leaves us with 37,897
ratings from the 1,594 users on 487 items. The sparsity of
the dataset was around 95%.</p>
      <p>As it is shown in table 4, the preference ratios of both
male and female users in Crime and Sci-Fi movies are similar.
Both men and women have a preference ratio of around 0.7
on Sci-Fi and around 0.3 for the Crime genre. According to
Table 4, the whole population prefers Sci-Fi movies to Crime
movies, and we see a similar trend in both user groups, male
and female.</p>
      <sec id="sec-10-1">
        <title>Genre</title>
        <p>Crime
Sci-Fi
Step 1: Group-Based Bias Disparity. Overall, the group-based
bias disparity is very similar to the pattern seen in the whole
population. Based on patterns shown in Figure 4, the
difference between the patterns that we see in Crime genre
for both men and women is minimal, and the same trend
is true for Romance genre. The diference in the absolute
values of bias disparity between groups is not as enormous
as the diference that we saw in Action and Romance
(Figure 2), which is partly because the two groups have similar
preference over the two categories.</p>
        <p>Neighborhood-based algorithms amplify the existing
preference bias for both groups. The matrix factorization
algorithms either dampen the input bias, like BiasedMF and
SVD++, or they don’t change the input preference ratio
significantly, like WRMF.
Step 2: General Bias Disparity. As shown in the Figure 4, the
pattern of Crime and Sci-Fi over the whole population is
consistent with Action and Romance. The neighborhood based
algorithms, UserKNN and ItemKNN, show an increased
output preference ratio for the more preferred genre (Sci-Fi),
and a decreased PR for the less preferred genre (Crime). The
matrix factorization algorithms show diferent patterns from
neighborhood based algorithms but very similar pattern to
experiment 1. BiasedMF and SVD++ have the most
significant efects on the preference ratio, increasing the preference
ratios of the less favored category and decreasing those of
the more favored category. WRMF shows good calibration
here.</p>
        <p>The bias disparity showing in Figure 5 is also consistent
with the bias disparity shown in the Figure 2 of experiment
one.
Step 3: Users with Extreme Preferences. We had 37 users with
preference ratio value of zero on Crime movies, meaning
that they only watched Sci-Fi movies. The trends that we see
in Figure 6 for this group is pretty similar to Figure 3</p>
        <p>Similarly to experiment 1, algorithms such as UserKNN,
BPR, and WRMF, provide the recommendations well-calibrated
to the users’ initial preferences whereas BiasedMF and SVD++,
significantly ampen the initial preference biases.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5 CONCLUSION AND FUTURE WORK</title>
      <p>
        Although we focused here on a handful of the more common
movie genres, some important patterns can be seen. Both
of the neighborhood-based models show a similar trend
towards popularity, consistent with the findings of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. With
these models, we might expect that a dominant group would
contribute more neighbors in recommendation generation
and would influence predictions by virtue of its presence
in these groupings. These methods not only prioritize the
preference of the dominant group, but they also amplify the
biases for the dominant group across all users.
      </p>
      <p>
        Diferent from the previous research on the bias
ampliifcation of matrix factorization methods [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we observed
that diferent matrix factorization models influence
preference biases diferently. SVD++ and BiasedMF both dampen
the preference bias for diferent movie genres for both men
and women. WRMF algorithm is well-calibrated for the
SciFi/Crime genres for both men and women but the behavior
is inconsistent for Action/Romance genre.
      </p>
      <p>
        Each of these model-based algorithms produces a
lowrank approximation of the input rating data, but do so in
slightly diferent ways. Jannach et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] found that
modelbased algorithms generally have less popularity bias, so it
may be expected that such algorithm would not show as
much bias disparity as the memory-based ones. However,
further study will be required to understand the interactions
between input biases and each algorithm’s learning objective.
Interestingly, parameter tuning of these algorithms, which
produced better accuracy, did not change the bias disparity
pattern.
      </p>
      <p>As we have discovered in our experiments,
recommendation algorithms generally distort preference biases present
in the input data and do so in sometimes unpredictable ways.
Diferent groups of users may be treated in quite diferent
ways as a result. Bias disparity analysis is a useful tool in
understanding how aspects of the input data are reflected in
an algorithm’s output.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Himan</given-names>
            <surname>Abdollahpouri</surname>
          </string-name>
          , Robin Burke, and
          <string-name>
            <given-names>Bamshad</given-names>
            <surname>Mobasher</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Controlling popularity bias in learning-to-rank recommendation</article-title>
          .
          <source>In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM</source>
          ,
          <volume>42</volume>
          -
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Solon</given-names>
            <surname>Barocas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew D</given-names>
            <surname>Selbst</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Big data's disparate impact</article-title>
          . Calif. L. Rev.
          <volume>104</volume>
          (
          <year>2016</year>
          ),
          <fpage>671</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Burke</surname>
          </string-name>
          , Nasim Sonboli, and
          <string-name>
            <surname>Aldo</surname>
          </string-name>
          Ordonez-Gauger.
          <year>2018</year>
          .
          <article-title>Balanced neighborhoods for multi-sided fairness in recommendation</article-title>
          . In Conference on Fairness,
          <source>Accountability and Transparency</source>
          .
          <volume>202</volume>
          -
          <fpage>214</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Robin</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Burke</surname>
            , Himan Abdollahpouri, Bamshad Mobasher, and
            <given-names>Trinadh</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Towards Multi-Stakeholder Utility Evaluation of Recommender Systems.</article-title>
          .
          <source>In UMAP (Extended Proceedings).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Òscar</given-names>
            <surname>Celma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Cano</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>From hits to niches?: or how popular artists can bias music recommendation and discovery</article-title>
          .
          <source>In Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition . ACM</source>
          ,
          <volume>5</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Sushma</given-names>
            <surname>Channamsetty</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael D</given-names>
            <surname>Ekstrand</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Recommender response to diversity and popularity bias in user profiles</article-title>
          .
          <source>In The Thirtieth International Flairs Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Michael</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Ekstrand</surname>
          </string-name>
          , Mucun Tian,
          <string-name>
            <surname>Mohammed R Imran Kazi</surname>
            , Hoda Mehrpouyan, and
            <given-names>Daniel</given-names>
          </string-name>
          <string-name>
            <surname>Kluver</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Exploring author gender in book rating and recommendation</article-title>
          .
          <source>In Proceedings of the 12th ACM Conference on Recommender Systems. ACM</source>
          ,
          <volume>242</volume>
          -
          <fpage>250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Guibing</given-names>
            <surname>Guo</surname>
          </string-name>
          , Jie Zhang, Zhu Sun, and
          <string-name>
            <surname>Neil</surname>
          </string-name>
          Yorke-Smith.
          <year>2015</year>
          .
          <article-title>LibRec: A Java Library for Recommender Systems.</article-title>
          .
          <source>In UMAP Workshops</source>
          , Vol.
          <volume>4</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Moritz</given-names>
            <surname>Hardt</surname>
          </string-name>
          , Eric Price,
          <string-name>
            <given-names>Nati</given-names>
            <surname>Srebro</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Equality of opportunity in supervised learning</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>3315</volume>
          -
          <fpage>3323</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>L Herlocker</given-names>
          </string-name>
          ,
          <article-title>Joseph A Konstan, Loren G Terveen,</article-title>
          and John T Riedl.
          <year>2004</year>
          .
          <article-title>Evaluating collaborative filtering recommender systems</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 22</source>
          ,
          <issue>1</issue>
          (
          <year>2004</year>
          ),
          <fpage>5</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Yifan</surname>
            <given-names>Hu</given-names>
          </string-name>
          , Yehuda Koren, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Collaborative Filtering for Implicit Feedback Datasets.</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>ICDM</given-names>
          </string-name>
          , Vol.
          <volume>8</volume>
          . Citeseer,
          <volume>263</volume>
          -
          <fpage>272</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Neil</given-names>
            <surname>Hurley</surname>
          </string-name>
          and Mi Zhang.
          <year>2011</year>
          .
          <article-title>Novelty and diversity in top-n recommendation-analysis and evaluation</article-title>
          .
          <source>ACM Transactions on Internet Technology (TOIT) 10</source>
          ,
          <issue>4</issue>
          (
          <year>2011</year>
          ),
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Dietmar</surname>
            <given-names>Jannach</given-names>
          </string-name>
          , Lukas Lerche, Iman Kamehkhosh, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Jugovac</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>What recommenders recommend: an analysis of recommendation biases and possible countermeasures</article-title>
          .
          <source>User Modeling and User-Adapted Interaction 25</source>
          ,
          <issue>5</issue>
          (
          <year>2015</year>
          ),
          <fpage>427</fpage>
          -
          <lpage>491</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Toshihiro</surname>
            <given-names>Kamishima</given-names>
          </string-name>
          , Shotaro Akaho, Hideki Asoh, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Sakuma</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Correcting Popularity Bias by Enhancing Recommendation Neutrality.</article-title>
          . In RecSys Posters.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Yehuda</given-names>
            <surname>Koren</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Factorization meets the neighborhood: a multifaceted collaborative filtering model</article-title>
          .
          <source>In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM</source>
          ,
          <volume>426</volume>
          -
          <fpage>434</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yehuda</surname>
            <given-names>Koren</given-names>
          </string-name>
          , Robert Bell, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Matrix factorization techniques for recommender systems</article-title>
          .
          <source>Computer</source>
          <volume>8</volume>
          (
          <year>2009</year>
          ),
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Masoud</surname>
            <given-names>Mansoury</given-names>
          </string-name>
          , Robin Burke, Aldo Ordonez-Gauger, and
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Sepulveda</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Automating recommender systems experimentation with librec-auto</article-title>
          .
          <source>In Proceedings of the 12th ACM Conference on Recommender Systems. ACM</source>
          ,
          <volume>500</volume>
          -
          <fpage>501</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Safiya</surname>
            <given-names>Umoja</given-names>
          </string-name>
          <string-name>
            <surname>Noble</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Algorithms of oppression: How search engines reinforce racism</article-title>
          . nyu Press.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Eli</given-names>
            <surname>Pariser</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>The filter bubble: How the new personalized web is changing what we read and how we think</article-title>
          .
          <source>Penguin.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Arkadiusz</given-names>
            <surname>Paterek</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Improving regularized singular value decomposition for collaborative filtering</article-title>
          .
          <source>In Proceedings of KDD cup and workshop</source>
          , Vol.
          <year>2007</year>
          . 5-
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Stefen</surname>
            <given-names>Rendle</given-names>
          </string-name>
          , Christoph Freudenthaler, Zeno Gantner, and
          <string-name>
            <surname>Lars</surname>
          </string-name>
          Schmidt-Thieme.
          <year>2009</year>
          .
          <article-title>BPR: Bayesian personalized ranking from implicit feedback</article-title>
          .
          <source>In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence</source>
          . AUAI Press,
          <fpage>452</fpage>
          -
          <lpage>461</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Harald</given-names>
            <surname>Steck</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Calibrated recommendations</article-title>
          .
          <source>In Proceedings of the 12th ACM conference on recommender systems. ACM</source>
          ,
          <volume>154</volume>
          -
          <fpage>162</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Gábor</given-names>
            <surname>Takács</surname>
          </string-name>
          and
          <string-name>
            <given-names>Domonkos</given-names>
            <surname>Tikk</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Alternating least squares for personalized ranking</article-title>
          .
          <source>In Proceedings of the sixth ACM conference on Recommender systems. ACM</source>
          ,
          <volume>83</volume>
          -
          <fpage>90</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Virginia</surname>
            <given-names>Tsintzou</given-names>
          </string-name>
          , Evaggelia Pitoura, and
          <string-name>
            <given-names>Panayiotis</given-names>
            <surname>Tsaparas</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bias Disparity in Recommendation Systems</article-title>
          . arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>01461</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Sirui</given-names>
            <surname>Yao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bert</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Beyond parity: Fairness objectives for collaborative filtering</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          .
          <volume>2921</volume>
          -
          <fpage>2930</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Jieyu</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tianlu</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Yatskar</surname>
          </string-name>
          , Vicente Ordonez, and
          <string-name>
            <surname>Kai-Wei Chang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Men also like shopping: Reducing gender bias amplification using corpus-level constraints</article-title>
          .
          <source>arXiv preprint arXiv:1707.09457</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>