=Paper= {{Paper |id=Vol-2440/paper6 |storemode=property |title= Bias Disparity in Collaborative Recommendation: Algorithmic Evaluation and Comparison |pdfUrl=https://ceur-ws.org/Vol-2440/paper6.pdf |volume=Vol-2440 |authors=Masoud Mansoury,Bamshad Mobasher,Robin Burke,Mykola Pechenizkiy |dblpUrl=https://dblp.org/rec/conf/recsys/MansouryMBP19 }} == Bias Disparity in Collaborative Recommendation: Algorithmic Evaluation and Comparison== https://ceur-ws.org/Vol-2440/paper6.pdf
    Bias Disparity in Collaborative Recommendation: Algorithmic
                      Evaluation and Comparison∗

                            Masoud Mansoury†                                                               Bamshad Mobasher
                   Eindhoven University of Technology                                                       DePaul University
                       Eindhoven, the Netherlands                                                             Chicago, USA
                          m.mansoury@tue.nl                                                               mobasher@cs.depaul.edu

                                   Robin Burke                                                             Mykola Pechenizkiy
                       University of Colorado Boulder                                              Eindhoven University of Technology
                                Boulder, USA                                                           Eindhoven, the Netherlands
                         robin.burke@colorado.edu                                                         m.pechenizkiy@tue.nl

ABSTRACT                                                                                     It is important to note that in this paper, although we do not
Research on fairness in machine learning has been recently ex-                           directly measure the fairness of recommendation algorithms, we
tended to recommender systems. One of the factors that may impact                        study bias disparity of recommendation algorithms as an important
fairness is bias disparity, the degree to which a group’s preferences                    factor that affects fairness. The benefit of studying bias disparity in
on various item categories fail to be reflected in the recommenda-                       recommender systems is that, depending on the domain, knowing
tions they receive. In some cases biases in the original data may                        which algorithms produce more or less disparity from users’ stated
be amplified or reversed by the underlying recommendation algo-                          preferences can allow system designers to better control the rec-
rithm. In this paper, we explore how different recommendation                            ommendation output. In our analysis of bias disparity, we also take
algorithms reflect the tradeoff between ranking quality and bias dis-                    into account item coverage in recommended lists. A recommenda-
parity. Our experiments include neighborhood-based, model-based,                         tion algorithm with higher item coverage signifies that majority of
and trust-aware recommendation algorithms.                                               item providers in the system will have equal chance to be shown
                                                                                         to users.
KEYWORDS                                                                                     Our analysis includes a variety of recommendation algorithms:
                                                                                         neighborhood models, factorization models, and trust-aware recom-
Recommender systems, Trust ratings, Fairness, Bias disparity
                                                                                         mendation algorithms. In particular we investigate the performance
                                                                                         of trust-aware recommendation algorithms. In these algorithms,
1    INTRODUCTION                                                                        besides items ratings, explicit trust ratings are used as side infor-
Recommender systems are powerful tools in extracting users prefer-                       mation to enhance the quality of input values for recommender
ences and suggesting desired items. These systems, while accurate,                       systems. It has been shown that using explicit trust ratings will
may suffer from a lack of fairness to specific groups of users. Re-                      provide advantages for recommender systems [20]. First, since trust
search in fairness-aware recommender systems have shown that the                         ratings can be propagated, they can help overcome cold-start issue
outputs of recommendation algorithms are, in some cases, biased                          in recommender systems. Secondly, trust-aware methods are robust
against protected groups [7]. As a result, this discrimination among                     against shilling attacks in recommender systems [16]. In this paper,
users will degrade users’ satisfaction, loyalty, and effectiveness of                    we also analyze the performance of these algorithms in addressing
recommender systems, and at worst, it can lead to or perpetuate                          bias disparity in recommender systems.
undesirable social dynamics.                                                                 The motivation behind this research is analyzing the perfor-
   Discrimination in recommendation output can originate from                            mance of recommendation algorithms in preference deviation across
different sources. It may stem from the underlying biases in the                         item categories for a specific group of users (e.g., male vs. female).
input data [4, 25] used for training. On the other hand, the discrim-                    Given protected and unprotected groups, we aim to compare the
inative behavior may be the result of recommendation algorithms                          ability of recommendation algorithms to generate recommenda-
[13, 27, 28].                                                                            tions equally well for each group based on their preferences in
   In this paper, we examine the effectiveness of recommendation                         training data. Therefore, no matter what the context of the dataset
algorithms in capturing different groups’ interests across item cate-                    is, given protected/unprotected groups and item categories, we
gories. We compare different recommendation algorithms in terms                          are interested in comparing recommendation algorithms for their
of how they capture the categorical preferences of users and reflect                     ability to recommend preferred item categories to these groups of
them in the recommendation delivered.                                                    users.
∗ Copyright 2019 for this paper by its authors. Use permitted under Creative Commons         For experiments, we prepared a sample of publicly-available
License Attribution 4.0 International (CC BY 4.0).                                       Yelp dataset for research on fairness-aware recommender systems.
Presented at the RMSE workshop held in conjunction with the 13th ACM Conference          Our experiments are performed on multiple recommendation algo-
on Recommender Systems (RecSys), 2019, in Copenhagen, Denmark.
† This author also has affiliation in School of Computing, DePaul University, Chicago,   rithms and the results are evaluated in terms of bias disparity and
USA, mmansou4@depaul.edu.                                                                average disparity along with ranking quality and item coverage.
RMSE’19, September 2019, Copenhagen, Denmark                                                                          Masoud Mansoury, et al.


2   BACKGROUND                                                          for end users. Also, Liu and Burke in [17] proposed a fairness-aware
The problem of unfair outputs in machine learning applications is       re-ranking approach that iteratively balances the ranking quality
well studied [3, 6, 12] and also it has been extended to recommender    and provider fairness. In this post-processing approach, users’ tol-
systems. Various studies have considered fairness in recommenda-        erance for diversity list is also considered to find trade-off between
tion results [4].                                                       accuracy and provider fairness.
    One research direction in fairness-aware recommender systems
is providing fair recommendations for consumers. Burke et. al. in [4]   3   FAIRNESS METRICS
have shown that inclusion of a balanced neighborhood regulariza-        In this paper, we compare the performance of state-of-the-art rec-
tion to SLIM algorithm can improve the equity of recommendations        ommendation algorithms in terms of bias disparity in recommended
for protected and unprotected groups. Based on their definition         lists. We also consider ranking quality and item coverage of recom-
for protected and unprotected groups, their solution takes into ac-     mendation algorithms as two important additional metrics.
count the group fairness of recommendation outputs. Analogously,            We use two metrics to measure changes in bias for groups of
Yao and Huang in [27] improved the equity of recommendation re-         users given item categories: bias disparity and average disparity.
sults by adding fairness terms to objective function in model-based         Bias disparity measures how much an individual’s recommenda-
recommendation algorithms. They proposed four fairness metrics          tion list deviates from his or her original preferences in the training
that capture the degree of unfairness in recommendation outputs         set [25]. Given a group of users, G, and an item category, C, bias
and added these metrics to learning objective function to further       disparity is defined as follow:
optimize it for fair results.
    Zhu et al. in [29] proposed a fairness-aware tensor-based rec-                                      B R (G, C) − BT (G, C)
                                                                                         BD(G, C) =                                         (1)
ommender systems to improve the equity of recommendations                                                     BT (G, C)
while maintaining the recommendation quality. The idea in their         where BT (B R ) is the bias value of group G on category C in training
paper is isolating sensitive information from latent factor matrices    data (recommendation list). BT is defined by:
of the tensor model and then using this information to generate
fairness-aware recommendations.                                                                              PRT (G, C)
                                                                                               BT (G, C) =                                 (2)
    Besides consumer fairness, provider fairness is another research                                           P(C)
direction in fairness-aware recommender systems. Provider fairness         where P(C) is the fraction of item category C in the dataset de-
refers to the fact that items belong to each provider have equal        fined as |C |\|m|. PRT is the preference ratio of group G on category
chance to be shown in the recommended lists. This is known as           C calculated as:
popularity bias and usually measured by item coverage.
    Abdollahpouri et al., [2] addressed popularity bias in learning-                                                   T (u, i)
                                                                                                          Í     Í
to-rank algorithms by inclusion of fairness-aware regularization                             PRT (G, C) = Íu ∈G Íi ∈C                     (3)
                                                                                                            u ∈G i ∈I T (u, i)
term into objective function. They showed that the fairness-aware
                                                                            where T is the binarized user-item matrix. If user u has rated
regularization term controls the recommendations being toward
                                                                        item i, then T (u, i) = 1, otherwise T (u, i) = 0.
popular items.
                                                                            The bias value of group G on category C in the recommendation
    Jannach et al., [11] conducted a comprehensive set of analysis
                                                                        list, B R , is defined similarly.
on popularity bias of several recommendation algorithms. They
                                                                            On the other hand, average disparity measures how much prefer-
analyzed recommended items by different recommendation algo-
                                                                        ence disparity between training data and recommendation list for
rithms in terms of their average ratings and their popularity. While
                                                                        one group of users (e.g., unprotected groups) is different from that
it is very dependent to the characteristics of the data sets, they
                                                                        for another group of users (e.g., protected group). Inspired by value
found that some algorithms (e.g., SlopeOne, KNN techniques, and
                                                                        unfairness metric proposed by Yao and Huang [27], we introduce
ALS-variant of factorization models) focus mostly on high-rated
                                                                        the average disparity as:
items which bias them toward a small sets of items (low coverage).
Also, they found that some algorithms (e.g., ALS-variants of fac-                                  |C |
torization model) tend to recommend popular items, while some                                   1 Õ
                                                                                 disparity =            |(N R (GU , Ci ) − NT (GU , Ci ))
other algorithms (e.g., UserKNN and SlopeOne) tend to recommend                                |C | i=0                                      (4)
less-popular items.                                                                                   −(N R (G P , Ci ) − NT (G P , Ci ))|
    Multi-stakeholder recommender systems simultaneously take
into account the fairness of all stakeholders or entities in a multi-      where GU and G P are unprotected and protected groups, re-
sided platform. The main goal of multi-stakeholder recommenda-          spectively. N R (G, C) and NT (G, C) return number of items from
tions is maximizing the fairness of all stakeholders. Consumers and     category C in recommendation lists and training data, respectively,
providers are the major stakeholders in most multi-sided platforms      that are rated by users in group G.
[1, 5].                                                                    As part of our analysis, we also measure item coverage of recom-
    Surer et al. in [30] proposed a multi-stakeholder optimization      mended lists which is an important consideration in provider-side
model that works as a post-processing approach for standard rec-        fairness. Given the whole set of items in the system, I , and whole
ommendation algorithms. In this model, a set of constraints for         recommendation lists for all users, R all , item coverage measures
providers are considered when generating recommendation lists           what percentage of items in the system appeared in recommenda-
                                                                        tion lists and can be calculated as:
Bias Disparity in Collaborative Recommendation                                                RMSE’19, September 2019, Copenhagen, Denmark

                  Table 1: Parameter configuration                         are 1,355 users who provided 100,409 ratings on 1,272 businesses.
            parameter                  values
                                                                           The range of ratings is 1 (not preferred) to 5 (preferred). The density
                                                                           of rating matrix is 5.826.
            #neighbors                 {10,20,30,40,50,70,100,200}            This Yelp dataset also has information about users friendship.
            shrinkage                  {10,30,50,100,200}
                                                                           Each user has selected a set of other users as her friends. We inter-
            similarity                 {pcc,cos}
                                                                           pret this relationships as a trust network. When user A selects user
            user regularization        {0.0001,0.001,0.005,0.01}
            item regularization        {0.0001,0.001,0.005,0.01}           B as a friend, it means that user A trusts user B with respect to the
            bias regularization        {0.0001,0.001,0.005,0.01}           corresponding domain or category. In this dataset, 919 users have
            implicit regularization    {0.0001,0.001,0.005,0.01}           expressed their trustworthiness to 1,172 users and there are 26,453
            learning rate              {0.0001,0.001,0.005,0.01}           trust relationships between users. With regard to the number of
            #iterations                {10,30,50,100}                      users, the density of trust matrix is 2.456.
            #factors                   {10,30,50,100,150,200,300}             In order to evaluate the recommendation outputs in terms of bias
            ℓ1 -norm                   {0.005,0.05,0.5,2,5}                disparity and average disparity, specific information about users
            ℓ2 -norm                   {0.005,0.05,0.5,2,5}                and items is needed. First, we need to define users group based on
                                                                           users demographic information and item category based on item
                                                                           contents. In Yelp dataset, there is no useful information about user
                                                                           to define users’ group. To overcome this issue, we prepared the
                                      |{i, i ∈ (R all ∩ I )}|
                  coveraдe = 100.                                    (5)   dataset by extracting users’ gender from users’ name. To do this,
                                                |I |                       we use an existing online tool2 to extract users’ gender. In this tool,
4 EXPERIMENTS                                                              for each user name as input, it will return the predicted gender,
                                                                           number of samples used for prediction, and prediction accuracy.
4.1 Experimental setup                                                     Hence, it enables us to increase the reliability of extracted genders
For comparing the effects of recommendation algorithms on bias             by taking outputs with high accuracy and fair amount of samples.
and on item coverage, we performed an extensive experiments                   Moreover, information about items’ category is provided in the
on state-of-the-art recommendation algorithms. Experiments are             dataset. Each business in Yelp dataset is assigned multiple relevant
performed on model-based, neighborhood-based, and trust-aware              categories.
recommendation algorithms.                                                    Overall, the prepared dataset has four separate sets:
   Our experiments on neighborhood-based recommendation algo-                  1. The rating data that each user provided to businesses.
rithms include user-based collaborative filtering (UserKNN) [22] and           2. Explicit trust data that each user has selected trusted (friends)
item-based collaborative filtering (ItemKNN) [23]. Also, our experi-              users.
ments on model-based recommendation algorithms include biased                  3. Users information that consists of users’ gender.
matrix factorization (BiasedMF) [15], combined explicit and im-                4. Items category that consists of several category for each busi-
plicit model (SVD++) [14], list-wise matrix factorization (ListRankMF)            ness.
[24], and the sparse linear method (SLIM) [21]. Finally, our exper-
                                                                             By using this dataset, we define the set G =< male, f emale >
iments on trust-aware recommendation algorithms include trust-
                                                                           and set C as categories assigned to each business. The dataset is
aware neighborhood model (TrustKNN) [20], trust-based singular
                                                                           available at https://github.com/masoudmansoury/yelp_core40.
value decomposition (TrustSVD) [9], social regularization-based
method (SoReg) [18], trust-based matrix factorization (TrustMF)
                                                                           4.3     Experimental results
[26], and social matrix factorization (SocialMF) [10]. Besides above
well-known recommendation algorithms, we also performed exper-             In this section, we compare the performance of recommendation
iments on two naive algorithms: random and most popular.                   algorithms across the different metrics discussed earlier. First, we
   For sensitivity analysis, we performed extensive experiments            show the bias disparity of recommendations results on top 10 most
with different parameter configurations for each algorithm. Table 1        preferred item categories. Second, we show average disparity for
shows the parameter configurations we used for our experiments.            each algorithm on all categories. For sensible comparison, we also
   We performed 5-fold cross validation, and in the test condition,        take into account the ranking quality and item coverage.
generated recommendation lists of size 10 for each user. Then,                4.3.1 Bias disparity. Results on model-based recommendation
we evaluated nDCG, item coverage, bias disparity, and average              algorithms on top 10 most preferred item categories for male and
disparity at list size 10. Results were averaged over all users and then   female are shown in Figure 1. Figure 1a shows the bias disparity for
over all folds. We used librec-auto and LibRec 2.0 for all experiments     male individuals and Figure 1b shows the bias disparity for female
[8, 19].                                                                   individuals. Since there is always a trade-off between accuracy and
                                                                           non-accuracy metrics (e.g., nDCG vs. fairness), for comparison, the
4.2     Yelp dataset                                                       fairness analysis is conducted on recommendation outputs that
For our experiments, we use a subset of Yelp dataset from round 12         give the same nDCG (highest possible) for all recommendation al-
of Yelp Challenge1 . In this sample, each user has rated at least 40       gorithms. For model-based recommendation algorithms, the nDCG
businesses and each business is rated by at least 40 users. Thus, there    value is set to 0.023±0.001. This setting guarantees that the fairness
1 https://www.yelp.com/dataset                                             2 https://gender-api.com
RMSE’19, September 2019, Copenhagen, Denmark                                                                      Masoud Mansoury, et al.




                                                                 (a) Male




                                                                (b) Female

Figure 1: Bias disparity for model-based recommendation algorithms. The x-axis is the top 10 most preferred categories for
male and female on training data and y-axis is bias value computed by equation 2. The numbers on each bar shows the bias
disparity computed by equation 1. Numbers in bold show the lowest bias disparity for each category.


of recommendation algorithms is compared in same condition for             Results on neighborhood-based recommendation algorithms for
all algorithms.                                                         male and female groups are shown in Figure 2. The nDCG values for
    As it is shown in Figure 1, in most cases, SoReg provides lower     neighborhood algorithms are all set to 0.074 ± 0.01. Figure 2a shows
bias disparity on top 10 most preferred categories for male and         the bias disparity of neighborhood models for male. TrustKNN gen-
female groups. For males in Figure 1a, SoReg and SLIM generated         erated more stable recommendations compared to other algorithms
more stable outputs compared to other algorithms with the lowest        with 50% top 10 most categories. Also, for other categories, its out-
bias disparity in 40% cases. On the other hand, for female, SoReg       put is very close to the best one. Moreover, a better output in terms
and ListRankMF generated recommendations with the lowest bias           of bias disparity can be observed in Figure 2b for female. On 60%
disparity of 50% and 40% cases, respectively, when compared to          of top 10 most preferred categories, TrustKNN worked better that
other recommendation algorithms.                                        other neighborhood algorithms.
    In Figure 1, we did not report the results for BiasedMF, SVD++,
SocialMF, TrustMF, and random and most popular item recom-                 4.3.2 Average disparity. Figure 3 compares the performance of
mendations because these algorithms either did not recommend            recommendation algorithms with respect to two criteria: 1) how
any items from top 10 most preferred categories, or their ranking       accurately recommendation algorithms generate stable (i.e. low
quality was lower than specified value for other algorithms.            disparity) recommendations for unprotected and protected groups,
                                                                        2) how accurately recommendation algorithms are able to equally
Bias Disparity in Collaborative Recommendation                                        RMSE’19, September 2019, Copenhagen, Denmark




                                                                (a) Male




                                                               (b) Female

Figure 2: Bias disparity for memory-based recommendation algorithms. The x-axis is the top 10 most preferred categories for
male and female on training data and y-axis is bias value computed by equation 2. The numbers on each bar shows the bias
disparity computed by equation 1. Numbers in bold show the lowest bias disparity for each category.


recommend the items belonging to all providers when generating         coverage. These algorithms provide baselines that other algorithms
recommendations (provider-side fairness).                              should be expected to beat.
   For all experiments that we performed with different hyperpa-           For neighborhood models, TrustKNN showed better performance.
rameters, the best and worst nDCG for each algorithm are reported      Although it has lower ranking quality than UserKNN and ItemKNN,
in Figure 3.                                                           it has significantly better item coverage and average disparity. One
   Random guess algorithm is a naive approach that randomly rec-       possible reason for low nDCG of TrustKNN can be high sparsity
ommends a list of items to each user. Although this algorithm has      of trust matrix. Using a propagation model for reducing the spar-
low accuracy, it has the highest item coverage and lower average       sity of trust matrix may increase the ranking quality of TrustKNN.
disparity compared to other recommendation algorithms. This al-        Overall, neighborhood algorithms worked better than model-based
gorithm does not take any preferences into account and unlikely to     algorithms in terms of all metrics. This is due to the fact that the
provide good results for any user. Also, most popular item recom-      rating data for these experiments is very dense and all users are
mendation is another naive, non-personalized, algorithm that only      heavy raters.
recommends items with the highest number of ratings to each user.          For model-based algorithms, SLIM shows better performance
Although it has high ranking quality and average disparity similar     compared to other algorithms. From Figure 3a, while showing high
to model-based recommendation algorithms, it has the lowest item       nDCG, it has the lowest average disparity and in terms of item cov-
                                                                       erage, it has comparable coverage to other model-based algorithms.
RMSE’19, September 2019, Copenhagen, Denmark                                                                                                  Masoud Mansoury, et al.




                    (a) nDCG vs. average disparity                                                         (b) nDCG vs. item coverage

          Figure 3: Comparison of recommendation algorithms by ranking quality and item coverage/average disparity.


This result is also consistent with the definition of SLIM algorithm                      [4] Robin Burke, Nasim Sonboli, Masoud Mansoury, and Aldo OrdoÃśez-Gauger.
which is an extension of ItemKNN and analogous to neighborhood                                2017. Balanced neighborhoods for fairness-aware collaborative recommendation.
                                                                                              In RecSys workshop on Fairness, Accountability and Transparency in Recommender
algorithms, it showed significant performance.                                                Systems.
   In addition, ListRankMF is another model-based algorithm that,                         [5] Robin D. Burke, Himan Abdollahpouri, Bamshad Mobasher, and Trinadh Gupta.
                                                                                              2016. Towards Multi-Stakeholder Utility Evaluation of Recommender Systems.
although having high accuracy and item coverage, has average                                  In In UMAP (Extended Proceedings).
disparity is as high as other algorithms. Also, for model-based                           [6] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard
trust-aware recommendation algorithms, although SoReg showed                                  Zemel. 2012. Fairness through awareness. In In Proceedings of the 3rd innovations
                                                                                              in theoretical computer science conference. 214–226.
significant reduction in bias disparity on the top 10 most preferred                      [7] Michael D. Ekstrand, Mucun Tian, Ion Madrazo Azpiazu, Jennifer D. Ekstrand,
categories, it did not improve the average disparity on all categories.                       Oghenemaro Anuyah, David McNeill, and Maria Soledad Pera. 2018. All The Cool
                                                                                              Kids, How Do They Fit In?: Popularity and Demographic Biases in Recommender
                                                                                              Evaluation and Effectiveness. In In Conference on Fairness, Accountability and
5    CONCLUSION                                                                               Transparency. 172–186.
In this paper, we examined the effectiveness of recommendation                            [8] Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith. 2015. LibRec: A Java
                                                                                              Library for Recommender Systems. In UMAP Workshops.
algorithms in generating outputs with lower bias disparity for dif-                       [9] Guibing Guo, Jie Zhang, and Neil Yorke-Smith. 2015. TrustSVD: collaborative
ferent groups of users across item categories. We measured the                                filtering with both the explicit and implicit influence of user trust and of item
                                                                                              ratings. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
performance of recommendation algorithms in terms of bias dis-                           [10] Mohsen Jamali and Martin Ester. 2010. A matrix factorization technique with
parity on top 10 most preferred item categories, average disparity,                           trust propagation for recommendation in social networks. In In Proceedings of
ranking quality, and item coverage. A comprehensive sets of ex-                               the fourth ACM conference on Recommender systems. 135–142.
                                                                                         [11] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015.
periments showed that neighborhood models work significantly                                  What recommenders recommend: an analysis of recommendation biases and
better than other algorithms, particularly trust-aware neighbor-                              possible countermeasures. User Modeling and User-Adapted Interaction 25, 5
hood model that outperformed other algorithms. Also, we observed                              (2015), 427–491.
                                                                                         [12] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy. 2010. Discrimination
that in most cases, having additional information along with rating                           aware decision tree learning. In In 2010 IEEE International Conference on Data
data can enhance the performance of recommender systems.                                      Mining. 869–874.
                                                                                         [13] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. 2011. Fairness-aware
   For future work, we would like to investigate individual fairness                          learning through regularization approach. In In 11th International Conference on
by considering the performance of recommendation algorithms in                                Data Mining Workshops. 643–650.
capturing individual users’ interest across different item categories.                   [14] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted
                                                                                              collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international
Also, we are interested to repeat the experiments in this paper                               conference on Knowledge discovery and data mining. ACM, 426–434.
on another sample of Yelp dataset with sparser rating data and                           [15] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-
denser trust data to see how recommendation algorithms are able                               niques for recommender systems. Computer 42, 8 (2009).
                                                                                         [16] Shyong K. Lam and John Riedl. 2004. Shilling recommender systems for fun
to control bias disparity.                                                                    and profit. In Proceedings of the 13th international conference on World Wide Web.
                                                                                              ACM, 393–402.
REFERENCES                                                                               [17] Weiwen Liu and Robin Burke. 2018. Personalizing Fairness-aware Re-ranking.
                                                                                              CoRR abs/1809.02921 (2018). arXiv:1809.02921 http://arxiv.org/abs/1809.02921
 [1] Himan Abdollahpouri, Gediminas Adomavicius, Robin Burke, Ido Guy, Dietmar           [18] Hao Ma, Dengyong Zhou, Chao Liu, Michael R. Lyu, and Irwin King. 2011.
     Jannach, Toshihiro Kamishima, Jan Krasnodebski, and Luiz Augusto Pizzato. 2019.          Recommender systems with social regularization. In Proceedings of the fourth
     Beyond Personalization: Research Directions in Multistakeholder Recommenda-              ACM international conference on Web search and data mining. 287–296.
     tion. CoRR abs/1905.01986 (2019). arXiv:1905.01986 http://arxiv.org/abs/1905.       [19] Masoud Mansoury, Robin Burke, Aldo Ordonez-Gauger, and Xavier Sepulveda.
     01986                                                                                    2018. Automating recommender systems experimentation with librec-auto. In
 [2] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling                Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 500–501.
     Popularity Bias in Learning-to-Rank Recommendation. In RecSys ’17 Proceedings       [20] Paolo Massa and Paolo Avesani. 2007. Trust-aware recommender systems. In
     of the Eleventh ACM Conference on Recommender Systems. 42–46.                            Proceedings of the 2007 ACM conference on Recommender systems. ACM, 17–24.
 [3] Engin Bozdag. 2013. Bias in algorithmic filtering and personalization. Ethics and
     information technology 15, 3 (2013), 209–227.
Bias Disparity in Collaborative Recommendation                                              RMSE’19, September 2019, Copenhagen, Denmark


[21] Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for Top-N
     Recommender Systems. In Data Mining (ICDM), 2011 IEEE 11th International
     Conference on. IEEE, 497–506.
[22] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John
     Riedl. 1994. GroupLens: an open architecture for collaborative filtering of netnews.
     In Proceedings of the 1994 ACM conference on Computer supported cooperative
     work. ACM, 175–186.
[23] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based
     collaborative filtering recommendation algorithms. In WWW’01 Proceedings of
     the 10th international conference on World Wide Web. 285–295.
[24] Yue Shi, Martha Larson, and Alan Hanjalic. 2010. List-wise learning to rank with
     matrix factorization for collaborative filtering. In Proceedings of the fourth ACM
     conference on Recommender systems. ACM, 269–272.
[25] Virginia Tsintzou, Evaggelia Pitoura, and Panayiotis Tsaparas. 2018. Bias Dispar-
     ity in Recommendation Systems. CoRR abs/1811.01461 (2018). arXiv:1811.01461
     http://arxiv.org/abs/1811.01461
[26] Bo Yang, Yu Lei, Jiming Liu, and Wenjie Li. 2017. Social collaborative filtering by
     trust. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 8 (2017),
     1633–1647.
[27] Sirui Yao and Bert Huang. 2017. Beyond parity: Fairness objectives for collabora-
     tive filtering. In In Advances in Neural Information Processing Systems. 2921–2930.
[28] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013.
     Learning fair representations. In In International Conference on Machine Learning.
     325–333.
[29] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-aware tensor-based
     recommendation. In In Proceedings of the 27th ACM International Conference on
     Information and Knowledge Management. 1153–1162.
[30] ÃŰzge SÃijrer, Robin Burke, and Edward C. Malthouse. 2018. Multistakeholder
     recommendation with provider constraints. In In Proceedings of the 12th ACM
     Conference on Recommender Systems. 54–62.