=Paper=
{{Paper
|id=Vol-3318/short6
|storemode=property
|title=Bias mitigation in recommender systems to improve diversity
|pdfUrl=https://ceur-ws.org/Vol-3318/short6.pdf
|volume=Vol-3318
|authors=Du Cheng,Doruk Kilitçioğlu,Serdar Kadioğlu
|dblpUrl=https://dblp.org/rec/conf/cikm/ChengKK22
}}
==Bias mitigation in recommender systems to improve diversity==
<pdf width="1500px">https://ceur-ws.org/Vol-3318/short6.pdf</pdf>
<pre>
Bias mitigation in recommender systems to improve
diversity
Du Cheng1 , Doruk Kilitçioğlu1 and Serdar Kadıoğlu1,2
1
    AI Center of Excellence, Fidelity Investments, 245 Summer St, Boston MA 02210
2
    Computer Science Department, Brown University, 115 Waterman St, Providence, RI 02906


                                       Abstract
                                       Evaluating and mitigating bias in recommendation systems is of great practical interest in many real-world applications.
                                       This motivates the community to call for a more rounded evaluation of recommendation solutions that not only measures
                                       performance based on standard success metrics such as hit rate and ranking but also the quality across different user groups.
                                       To this end, we propose integrating post-processing techniques to mitigate bias in recommendations and measure the
                                       effectiveness of our approach in the CIKM 2022 EvalRS Challenge.

                                       Keywords
                                       Recommender Systems, CIKM 2022 EvalRS Challenge, Algorithmic Fairness, Equalized Odds


1. Introduction                                             models [6, 7] exist, our choice of a classical method is
                                                            motivated by the desire to quantify the attribution of our
Recommendation systems are ubiquitous in several appli- post-processing approach for bias mitigation. We choose
cations, and their success is often measured by point-wise to build the model based on implicit feedback from the in-
engagement metrics. Unfortunately, this might not only teraction data, given that we do not have access to direct
hinder important information when evaluating model input from users, and item features are limited.
performance but might also suffer from unwanted bias
across different user and item groups.                      2.1. Alternating Least Squares (ALS)
    In response to the CIKM 2022 EvalRS Challenge [1],
                                                            The Alternating Least Squares [8] is a classical method
we propose1 an approach that adds both activity-based
                                                            that treats interaction data as indication of user prefer-
averaging, and post-processing steps to a collaborative
                                                            ences and takes into account the associated confidence
filtering baseline model for bias mitigation. Specifically,
                                                            levels. It computes the user factor 𝑥𝑢 ∈ R𝑓 and a vector
we use equalized odds calibration [2] to perturb decisions
                                                            𝑦𝑖 ∈ R𝑓 for each item such that the user preference 𝑝𝑢𝑖
of the recommender conditioned on protected classes to
                                                            can be expressed as 𝑥𝑇𝑢 𝑦𝑖 . Consequently, the following
enhance fairness.
                                                            cost function is minimized:
    We tested this approach using the public Last.fm              ∑︁                            ∑︁             ∑︁
dataset by applying bias mitigation to improve diver- 𝑥⋆,𝑦⋆  min        𝑐𝑢𝑖 (𝑝𝑢𝑖 −𝑥𝑇𝑢 𝑦𝑖 )2 +𝜆(     ||𝑥𝑢 ||2 +     ||𝑦𝑖 ||2 )
sity metrics. We then measure the difference between               𝑢,𝑖                           𝑢              𝑖

the performance over the entire test set versus the “pro-                                                                 (1)
tected” classes, such as song popularity and short user      where   the  second  term   serves to regularize  the  model,
                                                            and 𝑟𝑢𝑖 denotes the confidence level of user 𝑢’s pref-
history. In the following, we provide further details on
                                                            erence towards item 𝑖. Here 𝑟𝑢𝑖 can be computed as
the approach taken for the EvalRS Challenge.
                                                            1 + 𝛼𝑟𝑢𝑖 where 𝛼 is the weight given to positive feed-
                                                            back, and 𝑟𝑢𝑖 is the raw binarized interaction.
2. Our Methodology
                                                                                                2.2. Beyond ALS
Our work revolves around a standard approach as a                                                       We propose two directions to extend the ALS model for
baseline commonly known as Collaborative Filtering [3].                                                 better performance. The first direction is to train the
While sophisticated techniques, e.g., based on recent ad-                                               model on the entire dataset as well as categorized subsets.
vances in Deep Neural Networks [4, 5] and Transformer                                                   In the Last.fm dataset, the users are categorized into
                                                                                                        three groups according to their user activity level. As
EvalRS Workshop at CIKM’22: Proceedings of the 31st ACM Interna- such, an alternating least squares model specific to the
tional Conference on Information and Knowledge Management                                               user activity level is trained on each group. We take
$ du.cheng@fmr.com (D. Cheng); doruk.kilitcioglu@fmr.com                                                the top 𝑛_𝑠𝑢𝑚 items from each model and recommend
(D. Kilitçioğlu); serdar.kadioglu@fmr.com (S. Kadıoğlu)
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License the 𝑡𝑜𝑝−𝑘 items that score the highest in the average
           Attribution 4.0 International (CC BY 4.0).
    CEUR

           CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                        preference score from both the overall model and the
1
  https://github.com/fidelity/jurity/tree/master/evalrs                                                 categorized model in the final results.
  The second direction is post-processing. The impor-        alpha    Regularization       Factors Score  Hit Rate
tant insight is to view features such as gender, country,     0.1            0.1             50    -21.82  0.046
user activity, artist, and track popularity as criteria to     1            0.05             50    -27.19  0.062
create the “protected” groups. The problem can then            10            0.1             50    -27.41  0.075
be formulated as adjusting the model such that discrim-
                                                               20            0.2             50    -26.75  0.076
ination is mitigated from the recommendation results.
One approach to achieve this is by optimizing Equalized        40            0.1             50    -23.54  0.075
Odds [9]. The equalized odds is defined as the indepen- Table 1
dence of the predictor 𝑌ˆ and protected membership A Hyperparameter tuning of the ALS algorithm. We list 5 out of
                                                           100 configurations for brevity.
conditional on the prediction results. More formally;
                                                                 In practice, binarization across all users (items) and
                                                              equalizing  odds per user (item) is costly. We need one
𝑃 𝑟(𝑌ˆ = 1|𝐴 = 0, 𝑌 = 𝑦) = 𝑃 𝑟(𝑌ˆ = 1|𝐴 = 1, 𝑌 = 𝑦)
                                                              mitigation model per user to mitigate differences in track
                                                          (2)
                                                              popularity. Analogously, we need one mitigation model
given 𝑦 ∈ (0, 1).
                                                              per item when balancing user activity. To simplify the
   Conceptually, the equalized odds method works as fol-
                                                              process, we choose the 𝑡𝑜𝑝−𝑚 items with the highest
lows. First, we find the convex hull of the ROC curves of
                                                              engagement the mitigation is focused on user activity.
the contrasted groups such that any false-positive rate
(FPR), true-positive rate (TPR) pair can be satisfied by ei-
ther protected-group-conditional predictor. During train-
                                                              3. Experiments
ing, we obtain four probabilities of flipping the likelihood 3.1. The Challenge Data
of a positive prediction. Then during prediction, we ap- Our initial analysis is based on the transformed version
ply these learned mixing rates on the new data. The of LFM-1b dataset. This dataset contains >119K distinct
open-source Jurity library [10] offers an implementation users, >820K tracks and >37M listening events. Our pri-
of this method and helps us achieve our goal 2 .              mary data source remains the implicit user feedback from
                                                              the interaction data. Based on MRED_USER_ACTIVITY,
2.3. Beyond Binary Fairness Metrics                           we  separate users into the [1, 100, 1000] activity groups,
                                                              and evaluate recommendation results across the groups.
Notice, however, that there is still a gap between generat-
ing recommendations, which can be seen as multi-class, 3.2. Additional Testing
multi-label prediction, and binary mitigation techniques, We extend the testing suite provided by RecList [11] with
such as equalized odds. To bridge this gap, we propose a custom test aimed at fairness in recommender systems.
the following approach:                                       More specifically, we look at the intersection between the
                                                              user activity and track popularity, evaluating whether
      • Obtain the user-item-score matrix for user-item
                                                              there is a material difference in the popularity of tracks
        pairs and run softmax on it.
                                                              that is recommended to users with differing activity.
      • Calculate a binary cutoff point per item based on        To evaluate our new metric, inspired by
        the 80% quantile scores.                              MRED_USER_ACTIVITY, we first binarize users
      • Binarize the results item-wise and run equalized into high- and low-activity groups (using 1000 listens
        odds, using user activity as a protected class.       as a cut-off). We then bin tracks into [1, 10, 100, 1000]
      • In case the application of equalized odds change groups, similar to MRED_TRACK_POPULARITY. For
        the binary label, go back to the softmax scores each user, we look at which track popularity groups they
        and use its complement, i.e., (1 - softmax).          are recommended and the activity group they belong to.
      • Re-order recommendations using the new scores. We then utilize the multi-class statistical parity measure
                                                              from Jurity [10] to measure fairness.
The result of this process leads to a normalized set of
scores where for each item, the decision of whether that 3.3. Numerical Results
item is recommended to a user is now unbiased between We focus on optimizing the model performance in two
high activity and low activity users. As such, we would aspects, i) the standard performance metrics and ii) the
expect to see a lower difference in metrics when compar- diversity metrics. The metrics are calculated using the
ing high activity and low activity users, which is exactly RecList [11] library provided by the organizers.
what MRED_USER_ACTIVITY measures. This technique                 Table 1 presents the set of hyperparameters considered
can be applied to user activity and item popularity.          when training the model on the whole dataset to find the
                                                              configuration with the highest performance score.
2
    https://github.com/fidelity/jurity
                      Model                     Score     Hit Rate    MRED_USER_ACTIVITY              Runtime
                 CBOW Baseline                  -1.212      0.036               -0.022                  N/A
                       ALS                     -21.823      0.046               -0.007              3 min per fold
            User Activity Specific ALS           -100       0.004               -0.001             10 min per fold
       ALS + Averaging (with n_sum = 500)       -11.31      0.027              -0.0086             19 min per fold
       ALS + Averaging (with n_sum = 1000)      -6.670      0.017               -0.005             25 min per fold
              ALS + Post-processing            -18.761      0.042               -0.006              4 min per fold

Table 2
Comparison of post-processing with hyper-parameter tuned ALS model. The parameter n_sum specifies the number of top
items from each model that is used in the calculation. Our final challenge submission is ALS + Averaging with n_sum = 500.


  Table 2 summarizes our attempts that involve training     An area for future improvement is to focus on how to
and evaluating ALS, User Activity Specific ALS, the Aver- utilize averaging in such a way that benefits beyond a
aged models and the Post-processing algorithm for bias single protected class.
mitigation to balance performance and diversity metrics.
                                                              4.2. The Impact of Post-Processing
4. Discussion                                                We compare our post-processing results with the baseline
                                                             CBOW model and our baseline ALS model. Our overall
4.1. The Impact of Averaging                                 score is lower than the CBOW baseline model, while
We compare our work with the CBOW baseline the our hit rate is higher. Our MRED_USER_ACTIVITY is
challenge organizers provided and other solutions. Our again higher than this baseline since our post-processing
overall score is lower than the CBOW baseline, while targets this metric. Compared to our baseline ALS
our hit rate is lower but similar. Remember that model, our overall score increases due to an increase
our approach for averaging specifically targets the in MRED_USER_ACTIVITY, and our hit rate worsens.
MRED_USER_ACTIVITY metric. Hence, as expected, Further compared to our averaging model, the post-
our score in this metric is better than the baseline.        processing sacrifices less hit rate at the expense of achiev-
    One immediate observation from the challenge results ing less improvement in MRED_USER_ACTIVITY.
presented in Leaderboard - II 3 is, due to the aggregated       Since the post-processing involves fitting an equalized
scoring scheme, it is non-trivial to compare algorithms.     odds  model per item, we quickly hit high runtimes when
Different methods exhibit different strengths. Notice that using more than 1000 items. By utilizing the most popular
scores in Leaderboard -II are calculated based on the items, we increase the impact we get from each trained
previous statistics achieved in Leaderboard - I.             model. However, our results show that even though the
    In terms of traditional performance metrics, it is worth post-processing accomplishes an improvement direction-
noting that our hit-rate performance is within the top-5 ally, our solution based on averaging performs better.
solutions. This is interesting, given that we only utilized Combining post-processing with averaging leads to the
a standard recommendation algorithm. Without post- best result on MRED_USER_ACTIVITY in the leader-
processing, our hit rate would be even higher.               board at the expense of hit rate. Therefore, our final
    In line with the rounded evaluation objective of the submission for the CIKM 2022 EvalRS Challenge is the
competition, the top-scoring solution does not strike a averaging model with 𝑛_𝑠𝑢𝑚 = 500.
high hit rate either. This is even the case for the two-
tower deep neural networks, as evident in Leaderboard - I, 5. Conclusion
where its performance trails behind the classical CBOW.
                                                             In this work, we augmented the well-known collaborative
    In terms of extended metrics, our results on
                                                             filtering algorithm with ensembles and bias mitigation
MRED_USER_ACTIVITY metric considerably improve
                                                             to strike a balance between performance and diversity.
over the CBOW baseline and is the 7th over 14, discard-
                                                             This carefully crafted CIKM Challenge goes beyond stan-
ing solutions with -100 performance scores. Our rela-
                                                             dard metrics, provides the easy-to-use RecList library,
tively good performance on MRED_USER_ACTIVITY,
                                                             and raises awareness for a rounded evaluation. In the
even when our overall scores are not the best, is empir-
                                                             same spirit, we focused on mitigating bias on diversity
ical evidence for the efficacy of the ensemble modeling
                                                             metrics, leveraged the Jurity library, and demonstrated
approach. However, when targeting one specific metric,
                                                             encouraging results. We showed how to use existing
our overall results have suffered.
                                                             algorithmic fairness metrics for recommendations and
3
  https://reclist.io/cikm2022-cup/leaderboard.html           extended equalized odds beyond binary classification.
References                                                         9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
                                                              [10] F. Michalský, S. Kadıoğlu, Surrogate ground truth
[1] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio,           generation to enhance binary fairness evaluation
    C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs:             in uplift modeling, in: 2021 20th IEEE Interna-
    a rounded evaluation of recommender systems,                   tional Conference on Machine Learning and Ap-
    2022. URL: https://arxiv.org/abs/2207.05772. doi:10.           plications (ICMLA), 2021, pp. 1654–1659. doi:10.
    48550/ARXIV.2207.05772.                                        1109/ICMLA52953.2021.00264.
[2] M. Hardt, E. Price, N. Srebro, Equality of oppor-         [11] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko,
    tunity in supervised learning, 2016. URL: https:               Beyond ndcg: Behavioral testing of recommender
    //arxiv.org/abs/1610.02413. doi:10.48550/ARXIV.                systems with reclist, WWW ’22 Companion, As-
    1610.02413.                                                    sociation for Computing Machinery, New York,
[3] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-             NY, USA, 2022, p. 99–104. URL: https://doi.org/
    based collaborative filtering recommendation algo-             10.1145/3487553.3524215. doi:10.1145/3487553.
    rithms, in: Proceedings of the 10th International              3524215.
    Conference on World Wide Web, WWW ’01, Asso-
    ciation for Computing Machinery, New York, NY,
    USA, 2001, p. 285–295. URL: https://doi.org/10.1145/
    371920.372071. doi:10.1145/371920.372071.
[4] M. Naumov, D. Mudigere, Dlrm: An advanced, open
    source deep learning recommendation model, 2020.
[5] X. Yi, J. Yang, L. Hong, D. Z. Cheng, L. Heldt,
    A. Kumthekar, Z. Zhao, L. Wei, E. Chi, Sampling-
    bias-corrected neural modeling for large corpus
    item recommendations, in: Proceedings of the
    13th ACM Conference on Recommender Systems,
    RecSys ’19, Association for Computing Machinery,
    New York, NY, USA, 2019, p. 269–277. URL: https:
    //doi.org/10.1145/3298689.3346996. doi:10.1145/
    3298689.3346996.
[6] G. de Souza Pereira Moreira, S. Rabhi, J. M. Lee,
    R. Ak, E. Oldridge, Transformers4rec: Bridging
    the gap between nlp and sequential / session-based
    recommendation, in: Proceedings of the 15th ACM
    Conference on Recommender Systems, RecSys ’21,
    Association for Computing Machinery, New York,
    NY, USA, 2021, p. 143–153. URL: https://doi.org/
    10.1145/3460231.3474255. doi:10.1145/3460231.
    3474255.
[7] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,
    Bert4rec: Sequential recommendation with bidirec-
    tional encoder representations from transformer, in:
    Proceedings of the 28th ACM International Confer-
    ence on Information and Knowledge Management,
    CIKM ’19, 2019, p. 1441–1450.
[8] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering
    for implicit feedback datasets, in: 2008 Eighth IEEE
    International Conference on Data Mining, 2008, pp.
    263–272. doi:10.1109/ICDM.2008.22.
[9] M. Hardt, E. Price, E. Price, N. Srebro,
    Equality of opportunity in supervised learn-
    ing,    in: D. Lee, M. Sugiyama, U. Luxburg,
    I. Guyon, R. Garnett (Eds.), Advances in
    Neural Information Processing Systems, vol-
    ume 29, Curran Associates, Inc., 2016. URL:
    https://proceedings.neurips.cc/paper/2016/file/

</pre>