<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bias mitigation in recommender systems to improve diversity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Du Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Doruk Kilitçioğlu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Serdar Kadıoğlu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AI Center of Excellence</institution>
          ,
          <addr-line>Fidelity Investments, 245 Summer St, Boston MA 02210</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Department, Brown University</institution>
          ,
          <addr-line>115 Waterman St, Providence, RI 02906</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evaluating and mitigating bias in recommendation systems is of great practical interest in many real-world applications. This motivates the community to call for a more rounded evaluation of recommendation solutions that not only measures performance based on standard success metrics such as hit rate and ranking but also the quality across diferent user groups. To this end, we propose integrating post-processing techniques to mitigate bias in recommendations and measure the efectiveness of our approach in the CIKM 2022 EvalRS Challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Recommender Systems</kwd>
        <kwd>CIKM 2022 EvalRS Challenge</kwd>
        <kwd>Algorithmic Fairness</kwd>
        <kwd>Equalized Odds</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>models [6, 7] exist, our choice of a classical method is
motivated by the desire to quantify the attribution of our
Recommendation systems are ubiquitous in several appli- post-processing approach for bias mitigation. We choose
cations, and their success is often measured by point-wise to build the model based on implicit feedback from the
inengagement metrics. Unfortunately, this might not only teraction data, given that we do not have access to direct
hinder important information when evaluating model input from users, and item features are limited.
performance but might also sufer from unwanted bias
across diferent user and item groups. 2.1. Alternating Least Squares (ALS)</p>
      <p>In response to the CIKM 2022 EvalRS Challenge [1],
we propose1 an approach that adds both activity-based The Alternating Least Squares [8] is a classical method
averaging, and post-processing steps to a collaborative that treats interaction data as indication of user
preferwiflteeruinsge ebqausealliizneedmoodddselcfaolirbbraiatisomni[t2i]gatotipoenr.tSuprbecdieficcailslyio, ns leenvceelss.aIntdcotmakpeusteinsttoheacucsoeurnfatctthoer asso∈ciaRte dancodnafidveenccteor
of the recommender conditioned on protected classes to  ∈ R for each itemsuch that the user preference 
enhance fairness. can be expressed as  . Consequently, the following</p>
      <p>We tested this approach using the public Last.fm cost function is minimized:
dataset by applying bias mitigation to improve diver- min ∑︁ ( −  )2 +  (∑︁ ||||2 + ∑︁ ||||2)
sity metrics. We then measure the diference between ⋆,⋆ ,  
the performance over the entire test set versus the
“protected” classes, such as song popularity and short user
history. In the following, we provide further details on
the approach taken for the EvalRS Challenge.</p>
      <p>(1)
where the second term serves to regularize the model,
and  denotes the confidence level of user ’s
preference towards item . Here  can be computed as
1 +   where  is the weight given to positive
feedback, and  is the raw binarized interaction.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Our Methodology</title>
      <sec id="sec-2-1">
        <title>2.2. Beyond ALS</title>
        <p>Our work revolves around a standard approach as a
baseline commonly known as Collaborative Filtering [3].
While sophisticated techniques, e.g., based on recent
advances in Deep Neural Networks [4, 5] and Transformer</p>
        <p>The second direction is post-processing. The
important insight is to view features such as gender, country,
user activity, artist, and track popularity as criteria to
create the “protected” groups. The problem can then
be formulated as adjusting the model such that
discrimination is mitigated from the recommendation results.
One approach to achieve this is by optimizing Equalized
Odds [9]. The equalized odds is defined as the
independence of the predictor ˆ and protected membership A
conditional on the prediction results. More formally;
given  ∈ (0, 1).</p>
        <p>Conceptually, the equalized odds method works as
follows. First, we find the convex hull of the ROC curves of
the contrasted groups such that any false-positive rate
(FPR), true-positive rate (TPR) pair can be satisfied by
either protected-group-conditional predictor. During
training, we obtain four probabilities of flipping the likelihood
of a positive prediction. Then during prediction, we
apply these learned mixing rates on the new data. The
open-source Jurity library [10] ofers an implementation
of this method and helps us achieve our goal 2.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Beyond Binary Fairness Metrics</title>
        <p>In practice, binarization across all users (items) and
equalizing odds per user (item) is costly. We need one
 (ˆ = 1| = 0,  = ) =  (ˆ = 1| = 1,  = ) mitigation model per user to mitigate diferences in track
(2)
alpha</p>
        <sec id="sec-2-2-1">
          <title>Regularization Factors 0.1 1</title>
          <p>10
20
40
0.1
-21.82
-27.19
-27.41
-26.75
-23.54
0.046
0.062
popularity. Analogously, we need one mitigation model
per item when balancing user activity. To simplify the
process, we choose the −  items with the highest
engagement the mitigation is focused on user activity.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. The Challenge Data</title>
        <p>Our initial analysis is based on the transformed version
of LFM-1b dataset. This dataset contains &gt;119K distinct
users, &gt;820K tracks and &gt;37M listening events. Our
primary data source remains the implicit user feedback from
the interaction data. Based on MRED_USER_ACTIVITY,
we separate users into the [1, 100, 1000] activity groups,
and evaluate recommendation results across the groups.</p>
        <p>Notice, however, that there is still a gap between
generating recommendations, which can be seen as multi-class, 3.2. Additional Testing
multi-label prediction, and binary mitigation techniques, We extend the testing suite provided by RecList [11] with
such as equalized odds. To bridge this gap, we propose a custom test aimed at fairness in recommender systems.
the following approach: More specifically, we look at the intersection between the
user activity and track popularity, evaluating whether
• Obtain the user-item-score matrix for user-item there is a material diference in the popularity of tracks
pairs and run softmax on it. that is recommended to users with difering activity.
• Calculate a binary cutof point per item based on To evaluate our new metric, inspired by
the 80% quantile scores. MRED_USER_ACTIVITY, we first binarize users
• Binarize the results item-wise and run equalized into high- and low-activity groups (using 1000 listens
odds, using user activity as a protected class. as a cut-of). We then bin tracks into [1, 10, 100, 1000]
• In case the application of equalized odds change groups, similar to MRED_TRACK_POPULARITY. For
the binary label, go back to the softmax scores each user, we look at which track popularity groups they
and use its complement, i.e., (1 - softmax). are recommended and the activity group they belong to.
• Re-order recommendations using the new scores. We then utilize the multi-class statistical parity measure
from Jurity [10] to measure fairness.</p>
        <p>The result of this process leads to a normalized set of
scores where for each item, the decision of whether that
item is recommended to a user is now unbiased between
high activity and low activity users. As such, we would
expect to see a lower diference in metrics when
comparing high activity and low activity users, which is exactly
what MRED_USER_ACTIVITY measures. This technique
can be applied to user activity and item popularity.
2https://github.com/fidelity/jurity</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Numerical Results</title>
        <p>We focus on optimizing the model performance in two
aspects, i) the standard performance metrics and ii) the
diversity metrics. The metrics are calculated using the
RecList [11] library provided by the organizers.</p>
        <p>Table 1 presents the set of hyperparameters considered
when training the model on the whole dataset to find the
configuration with the highest performance score.</p>
        <p>ALS</p>
        <sec id="sec-3-2-1">
          <title>CBOW Baseline</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>User Activity Specific ALS</title>
          <p>ALS + Averaging (with n_sum = 500)
ALS + Averaging (with n_sum = 1000)</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>ALS + Post-processing</title>
          <p>-1.212
-21.823
0.036
0.046</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. The Impact of Averaging</title>
        <p>We compare our work with the CBOW baseline the
challenge organizers provided and other solutions. Our
overall score is lower than the CBOW baseline, while
our hit rate is lower but similar. Remember that
our approach for averaging specifically targets the
MRED_USER_ACTIVITY metric. Hence, as expected,
our score in this metric is better than the baseline.</p>
        <p>One immediate observation from the challenge results
presented in Leaderboard - II 3 is, due to the aggregated
scoring scheme, it is non-trivial to compare algorithms.
Diferent methods exhibit diferent strengths. Notice that
scores in Leaderboard -II are calculated based on the
previous statistics achieved in Leaderboard - I.</p>
        <p>In terms of traditional performance metrics, it is worth
noting that our hit-rate performance is within the top-5
solutions. This is interesting, given that we only utilized
a standard recommendation algorithm. Without
postprocessing, our hit rate would be even higher.</p>
        <p>In line with the rounded evaluation objective of the
competition, the top-scoring solution does not strike a
high hit rate either. This is even the case for the
twotower deep neural networks, as evident in Leaderboard - I,
where its performance trails behind the classical CBOW.</p>
        <p>In terms of extended metrics, our results on
MRED_USER_ACTIVITY metric considerably improve
over the CBOW baseline and is the 7th over 14,
discarding solutions with -100 performance scores. Our
relatively good performance on MRED_USER_ACTIVITY,
even when our overall scores are not the best, is
empirical evidence for the eficacy of the ensemble modeling
approach. However, when targeting one specific metric,
our overall results have sufered.
3https://reclist.io/cikm2022-cup/leaderboard.html</p>
        <p>An area for future improvement is to focus on how to
utilize averaging in such a way that benefits beyond a
single protected class.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. The Impact of Post-Processing</title>
        <p>We compare our post-processing results with the baseline
CBOW model and our baseline ALS model. Our overall
score is lower than the CBOW baseline model, while
our hit rate is higher. Our MRED_USER_ACTIVITY is
again higher than this baseline since our post-processing
targets this metric. Compared to our baseline ALS
model, our overall score increases due to an increase
in MRED_USER_ACTIVITY, and our hit rate worsens.
Further compared to our averaging model, the
postprocessing sacrifices less hit rate at the expense of
achieving less improvement in MRED_USER_ACTIVITY.</p>
        <p>Since the post-processing involves fitting an equalized
odds model per item, we quickly hit high runtimes when
using more than 1000 items. By utilizing the most popular
items, we increase the impact we get from each trained
model. However, our results show that even though the
post-processing accomplishes an improvement
directionally, our solution based on averaging performs better.
Combining post-processing with averaging leads to the
best result on MRED_USER_ACTIVITY in the
leaderboard at the expense of hit rate. Therefore, our final
submission for the CIKM 2022 EvalRS Challenge is the
averaging model with _ = 500.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we augmented the well-known collaborative
ifltering algorithm with ensembles and bias mitigation
to strike a balance between performance and diversity.</p>
      <p>This carefully crafted CIKM Challenge goes beyond
standard metrics, provides the easy-to-use RecList library,
and raises awareness for a rounded evaluation. In the
same spirit, we focused on mitigating bias on diversity
metrics, leveraged the Jurity library, and demonstrated
encouraging results. We showed how to use existing
algorithmic fairness metrics for recommendations and
extended equalized odds beyond binary classification.
9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
[10] F. Michalský, S. Kadıoğlu, Surrogate ground truth
[1] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, generation to enhance binary fairness evaluation
C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: in uplift modeling, in: 2021 20th IEEE
Internaa rounded evaluation of recommender systems, tional Conference on Machine Learning and
Ap2022. URL: https://arxiv.org/abs/2207.05772. doi:10. plications (ICMLA), 2021, pp. 1654–1659. doi:10.
48550/ARXIV.2207.05772. 1109/ICMLA52953.2021.00264.
[2] M. Hardt, E. Price, N. Srebro, Equality of oppor- [11] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko,
tunity in supervised learning, 2016. URL: https: Beyond ndcg: Behavioral testing of recommender
//arxiv.org/abs/1610.02413. doi:10.48550/ARXIV. systems with reclist, WWW ’22 Companion,
As1610.02413. sociation for Computing Machinery, New York,
[3] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item- NY, USA, 2022, p. 99–104. URL: https://doi.org/
based collaborative filtering recommendation algo- 10.1145/3487553.3524215. doi:10.1145/3487553.
rithms, in: Proceedings of the 10th International 3524215.</p>
      <p>Conference on World Wide Web, WWW ’01,
Association for Computing Machinery, New York, NY,
USA, 2001, p. 285–295. URL: https://doi.org/10.1145/
371920.372071. doi:10.1145/371920.372071.
[4] M. Naumov, D. Mudigere, Dlrm: An advanced, open</p>
      <p>source deep learning recommendation model, 2020.
[5] X. Yi, J. Yang, L. Hong, D. Z. Cheng, L. Heldt,</p>
      <p>A. Kumthekar, Z. Zhao, L. Wei, E. Chi,
Samplingbias-corrected neural modeling for large corpus
item recommendations, in: Proceedings of the
13th ACM Conference on Recommender Systems,
RecSys ’19, Association for Computing Machinery,
New York, NY, USA, 2019, p. 269–277. URL: https:
//doi.org/10.1145/3298689.3346996. doi:10.1145/
3298689.3346996.
[6] G. de Souza Pereira Moreira, S. Rabhi, J. M. Lee,</p>
      <p>R. Ak, E. Oldridge, Transformers4rec: Bridging
the gap between nlp and sequential / session-based
recommendation, in: Proceedings of the 15th ACM
Conference on Recommender Systems, RecSys ’21,
Association for Computing Machinery, New York,
NY, USA, 2021, p. 143–153. URL: https://doi.org/
10.1145/3460231.3474255. doi:10.1145/3460231.</p>
      <p>3474255.
[7] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,</p>
      <p>Bert4rec: Sequential recommendation with
bidirectional encoder representations from transformer, in:
Proceedings of the 28th ACM International
Conference on Information and Knowledge Management,</p>
      <p>CIKM ’19, 2019, p. 1441–1450.
[8] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering
for implicit feedback datasets, in: 2008 Eighth IEEE
International Conference on Data Mining, 2008, pp.</p>
      <p>263–272. doi:10.1109/ICDM.2008.22.
[9] M. Hardt, E. Price, E. Price, N. Srebro,</p>
      <p>Equality of opportunity in supervised
learning, in: D. Lee, M. Sugiyama, U. Luxburg,
I. Guyon, R. Garnett (Eds.), Advances in
Neural Information Processing Systems,
volume 29, Curran Associates, Inc., 2016. URL:
https://proceedings.neurips.cc/paper/2016/lfie/</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>