1. Introduction

Bias mitigation in recommender systems to improve diversity

Du Cheng

Doruk Kilitçioğlu

Serdar Kadıoğlu

0 1 0 AI Center of Excellence , Fidelity Investments, 245 Summer St, Boston MA 02210 , USA 1 Computer Science Department, Brown University , 115 Waterman St, Providence, RI 02906 , USA

Evaluating and mitigating bias in recommendation systems is of great practical interest in many real-world applications. This motivates the community to call for a more rounded evaluation of recommendation solutions that not only measures performance based on standard success metrics such as hit rate and ranking but also the quality across diferent user groups. To this end, we propose integrating post-processing techniques to mitigate bias in recommendations and measure the efectiveness of our approach in the CIKM 2022 EvalRS Challenge.

eol>Recommender Systems CIKM 2022 EvalRS Challenge Algorithmic Fairness Equalized Odds

1. Introduction

models [6, 7] exist, our choice of a classical method is motivated by the desire to quantify the attribution of our Recommendation systems are ubiquitous in several appli- post-processing approach for bias mitigation. We choose cations, and their success is often measured by point-wise to build the model based on implicit feedback from the inengagement metrics. Unfortunately, this might not only teraction data, given that we do not have access to direct hinder important information when evaluating model input from users, and item features are limited. performance but might also sufer from unwanted bias across diferent user and item groups. 2.1. Alternating Least Squares (ALS)

In response to the CIKM 2022 EvalRS Challenge [1], we propose1 an approach that adds both activity-based The Alternating Least Squares [8] is a classical method averaging, and post-processing steps to a collaborative that treats interaction data as indication of user preferwiflteeruinsge ebqausealliizneedmoodddselcfaolirbbraiatisomni[t2i]gatotipoenr.tSuprbecdieficcailslyio, ns leenvceelss.aIntdcotmakpeusteinsttoheacucsoeurnfatctthoer asso∈ciaRte dancodnafidveenccteor of the recommender conditioned on protected classes to ∈ R for each itemsuch that the user preference enhance fairness. can be expressed as . Consequently, the following

We tested this approach using the public Last.fm cost function is minimized: dataset by applying bias mitigation to improve diver- min ∑︁ ( − )2 + (∑︁ ||||2 + ∑︁ ||||2) sity metrics. We then measure the diference between ⋆,⋆ , the performance over the entire test set versus the “protected” classes, such as song popularity and short user history. In the following, we provide further details on the approach taken for the EvalRS Challenge.

(1) where the second term serves to regularize the model, and denotes the confidence level of user ’s preference towards item . Here can be computed as 1 + where is the weight given to positive feedback, and is the raw binarized interaction.

2. Our Methodology 2.2. Beyond ALS

Our work revolves around a standard approach as a baseline commonly known as Collaborative Filtering [3]. While sophisticated techniques, e.g., based on recent advances in Deep Neural Networks [4, 5] and Transformer

The second direction is post-processing. The important insight is to view features such as gender, country, user activity, artist, and track popularity as criteria to create the “protected” groups. The problem can then be formulated as adjusting the model such that discrimination is mitigated from the recommendation results. One approach to achieve this is by optimizing Equalized Odds [9]. The equalized odds is defined as the independence of the predictor ˆ and protected membership A conditional on the prediction results. More formally; given ∈ (0, 1).

Conceptually, the equalized odds method works as follows. First, we find the convex hull of the ROC curves of the contrasted groups such that any false-positive rate (FPR), true-positive rate (TPR) pair can be satisfied by either protected-group-conditional predictor. During training, we obtain four probabilities of flipping the likelihood of a positive prediction. Then during prediction, we apply these learned mixing rates on the new data. The open-source Jurity library [10] ofers an implementation of this method and helps us achieve our goal 2.

2.3. Beyond Binary Fairness Metrics

In practice, binarization across all users (items) and equalizing odds per user (item) is costly. We need one (ˆ = 1| = 0, = ) = (ˆ = 1| = 1, = ) mitigation model per user to mitigate diferences in track (2) alpha

Regularization Factors 0.1 1

10 20 40 0.1 -21.82 -27.19 -27.41 -26.75 -23.54 0.046 0.062 popularity. Analogously, we need one mitigation model per item when balancing user activity. To simplify the process, we choose the − items with the highest engagement the mitigation is focused on user activity.

3. Experiments 3.1. The Challenge Data

Our initial analysis is based on the transformed version of LFM-1b dataset. This dataset contains >119K distinct users, >820K tracks and >37M listening events. Our primary data source remains the implicit user feedback from the interaction data. Based on MRED_USER_ACTIVITY, we separate users into the [1, 100, 1000] activity groups, and evaluate recommendation results across the groups.

Notice, however, that there is still a gap between generating recommendations, which can be seen as multi-class, 3.2. Additional Testing multi-label prediction, and binary mitigation techniques, We extend the testing suite provided by RecList [11] with such as equalized odds. To bridge this gap, we propose a custom test aimed at fairness in recommender systems. the following approach: More specifically, we look at the intersection between the user activity and track popularity, evaluating whether • Obtain the user-item-score matrix for user-item there is a material diference in the popularity of tracks pairs and run softmax on it. that is recommended to users with difering activity. • Calculate a binary cutof point per item based on To evaluate our new metric, inspired by the 80% quantile scores. MRED_USER_ACTIVITY, we first binarize users • Binarize the results item-wise and run equalized into high- and low-activity groups (using 1000 listens odds, using user activity as a protected class. as a cut-of). We then bin tracks into [1, 10, 100, 1000] • In case the application of equalized odds change groups, similar to MRED_TRACK_POPULARITY. For the binary label, go back to the softmax scores each user, we look at which track popularity groups they and use its complement, i.e., (1 - softmax). are recommended and the activity group they belong to. • Re-order recommendations using the new scores. We then utilize the multi-class statistical parity measure from Jurity [10] to measure fairness.

The result of this process leads to a normalized set of scores where for each item, the decision of whether that item is recommended to a user is now unbiased between high activity and low activity users. As such, we would expect to see a lower diference in metrics when comparing high activity and low activity users, which is exactly what MRED_USER_ACTIVITY measures. This technique can be applied to user activity and item popularity. 2https://github.com/fidelity/jurity

3.3. Numerical Results

We focus on optimizing the model performance in two aspects, i) the standard performance metrics and ii) the diversity metrics. The metrics are calculated using the RecList [11] library provided by the organizers.

Table 1 presents the set of hyperparameters considered when training the model on the whole dataset to find the configuration with the highest performance score.

ALS

CBOW Baseline User Activity Specific ALS

ALS + Averaging (with n_sum = 500) ALS + Averaging (with n_sum = 1000)

ALS + Post-processing

-1.212 -21.823 0.036 0.046

4. Discussion 4.1. The Impact of Averaging

We compare our work with the CBOW baseline the challenge organizers provided and other solutions. Our overall score is lower than the CBOW baseline, while our hit rate is lower but similar. Remember that our approach for averaging specifically targets the MRED_USER_ACTIVITY metric. Hence, as expected, our score in this metric is better than the baseline.

One immediate observation from the challenge results presented in Leaderboard - II 3 is, due to the aggregated scoring scheme, it is non-trivial to compare algorithms. Diferent methods exhibit diferent strengths. Notice that scores in Leaderboard -II are calculated based on the previous statistics achieved in Leaderboard - I.

In terms of traditional performance metrics, it is worth noting that our hit-rate performance is within the top-5 solutions. This is interesting, given that we only utilized a standard recommendation algorithm. Without postprocessing, our hit rate would be even higher.

In line with the rounded evaluation objective of the competition, the top-scoring solution does not strike a high hit rate either. This is even the case for the twotower deep neural networks, as evident in Leaderboard - I, where its performance trails behind the classical CBOW.

In terms of extended metrics, our results on MRED_USER_ACTIVITY metric considerably improve over the CBOW baseline and is the 7th over 14, discarding solutions with -100 performance scores. Our relatively good performance on MRED_USER_ACTIVITY, even when our overall scores are not the best, is empirical evidence for the eficacy of the ensemble modeling approach. However, when targeting one specific metric, our overall results have sufered. 3https://reclist.io/cikm2022-cup/leaderboard.html

An area for future improvement is to focus on how to utilize averaging in such a way that benefits beyond a single protected class.

4.2. The Impact of Post-Processing

We compare our post-processing results with the baseline CBOW model and our baseline ALS model. Our overall score is lower than the CBOW baseline model, while our hit rate is higher. Our MRED_USER_ACTIVITY is again higher than this baseline since our post-processing targets this metric. Compared to our baseline ALS model, our overall score increases due to an increase in MRED_USER_ACTIVITY, and our hit rate worsens. Further compared to our averaging model, the postprocessing sacrifices less hit rate at the expense of achieving less improvement in MRED_USER_ACTIVITY.

Since the post-processing involves fitting an equalized odds model per item, we quickly hit high runtimes when using more than 1000 items. By utilizing the most popular items, we increase the impact we get from each trained model. However, our results show that even though the post-processing accomplishes an improvement directionally, our solution based on averaging performs better. Combining post-processing with averaging leads to the best result on MRED_USER_ACTIVITY in the leaderboard at the expense of hit rate. Therefore, our final submission for the CIKM 2022 EvalRS Challenge is the averaging model with _ = 500.

5. Conclusion

In this work, we augmented the well-known collaborative ifltering algorithm with ensembles and bias mitigation to strike a balance between performance and diversity.

This carefully crafted CIKM Challenge goes beyond standard metrics, provides the easy-to-use RecList library, and raises awareness for a rounded evaluation. In the same spirit, we focused on mitigating bias on diversity metrics, leveraged the Jurity library, and demonstrated encouraging results. We showed how to use existing algorithmic fairness metrics for recommendations and extended equalized odds beyond binary classification. 9d2682367c3935defcb1f9e247a97c0d-Paper.pdf. [10] F. Michalský, S. Kadıoğlu, Surrogate ground truth [1] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, generation to enhance binary fairness evaluation C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: in uplift modeling, in: 2021 20th IEEE Internaa rounded evaluation of recommender systems, tional Conference on Machine Learning and Ap2022. URL: https://arxiv.org/abs/2207.05772. doi:10. plications (ICMLA), 2021, pp. 1654–1659. doi:10. 48550/ARXIV.2207.05772. 1109/ICMLA52953.2021.00264. [2] M. Hardt, E. Price, N. Srebro, Equality of oppor- [11] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, tunity in supervised learning, 2016. URL: https: Beyond ndcg: Behavioral testing of recommender //arxiv.org/abs/1610.02413. doi:10.48550/ARXIV. systems with reclist, WWW ’22 Companion, As1610.02413. sociation for Computing Machinery, New York, [3] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item- NY, USA, 2022, p. 99–104. URL: https://doi.org/ based collaborative filtering recommendation algo- 10.1145/3487553.3524215. doi:10.1145/3487553. rithms, in: Proceedings of the 10th International 3524215.

Conference on World Wide Web, WWW ’01, Association for Computing Machinery, New York, NY, USA, 2001, p. 285–295. URL: https://doi.org/10.1145/ 371920.372071. doi:10.1145/371920.372071. [4] M. Naumov, D. Mudigere, Dlrm: An advanced, open

source deep learning recommendation model, 2020. [5] X. Yi, J. Yang, L. Hong, D. Z. Cheng, L. Heldt,

A. Kumthekar, Z. Zhao, L. Wei, E. Chi, Samplingbias-corrected neural modeling for large corpus item recommendations, in: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 269–277. URL: https: //doi.org/10.1145/3298689.3346996. doi:10.1145/ 3298689.3346996. [6] G. de Souza Pereira Moreira, S. Rabhi, J. M. Lee,

R. Ak, E. Oldridge, Transformers4rec: Bridging the gap between nlp and sequential / session-based recommendation, in: Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 143–153. URL: https://doi.org/ 10.1145/3460231.3474255. doi:10.1145/3460231.

3474255. [7] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,

Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management,

CIKM ’19, 2019, p. 1441–1450. [8] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering for implicit feedback datasets, in: 2008 Eighth IEEE International Conference on Data Mining, 2008, pp.

263–272. doi:10.1109/ICDM.2008.22. [9] M. Hardt, E. Price, E. Price, N. Srebro,

Equality of opportunity in supervised learning, in: D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 29, Curran Associates, Inc., 2016. URL: https://proceedings.neurips.cc/paper/2016/lfie/