1. Introduction

Item-based variational auto-encoder for fair music recommendation

Jinhyeok Park

Dain Kim

Dongwoo Kim

0 0 Pohang University of Science and Technology , Pohang , Republic of Korea

We present our solution for the EvalRS DataChallenge. The EvalRS DataChallenge aims to build a more realistic recommender system considering accuracy, fairness, and diversity in evaluation. Our proposed system is based on an ensemble between an item-based variational auto-encoder (VAE) and a Bayesian personalized ranking matrix factorization (BPRMF). To mitigate the bias in popularity, we use an item-based VAE for each popularity group with an additional fairness regularization. To make a reasonable recommendation even the predictions are inaccurate, we combine the recommended list of BPRMF and that of item-based VAE. Through the experiments, we demonstrate that the item-based VAE with fairness regularization significantly reduces popularity bias compared to the user-based VAE. The ensemble between the item-based VAE and BPRMF makes the top-1 item similar to the ground truth even the predictions are inaccurate. Finally, we propose a 'Coeficient Variance based Fairness' as a novel evaluation metric based on our reflections from the extensive experiments.

eol>recommender systems fairness variational auto-encoder collaborative filtering

1. Introduction

EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022, Atlanta, GA * Corresponding author. † These authors contributed equally. $ jinhyeok1234@postech.ac.kr (J. Park); dain5832@postech.ac.kr (D. Kim); dongwookim@postech.ac.kr (D. Kim)

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License accuracy into account. CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)

Evaluation metrics We describe the evaluation metrics used in the EvalRS DataChallenge [10]. The evaluation metrics can be categorized into three diferent measures:

Let x = [1, 2, ... ] be an implicit feedback of user where is binary indicator specifying whether user interacted with the item . The likelihood function (x|) is then modeled via a multinomial distribution conditioned on the latent vector z. The multivariate • Accuracy metrics: accuracy metrics indicate the normal distribution is used as a variational distribution predictive performance of a model. It includes hit (z|x). During the training, one can optimize the rate (HR) and mean reciprocal rank (MRR), which parameters to maximize the ELBO. After training, the are widely used in recommender systems. recommended items are chosen based on the multino• Accuracy metrics on a per-group basis: group- mial distribution among the items that have not been based metrics are designed to evaluate the fair- interacted so far. ness and robustness of the model. The challenge adopts the miss rate equality diference (MRED), Item-based VAE Although using implicit feedback of which measures the average diference between a user, i.e., x, as an input of VAE is a common approach the miss rate (MR) of each group and the MR (user-based VAE), alternatively, one can use implicit feedof the entire dataset. The metrics are evaluated back of an item as an input of VAE (item-based VAE). The across five diferent groups: gender, country, user implicit feedback vector of item can be constructed as history, artist popularity, and track popularity. x = [1, 2, ... ], where indicates the interac• Behavioral tests: behavioral tests measure the tion between item and user .

similarity between recommended and ground To recommend items with item-based VAE, the model truth items and the diversity of recommended infers logits over all items to complete the user-item items. Behavioral tests consist of two metrics; ‘be interaction matrix and recommends top- items for each less wrong’ and ‘latent diversity.’ Be less wrong user. Empirically, we find that the item-based VAE tends measures the distance between the embeddings to recommend unpopular items compared to the userof ground truth and the predicted result. Latent based VAE. diversity indicates a model density in the latent space of tracks.

Bayesian personalized ranking matrix factorization We use Bayesian personalized ranking matrix factorization (BPRMF) [16] as the second baseline model. BPRMF estimates the posterior distribution over the likelihood of pair-wise ranking between items with a prior distribution.

2. Method 2.1. Baseline Models We use the variational auto-encoders (VAE) and Bayesian

personalized ranking matrix factorization (BPRMF) as 2.2. Model Optimization our backbone methods. In this section, we describe the backbone methods and explain how to use these back- In this section, we introduce various methods used to bones to curate the final recommendation list. improve the performance of the item-based VAE for phase 2. Our approach mainly targets group-based metrics and Variational auto-encoders for collaborative filter- behavioral tests rather than accuracy metrics. ing. In this work, we employ the variational autoencoder (VAE) for collaborative filtering [ 13] as the first Popularity-aware training based on items We aim backbone model. The objective of VAE [14] is to max- to improve the MRED between track popularity groups imize the evidence lower bound (ELBO) for each data and artist popularity groups, which are significant factors point : in phase 2.

Based on the item-based VAE, to reduce the perfor (; , ) =E(|)[log ( | )] mance gap between artist popularity groups, we divide − · KL(( | )‖()), items by artist popularity groups and train a VAE for each group separately. After training, we find that the least where is the latent variable, measures the importance popular artist group is underfitted compared to other of the KL divergence, and the likelihood function and groups. Therefore, we train two more epochs for this the variational distribution are parameterized by group. Then, we pick a certain number of items from and , respectively. each group to make a recommendation. Please check the

There have been multiple approaches to employ the details of this process in the Final Recommendation part. VAE framework for collaborative filtering [ 13, 15]. In this The MRED between track popularity groups is also an work, we follow the framework proposed by [13]. important factor for phase 2. Although we divided items VAE(item) VAE(user) BPRMF

Model 1 10 100 1000 total Final Recommendation Since there are four artist VAE (item) 0.8946 0.7865 0.7770 0.8803 0.7879 popularity groups, we train four separate VAEs each of VAE (user) 0.9398 0.8861 0.8062 0.6448 0.8407 which is designated for each group. From the four VAEs, BPRMF 0.9965 0.9830 0.9387 0.9487 0.9628 we first create a list of 98 items to be recommended. First, Table 2 we take 38/20/20/20 items from artists groups 0, 1, 2, and MR for each model at each track popularity group. 3 respectively, where group 0 indicates the least popular group. Among those selected items, we take the five most probable items from each group and curate a list by artist groups, MRED between the track population of the top 20 items. The 20 items are ordered with 2, groups is also reduced. However, the least popular item 1, 3, 0 (5/5/5/5) in descending order of the number of group whose playcount is less than ten is still not recom- items in each group. The remaining 78 items are listed mended well, as shown in Figure 1. To recommend an after with the same order (15/15/15/33). One additional item from the least popular track group, we additionally item recommended from the least popular track group is train a separate VAE for this group and include at least added at the end of the list. one item from this group. In addition, we find that the item-based VAE often fails to achieve good performance in the behavioral tests, esFairness Regularization Fairness regularization aims pecially for ‘be less wrong’. Therefore, we ensemble the to introduce an additional regularizer term to the ob- item-based VAE and BPRMF which shows good perforjective to narrow the gap between group losses. The mance in ‘be less wrong’. Since the metric only considers approach has been widely adopted in fields such as com- the top-1 item, we put the most probable item from the puter vision, natural language processing, and sound [17, BPRMF model at the top of our previous recommendation 18]. list, resulting in 100 recommended items.

Many recommender systems also employ regularizers to improve group fairness [5, 19].

We incorporate a fairness regularization into VAE 3. Experiments based on the work of [20]. The regularization term computes the average diference between the group recon- 3.1. Dataset struction loss and the entire reconstruction loss as ⎡⃒ ⃒ 1

∑︁ E [log ( | )] (; , ) = E ⎣⃒⃒⃒ | | ∈ ⃒ − 1 ∑︁ E [log ( | )]⃒⃒⃒⃒ ]︃ , || ∈ ⃒ where is a set of all items, and is a set of items that belong to the group . Groups are divided into 1, 10, 100, and 1000 based on track popularity, and each item is assigned to the group according to its total play counts.

Our final objective can be expressed as follows:

(; , ) = (; , ) − · , where the hyperparameter controls the weight of the regularizer. The higher value of indicates that the model takes a greater proportion of fairness into account during the optimization process.

The LFM-1b dataset [11] is provided for the challenge.

The dataset consists of the listening history of users with demographic information, such as gender and nationality, and metadata of the item. The dataset includes 119,555 users, 820,998 tracks and 37,926,429 interactions. The test set was generated based on the leave-one-out framework by randomly masking one item from each user’s history. Please check the detailed pre-processing steps of the dataset for the challenge in [10]. 3.2. Phase 1 In phase 1, we conduct an experiment to check the performances of our baseline models: item-based VAE, userbased VAE, and BPRMF. For all experiments with VAEs, we adopt the same architecture as [13]. We set the batch size to 32. Latent dimension is set to 500, and the size of hidden layer is set to 300. We train for 5 epochs using the Adam [21] optimizer with a learning rate of 0.001. Fold1 Fold2 Fold3 Fold4 Average Baseline

Score

Phase2 -0.2924 -0.3000 -0.2948 -0.2984 3.3. Phase 2

Based on the preliminary results shown in phase 1,

we combine the results of the item-based VAE and the BPRMF to curate the recommendation list for phase 2.

In phase 1, the overall score is determined by a simple average of each test. However, in phase 2, the imporFigure 1: Popularity distributions of all items and items rec- tance of each metric is adjusted based on the perforommended through the model. The x-axis represents track mance of participants in phase 1. As we replicate the popularity groups, and y-axis represents proportion of each process described in the challenge to analyze relative group. We use the logarithmic scale on the y-axis due to values of weights, we observe an enormous gap between skewed distributions. The result shows that the recommended the weight of artist popularity and that of HR. Thus, we items from the item-VAE follow a similar distribution to the mainly focused on mitigating the bias of ‘artist populartotal popularity distributions of all items. ity’.

For the final experiments, we set the latent dimension of the item-based VAE to 17 and the batch size to 32.

Dropout rate is set to 0.2. For BPRMF, we set the batch Then, we train the model using the Adam optimizer with size to 8192, and dimension to 64. We train for 10 epochs a learning rate of 1e-3 for two epochs. As the unpopwith a learning rate of 0.001. Instead of using weight ular artist group does not fit well, We find that having regularization, we normalize the vector if the maximum additional two epochs to train the VAE for the unpopular value of user vectors and item vectors is greater than 1. artist group generally helps to improve the performance.

Table 1 reveals variational auto-encoders for collabo- When we train the least popular item group, we set the rative lfitering achieve good performance when applying latent dimension to 15 and train for 2 epochs. and a simple average of all metrics. In particular, the item- , which are the coeficient of the KL divergence term based VAE shows good performance in not only hit rate and the coeficient of the regularizer, are set to 0.0001 and MRR but also MRED between user activity, track and 0.003, respectively, after parameter searching. For popularity, and artist popularity groups. Table 2 shows BPRMF, we set the latent dimension to 200 and other the MR of each track popularity group between baseline conditions are set to the same as the baseline. Then, we models. We can observe that the item-based VAE has make the final recommendation described in 2.2. 1 better accuracy with unpopular item groups. Table 3 shows our results of all folds with the phase 2

We observe that the item-based VAE recommends as score and the baseline score. The results show that our many unpopular items as popular ones. Figure 1 shows model successfully reduces the gap between the artist popularity distributions of the recommended items and popularity groups and the track popularity groups. Furall items. The result indicates that the recommended thermore, our model shows lower MREDs between user items from item-based VAE follow a similar distribution activity, gender, and country groups than those of the of the total item popularity computed in the training set. baseline provided by the challenge organizers. In the meanwhile, the user-based VAE tends to recommend more popular items than the item-based VAE.

We find that although the BPRMF method does not 1The source code for reproducing the experiments is available at show a good overall performance and it tends to recom- https://github.com/ParkJinHyeock/evalRS-submission

Track popularity

Artist popularity VAE (item) VAE (user) VAE (final) BPRMF

Model VAE (item) VAE (user) VAE (final) BPRMF

Hit 0.2121 -0.0529 -0.0937 -0.0023 -0.0230 MRED -0.0216 -0.0430 -0.0010 -0.0102

CV(ours) (final) denotes our final submission.

Miss rate and proposed metric of each model at each track popularity group (top) and artist popularity group (bottom). VAE this perspective, using MRED to measure fairness might the fairness of the model. Finally, we suggest the notion

4. Discussion and Reflection 4.1. User Fairness

As shown in experiments, our main approach focuses on balancing the HR of the artist and track popularity groups. However, our method also yields fair performance in user-related fairness metrics. The previous study [22] shows that there are no significant diferences in the performance of the model between the gender groups. The authors also identify a negative relationship between user activity and performance. We observe a similar phenomenon from our results; there are small diferences between gender and country groups, and a negative correlation between user activity and performance. However, with the item-based VAE, reducing the gap between item groups also reduces the gap between user activity groups.

4.2. Reflection on Evaluation Metric The EvalRS DataChallenge [10] evaluates the fairness of

the model using the average diference of MR between groups. In this section, we analyze the weakness of this approach and propose a novel metric that improves it.

We first analyze the limitation of the current fairness

metric. Suppose there is a model with a hit rate of 0.2 and another with a hit rate of 0.02. If the average deviation of HR is both 0.01, the two models would be considered to produce equivalent performance regarding fairness.

However, in terms of relative ratios, the same deviation accounts for 5% of the former but 50% of the latter. From lead to unreasonable comparison. With this intuition, we propose a ‘Coeficient of Variance (CV)[23] based fairness’ which is less sensitive to scale. The Coeficient of Variance is defined as the standard deviation divided by the mean multiplied by 100: 5. Conclusion

In this work, we propose a fairness-aware variational auto-encoder for recommender systems. Our approach shows that the item-based VAE significantly reduces the popularity bias of the model. Moreover, we conclude that obtaining the recommendation results from various artist groups and adapting a regularizer further improves of ‘Coeficient of Variance based Fairness’ for the model evaluation and demonstrate that it reasonably measures the fairness of the model. tional bayes, arXiv preprint arXiv:1312.6114 (2013). [15] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, Autorec: [1] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, Autoencoders meet collaborative filtering, in: ProX. He, Bias and debias in recommender system: ceedings of the 24th international conference on A survey and future directions, arXiv preprint World Wide Web, 2015, pp. 111–112. arXiv:2010.03240 (2020). [16] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt[2] Y. Wang, W. Ma, M. Zhang*, Y. Liu, S. Ma, A survey Thieme, Bpr: Bayesian personalized ranking from on the fairness of recommender systems, ACM implicit feedback, arXiv preprint arXiv:1205.2618 Journal of the ACM (JACM) (2022). (2012). [3] M. Morik, A. Singh, J. Hong, T. Joachims, Control- [17] K. Zhao, J. Xu, M.-M. Cheng, Regularface: Deep ling fairness and bias in dynamic learning-to-rank, face recognition via exclusive regularization, in: in: Proceedings of the 43rd international ACM SI- Proceedings of the IEEE/CVF Conference on ComGIR conference on research and development in puter Vision and Pattern Recognition, 2019, pp. information retrieval, 2020, pp. 429–438. 1136–1144. [4] A. Singh, T. Joachims, Fairness of exposure in rank- [18] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, C. Dwork, ings, in: Proceedings of the 24th ACM SIGKDD Learning fair representations, in: International International Conference on Knowledge Discovery conference on machine learning, PMLR, 2013, pp. & Data Mining, 2018, pp. 2219–2228. 325–333. [5] S. Yao, B. Huang, Beyond parity: Fairness objec- [19] A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, tives for collaborative filtering, Advances in neural L. Heldt, Z. Zhao, L. Hong, E. H. Chi, et al., Fairness information processing systems 30 (2017). in recommendation ranking through pairwise com[6] S. Vargas, P. Castells, Rank and relevance in novelty parisons, in: Proceedings of the 25th ACM SIGKDD and diversity metrics for recommender systems, International Conference on Knowledge Discovery in: Proceedings of the fifth ACM conference on & Data Mining, 2019, pp. 2212–2220.

Recommender systems, 2011, pp. 109–116. [20] R. Borges, K. Stefanidis, F2vae: a framework for [7] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond mitigating user unfairness in recommendation sysaccuracy: evaluating recommender systems by cov- tems, in: Proceedings of the 37th ACM/SIGAPP erage and serendipity, in: Proceedings of the fourth Symposium on Applied Computing, 2022, pp. 1391– ACM conference on Recommender systems, 2010, 1398.

pp. 257–260. [21] D. P. Kingma, J. Ba, Adam: A method for stochas[8] Y. Li, H. Chen, Z. Fu, Y. Ge, Y. Zhang, User-oriented tic optimization, arXiv preprint arXiv:1412.6980 fairness in recommendation, in: Proceedings of the (2014).

Web Conference 2021, 2021, pp. 624–632. [22] M. D. Ekstrand, M. Tian, I. M. Azpiazu, J. D. Ek[9] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of at- strand, O. Anuyah, D. McNeill, M. S. Pera, All the tention: Amortizing individual fairness in rankings, cool kids, how do they fit in?: Popularity and dein: The 41st international acm sigir conference on mographic biases in recommender evaluation and research & development in information retrieval, efectiveness, in: Conference on fairness, account2018, pp. 405–414. ability and transparency, PMLR, 2018, pp. 172–186. [10] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, [23] C. E. Brown, Coeficient of variation, in: Applied C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: a multivariate statistics in geohydrology and related rounded evaluation of recommender systems, arXiv sciences, Springer, 1998, pp. 155–157. preprint arXiv:2207.05772 (2022). [11] M. Schedl, The lfm-1b dataset for music retrieval and recommendation, in: Proceedings of the 2016 ACM on international conference on multimedia retrieval, 2016, pp. 103–110. [12] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko,

Beyond ndcg: behavioral testing of recommender systems with reclist, in: Companion Proceedings of the Web Conference 2022, 2022, pp. 99–104. [13] D. Liang, R. G. Krishnan, M. D. Hofman, T. Jebara,

Variational autoencoders for collaborative filtering, in: Proceedings of the 2018 world wide web conference, 2018, pp. 689–698. [14] D. P. Kingma, M. Welling, Auto-encoding varia