Item-based variational auto-encoder for fair music recommendation Jinhyeok Park1,† , Dain Kim1,† and Dongwoo Kim1,* 1 Pohang University of Science and Technology, Pohang, Republic of Korea Abstract We present our solution for the EvalRS DataChallenge. The EvalRS DataChallenge aims to build a more realistic recommender system considering accuracy, fairness, and diversity in evaluation. Our proposed system is based on an ensemble between an item-based variational auto-encoder (VAE) and a Bayesian personalized ranking matrix factorization (BPRMF). To mitigate the bias in popularity, we use an item-based VAE for each popularity group with an additional fairness regularization. To make a reasonable recommendation even the predictions are inaccurate, we combine the recommended list of BPRMF and that of item-based VAE. Through the experiments, we demonstrate that the item-based VAE with fairness regularization significantly reduces popularity bias compared to the user-based VAE. The ensemble between the item-based VAE and BPRMF makes the top-1 item similar to the ground truth even the predictions are inaccurate. Finally, we propose a β€˜Coefficient Variance based Fairness’ as a novel evaluation metric based on our reflections from the extensive experiments. Keywords recommender systems, fairness, variational auto-encoder, collaborative filtering 1. Introduction The performances of recommendations are evaluated through accuracy metrics (e.g., hit rate, mean reciprocal Recommender systems are rising as a powerful tool that rank), accuracy metrics on a per-group basis to measure predicts user preferences based on past interactions be- fairness, and behavioral tests to measure the diversity of tween users and items. Industries such as e-commerce, recommended items using Reclist [12]. The challenge is music, and social media adopt recommender systems divided into two phases depending on how each metric to provide users with a more personalized experience is aggregated. In phase 1, the final evaluation score is and foster a marketplace. However, several works have computed using a simple average, and in phase 2, the shown that excessive emphasis on user utility alone may weight of each metric is adjusted according to the diffi- result in problems like the Matthew effect and filter bub- culty observed during phase 1 before aggregation. ble [1, 2, 3]. In this work, we propose a framework that can sat- Utility-focused model selection is undesirable since it isfy various evaluation metrics comprehensively. We may lead to inequality in distribution, which eventually adopt the variational auto-encoder for collaborative fil- suppresses market diversity [4]. For this reason, many tering [13] as our baseline, which aims to produce a like- previous studies have proposed the necessity of metrics lihood of the user-item interaction matrix from multi- beyond accuracy, such as fairness, diversity, and serendip- nomial distribution via an auto-encoding architecture. ity [5, 6, 7]. For instance, Li et al. [8] demonstrate that Through extensive model evaluation, we found three there exists a performance gap between the inactive and strategies that can mitigate potential biases while keep- active user groups and suggest the definition of user- ing a relatively high utility. First, we found that the oriented group fairness. Biega et al. [9] propose equity of item-based VAE helps to alleviate the popularity bias attention that requires the exposure to be proportional of recommendations compared to the user-based VAE. to the relevance of an item. Second, we found that training separate VAE models for EvalRS DataChallenge is designed to emphasize the artist popularity groups can mitigate the popularity bias. importance of measuring recommendation performance Lastly, we found that a fairness regularizer, designed to from various perspectives, including accuracy, fairness, minimize the gap between the losses of different groups, and diversity [10]. Using the LFM-1b dataset [11], partic- further leverages the fairness in item groups. ipants are asked to recommend top-π‘˜ items for each user. The rest of the paper is structured as follows. In sec- EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022, tion 2, we describe our model architecture and strategies Atlanta, GA to improve model fairness. In section 3, we show the * Corresponding author. experimental result with a discussion. In section 4, we † These authors contributed equally. reflect on our findings and propose a new metric that $ jinhyeok1234@postech.ac.kr (J. Park); dain5832@postech.ac.kr can better measure the fairness of the model by taking (D. Kim); dongwookim@postech.ac.kr (D. Kim) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License accuracy into account. Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Evaluation metrics We describe the evaluation met- Let x𝑒 = [π‘₯𝑒1 , π‘₯𝑒2 , ...π‘₯𝑒𝐼 ] be an implicit feedback of rics used in the EvalRS DataChallenge [10]. The eval- user 𝑒 where π‘₯𝑒𝑖 is binary indicator specifying whether uation metrics can be categorized into three different user 𝑒 interacted with the item 𝑖. The likelihood function measures: 𝑝(x𝑒 |𝑧𝑒 ) is then modeled via a multinomial distribution conditioned on the latent vector z𝑒 . The multivariate β€’ Accuracy metrics: accuracy metrics indicate the normal distribution is used as a variational distribution predictive performance of a model. It includes hit π‘ž(z |x ). During the training, one can optimize the 𝑒 𝑒 rate (HR) and mean reciprocal rank (MRR), which parameters to maximize the ELBO. After training, the are widely used in recommender systems. recommended items are chosen based on the multino- β€’ Accuracy metrics on a per-group basis: group- mial distribution among the items that have not been based metrics are designed to evaluate the fair- interacted so far. ness and robustness of the model. The challenge adopts the miss rate equality difference (MRED), Item-based VAE Although using implicit feedback of which measures the average difference between a user, i.e., x , as an input of VAE is a common approach 𝑒 the miss rate (MR) of each group and the MR (user-based VAE), alternatively, one can use implicit feed- of the entire dataset. The metrics are evaluated back of an item as an input of VAE (item-based VAE). The across five different groups: gender, country, user implicit feedback vector of item 𝑖 can be constructed as history, artist popularity, and track popularity. x𝑖 = [π‘₯𝑖1 , π‘₯𝑖2 , ...π‘₯π‘–π‘ˆ ], where π‘₯𝑖𝑗 indicates the interac- β€’ Behavioral tests: behavioral tests measure the tion between item 𝑖 and user 𝑗. similarity between recommended and ground To recommend items with item-based VAE, the model truth items and the diversity of recommended infers logits over all items to complete the user-item items. Behavioral tests consist of two metrics; β€˜be interaction matrix and recommends top-𝑁 items for each less wrong’ and β€˜latent diversity.’ Be less wrong user. Empirically, we find that the item-based VAE tends measures the distance between the embeddings to recommend unpopular items compared to the user- of ground truth and the predicted result. Latent based VAE. diversity indicates a model density in the latent space of tracks. Bayesian personalized ranking matrix factorization We use Bayesian personalized ranking matrix factoriza- 2. Method tion (BPRMF) [16] as the second baseline model. BPRMF estimates the posterior distribution over the likelihood 2.1. Baseline Models of pair-wise ranking between items with a prior distribu- tion. We use the variational auto-encoders (VAE) and Bayesian personalized ranking matrix factorization (BPRMF) as 2.2. Model Optimization our backbone methods. In this section, we describe the backbone methods and explain how to use these back- In this section, we introduce various methods used to bones to curate the final recommendation list. improve the performance of the item-based VAE for phase 2. Our approach mainly targets group-based metrics and Variational auto-encoders for collaborative filter- behavioral tests rather than accuracy metrics. ing. In this work, we employ the variational auto- encoder (VAE) for collaborative filtering [13] as the first Popularity-aware training based on items We aim backbone model. The objective of VAE [14] is to max- to improve the MRED between track popularity groups imize the evidence lower bound (ELBO) for each data and artist popularity groups, which are significant factors point π‘₯𝑖 : in phase 2. Based on the item-based VAE, to reduce the perfor- 𝐿𝛽 (π‘₯𝑖 ; πœƒ, πœ‘) =Eπ‘žπœ‘ (𝑧𝑖 |π‘₯𝑖 ) [log π‘πœƒ (π‘₯𝑖 | 𝑧𝑖 )] mance gap between artist popularity groups, we divide βˆ’ 𝛽 Β· KL(π‘žπœ‘ (𝑧𝑖 | π‘₯𝑖 )‖𝑝(𝑧𝑖 )), items by artist popularity groups and train a VAE for each group separately. After training, we find that the least where 𝑧𝑖 is the latent variable, 𝛽 measures the importance popular artist group is underfitted compared to other of the KL divergence, and the likelihood function π‘πœƒ and groups. Therefore, we train two more epochs for this the variational distribution π‘žπœ‘ are parameterized by πœƒ group. Then, we pick a certain number of items from and πœ‘, respectively. each group to make a recommendation. Please check the There have been multiple approaches to employ the details of this process in the Final Recommendation part. VAE framework for collaborative filtering [13, 15]. In this The MRED between track popularity groups is also an work, we follow the framework proposed by [13]. important factor for phase 2. Although we divided items Hit Country User TrackPop ArtistPop Gender Be less Latent Score MRR Rate (MRED) (MRED) (MRED) (MRED) (MRED) Wrong Diversity Phase1 VAE(item) 0.2121 0.0399 -0.0248 -0.0287 -0.0529 -0.0216 -0.0144 0.3189 -0.3041 0.0138 VAE(user) 0.1593 0.0256 -0.0161 -0.0323 -0.0937 -0.0430 -0.0044 0.3512 -0.2726 0.0082 BPRMF 0.0372 0.0025 -0.0098 -0.0163 -0.0230 -0.0102 -0.0070 0.3721 -0.2948 0.0056 Table 1 Phase 1 results of our baseline models obtained by simple averaging of nine metrics. Model 1 10 100 1000 total Final Recommendation Since there are four artist VAE (item) 0.8946 0.7865 0.7770 0.8803 0.7879 popularity groups, we train four separate VAEs each of VAE (user) 0.9398 0.8861 0.8062 0.6448 0.8407 which is designated for each group. From the four VAEs, BPRMF 0.9965 0.9830 0.9387 0.9487 0.9628 we first create a list of 98 items to be recommended. First, Table 2 we take 38/20/20/20 items from artists groups 0, 1, 2, and MR for each model at each track popularity group. 3 respectively, where group 0 indicates the least popular group. Among those selected items, we take the five most probable items from each group and curate a list by artist groups, MRED between the track population of the top 20 items. The 20 items are ordered with 2, groups is also reduced. However, the least popular item 1, 3, 0 (5/5/5/5) in descending order of the number of group whose playcount is less than ten is still not recom- items in each group. The remaining 78 items are listed mended well, as shown in Figure 1. To recommend an after with the same order (15/15/15/33). One additional item from the least popular track group, we additionally item recommended from the least popular track group is train a separate VAE for this group and include at least added at the end of the list. one item from this group. In addition, we find that the item-based VAE often fails to achieve good performance in the behavioral tests, es- Fairness Regularization Fairness regularization aims pecially for β€˜be less wrong’. Therefore, we ensemble the to introduce an additional regularizer term to the ob- item-based VAE and BPRMF which shows good perfor- jective to narrow the gap between group losses. The mance in β€˜be less wrong’. Since the metric only considers approach has been widely adopted in fields such as com- the top-1 item, we put the most probable item from the puter vision, natural language processing, and sound [17, BPRMF model at the top of our previous recommendation 18]. list, resulting in 100 recommended items. Many recommender systems also employ regularizers to improve group fairness [5, 19]. We incorporate a fairness regularization into VAE 3. Experiments based on the work of [20]. The regularization term com- putes the average difference between the group recon- 3.1. Dataset struction loss and the entire reconstruction loss as The LFM-1b dataset [11] is provided for the challenge. βŽ‘βƒ’ The dataset consists of the listening history of users with demographic information, such as gender and nationality, βƒ’ βƒ’ 1 βˆ‘οΈ πΉπœ‘ (π‘₯𝑖 ; πœƒ, πœ‘) = E βŽ£βƒ’βƒ’ E [log π‘πœƒ (π‘₯𝑐 | 𝑧𝑐 )] and metadata of the item. The dataset includes 119,555 βƒ’ |𝐺𝑗 | π‘βˆˆπΊπ‘— users, 820,998 tracks and 37,926,429 interactions. The test set was generated based on the leave-one-out frame- βƒ’]οΈƒ 1 βˆ‘οΈ βƒ’ βˆ’ E [log π‘πœƒ (π‘₯𝑖 | 𝑧𝑖 )]βƒ’ , work by randomly masking one item from each user’s βƒ’ |𝐼| π‘–βˆˆπΌ history. Please check the detailed pre-processing steps βƒ’ where 𝐼 is a set of all items, and 𝐺𝑗 is a set of items of the dataset for the challenge in [10]. that belong to the group 𝑗. Groups are divided into 1, 10, 100, and 1000 based on track popularity, and each item is 3.2. Phase 1 assigned to the group according to its total play counts. Our final objective can be expressed as follows: In phase 1, we conduct an experiment to check the per- formances of our baseline models: item-based VAE, user- 𝐿𝑅 𝛽 (π‘₯𝑖 ; πœƒ, πœ‘) = 𝐿𝛽 (π‘₯𝑖 ; πœƒ, πœ‘) βˆ’ 𝛾 Β· πΉπœ‘ , based VAE, and BPRMF. For all experiments with VAEs, where the hyperparameter 𝛾 controls the weight of the we adopt the same architecture as [13]. We set the batch regularizer. The higher value of 𝛾 indicates that the size to 32. Latent dimension is set to 500, and the size of model takes a greater proportion of fairness into account hidden layer is set to 300. We train for 5 epochs using during the optimization process. the Adam [21] optimizer with a learning rate of 0.001. Hit Country User TrackPop ArtistPop Gender Be less Latent Score MRR Rate (MRED) (MRED) (MRED) (MRED) (MRED) Wrong Diversity Phase2 Fold1 0.0154 0.0015 -0.0030 -0.0035 -0.0021 -0.0007 -0.0003 0.3661 -0.2924 Fold2 0.0151 0.0016 -0.0036 -0.0021 -0.0024 -0.0021 -0.0012 0.3602 -0.3000 Fold3 0.0169 0.0021 -0.0047 -0.0044 -0.0023 -0.0005 -0.0004 0.3685 -0.2948 Fold4 0.0169 0.0017 -0.0036 -0.0017 -0. 0024 -0.0010 -0.0008 0.3609 -0.2984 Average 0.0161 0.0017 -0.0037 -0.0029 -0.0023 -0.0010 -0.0007 0.3639 -0.2964 1.553 Baseline 0.0363 0.0037 -0.0090 -0.0224 -0.0111 -0.0072 -0.0061 0.3758 -0.3080 -1.212 Table 3 Our final results for the four folds and average of them. Baseline denotes β€˜CBOWRecSysBaseline’ provided by the challenge organizers. mend famous items, it outperforms other methods in β€˜Be less wrong’. Table 1 shows the preliminary results. 3.3. Phase 2 Based on the preliminary results shown in phase 1, we combine the results of the item-based VAE and the BPRMF to curate the recommendation list for phase 2. In phase 1, the overall score is determined by a simple average of each test. However, in phase 2, the impor- Figure 1: Popularity distributions of all items and items rec- tance of each metric is adjusted based on the perfor- ommended through the model. The x-axis represents track mance of participants in phase 1. As we replicate the popularity groups, and y-axis represents proportion of each process described in the challenge to analyze relative group. We use the logarithmic scale on the y-axis due to values of weights, we observe an enormous gap between skewed distributions. The result shows that the recommended the weight of artist popularity and that of HR. Thus, we items from the item-VAE follow a similar distribution to the mainly focused on mitigating the bias of β€˜artist popular- total popularity distributions of all items. ity’. For the final experiments, we set the latent dimension of the item-based VAE to 17 and the batch size to 32. Dropout rate is set to 0.2. For BPRMF, we set the batch Then, we train the model using the Adam optimizer with size to 8192, and dimension to 64. We train for 10 epochs a learning rate of 1e-3 for two epochs. As the unpop- with a learning rate of 0.001. Instead of using weight ular artist group does not fit well, We find that having regularization, we normalize the vector if the maximum additional two epochs to train the VAE for the unpopular value of user vectors and item vectors is greater than 1. artist group generally helps to improve the performance. Table 1 reveals variational auto-encoders for collabo- When we train the least popular item group, we set the rative filtering achieve good performance when applying latent dimension to 15 and train for 2 epochs. 𝛽 and a simple average of all metrics. In particular, the item- 𝛾, which are the coefficient of the KL divergence term based VAE shows good performance in not only hit rate and the coefficient of the regularizer, are set to 0.0001 and MRR but also MRED between user activity, track and 0.003, respectively, after parameter searching. For popularity, and artist popularity groups. Table 2 shows BPRMF, we set the latent dimension to 200 and other the MR of each track popularity group between baseline conditions are set to the same as the baseline. Then, we models. We can observe that the item-based VAE has make the final recommendation described in 2.2. 1 better accuracy with unpopular item groups. Table 3 shows our results of all folds with the phase 2 We observe that the item-based VAE recommends as score and the baseline score. The results show that our many unpopular items as popular ones. Figure 1 shows model successfully reduces the gap between the artist popularity distributions of the recommended items and popularity groups and the track popularity groups. Fur- all items. The result indicates that the recommended thermore, our model shows lower MREDs between user items from item-based VAE follow a similar distribution activity, gender, and country groups than those of the of the total item popularity computed in the training set. baseline provided by the challenge organizers. In the meanwhile, the user-based VAE tends to recom- mend more popular items than the item-based VAE. We find that although the BPRMF method does not 1 The source code for reproducing the experiments is available at show a good overall performance and it tends to recom- https://github.com/ParkJinHyeock/evalRS-submission Model 1 10 100 1000 Hit MRED CV(ours) VAE (item) 0.8946 0.7865 0.7770 0.8803 0.2121 -0.0529 0.2559 Track VAE (user) 0.9398 0.8861 0.8063 0.6448 0.1593 -0.0937 0.7022 popularity VAE (final) 0.9858 0.9851 0.9821 0.9867 0.0161 -0.0023 0.0701 BPRMF 0.9965 0.9831 0.9387 0.9487 0.0372 -0.0230 0.6436 Model 1 100 1000 10000 Hit MRED CV(ours) VAE (item) 0.8259 0.8107 0.7688 0.7942 0.2121 -0.0216 0.1019 Artist VAE (user) 0.8962 0.8887 0.8556 0.7870 0.1593 -0.0430 0.2716 popularity VAE (final) 0.9831 0.9848 0.9835 0.9841 0.0161 -0.0010 0.1459 BPRMF 0.9850 0.9721 0.9629 0.9546 0.0372 -0.0102 0.3070 Table 4 Miss rate and proposed metric of each model at each track popularity group (top) and artist popularity group (bottom). VAE (final) denotes our final submission. 4. Discussion and Reflection By dividing by the mean, the metric indicates the rela- tive ratio of deviation to performance. Inspired by this, 4.1. User Fairness our proposed metric can be expressed as follows. As shown in experiments, our main approach focuses on balancing the HR of the artist and track popularity βˆšοΈƒ βˆ‘οΈ€ 𝑖 (HRavg βˆ’ HRgroup𝑖 ) 2 βˆ’1 groups. However, our method also yields fair perfor- CVHR = (HRavg ) (2) 𝑁groups mance in user-related fairness metrics. The previous study [22] shows that there are no significant differences The proposed metric quantifies the fairness of the in the performance of the model between the gender model, considering the average of HR when measuring groups. The authors also identify a negative relationship the deviation. The lower value indicates higher fairness. between user activity and performance. We observe a Our metric reasonably evaluates fairness through a rel- similar phenomenon from our results; there are small ative ratio. Even if the model achieves a low deviation, differences between gender and country groups, and a the proposed metric will be penalized if the deviation is negative correlation between user activity and perfor- relatively large compared to the HR. Table 4 shows the mance. However, with the item-based VAE, reducing the MR of each group, MRED, and our proposed metric. We gap between item groups also reduces the gap between observe that for artist popularity groups, the item-based user activity groups. VAE outperforms the final model as it has a relatively low deviation with high HR, which is consistent with 4.2. Reflection on Evaluation Metric our intuition. Models with a relatively large deviation between each group have a high penalty, while models The EvalRS DataChallenge [10] evaluates the fairness of with a relatively low deviation have a low penalty. the model using the average difference of MR between groups. In this section, we analyze the weakness of this approach and propose a novel metric that improves it. 5. Conclusion We first analyze the limitation of the current fairness metric. Suppose there is a model with a hit rate of 0.2 and In this work, we propose a fairness-aware variational another with a hit rate of 0.02. If the average deviation auto-encoder for recommender systems. Our approach of HR is both 0.01, the two models would be considered shows that the item-based VAE significantly reduces the to produce equivalent performance regarding fairness. popularity bias of the model. Moreover, we conclude However, in terms of relative ratios, the same deviation that obtaining the recommendation results from various accounts for 5% of the former but 50% of the latter. From artist groups and adapting a regularizer further improves this perspective, using MRED to measure fairness might the fairness of the model. Finally, we suggest the notion lead to unreasonable comparison. of β€˜Coefficient of Variance based Fairness’ for the model With this intuition, we propose a β€˜Coefficient of Vari- evaluation and demonstrate that it reasonably measures ance (CV)[23] based fairness’ which is less sensitive to the fairness of the model. scale. The Coefficient of Variance is defined as the stan- dard deviation divided by the mean multiplied by 100: 𝜎 CV = * 100 (1) π‘š References tional bayes, arXiv preprint arXiv:1312.6114 (2013). [15] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, Autorec: [1] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, Autoencoders meet collaborative filtering, in: Pro- X. He, Bias and debias in recommender system: ceedings of the 24th international conference on A survey and future directions, arXiv preprint World Wide Web, 2015, pp. 111–112. arXiv:2010.03240 (2020). [16] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt- [2] Y. Wang, W. Ma, M. Zhang*, Y. Liu, S. Ma, A survey Thieme, Bpr: Bayesian personalized ranking from on the fairness of recommender systems, ACM implicit feedback, arXiv preprint arXiv:1205.2618 Journal of the ACM (JACM) (2022). (2012). [3] M. Morik, A. Singh, J. Hong, T. Joachims, Control- [17] K. Zhao, J. Xu, M.-M. Cheng, Regularface: Deep ling fairness and bias in dynamic learning-to-rank, face recognition via exclusive regularization, in: in: Proceedings of the 43rd international ACM SI- Proceedings of the IEEE/CVF Conference on Com- GIR conference on research and development in puter Vision and Pattern Recognition, 2019, pp. information retrieval, 2020, pp. 429–438. 1136–1144. [4] A. Singh, T. Joachims, Fairness of exposure in rank- [18] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, C. Dwork, ings, in: Proceedings of the 24th ACM SIGKDD Learning fair representations, in: International International Conference on Knowledge Discovery conference on machine learning, PMLR, 2013, pp. & Data Mining, 2018, pp. 2219–2228. 325–333. [5] S. Yao, B. Huang, Beyond parity: Fairness objec- [19] A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu, tives for collaborative filtering, Advances in neural L. Heldt, Z. Zhao, L. Hong, E. H. Chi, et al., Fairness information processing systems 30 (2017). in recommendation ranking through pairwise com- [6] S. Vargas, P. Castells, Rank and relevance in novelty parisons, in: Proceedings of the 25th ACM SIGKDD and diversity metrics for recommender systems, International Conference on Knowledge Discovery in: Proceedings of the fifth ACM conference on & Data Mining, 2019, pp. 2212–2220. Recommender systems, 2011, pp. 109–116. [20] R. Borges, K. Stefanidis, F2vae: a framework for [7] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond mitigating user unfairness in recommendation sys- accuracy: evaluating recommender systems by cov- tems, in: Proceedings of the 37th ACM/SIGAPP erage and serendipity, in: Proceedings of the fourth Symposium on Applied Computing, 2022, pp. 1391– ACM conference on Recommender systems, 2010, 1398. pp. 257–260. [21] D. P. Kingma, J. Ba, Adam: A method for stochas- [8] Y. Li, H. Chen, Z. Fu, Y. Ge, Y. Zhang, User-oriented tic optimization, arXiv preprint arXiv:1412.6980 fairness in recommendation, in: Proceedings of the (2014). Web Conference 2021, 2021, pp. 624–632. [22] M. D. Ekstrand, M. Tian, I. M. Azpiazu, J. D. Ek- [9] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of at- strand, O. Anuyah, D. McNeill, M. S. Pera, All the tention: Amortizing individual fairness in rankings, cool kids, how do they fit in?: Popularity and de- in: The 41st international acm sigir conference on mographic biases in recommender evaluation and research & development in information retrieval, effectiveness, in: Conference on fairness, account- 2018, pp. 405–414. ability and transparency, PMLR, 2018, pp. 172–186. [10] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, [23] C. E. Brown, Coefficient of variation, in: Applied C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: a multivariate statistics in geohydrology and related rounded evaluation of recommender systems, arXiv sciences, Springer, 1998, pp. 155–157. preprint arXiv:2207.05772 (2022). [11] M. Schedl, The lfm-1b dataset for music retrieval and recommendation, in: Proceedings of the 2016 ACM on international conference on multimedia retrieval, 2016, pp. 103–110. [12] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Beyond ndcg: behavioral testing of recommender systems with reclist, in: Companion Proceedings of the Web Conference 2022, 2022, pp. 99–104. [13] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara, Variational autoencoders for collaborative filtering, in: Proceedings of the 2018 world wide web confer- ence, 2018, pp. 689–698. [14] D. P. Kingma, M. Welling, Auto-encoding varia-