Item-based variational auto-encoder for fair music
recommendation
Jinhyeok Park1,† , Dain Kim1,† and Dongwoo Kim1,*
1
    Pohang University of Science and Technology, Pohang, Republic of Korea


                                             Abstract
                                             We present our solution for the EvalRS DataChallenge. The EvalRS DataChallenge aims to build a more realistic recommender
                                             system considering accuracy, fairness, and diversity in evaluation. Our proposed system is based on an ensemble between an
                                             item-based variational auto-encoder (VAE) and a Bayesian personalized ranking matrix factorization (BPRMF). To mitigate the
                                             bias in popularity, we use an item-based VAE for each popularity group with an additional fairness regularization. To make a
                                             reasonable recommendation even the predictions are inaccurate, we combine the recommended list of BPRMF and that of
                                             item-based VAE. Through the experiments, we demonstrate that the item-based VAE with fairness regularization significantly
                                             reduces popularity bias compared to the user-based VAE. The ensemble between the item-based VAE and BPRMF makes the
                                             top-1 item similar to the ground truth even the predictions are inaccurate. Finally, we propose a ‘Coefficient Variance based
                                             Fairness’ as a novel evaluation metric based on our reflections from the extensive experiments.

                                             Keywords
                                             recommender systems, fairness, variational auto-encoder, collaborative filtering


1. Introduction                                                                                        The performances of recommendations are evaluated
                                                                                                       through accuracy metrics (e.g., hit rate, mean reciprocal
Recommender systems are rising as a powerful tool that rank), accuracy metrics on a per-group basis to measure
predicts user preferences based on past interactions be- fairness, and behavioral tests to measure the diversity of
tween users and items. Industries such as e-commerce, recommended items using Reclist [12]. The challenge is
music, and social media adopt recommender systems divided into two phases depending on how each metric
to provide users with a more personalized experience is aggregated. In phase 1, the final evaluation score is
and foster a marketplace. However, several works have computed using a simple average, and in phase 2, the
shown that excessive emphasis on user utility alone may weight of each metric is adjusted according to the diffi-
result in problems like the Matthew effect and filter bub- culty observed during phase 1 before aggregation.
ble [1, 2, 3].                                                                                            In this work, we propose a framework that can sat-
   Utility-focused model selection is undesirable since it isfy various evaluation metrics comprehensively. We
may lead to inequality in distribution, which eventually adopt the variational auto-encoder for collaborative fil-
suppresses market diversity [4]. For this reason, many tering [13] as our baseline, which aims to produce a like-
previous studies have proposed the necessity of metrics lihood of the user-item interaction matrix from multi-
beyond accuracy, such as fairness, diversity, and serendip- nomial distribution via an auto-encoding architecture.
ity [5, 6, 7]. For instance, Li et al. [8] demonstrate that Through extensive model evaluation, we found three
there exists a performance gap between the inactive and strategies that can mitigate potential biases while keep-
active user groups and suggest the definition of user- ing a relatively high utility. First, we found that the
oriented group fairness. Biega et al. [9] propose equity of item-based VAE helps to alleviate the popularity bias
attention that requires the exposure to be proportional of recommendations compared to the user-based VAE.
to the relevance of an item.                                                                           Second, we found that training separate VAE models for
   EvalRS DataChallenge is designed to emphasize the artist popularity groups can mitigate the popularity bias.
importance of measuring recommendation performance Lastly, we found that a fairness regularizer, designed to
from various perspectives, including accuracy, fairness, minimize the gap between the losses of different groups,
and diversity [10]. Using the LFM-1b dataset [11], partic- further leverages the fairness in item groups.
ipants are asked to recommend top-𝑘 items for each user.                                                  The rest of the paper is structured as follows. In sec-
EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022,                                         tion 2, we describe our model architecture and strategies
Atlanta, GA                                                                                            to improve model fairness. In section 3, we show the
*
  Corresponding author.                                                                                experimental result with a discussion. In section 4, we
†
  These authors contributed equally.                                                                   reflect on our findings and propose a new metric that
$ jinhyeok1234@postech.ac.kr (J. Park); dain5832@postech.ac.kr                                         can better measure the fairness of the model by taking
(D. Kim); dongwookim@postech.ac.kr (D. Kim)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License accuracy into account.
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Evaluation metrics We describe the evaluation met-             Let x𝑢 = [𝑥𝑢1 , 𝑥𝑢2 , ...𝑥𝑢𝐼 ] be an implicit feedback of
rics used in the EvalRS DataChallenge [10]. The eval- user 𝑢 where 𝑥𝑢𝑖 is binary indicator specifying whether
uation metrics can be categorized into three different user 𝑢 interacted with the item 𝑖. The likelihood function
measures:                                                   𝑝(x𝑢 |𝑧𝑢 ) is then modeled via a multinomial distribution
                                                            conditioned on the latent vector z𝑢 . The multivariate
      • Accuracy metrics: accuracy metrics indicate the normal distribution is used as a variational distribution
        predictive performance of a model. It includes hit 𝑞(z |x ). During the training, one can optimize the
                                                                𝑢 𝑢
        rate (HR) and mean reciprocal rank (MRR), which parameters to maximize the ELBO. After training, the
        are widely used in recommender systems.             recommended items are chosen based on the multino-
      • Accuracy metrics on a per-group basis: group- mial distribution among the items that have not been
        based metrics are designed to evaluate the fair- interacted so far.
        ness and robustness of the model. The challenge
        adopts the miss rate equality difference (MRED), Item-based VAE Although using implicit feedback of
        which measures the average difference between a user, i.e., x , as an input of VAE is a common approach
                                                                          𝑢
        the miss rate (MR) of each group and the MR (user-based VAE), alternatively, one can use implicit feed-
        of the entire dataset. The metrics are evaluated back of an item as an input of VAE (item-based VAE). The
        across five different groups: gender, country, user implicit feedback vector of item 𝑖 can be constructed as
        history, artist popularity, and track popularity.   x𝑖 = [𝑥𝑖1 , 𝑥𝑖2 , ...𝑥𝑖𝑈 ], where 𝑥𝑖𝑗 indicates the interac-
      • Behavioral tests: behavioral tests measure the tion between item 𝑖 and user 𝑗.
        similarity between recommended and ground              To recommend items with item-based VAE, the model
        truth items and the diversity of recommended infers logits over all items to complete the user-item
        items. Behavioral tests consist of two metrics; ‘be interaction matrix and recommends top-𝑁 items for each
        less wrong’ and ‘latent diversity.’ Be less wrong user. Empirically, we find that the item-based VAE tends
        measures the distance between the embeddings to recommend unpopular items compared to the user-
        of ground truth and the predicted result. Latent based VAE.
        diversity indicates a model density in the latent
        space of tracks.
                                                            Bayesian personalized ranking matrix factorization
                                                            We use Bayesian personalized ranking matrix factoriza-
2. Method                                                   tion (BPRMF) [16] as the second baseline model. BPRMF
                                                            estimates the posterior distribution over the likelihood
2.1. Baseline Models                                        of pair-wise ranking between items with a prior distribu-
                                                            tion.
We use the variational auto-encoders (VAE) and Bayesian
personalized ranking matrix factorization (BPRMF) as
                                                            2.2. Model Optimization
our backbone methods. In this section, we describe the
backbone methods and explain how to use these back- In this section, we introduce various methods used to
bones to curate the final recommendation list.              improve the performance of the item-based VAE for phase
                                                            2. Our approach mainly targets group-based metrics and
Variational auto-encoders for collaborative filter- behavioral tests rather than accuracy metrics.
ing. In this work, we employ the variational auto-
encoder (VAE) for collaborative filtering [13] as the first Popularity-aware training based on items We aim
backbone model. The objective of VAE [14] is to max- to improve the MRED between track popularity groups
imize the evidence lower bound (ELBO) for each data and artist popularity groups, which are significant factors
point 𝑥𝑖 :                                                  in phase 2.
                                                               Based on the item-based VAE, to reduce the perfor-
      𝐿𝛽 (𝑥𝑖 ; 𝜃, 𝜑) =E𝑞𝜑 (𝑧𝑖 |𝑥𝑖 ) [log 𝑝𝜃 (𝑥𝑖 | 𝑧𝑖 )]     mance gap between artist popularity groups, we divide
                       − 𝛽 · KL(𝑞𝜑 (𝑧𝑖 | 𝑥𝑖 )‖𝑝(𝑧𝑖 )),      items by artist popularity groups and train a VAE for each
                                                            group separately. After training, we find that the least
where 𝑧𝑖 is the latent variable, 𝛽 measures the importance popular artist group is underfitted compared to other
of the KL divergence, and the likelihood function 𝑝𝜃 and groups. Therefore, we train two more epochs for this
the variational distribution 𝑞𝜑 are parameterized by 𝜃 group. Then, we pick a certain number of items from
and 𝜑, respectively.                                        each group to make a recommendation. Please check the
   There have been multiple approaches to employ the details of this process in the Final Recommendation part.
VAE framework for collaborative filtering [13, 15]. In this    The MRED between track popularity groups is also an
work, we follow the framework proposed by [13].             important factor for phase 2. Although we divided items
                Hit                   Country      User        TrackPop    ArtistPop   Gender     Be less    Latent      Score
                            MRR
                Rate                  (MRED)      (MRED)        (MRED)     (MRED)      (MRED)     Wrong     Diversity   Phase1
  VAE(item)    0.2121     0.0399        -0.0248    -0.0287       -0.0529     -0.0216    -0.0144    0.3189     -0.3041   0.0138
  VAE(user)     0.1593     0.0256       -0.0161    -0.0323       -0.0937     -0.0430   -0.0044     0.3512    -0.2726     0.0082
   BPRMF        0.0372     0.0025      -0.0098    -0.0163       -0.0230     -0.0102     -0.0070   0.3721      -0.2948    0.0056

Table 1
Phase 1 results of our baseline models obtained by simple averaging of nine metrics.

   Model           1        10        100     1000     total        Final Recommendation Since there are four artist
 VAE (item)   0.8946     0.7865     0.7770   0.8803   0.7879        popularity groups, we train four separate VAEs each of
 VAE (user)   0.9398     0.8861     0.8062   0.6448   0.8407        which is designated for each group. From the four VAEs,
  BPRMF       0.9965     0.9830     0.9387   0.9487   0.9628        we first create a list of 98 items to be recommended. First,
Table 2                                                             we take 38/20/20/20 items from artists groups 0, 1, 2, and
MR for each model at each track popularity group.                   3 respectively, where group 0 indicates the least popular
                                                                    group. Among those selected items, we take the five
                                                                    most probable items from each group and curate a list
by artist groups, MRED between the track population
                                                                    of the top 20 items. The 20 items are ordered with 2,
groups is also reduced. However, the least popular item
                                                                    1, 3, 0 (5/5/5/5) in descending order of the number of
group whose playcount is less than ten is still not recom-
                                                                    items in each group. The remaining 78 items are listed
mended well, as shown in Figure 1. To recommend an
                                                                    after with the same order (15/15/15/33). One additional
item from the least popular track group, we additionally
                                                                    item recommended from the least popular track group is
train a separate VAE for this group and include at least
                                                                    added at the end of the list.
one item from this group.
                                                                       In addition, we find that the item-based VAE often fails
                                                                    to achieve good performance in the behavioral tests, es-
Fairness Regularization Fairness regularization aims                pecially for ‘be less wrong’. Therefore, we ensemble the
to introduce an additional regularizer term to the ob-              item-based VAE and BPRMF which shows good perfor-
jective to narrow the gap between group losses. The                 mance in ‘be less wrong’. Since the metric only considers
approach has been widely adopted in fields such as com-             the top-1 item, we put the most probable item from the
puter vision, natural language processing, and sound [17,           BPRMF model at the top of our previous recommendation
18].                                                                list, resulting in 100 recommended items.
   Many recommender systems also employ regularizers
to improve group fairness [5, 19].
   We incorporate a fairness regularization into VAE 3. Experiments
based on the work of [20]. The regularization term com-
putes the average difference between the group recon- 3.1. Dataset
struction loss and the entire reconstruction loss as
                                                              The LFM-1b dataset [11] is provided for the challenge.
                        ⎡⃒                                    The dataset consists of the listening history of users with
                                                              demographic information, such as gender and nationality,
                          ⃒
                          ⃒ 1 ∑︁
   𝐹𝜑 (𝑥𝑖 ; 𝜃, 𝜑) = E ⎣⃒⃒               E [log 𝑝𝜃 (𝑥𝑐 | 𝑧𝑐 )] and metadata of the item. The dataset includes 119,555
                          ⃒ |𝐺𝑗 | 𝑐∈𝐺𝑗
                                                              users, 820,998 tracks and 37,926,429 interactions. The
                                                              test set was generated based on the leave-one-out frame-
                                                          ⃒]︃
                              1 ∑︁
                                                          ⃒
                         −           E [log 𝑝𝜃 (𝑥𝑖 | 𝑧𝑖 )]⃒ , work by randomly masking one item from each user’s
                                                          ⃒
                             |𝐼| 𝑖∈𝐼
                                                              history. Please check the detailed pre-processing steps
                                                          ⃒

where 𝐼 is a set of all items, and 𝐺𝑗 is a set of items of the dataset for the challenge in [10].
that belong to the group 𝑗. Groups are divided into 1, 10,
100, and 1000 based on track popularity, and each item is 3.2. Phase 1
assigned to the group according to its total play counts.
Our final objective can be expressed as follows:              In phase 1, we conduct an experiment to check the per-
                                                              formances of our baseline models: item-based VAE, user-
         𝐿𝑅 𝛽 (𝑥𝑖 ; 𝜃, 𝜑) = 𝐿𝛽 (𝑥𝑖 ; 𝜃, 𝜑) − 𝛾 · 𝐹𝜑 ,         based VAE, and BPRMF. For all experiments with VAEs,
where the hyperparameter 𝛾 controls the weight of the         we  adopt the same architecture as [13]. We set the batch
regularizer. The higher value of 𝛾 indicates that the         size to 32. Latent dimension is set to 500, and the size of
model takes a greater proportion of fairness into account hidden layer is set to 300. We train for 5 epochs using
during the optimization process.                              the Adam [21] optimizer with a learning rate of 0.001.
               Hit               Country      User      TrackPop         ArtistPop    Gender      Be less     Latent       Score
                        MRR
               Rate              (MRED)      (MRED)      (MRED)          (MRED)       (MRED)      Wrong      Diversity    Phase2
   Fold1      0.0154   0.0015     -0.0030     -0.0035      -0.0021         -0.0007     -0.0003     0.3661      -0.2924
   Fold2      0.0151   0.0016     -0.0036     -0.0021      -0.0024         -0.0021     -0.0012     0.3602      -0.3000
   Fold3      0.0169   0.0021     -0.0047     -0.0044      -0.0023         -0.0005     -0.0004     0.3685      -0.2948
   Fold4      0.0169   0.0017     -0.0036     -0.0017     -0. 0024         -0.0010     -0.0008     0.3609      -0.2984
  Average     0.0161   0.0017     -0.0037     -0.0029      -0.0023         -0.0010     -0.0007     0.3639      -0.2964      1.553
  Baseline    0.0363   0.0037     -0.0090     -0.0224      -0.0111         -0.0072     -0.0061     0.3758      -0.3080     -1.212

Table 3
Our final results for the four folds and average of them. Baseline denotes ‘CBOWRecSysBaseline’ provided by the challenge
organizers.


                                                                 mend famous items, it outperforms other methods in ‘Be
                                                                 less wrong’. Table 1 shows the preliminary results.

                                                                 3.3. Phase 2
                                                                 Based on the preliminary results shown in phase 1,
                                                                 we combine the results of the item-based VAE and the
                                                                 BPRMF to curate the recommendation list for phase 2.
                                                                    In phase 1, the overall score is determined by a simple
                                                                 average of each test. However, in phase 2, the impor-
Figure 1: Popularity distributions of all items and items rec-
                                                                 tance of each metric is adjusted based on the perfor-
ommended through the model. The x-axis represents track          mance of participants in phase 1. As we replicate the
popularity groups, and y-axis represents proportion of each      process described in the challenge to analyze relative
group. We use the logarithmic scale on the y-axis due to         values of weights, we observe an enormous gap between
skewed distributions. The result shows that the recommended      the weight of artist popularity and that of HR. Thus, we
items from the item-VAE follow a similar distribution to the     mainly focused on mitigating the bias of ‘artist popular-
total popularity distributions of all items.                     ity’.
                                                                    For the final experiments, we set the latent dimension
                                                                 of the item-based VAE to 17 and the batch size to 32.
Dropout rate is set to 0.2. For BPRMF, we set the batch
                                                                 Then, we train the model using the Adam optimizer with
size to 8192, and dimension to 64. We train for 10 epochs
                                                                 a learning rate of 1e-3 for two epochs. As the unpop-
with a learning rate of 0.001. Instead of using weight
                                                                 ular artist group does not fit well, We find that having
regularization, we normalize the vector if the maximum
                                                                 additional two epochs to train the VAE for the unpopular
value of user vectors and item vectors is greater than 1.
                                                                 artist group generally helps to improve the performance.
   Table 1 reveals variational auto-encoders for collabo-
                                                                 When we train the least popular item group, we set the
rative filtering achieve good performance when applying
                                                                 latent dimension to 15 and train for 2 epochs. 𝛽 and
a simple average of all metrics. In particular, the item-
                                                                 𝛾, which are the coefficient of the KL divergence term
based VAE shows good performance in not only hit rate
                                                                 and the coefficient of the regularizer, are set to 0.0001
and MRR but also MRED between user activity, track
                                                                 and 0.003, respectively, after parameter searching. For
popularity, and artist popularity groups. Table 2 shows
                                                                 BPRMF, we set the latent dimension to 200 and other
the MR of each track popularity group between baseline
                                                                 conditions are set to the same as the baseline. Then, we
models. We can observe that the item-based VAE has
                                                                 make the final recommendation described in 2.2. 1
better accuracy with unpopular item groups.
                                                                    Table 3 shows our results of all folds with the phase 2
   We observe that the item-based VAE recommends as
                                                                 score and the baseline score. The results show that our
many unpopular items as popular ones. Figure 1 shows
                                                                 model successfully reduces the gap between the artist
popularity distributions of the recommended items and
                                                                 popularity groups and the track popularity groups. Fur-
all items. The result indicates that the recommended
                                                                 thermore, our model shows lower MREDs between user
items from item-based VAE follow a similar distribution
                                                                 activity, gender, and country groups than those of the
of the total item popularity computed in the training set.
                                                                 baseline provided by the challenge organizers.
In the meanwhile, the user-based VAE tends to recom-
mend more popular items than the item-based VAE.
   We find that although the BPRMF method does not               1
                                                                     The source code for reproducing the experiments is available at
show a good overall performance and it tends to recom-               https://github.com/ParkJinHyeock/evalRS-submission
                             Model            1       10        100     1000      Hit       MRED      CV(ours)
                           VAE (item)    0.8946    0.7865     0.7770   0.8803   0.2121      -0.0529      0.2559
               Track       VAE (user)    0.9398    0.8861     0.8063   0.6448   0.1593      -0.0937      0.7022
             popularity    VAE (final)   0.9858    0.9851     0.9821   0.9867   0.0161      -0.0023      0.0701
                            BPRMF        0.9965    0.9831     0.9387   0.9487   0.0372      -0.0230      0.6436
                             Model            1      100       1000    10000      Hit       MRED      CV(ours)
                           VAE (item)    0.8259    0.8107     0.7688   0.7942   0.2121      -0.0216      0.1019
               Artist      VAE (user)    0.8962    0.8887     0.8556   0.7870   0.1593      -0.0430      0.2716
             popularity    VAE (final)   0.9831    0.9848     0.9835   0.9841   0.0161      -0.0010      0.1459
                            BPRMF        0.9850    0.9721     0.9629   0.9546   0.0372      -0.0102      0.3070

Table 4
Miss rate and proposed metric of each model at each track popularity group (top) and artist popularity group (bottom). VAE
(final) denotes our final submission.


4. Discussion and Reflection                                      By dividing by the mean, the metric indicates the rela-
                                                               tive ratio of deviation to performance. Inspired by this,
4.1. User Fairness                                             our proposed metric can be expressed as follows.
As shown in experiments, our main approach focuses
on balancing the HR of the artist and track popularity
                                                                                          √︃ ∑︀
                                                                                               𝑖 (HRavg − HRgroup𝑖 )
                                                                                                                    2
                                                                                     −1
groups. However, our method also yields fair perfor-               CVHR = (HRavg )                                      (2)
                                                                                                      𝑁groups
mance in user-related fairness metrics. The previous
study [22] shows that there are no significant differences        The proposed metric quantifies the fairness of the
in the performance of the model between the gender             model, considering the average of HR when measuring
groups. The authors also identify a negative relationship      the deviation. The lower value indicates higher fairness.
between user activity and performance. We observe a            Our metric reasonably evaluates fairness through a rel-
similar phenomenon from our results; there are small           ative ratio. Even if the model achieves a low deviation,
differences between gender and country groups, and a           the proposed metric will be penalized if the deviation is
negative correlation between user activity and perfor-         relatively large compared to the HR. Table 4 shows the
mance. However, with the item-based VAE, reducing the          MR of each group, MRED, and our proposed metric. We
gap between item groups also reduces the gap between           observe that for artist popularity groups, the item-based
user activity groups.                                          VAE outperforms the final model as it has a relatively
                                                               low deviation with high HR, which is consistent with
4.2. Reflection on Evaluation Metric                           our intuition. Models with a relatively large deviation
                                                               between each group have a high penalty, while models
The EvalRS DataChallenge [10] evaluates the fairness of
                                                               with a relatively low deviation have a low penalty.
the model using the average difference of MR between
groups. In this section, we analyze the weakness of this
approach and propose a novel metric that improves it.          5. Conclusion
   We first analyze the limitation of the current fairness
metric. Suppose there is a model with a hit rate of 0.2 and    In this work, we propose a fairness-aware variational
another with a hit rate of 0.02. If the average deviation      auto-encoder for recommender systems. Our approach
of HR is both 0.01, the two models would be considered         shows that the item-based VAE significantly reduces the
to produce equivalent performance regarding fairness.          popularity bias of the model. Moreover, we conclude
However, in terms of relative ratios, the same deviation       that obtaining the recommendation results from various
accounts for 5% of the former but 50% of the latter. From      artist groups and adapting a regularizer further improves
this perspective, using MRED to measure fairness might         the fairness of the model. Finally, we suggest the notion
lead to unreasonable comparison.                               of ‘Coefficient of Variance based Fairness’ for the model
   With this intuition, we propose a ‘Coefficient of Vari-     evaluation and demonstrate that it reasonably measures
ance (CV)[23] based fairness’ which is less sensitive to       the fairness of the model.
scale. The Coefficient of Variance is defined as the stan-
dard deviation divided by the mean multiplied by 100:
                            𝜎
                     CV =     * 100                     (1)
                            𝑚
References                                                        tional bayes, arXiv preprint arXiv:1312.6114 (2013).
                                                             [15] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, Autorec:
 [1] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang,                 Autoencoders meet collaborative filtering, in: Pro-
     X. He, Bias and debias in recommender system:                ceedings of the 24th international conference on
     A survey and future directions, arXiv preprint               World Wide Web, 2015, pp. 111–112.
     arXiv:2010.03240 (2020).                                [16] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-
 [2] Y. Wang, W. Ma, M. Zhang*, Y. Liu, S. Ma, A survey           Thieme, Bpr: Bayesian personalized ranking from
     on the fairness of recommender systems, ACM                  implicit feedback, arXiv preprint arXiv:1205.2618
     Journal of the ACM (JACM) (2022).                            (2012).
 [3] M. Morik, A. Singh, J. Hong, T. Joachims, Control-      [17] K. Zhao, J. Xu, M.-M. Cheng, Regularface: Deep
     ling fairness and bias in dynamic learning-to-rank,          face recognition via exclusive regularization, in:
     in: Proceedings of the 43rd international ACM SI-            Proceedings of the IEEE/CVF Conference on Com-
     GIR conference on research and development in                puter Vision and Pattern Recognition, 2019, pp.
     information retrieval, 2020, pp. 429–438.                    1136–1144.
 [4] A. Singh, T. Joachims, Fairness of exposure in rank-    [18] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, C. Dwork,
     ings, in: Proceedings of the 24th ACM SIGKDD                 Learning fair representations, in: International
     International Conference on Knowledge Discovery              conference on machine learning, PMLR, 2013, pp.
     & Data Mining, 2018, pp. 2219–2228.                          325–333.
 [5] S. Yao, B. Huang, Beyond parity: Fairness objec-        [19] A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu,
     tives for collaborative filtering, Advances in neural        L. Heldt, Z. Zhao, L. Hong, E. H. Chi, et al., Fairness
     information processing systems 30 (2017).                    in recommendation ranking through pairwise com-
 [6] S. Vargas, P. Castells, Rank and relevance in novelty        parisons, in: Proceedings of the 25th ACM SIGKDD
     and diversity metrics for recommender systems,               International Conference on Knowledge Discovery
     in: Proceedings of the fifth ACM conference on               & Data Mining, 2019, pp. 2212–2220.
     Recommender systems, 2011, pp. 109–116.                 [20] R. Borges, K. Stefanidis, F2vae: a framework for
 [7] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond             mitigating user unfairness in recommendation sys-
     accuracy: evaluating recommender systems by cov-             tems, in: Proceedings of the 37th ACM/SIGAPP
     erage and serendipity, in: Proceedings of the fourth         Symposium on Applied Computing, 2022, pp. 1391–
     ACM conference on Recommender systems, 2010,                 1398.
     pp. 257–260.                                            [21] D. P. Kingma, J. Ba, Adam: A method for stochas-
 [8] Y. Li, H. Chen, Z. Fu, Y. Ge, Y. Zhang, User-oriented        tic optimization, arXiv preprint arXiv:1412.6980
     fairness in recommendation, in: Proceedings of the           (2014).
     Web Conference 2021, 2021, pp. 624–632.                 [22] M. D. Ekstrand, M. Tian, I. M. Azpiazu, J. D. Ek-
 [9] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of at-         strand, O. Anuyah, D. McNeill, M. S. Pera, All the
     tention: Amortizing individual fairness in rankings,         cool kids, how do they fit in?: Popularity and de-
     in: The 41st international acm sigir conference on           mographic biases in recommender evaluation and
     research & development in information retrieval,             effectiveness, in: Conference on fairness, account-
     2018, pp. 405–414.                                           ability and transparency, PMLR, 2018, pp. 172–186.
[10] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio,    [23] C. E. Brown, Coefficient of variation, in: Applied
     C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: a         multivariate statistics in geohydrology and related
     rounded evaluation of recommender systems, arXiv             sciences, Springer, 1998, pp. 155–157.
     preprint arXiv:2207.05772 (2022).
[11] M. Schedl, The lfm-1b dataset for music retrieval
     and recommendation, in: Proceedings of the 2016
     ACM on international conference on multimedia
     retrieval, 2016, pp. 103–110.
[12] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko,
     Beyond ndcg: behavioral testing of recommender
     systems with reclist, in: Companion Proceedings
     of the Web Conference 2022, 2022, pp. 99–104.
[13] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara,
     Variational autoencoders for collaborative filtering,
     in: Proceedings of the 2018 world wide web confer-
     ence, 2018, pp. 689–698.
[14] D. P. Kingma, M. Welling, Auto-encoding varia-