<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Item-based variational auto-encoder for fair music recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jinhyeok Park</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dain Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dongwoo Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pohang University of Science and Technology</institution>
          ,
          <addr-line>Pohang</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present our solution for the EvalRS DataChallenge. The EvalRS DataChallenge aims to build a more realistic recommender system considering accuracy, fairness, and diversity in evaluation. Our proposed system is based on an ensemble between an item-based variational auto-encoder (VAE) and a Bayesian personalized ranking matrix factorization (BPRMF). To mitigate the bias in popularity, we use an item-based VAE for each popularity group with an additional fairness regularization. To make a reasonable recommendation even the predictions are inaccurate, we combine the recommended list of BPRMF and that of item-based VAE. Through the experiments, we demonstrate that the item-based VAE with fairness regularization significantly reduces popularity bias compared to the user-based VAE. The ensemble between the item-based VAE and BPRMF makes the top-1 item similar to the ground truth even the predictions are inaccurate. Finally, we propose a 'Coeficient Variance based Fairness' as a novel evaluation metric based on our reflections from the extensive experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;recommender systems</kwd>
        <kwd>fairness</kwd>
        <kwd>variational auto-encoder</kwd>
        <kwd>collaborative filtering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>EvalRS 2022: CIKM EvalRS 2022 DataChallenge, October 21, 2022,
Atlanta, GA
* Corresponding author.
† These authors contributed equally.
$ jinhyeok1234@postech.ac.kr (J. Park); dain5832@postech.ac.kr
(D. Kim); dongwookim@postech.ac.kr (D. Kim)</p>
      <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License accuracy into account.
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)</p>
      <sec id="sec-1-1">
        <title>Evaluation metrics We describe the evaluation metrics used in the EvalRS DataChallenge [10]. The evaluation metrics can be categorized into three diferent measures:</title>
        <p>Let x = [1, 2, ... ] be an implicit feedback of
user  where  is binary indicator specifying whether
user  interacted with the item . The likelihood function
(x|) is then modeled via a multinomial distribution
conditioned on the latent vector z. The multivariate
• Accuracy metrics: accuracy metrics indicate the normal distribution is used as a variational distribution
predictive performance of a model. It includes hit (z|x). During the training, one can optimize the
rate (HR) and mean reciprocal rank (MRR), which parameters to maximize the ELBO. After training, the
are widely used in recommender systems. recommended items are chosen based on the
multino• Accuracy metrics on a per-group basis: group- mial distribution among the items that have not been
based metrics are designed to evaluate the fair- interacted so far.
ness and robustness of the model. The challenge
adopts the miss rate equality diference (MRED), Item-based VAE Although using implicit feedback of
which measures the average diference between a user, i.e., x, as an input of VAE is a common approach
the miss rate (MR) of each group and the MR (user-based VAE), alternatively, one can use implicit
feedof the entire dataset. The metrics are evaluated back of an item as an input of VAE (item-based VAE). The
across five diferent groups: gender, country, user implicit feedback vector of item  can be constructed as
history, artist popularity, and track popularity. x = [1, 2, ... ], where  indicates the
interac• Behavioral tests: behavioral tests measure the tion between item  and user .</p>
        <p>similarity between recommended and ground To recommend items with item-based VAE, the model
truth items and the diversity of recommended infers logits over all items to complete the user-item
items. Behavioral tests consist of two metrics; ‘be interaction matrix and recommends top- items for each
less wrong’ and ‘latent diversity.’ Be less wrong user. Empirically, we find that the item-based VAE tends
measures the distance between the embeddings to recommend unpopular items compared to the
userof ground truth and the predicted result. Latent based VAE.
diversity indicates a model density in the latent
space of tracks.</p>
        <p>Bayesian personalized ranking matrix factorization
We use Bayesian personalized ranking matrix
factorization (BPRMF) [16] as the second baseline model. BPRMF
estimates the posterior distribution over the likelihood
of pair-wise ranking between items with a prior
distribution.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. Baseline Models</title>
        <sec id="sec-2-1-1">
          <title>We use the variational auto-encoders (VAE) and Bayesian</title>
          <p>personalized ranking matrix factorization (BPRMF) as 2.2. Model Optimization
our backbone methods. In this section, we describe the
backbone methods and explain how to use these back- In this section, we introduce various methods used to
bones to curate the final recommendation list. improve the performance of the item-based VAE for phase
2. Our approach mainly targets group-based metrics and
Variational auto-encoders for collaborative filter- behavioral tests rather than accuracy metrics.
ing. In this work, we employ the variational
autoencoder (VAE) for collaborative filtering [ 13] as the first Popularity-aware training based on items We aim
backbone model. The objective of VAE [14] is to max- to improve the MRED between track popularity groups
imize the evidence lower bound (ELBO) for each data and artist popularity groups, which are significant factors
point : in phase 2.</p>
          <p>Based on the item-based VAE, to reduce the
perfor (; ,  ) =E(|)[log  ( | )] mance gap between artist popularity groups, we divide
−  · KL(( | )‖()), items by artist popularity groups and train a VAE for each
group separately. After training, we find that the least
where  is the latent variable,  measures the importance popular artist group is underfitted compared to other
of the KL divergence, and the likelihood function  and groups. Therefore, we train two more epochs for this
the variational distribution  are parameterized by  group. Then, we pick a certain number of items from
and , respectively. each group to make a recommendation. Please check the</p>
          <p>There have been multiple approaches to employ the details of this process in the Final Recommendation part.
VAE framework for collaborative filtering [ 13, 15]. In this The MRED between track popularity groups is also an
work, we follow the framework proposed by [13]. important factor for phase 2. Although we divided items
VAE(item)
VAE(user)
BPRMF</p>
          <p>Model 1 10 100 1000 total Final Recommendation Since there are four artist
VAE (item) 0.8946 0.7865 0.7770 0.8803 0.7879 popularity groups, we train four separate VAEs each of
VAE (user) 0.9398 0.8861 0.8062 0.6448 0.8407 which is designated for each group. From the four VAEs,
BPRMF 0.9965 0.9830 0.9387 0.9487 0.9628 we first create a list of 98 items to be recommended. First,
Table 2 we take 38/20/20/20 items from artists groups 0, 1, 2, and
MR for each model at each track popularity group. 3 respectively, where group 0 indicates the least popular
group. Among those selected items, we take the five
most probable items from each group and curate a list
by artist groups, MRED between the track population of the top 20 items. The 20 items are ordered with 2,
groups is also reduced. However, the least popular item 1, 3, 0 (5/5/5/5) in descending order of the number of
group whose playcount is less than ten is still not recom- items in each group. The remaining 78 items are listed
mended well, as shown in Figure 1. To recommend an after with the same order (15/15/15/33). One additional
item from the least popular track group, we additionally item recommended from the least popular track group is
train a separate VAE for this group and include at least added at the end of the list.
one item from this group. In addition, we find that the item-based VAE often fails
to achieve good performance in the behavioral tests,
esFairness Regularization Fairness regularization aims pecially for ‘be less wrong’. Therefore, we ensemble the
to introduce an additional regularizer term to the ob- item-based VAE and BPRMF which shows good
perforjective to narrow the gap between group losses. The mance in ‘be less wrong’. Since the metric only considers
approach has been widely adopted in fields such as com- the top-1 item, we put the most probable item from the
puter vision, natural language processing, and sound [17, BPRMF model at the top of our previous recommendation
18]. list, resulting in 100 recommended items.</p>
          <p>Many recommender systems also employ regularizers
to improve group fairness [5, 19].</p>
          <p>We incorporate a fairness regularization into VAE 3. Experiments
based on the work of [20]. The regularization term
computes the average diference between the group recon- 3.1. Dataset
struction loss and the entire reconstruction loss as
⎡⃒
⃒ 1</p>
          <p>∑︁ E [log  ( | )]
(; ,  ) = E ⎣⃒⃒⃒ | | ∈
⃒
−
1 ∑︁ E [log  ( | )]⃒⃒⃒⃒ ]︃ ,
|| ∈ ⃒
where  is a set of all items, and  is a set of items
that belong to the group . Groups are divided into 1, 10,
100, and 1000 based on track popularity, and each item is
assigned to the group according to its total play counts.</p>
          <p>Our final objective can be expressed as follows:</p>
          <p>(; ,  ) =  (; ,  ) −  · ,
where the hyperparameter  controls the weight of the
regularizer. The higher value of  indicates that the
model takes a greater proportion of fairness into account
during the optimization process.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>The LFM-1b dataset [11] is provided for the challenge.</title>
          <p>The dataset consists of the listening history of users with
demographic information, such as gender and nationality,
and metadata of the item. The dataset includes 119,555
users, 820,998 tracks and 37,926,429 interactions. The
test set was generated based on the leave-one-out
framework by randomly masking one item from each user’s
history. Please check the detailed pre-processing steps
of the dataset for the challenge in [10].
3.2. Phase 1
In phase 1, we conduct an experiment to check the
performances of our baseline models: item-based VAE,
userbased VAE, and BPRMF. For all experiments with VAEs,
we adopt the same architecture as [13]. We set the batch
size to 32. Latent dimension is set to 500, and the size of
hidden layer is set to 300. We train for 5 epochs using
the Adam [21] optimizer with a learning rate of 0.001.
Fold1
Fold2
Fold3
Fold4
Average
Baseline</p>
          <p>Score</p>
          <p>Phase2
-0.2924
-0.3000
-0.2948
-0.2984
3.3. Phase 2</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Based on the preliminary results shown in phase 1,</title>
          <p>we combine the results of the item-based VAE and the
BPRMF to curate the recommendation list for phase 2.</p>
          <p>In phase 1, the overall score is determined by a simple
average of each test. However, in phase 2, the
imporFigure 1: Popularity distributions of all items and items rec- tance of each metric is adjusted based on the
perforommended through the model. The x-axis represents track mance of participants in phase 1. As we replicate the
popularity groups, and y-axis represents proportion of each process described in the challenge to analyze relative
group. We use the logarithmic scale on the y-axis due to values of weights, we observe an enormous gap between
skewed distributions. The result shows that the recommended the weight of artist popularity and that of HR. Thus, we
items from the item-VAE follow a similar distribution to the mainly focused on mitigating the bias of ‘artist
populartotal popularity distributions of all items. ity’.</p>
          <p>For the final experiments, we set the latent dimension
of the item-based VAE to 17 and the batch size to 32.</p>
          <p>Dropout rate is set to 0.2. For BPRMF, we set the batch Then, we train the model using the Adam optimizer with
size to 8192, and dimension to 64. We train for 10 epochs a learning rate of 1e-3 for two epochs. As the
unpopwith a learning rate of 0.001. Instead of using weight ular artist group does not fit well, We find that having
regularization, we normalize the vector if the maximum additional two epochs to train the VAE for the unpopular
value of user vectors and item vectors is greater than 1. artist group generally helps to improve the performance.</p>
          <p>Table 1 reveals variational auto-encoders for collabo- When we train the least popular item group, we set the
rative lfitering achieve good performance when applying latent dimension to 15 and train for 2 epochs.  and
a simple average of all metrics. In particular, the item-  , which are the coeficient of the KL divergence term
based VAE shows good performance in not only hit rate and the coeficient of the regularizer, are set to 0.0001
and MRR but also MRED between user activity, track and 0.003, respectively, after parameter searching. For
popularity, and artist popularity groups. Table 2 shows BPRMF, we set the latent dimension to 200 and other
the MR of each track popularity group between baseline conditions are set to the same as the baseline. Then, we
models. We can observe that the item-based VAE has make the final recommendation described in 2.2. 1
better accuracy with unpopular item groups. Table 3 shows our results of all folds with the phase 2</p>
          <p>We observe that the item-based VAE recommends as score and the baseline score. The results show that our
many unpopular items as popular ones. Figure 1 shows model successfully reduces the gap between the artist
popularity distributions of the recommended items and popularity groups and the track popularity groups.
Furall items. The result indicates that the recommended thermore, our model shows lower MREDs between user
items from item-based VAE follow a similar distribution activity, gender, and country groups than those of the
of the total item popularity computed in the training set. baseline provided by the challenge organizers.
In the meanwhile, the user-based VAE tends to
recommend more popular items than the item-based VAE.</p>
          <p>We find that although the BPRMF method does not 1The source code for reproducing the experiments is available at
show a good overall performance and it tends to recom- https://github.com/ParkJinHyeock/evalRS-submission</p>
          <p>Track
popularity</p>
          <p>Artist
popularity
VAE (item)
VAE (user)
VAE (final)
BPRMF</p>
          <p>Model
VAE (item)
VAE (user)
VAE (final)
BPRMF</p>
          <p>Hit
0.2121
-0.0529
-0.0937
-0.0023
-0.0230
MRED
-0.0216
-0.0430
-0.0010
-0.0102</p>
          <p>CV(ours)
(final) denotes our final submission.</p>
          <p>Miss rate and proposed metric of each model at each track popularity group (top) and artist popularity group (bottom). VAE
this perspective, using MRED to measure fairness might the fairness of the model. Finally, we suggest the notion</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Discussion and Reflection</title>
      <sec id="sec-3-1">
        <title>4.1. User Fairness</title>
        <p>As shown in experiments, our main approach focuses
on balancing the HR of the artist and track popularity
groups. However, our method also yields fair
performance in user-related fairness metrics. The previous
study [22] shows that there are no significant diferences
in the performance of the model between the gender
groups. The authors also identify a negative relationship
between user activity and performance. We observe a
similar phenomenon from our results; there are small
diferences between gender and country groups, and a
negative correlation between user activity and
performance. However, with the item-based VAE, reducing the
gap between item groups also reduces the gap between
user activity groups.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Reflection on Evaluation Metric</title>
        <sec id="sec-3-2-1">
          <title>The EvalRS DataChallenge [10] evaluates the fairness of</title>
          <p>the model using the average diference of MR between
groups. In this section, we analyze the weakness of this
approach and propose a novel metric that improves it.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>We first analyze the limitation of the current fairness</title>
          <p>metric. Suppose there is a model with a hit rate of 0.2 and
another with a hit rate of 0.02. If the average deviation
of HR is both 0.01, the two models would be considered
to produce equivalent performance regarding fairness.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>However, in terms of relative ratios, the same deviation accounts for 5% of the former but 50% of the latter. From lead to unreasonable comparison.</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>With this intuition, we propose a ‘Coeficient of Variance (CV)[23] based fairness’ which is less sensitive to scale. The Coeficient of Variance is defined as the standard deviation divided by the mean multiplied by 100:</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>In this work, we propose a fairness-aware variational
auto-encoder for recommender systems. Our approach
shows that the item-based VAE significantly reduces the
popularity bias of the model. Moreover, we conclude
that obtaining the recommendation results from various
artist groups and adapting a regularizer further improves
of ‘Coeficient of Variance based Fairness’ for the model
evaluation and demonstrate that it reasonably measures
the fairness of the model.
tional bayes, arXiv preprint arXiv:1312.6114 (2013).
[15] S. Sedhain, A. K. Menon, S. Sanner, L. Xie, Autorec:
[1] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, Autoencoders meet collaborative filtering, in:
ProX. He, Bias and debias in recommender system: ceedings of the 24th international conference on
A survey and future directions, arXiv preprint World Wide Web, 2015, pp. 111–112.
arXiv:2010.03240 (2020). [16] S. Rendle, C. Freudenthaler, Z. Gantner, L.
Schmidt[2] Y. Wang, W. Ma, M. Zhang*, Y. Liu, S. Ma, A survey Thieme, Bpr: Bayesian personalized ranking from
on the fairness of recommender systems, ACM implicit feedback, arXiv preprint arXiv:1205.2618
Journal of the ACM (JACM) (2022). (2012).
[3] M. Morik, A. Singh, J. Hong, T. Joachims, Control- [17] K. Zhao, J. Xu, M.-M. Cheng, Regularface: Deep
ling fairness and bias in dynamic learning-to-rank, face recognition via exclusive regularization, in:
in: Proceedings of the 43rd international ACM SI- Proceedings of the IEEE/CVF Conference on
ComGIR conference on research and development in puter Vision and Pattern Recognition, 2019, pp.
information retrieval, 2020, pp. 429–438. 1136–1144.
[4] A. Singh, T. Joachims, Fairness of exposure in rank- [18] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, C. Dwork,
ings, in: Proceedings of the 24th ACM SIGKDD Learning fair representations, in: International
International Conference on Knowledge Discovery conference on machine learning, PMLR, 2013, pp.
&amp; Data Mining, 2018, pp. 2219–2228. 325–333.
[5] S. Yao, B. Huang, Beyond parity: Fairness objec- [19] A. Beutel, J. Chen, T. Doshi, H. Qian, L. Wei, Y. Wu,
tives for collaborative filtering, Advances in neural L. Heldt, Z. Zhao, L. Hong, E. H. Chi, et al., Fairness
information processing systems 30 (2017). in recommendation ranking through pairwise
com[6] S. Vargas, P. Castells, Rank and relevance in novelty parisons, in: Proceedings of the 25th ACM SIGKDD
and diversity metrics for recommender systems, International Conference on Knowledge Discovery
in: Proceedings of the fifth ACM conference on &amp; Data Mining, 2019, pp. 2212–2220.</p>
      <p>Recommender systems, 2011, pp. 109–116. [20] R. Borges, K. Stefanidis, F2vae: a framework for
[7] M. Ge, C. Delgado-Battenfeld, D. Jannach, Beyond mitigating user unfairness in recommendation
sysaccuracy: evaluating recommender systems by cov- tems, in: Proceedings of the 37th ACM/SIGAPP
erage and serendipity, in: Proceedings of the fourth Symposium on Applied Computing, 2022, pp. 1391–
ACM conference on Recommender systems, 2010, 1398.</p>
      <p>pp. 257–260. [21] D. P. Kingma, J. Ba, Adam: A method for
stochas[8] Y. Li, H. Chen, Z. Fu, Y. Ge, Y. Zhang, User-oriented tic optimization, arXiv preprint arXiv:1412.6980
fairness in recommendation, in: Proceedings of the (2014).</p>
      <p>Web Conference 2021, 2021, pp. 624–632. [22] M. D. Ekstrand, M. Tian, I. M. Azpiazu, J. D.
Ek[9] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of at- strand, O. Anuyah, D. McNeill, M. S. Pera, All the
tention: Amortizing individual fairness in rankings, cool kids, how do they fit in?: Popularity and
dein: The 41st international acm sigir conference on mographic biases in recommender evaluation and
research &amp; development in information retrieval, efectiveness, in: Conference on fairness,
account2018, pp. 405–414. ability and transparency, PMLR, 2018, pp. 172–186.
[10] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, [23] C. E. Brown, Coeficient of variation, in: Applied
C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: a multivariate statistics in geohydrology and related
rounded evaluation of recommender systems, arXiv sciences, Springer, 1998, pp. 155–157.
preprint arXiv:2207.05772 (2022).
[11] M. Schedl, The lfm-1b dataset for music retrieval
and recommendation, in: Proceedings of the 2016
ACM on international conference on multimedia
retrieval, 2016, pp. 103–110.
[12] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko,</p>
      <p>Beyond ndcg: behavioral testing of recommender
systems with reclist, in: Companion Proceedings
of the Web Conference 2022, 2022, pp. 99–104.
[13] D. Liang, R. G. Krishnan, M. D. Hofman, T. Jebara,</p>
      <p>Variational autoencoders for collaborative filtering,
in: Proceedings of the 2018 world wide web
conference, 2018, pp. 689–698.
[14] D. P. Kingma, M. Welling, Auto-encoding
varia</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>