Track2Vec: fairness music recommendation with a
GPU-free customizable-driven framework
Wei-Wei Du1,∗ , Wei-Yao Wang1 and Wen-Chih Peng1
1
    Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan


                                             Abstract
                                             Recommendation systems have illustrated the significant progress made in characterizing users’ preferences based on their
                                             past behaviors. Despite the effectiveness of recommending accurately, there exist several factors that are essential but
                                             unexplored for evaluating various facets of recommendation systems, e.g., fairness, diversity, and limited resources. To
                                             address these issues, we propose Track2Vec, a GPU-free customizable-driven framework for fairness music recommendation.
                                             In order to take both accuracy and fairness into account, our solution consists of three modules, a customized fairness-aware
                                             groups for modeling different features based on configurable settings, a track representation learning module for learning
                                             better user embedding, and an ensemble module for ranking the recommendation results from different track representation
                                             learning modules. Moreover, inspired by TF-IDF which has been widely used in natural language processing, we introduce
                                             a metric called Miss Rate - Inverse Ground Truth Frequency (MR-ITF) to measure the fairness. Extensive experiments
                                             demonstrate that our model achieves a 4th price ranking in a GPU-free environment on the leaderboard in the EvalRS @
                                             CIKM 2022 challenge, which is superior to the official baseline by about 200% in terms of the official scores. In addition, the
                                             ablation study illustrates the necessity of ensembling each group to acquire both accurate and fair recommendations.

                                             Keywords
                                             recommendation system, ensemble methods, fairness metric


1. Introduction
Nowadays, there has been a surge in research focusing
on recommendation systems (RSs) in different domains
(e.g., movies, videos, news, products) with the aim of in-
creasing the possibility of targeting users to view or buy
recommended items based on their historical browses.
These approaches introduce their recommendation sys-
tems by filtering the most importance and eye-catching                                                                                Figure 1: An example of a music recommendation system.
information from the collected abundance of data to re-
lieve the information overload problem. In addition, Cov-
ington et al. [1] introduced a framework to first select                                                                              the past few years. For instance, Yang and Stoyanovich
hundreds of video candidates and then rank these videos                                                                               [2] introduced fairness measures by generating synthetic
according to the user history and video content to al-                                                                                data to quantify statistical parity and biases in rankings.
leviate the data sparsity problem and to generate more                                                                                Chia et al. [3] proposed RecList, a general plug-and-play
accurate recommendations.                                                                                                             framework to scale up behavioral testing.
   However, most of the work has adopted accuracy-                                                                                       In this challenge hosted by EvalRS1 , given user listen-
based metrics (e.g., hit-rate (HR), mean reciprocal rank                                                                              ing history, track metadata, and user metadata, the goal is
(MRR), and normalized discounted cumulative gain                                                                                      to recommend K songs for each user as shown in Figure 1.
(nDCG)), which fail to consider other factors that reflect                                                                            The recommended predictions are evaluated by standard
the robustness of the models. Therefore, researchers from                                                                             RSs metrics (HR, and MRR), standard metrics on a per-
both academia and industry have paid more attention to                                                                                group or slice basis (gender balance, artist popularity, user
investigating the issues of model fairness and diversity in                                                                           country, song popularity, and user history), and behavioral
CIKM’22: Proceedings of the 31st ACM International Conference on                                                                      tests (be less wrong, and latent diversity) [4]. To tackle
Information and Knowledge Management                                                                                                  the shared task, we propose a framework, Track2Vec, a
∗
     Corresponding author.                                                                                                            framework with three modules as a fairness music recom-
Envelope-Open wwdu.cs10@nycu.edu.tw (W. Du); sf1638.cs05@nctu.edu.tw
                                                                                                                                      mendation system. Specifically, our proposed Track2Vec
(W. Wang); wcpeng@cs.nycu.edu.tw (W. Peng)
GLOBE https://wwweiwei.github.io/ (W. Du)                                                                                             is composed of a customized fairness-aware groups for
Orcid 0000-0002-0627-0314 (W. Du); 0000-0002-6551-1720 (W. Wang);                                                                     dividing user history into multiple facets, a track repre-
0000-0002-0172-7311 (W. Peng)                                                                                                         sentation learning module for candidate matching, and an
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      1
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                            https://reclist.io/cikm2022-cup/
ensemble module for better ranking the recommended           performance on several recommendation system bench-
results. In this manner, Track2Vec is able to not only       marks. For instance, Covington et al. [1] introduced a
achieve robust performance without auxiliary tasks, but      two-stage framework, namely a deep candidate genera-
it can also be deployed with limited resources (e.g., a      tion model and a deep ranking model, for YouTube recom-
GPU-free machine), which demonstrates the practicality       mendation. PinSage proposed a data efficient Graph Con-
of our framework.                                            volutional Network (GCN) algorithm for web-scale rec-
   Although there are several metrics that evaluate model    ommendation [6]. Grbovic et al. [7], Vasile et al. [8] and
behavior with different numbers of bins, we argue that       Bianchi et al. [9] adopted a novel neural language-based
existing metrics that rely on manual settings fail to dis-   algorithm for product recommendation, and de Souza
tinguish the importance of different numbers of classes      Pereira Moreira et al. [10] employed the transformer ar-
automatically, which makes it hard to generalize to dif-     chitecture for session-based recommendation.
ferent tasks and datasets with the same settings. To that       In this shared task, one of the constraints is the limited
end, we introduce a new metric called Miss Rate - Inverse    resources and time for training and inferencing. There-
Ground Truth Frequency (MR-ITF), which computes the          fore, we introduce a data-driven unsupervised approach
less represented categories with more attention to avoid-    instead of deep learning supervised approaches to tackle
ing popular categories dominating the results. Our pro-      the task with both accurate predictions and efficiency.
posed metric can be extended to any existing metrics
directly by only computing the number of each class          2.2. Fairness Metrics
as the denominator. For instance, the numerators can
be replaced with MRR or nDCG for evaluating different        Most RSs focus on the accuracy of recommended results,
aspects with the different numbers of classes.               for instance, [11] introduces accuracy, coverage, variety,
   In summary, our contributions are four-fold:              recommender confidence, robustness, scalability and pri-
                                                             vacy from a common RSs-centric perspective. On the
    1. We propose Track2Vec as a fairness recommen-          other hand, it is also critical to evaluate trustworthiness,
       dation system with a customizable-driven frame-       utility, risk and usability from a user-centric perspective
       work, which achieves effective results (i.e., the     of RSs. In this EvalRS challenge, the organizers provide
       fourth prize on the leaderboard) in a GPU-free        various metrics for evaluating not only model perfor-
       environment.                                          mance but also model behavior with standard RSs met-
    2. To tackle the class imbalance issue, we introduce     rics, standard metrics on a per-group or slice basis, and a
       a customized fairness-aware groups to divide user     behavior test.
       history into different aspects based on customiz-        However, the metrics of model behavior require man-
       able configurations.                                  ual settings of divided bins and are hard to generalize to
    3. We introduce a novel metric, MR-ITF, to mea-          different tasks. Thus, we propose a novel metric, MR-ITF,
       sure the predictive distribution of the model by      which is a metric that computes frequent categories with
       weighting importance based on the number of           lower weights and few categories with higher weights
       predictions of each class, which can be general-      to be sure not to dominate the predictions by majority.
       ized to any existing metrics.                         This metric can also be extended to any existing metrics
    4. We conduct extensive experiments to demon-            by only modifying the numerators.
       strate the effectiveness of Track2Vec, which out-
       performs the official baseline about 200% in terms
       of the leaderboard (phase 2) score. Moreover, the 3. Method
       ablation study verifies the capabilities of the pro-
       posed framework.                                     3.1. Preliminary
                                                             The dataset of this task is based on the LFM-1b Dataset
2. Related Work                                              [12], corpus of listening events for music recommenda-
                                                             tion. It consists of 100M+ listening events and three types
2.1. Recommendation Systems                                  of data, users for user background information and pat-
                                                             terns of consumption, tracks which the artist and album
Nowadays, recommendation systems are able to solve the       belong to, and historical interactions for a collection of
information overload problem, which predicts a user’s        interactions between users and tracks. The details of the
preference based on the user history. In general, the        data process procedure can be referred to [4].
recommendation techniques can be divided into four cat-
egories: content-, collaborative filtering-, knowledge-,
and hybrid-based recommendation systems [5]. Recently,
deep learning approaches have led to state-of-the-art
Figure 2: The pipeline of our proposed framework. For every input user sequence (e.g., the green triangle F), our model
separates it by three features according to the input configuration (e.g., User Playcount, User Gender and Track Count) and
adopts the corresponding track representation learning modules to encode track embeddings for recommend tracks. Then, the
outputs from these modules are aggregated by the voting technique to recommend the final recommendations.


3.2. Our Recommendation System:                               logarithmic bucketing in base 10 (100, 1000 as division
     Track2Vec                                                in this paper). Therefore, we chose these three factors in
                                                              this work to divide users into the corresponding groups.
Figure 2 demonstrates the pipeline of our framework.          We note that these factors are configurable, which can
Given multiple types of user history, the customized          change to others in the framework.
fairness-aware groups divides each sequence into a dif-       Track Representation Learning. One of the limita-
ferent track representation learning module according         tions in this task is the time constraint (22.5 minutes/fold
to the configurable settings. For example, if the input       in average); thus it is challenging to learn a fine-grained
configuration is selected to focus on the user, a user se-    track representation using supervised deep learning ap-
quence will first be checked to be the training instance of   proaches in a limited amount of time. Therefore, we focus
the representation learning module in each group (e.g.,       on an unsupervised method to meet the requirement for
if the user is male then only the male module has this        training and recommending tracks for users. Specifically,
instance in the gender group). Afterwards, the predic-        we employ Word2Vec [13] to train track embeddings by
tions are aggregated by using ensemble techniques to          calculating the interactions between tracks, which only
generate the top K (K=100 in this paper) predictions for      requires both low computational cost and high-quality.
the corresponding user. In the training phase, the user       As there are two options in Word2Vec (i.e., continuous
features are used to separate user into groups based on       bag-of-words (CBOW) and skip-gram), we experiment us-
the input configuration. In the testing phase, the user       ing both methods with different negative sampling rates
features are obtained by fetching from the training data      and window sizes to select the best one. The CBOW ar-
(by user_id in this dataset).                                 chitecture predicts the current token based on the whole
Customized Fairness-Aware Groups. To enable our               context, and the skip-gram predicts surrounding tokens
model with the fairness behavior, we first discretized each   given the current token.
feature based on the feature distribution, and bunched        Ensemble Techniques. Ensembling prediction results
users into different groups by the customizable input         have demonstrated the robustness of models in previous
configuration to avoid the unbalance issue (e.g., majority    work [14, 15, 16], which motivated us to adopt ensem-
dominating the model behavior).                               ble techniques to produce more robust and diverse re-
   With the exploratory data analysis shown in Figure         sults. To consider different factors and to recommend
3, we observed that user playcount, user gender and           fairer tracks to users, voting is used for ensembling each
track count are the three most important factors that         group with different priorities. Specifically, the ensemble
affect the model behavior metrics. The playcount group        re-ranking strategy is applied as follows after generat-
used logarithmic bucketing in base 10 to divide user into     ing different predictions from each track representation
four sub-groups (10, 100, 1000 as divisions in this paper),   learning module:
the gender group divides the each sequence into male,
female and neutral, and the track count group also used             • Priority 1: Cumulative recommending times in
Table 1
Ablation study of Track2Vec. G: Gender. P: Playcount. U: User track count. Total score is computed as ((1) + (2) + (3)) / 3 same
as Phase 1 since Phase 2 requires a minimum hit-rate threshold.

                                          G             P         U           G+P        G+U        P+U      Track2Vec (ours)
       Standard RSs metrics (1)         0.0103        0.0118    0.0128       0.0127     0.0136     0.0143         0.0146
          Standard metrics
                                        -0.0073    -0.0039      -0.0061      -0.0052    -0.0055   -0.0044         -0.0055
          on a per-group (2)
         Behavioral tests (3)           -0.0138       0.0014    0.0008       0.0188     0.0156     0.0223         0.0271
            MR-ITF (ours)               -4.3862       -4.3863   -4.3861      -4.3862    -4.3861   -4.3860         -4.3861
             Total Score                -0.0048       0.0008    -0.0003       0.0041    0.0035     0.0057         0.0062


                                                                If the predictions and ground truths are imbalanced,
                                                             MR-ITF can attribute more importance to the tracks that
                                                             are underrepresented. That is, MR-ITF relieves the in-
                                                             fluence of the majority group to dominate the result of
   (a) User Gender. (b) User Playcount. (c) Track Count.     whether   it is a good model. For an edge example with
                                                             the LFM-1b Dataset, if the number of a less popular song
Figure 3: Data distributions of user gender, user playcount, is 1 and the others are all ”As It Was”, the hit-rate of the
and track count.
                                                             model is quite perfect when the predictions are all recom-
                                                             mend ”As It Was”, which is not fair and homogeneous in
                                                             real-world applications. In this scenario, MR-ITF can cap-
        descending order.                                    ture this unfair condition to evaluate the model behavior.
      • Priority 2: Original individual module ranking in It is worth noting that the nominator can be changed
        order.                                               to any existing metrics, which not only demonstrates
                                                             the generalizability of our proposed metric but also the
                                                             automatic adjustment without manual settings.
3.3. Our Fairness Metric: MR-ITF
Currently, HR, nDCG and MRR are the most used metrics 4. Experiment
in recommendation systems to evaluate the effectiveness
of models, but they fail to reflect the model behavior. To 4.1. Experimental Setting
address the issue, MRED, being less wrong and latent
diversity are proposed as an evaluation metric by RecList To implement our Track2Vec, we adopted Word2Vec[18]
[3]. However, these metrics require human settings for as the track representation learning module. The dimen-
the number of bins, but it is hard to generalize the same sion of each track embedding was set to 100, the window
configurations to other tasks and datasets. Inspired by size was set to 60, the minimum track frequency was 0,
term frequency - inverse document frequency [17], which the number of negative sampling was 5, random seed
is used for considering the frequency of the words and was set as 27 and the training epochs were set to 10. All
for lowering the importance of the high frequency words, the training and evaluation phases were conducted on a
we designed a novel metric, miss rate - inverse ground machine with AMD Ryzen Threadripper 3960X 24-Core
truth frequency (MR-ITF), to aggregate all scores with Processor and 252GB RAM (we do not report our GPU as
different importance weighting for each class.             our approach does not require it). The results of the abla-
   Formally, the computation of MR-ITF is as follows:      tive experiments is 4-fold boostrapped cross-validation2 .

                                  |𝐶|
                               ∑𝑖=1 𝑀𝑅𝑖 × 𝐼 𝑇 𝐹𝑖                      4.2. Results
              𝑀𝑅 − 𝐼 𝑇 𝐹 =          𝑁
                                                   ,            (1)
                                   ∑𝑗=1 𝑀𝑅𝑗                           Offline Performance. We first conducted an ablation
                                                                      study to ensure the effective design of our proposed
                              #𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠                      Track2Vec. As shown in Table 1, it is evident that the
              𝐼 𝑇 𝐹𝑖 = 𝑙𝑜𝑔(                      ),             (2)
                                   #𝑡𝑟𝑎𝑐𝑘𝑖                            performance of the total score (i.e., the average of (1) -
where 𝐶 is the number of classes (the number of tracks in
this paper), 𝑁 is the number of total instances, and 𝑀𝑅𝑖              2
                                                                          Our code will be available at https://github.com/wwweiwei/-
is the miss-rate of the 𝑖-th track, as in [3].                            Track2Vec.
Table 2
Performance of our Track2Vec in the leaderboard. The formula of the score is normalized with the official baseline and the
best score of Phase 1.
                                                                   Standard        Standard metrics
         Rank                    Model               Score                                              Behavioral tests
                                                                  RSs metrics       on a per-group
           4                    Track2Vec            1.1847         0.0088              2.9481              0.2050
          -              CBOWRecSysBaseline          -1.2122        0.0512              -3.7194             0.4527
   Improvements (%)             -                      198            -83                 179                 -55


Table 3                                                           4.3. Case Study: Track2Vec Behavior
Ensemble groups overlapping ratio.
                                                                  To analyze the behavior of each group, we further con-
       Groups                 Recommended track_id                ducted a case to demonstrate the overlapping coverage
                            Rolling in the Deep, Lights,          of the top 100 recommended tracks from each group of
       Gender                                                     Track2Vec. Table 3 illustrates parts of the recommended
                                   Get Lucky, ...
                                                                  results and the overlapping ratio. We can observe that
                                 Burn, Lights,
      Playcount                                                   there is little overlapping of recommendation results
                                We Found Love, ...
                                                                  across the three groups, which demonstrates the diver-
                          Set Fire to the Rain, We Found          sity of each group and shows that Track2Vec is capable
  User Track Count
                                 Love, Titanium, ...              of considering from these predictions to achieve both
                                 Lights, We Found                 accurate and fair recommendations.
      Track2Vec
                                   Love, Burn, ...
  Overlapping ratio                     12%
                                                                  5. Conclusion
                                                                  In this paper, we propose Track2Vec as a fairness rec-
(3)) is degraded without either groups (gender, playcount,
                                                                  ommendation system by a customizable—driven groups
user track count) compared with our Track2Vec. This
                                                                  to achieve fairness model behavior, track representation
result verifies the need to divide a user sequence to the
                                                                  learning to capture different user preferences and an
corresponding group based on the configurations. More-
                                                                  ensemble technique to aggregate different aspects. To
over, the behavioral tests of G perform the worst, which
                                                                  mitigate the issue of neglecting the minority groups, we
indicates that using user gender as the splitting standard
                                                                  introduce MR-ITF by weighting different degrees of im-
can achieve better results in each group (i.e., (2)), this
                                                                  portance for each class based on the corresponding fre-
hinders the model’s recommendation of accurate music
                                                                  quency, which can be extended to any existing metrics
tracks to users as well as the fair diversity. It is noted that
                                                                  without manual settings. By conducting extensive exper-
the performances of MR-ITF (i.e., our proposed metric)
                                                                  iments, our Track2Vec achieved superior performance
are similar in different models, which indicates that the
                                                                  compared to the official baseline, which shows not only
ground truths of tracks are quite diverse in the test set;
                                                                  the capability of recommending fair music tracks but
thus, the ITR term of each model are nearly the same. In
                                                                  also an efficient recommendation systems without any
addition, the miss rates of our variants are similar within
                                                                  GPU. In addition, our proposed MR-ITF is able to reflect
different tracks after our investigation.
                                                                  prediction bias, which uncovers the model behavior and
Testing Performance. Table 2 shows the performance
                                                                  fosters researchers to develop more advanced systems.
on the EvalRS leaderboard. Our approach achieved a total
score of 1.1847, ranking the fourth prize among 17 teams.
In addition, these results illustrate that our Track2Vec          References
outperforms the official baseline (CBOWRecSysBaseline)
by nearly 200%, which demonstrates the robust capability           [1] P. Covington, J. Adams, E. Sargin, Deep neural net-
of our model. Furthermore, our approach demonstrates                   works for youtube recommendations, in: RecSys,
that not only using less computation cost in a GPU-free                ACM, 2016, pp. 191–198.
framework, but also utilizing only three features and              [2] K. Yang, J. Stoyanovich, Measuring fairness in
track_id can achieve competitive performance.                          ranked outputs, in: SSDBM, ACM, 2017, pp.
                                                                       22:1–22:6.
                                                                   [3] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko,
     Beyond NDCG: behavioral testing of recommender [17] G. Salton, C. Buckley, Term-weighting approaches
     systems with reclist, CoRR abs/2111.09963 (2021).            in automatic text retrieval, Inf. Process. Manag. 24
 [4] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio,         (1988) 513–523.
     C. Greco, G. de Souza P. Moreira, P. J. Chia, Evalrs: a [18] R. Řehůřek, P. Sojka, Software Framework for Topic
     rounded evaluation of recommender systems, CoRR              Modelling with Large Corpora, in: Proceedings of
     abs/2207.05772 (2022).                                       the LREC 2010 Workshop on New Challenges for
 [5] Y. Peng, A survey on modern recommendation                   NLP Frameworks, ELRA, Valletta, Malta, 2010, pp.
     system based on big data, CoRR abs/2206.02631                45–50. http://is.muni.cz/publication/884893/en.
     (2022).
 [6] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L.
     Hamilton, J. Leskovec, Graph convolutional neural
     networks for web-scale recommender systems, in:
     KDD, ACM, 2018, pp. 974–983.
 [7] M. Grbovic, V. Radosavljevic, N. Djuric, N. Bhamidi-
     pati, J. Savla, V. Bhagwan, D. Sharp, E-commerce
     in your inbox: Product recommendations at scale,
     in: Proceedings of the 21th ACM SIGKDD interna-
     tional conference on knowledge discovery and data
     mining, 2015, pp. 1809–1818.
 [8] F. Vasile, E. Smirnova, A. Conneau, Meta-prod2vec:
     Product embeddings using side-information for rec-
     ommendation, in: Proceedings of the 10th ACM
     conference on recommender systems, 2016, pp.
     225–232.
 [9] F. Bianchi, B. Yu, J. Tagliabue, Bert goes shopping:
     Comparing distributional models for product repre-
     sentations, arXiv preprint arXiv:2012.09807 (2020).
[10] G. de Souza Pereira Moreira, S. Rabhi, J. M. Lee,
     R. Ak, E. Oldridge, Transformers4rec: Bridging
     the gap between nlp and sequential/session-based
     recommendation, in: Fifteenth ACM Conference
     on Recommender Systems, 2021, pp. 143–153.
[11] V. Ingale, S. Ellambotla, Literature review on perfor-
     mance evaluation of recommendation system with
     different dimensions of metrics, Available at SSRN
     4140551 (2022).
[12] M. Schedl, The lfm-1b dataset for music retrieval
     and recommendation, in: ICMR, ACM, 2016, pp.
     103–110.
[13] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient
     estimation of word representations in vector space,
     in: ICLR (Workshop Poster), 2013.
[14] W. Wang, K. Chang, Y. Tang, Emotiongif-yankee:
     A sentiment classifier with robust model based en-
     semble methods, CoRR abs/2007.02259 (2020).
[15] W. Wang, W. Peng, Team yao at factify 2022: Utiliz-
     ing pre-trained models and co-attention networks
     for multi-modal fact verification (short paper), in:
     DE-FACTIFY@AAAI, volume 3199 of CEUR Work-
     shop Proceedings, CEUR-WS.org, 2022.
[16] W. Wang, Y. Tang, W. Du, W. Peng, Nycu_twd@lt-
     edi-acl2022: Ensemble models with VADER and
     contrastive learning for detecting signs of depres-
     sion from social media, in: LT-EDI, Association for
     Computational Linguistics, 2022, pp. 136–139.