Track2Vec: fairness music recommendation with a GPU-free customizable-driven framework Wei-Wei Du1,∗ , Wei-Yao Wang1 and Wen-Chih Peng1 1 Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan Abstract Recommendation systems have illustrated the significant progress made in characterizing users’ preferences based on their past behaviors. Despite the effectiveness of recommending accurately, there exist several factors that are essential but unexplored for evaluating various facets of recommendation systems, e.g., fairness, diversity, and limited resources. To address these issues, we propose Track2Vec, a GPU-free customizable-driven framework for fairness music recommendation. In order to take both accuracy and fairness into account, our solution consists of three modules, a customized fairness-aware groups for modeling different features based on configurable settings, a track representation learning module for learning better user embedding, and an ensemble module for ranking the recommendation results from different track representation learning modules. Moreover, inspired by TF-IDF which has been widely used in natural language processing, we introduce a metric called Miss Rate - Inverse Ground Truth Frequency (MR-ITF) to measure the fairness. Extensive experiments demonstrate that our model achieves a 4th price ranking in a GPU-free environment on the leaderboard in the EvalRS @ CIKM 2022 challenge, which is superior to the official baseline by about 200% in terms of the official scores. In addition, the ablation study illustrates the necessity of ensembling each group to acquire both accurate and fair recommendations. Keywords recommendation system, ensemble methods, fairness metric 1. Introduction Nowadays, there has been a surge in research focusing on recommendation systems (RSs) in different domains (e.g., movies, videos, news, products) with the aim of in- creasing the possibility of targeting users to view or buy recommended items based on their historical browses. These approaches introduce their recommendation sys- tems by filtering the most importance and eye-catching Figure 1: An example of a music recommendation system. information from the collected abundance of data to re- lieve the information overload problem. In addition, Cov- ington et al. [1] introduced a framework to first select the past few years. For instance, Yang and Stoyanovich hundreds of video candidates and then rank these videos [2] introduced fairness measures by generating synthetic according to the user history and video content to al- data to quantify statistical parity and biases in rankings. leviate the data sparsity problem and to generate more Chia et al. [3] proposed RecList, a general plug-and-play accurate recommendations. framework to scale up behavioral testing. However, most of the work has adopted accuracy- In this challenge hosted by EvalRS1 , given user listen- based metrics (e.g., hit-rate (HR), mean reciprocal rank ing history, track metadata, and user metadata, the goal is (MRR), and normalized discounted cumulative gain to recommend K songs for each user as shown in Figure 1. (nDCG)), which fail to consider other factors that reflect The recommended predictions are evaluated by standard the robustness of the models. Therefore, researchers from RSs metrics (HR, and MRR), standard metrics on a per- both academia and industry have paid more attention to group or slice basis (gender balance, artist popularity, user investigating the issues of model fairness and diversity in country, song popularity, and user history), and behavioral CIKM’22: Proceedings of the 31st ACM International Conference on tests (be less wrong, and latent diversity) [4]. To tackle Information and Knowledge Management the shared task, we propose a framework, Track2Vec, a ∗ Corresponding author. framework with three modules as a fairness music recom- Envelope-Open wwdu.cs10@nycu.edu.tw (W. Du); sf1638.cs05@nctu.edu.tw mendation system. Specifically, our proposed Track2Vec (W. Wang); wcpeng@cs.nycu.edu.tw (W. Peng) GLOBE https://wwweiwei.github.io/ (W. Du) is composed of a customized fairness-aware groups for Orcid 0000-0002-0627-0314 (W. Du); 0000-0002-6551-1720 (W. Wang); dividing user history into multiple facets, a track repre- 0000-0002-0172-7311 (W. Peng) sentation learning module for candidate matching, and an © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) https://reclist.io/cikm2022-cup/ ensemble module for better ranking the recommended performance on several recommendation system bench- results. In this manner, Track2Vec is able to not only marks. For instance, Covington et al. [1] introduced a achieve robust performance without auxiliary tasks, but two-stage framework, namely a deep candidate genera- it can also be deployed with limited resources (e.g., a tion model and a deep ranking model, for YouTube recom- GPU-free machine), which demonstrates the practicality mendation. PinSage proposed a data efficient Graph Con- of our framework. volutional Network (GCN) algorithm for web-scale rec- Although there are several metrics that evaluate model ommendation [6]. Grbovic et al. [7], Vasile et al. [8] and behavior with different numbers of bins, we argue that Bianchi et al. [9] adopted a novel neural language-based existing metrics that rely on manual settings fail to dis- algorithm for product recommendation, and de Souza tinguish the importance of different numbers of classes Pereira Moreira et al. [10] employed the transformer ar- automatically, which makes it hard to generalize to dif- chitecture for session-based recommendation. ferent tasks and datasets with the same settings. To that In this shared task, one of the constraints is the limited end, we introduce a new metric called Miss Rate - Inverse resources and time for training and inferencing. There- Ground Truth Frequency (MR-ITF), which computes the fore, we introduce a data-driven unsupervised approach less represented categories with more attention to avoid- instead of deep learning supervised approaches to tackle ing popular categories dominating the results. Our pro- the task with both accurate predictions and efficiency. posed metric can be extended to any existing metrics directly by only computing the number of each class 2.2. Fairness Metrics as the denominator. For instance, the numerators can be replaced with MRR or nDCG for evaluating different Most RSs focus on the accuracy of recommended results, aspects with the different numbers of classes. for instance, [11] introduces accuracy, coverage, variety, In summary, our contributions are four-fold: recommender confidence, robustness, scalability and pri- vacy from a common RSs-centric perspective. On the 1. We propose Track2Vec as a fairness recommen- other hand, it is also critical to evaluate trustworthiness, dation system with a customizable-driven frame- utility, risk and usability from a user-centric perspective work, which achieves effective results (i.e., the of RSs. In this EvalRS challenge, the organizers provide fourth prize on the leaderboard) in a GPU-free various metrics for evaluating not only model perfor- environment. mance but also model behavior with standard RSs met- 2. To tackle the class imbalance issue, we introduce rics, standard metrics on a per-group or slice basis, and a a customized fairness-aware groups to divide user behavior test. history into different aspects based on customiz- However, the metrics of model behavior require man- able configurations. ual settings of divided bins and are hard to generalize to 3. We introduce a novel metric, MR-ITF, to mea- different tasks. Thus, we propose a novel metric, MR-ITF, sure the predictive distribution of the model by which is a metric that computes frequent categories with weighting importance based on the number of lower weights and few categories with higher weights predictions of each class, which can be general- to be sure not to dominate the predictions by majority. ized to any existing metrics. This metric can also be extended to any existing metrics 4. We conduct extensive experiments to demon- by only modifying the numerators. strate the effectiveness of Track2Vec, which out- performs the official baseline about 200% in terms of the leaderboard (phase 2) score. Moreover, the 3. Method ablation study verifies the capabilities of the pro- posed framework. 3.1. Preliminary The dataset of this task is based on the LFM-1b Dataset 2. Related Work [12], corpus of listening events for music recommenda- tion. It consists of 100M+ listening events and three types 2.1. Recommendation Systems of data, users for user background information and pat- terns of consumption, tracks which the artist and album Nowadays, recommendation systems are able to solve the belong to, and historical interactions for a collection of information overload problem, which predicts a user’s interactions between users and tracks. The details of the preference based on the user history. In general, the data process procedure can be referred to [4]. recommendation techniques can be divided into four cat- egories: content-, collaborative filtering-, knowledge-, and hybrid-based recommendation systems [5]. Recently, deep learning approaches have led to state-of-the-art Figure 2: The pipeline of our proposed framework. For every input user sequence (e.g., the green triangle F), our model separates it by three features according to the input configuration (e.g., User Playcount, User Gender and Track Count) and adopts the corresponding track representation learning modules to encode track embeddings for recommend tracks. Then, the outputs from these modules are aggregated by the voting technique to recommend the final recommendations. 3.2. Our Recommendation System: logarithmic bucketing in base 10 (100, 1000 as division Track2Vec in this paper). Therefore, we chose these three factors in this work to divide users into the corresponding groups. Figure 2 demonstrates the pipeline of our framework. We note that these factors are configurable, which can Given multiple types of user history, the customized change to others in the framework. fairness-aware groups divides each sequence into a dif- Track Representation Learning. One of the limita- ferent track representation learning module according tions in this task is the time constraint (22.5 minutes/fold to the configurable settings. For example, if the input in average); thus it is challenging to learn a fine-grained configuration is selected to focus on the user, a user se- track representation using supervised deep learning ap- quence will first be checked to be the training instance of proaches in a limited amount of time. Therefore, we focus the representation learning module in each group (e.g., on an unsupervised method to meet the requirement for if the user is male then only the male module has this training and recommending tracks for users. Specifically, instance in the gender group). Afterwards, the predic- we employ Word2Vec [13] to train track embeddings by tions are aggregated by using ensemble techniques to calculating the interactions between tracks, which only generate the top K (K=100 in this paper) predictions for requires both low computational cost and high-quality. the corresponding user. In the training phase, the user As there are two options in Word2Vec (i.e., continuous features are used to separate user into groups based on bag-of-words (CBOW) and skip-gram), we experiment us- the input configuration. In the testing phase, the user ing both methods with different negative sampling rates features are obtained by fetching from the training data and window sizes to select the best one. The CBOW ar- (by user_id in this dataset). chitecture predicts the current token based on the whole Customized Fairness-Aware Groups. To enable our context, and the skip-gram predicts surrounding tokens model with the fairness behavior, we first discretized each given the current token. feature based on the feature distribution, and bunched Ensemble Techniques. Ensembling prediction results users into different groups by the customizable input have demonstrated the robustness of models in previous configuration to avoid the unbalance issue (e.g., majority work [14, 15, 16], which motivated us to adopt ensem- dominating the model behavior). ble techniques to produce more robust and diverse re- With the exploratory data analysis shown in Figure sults. To consider different factors and to recommend 3, we observed that user playcount, user gender and fairer tracks to users, voting is used for ensembling each track count are the three most important factors that group with different priorities. Specifically, the ensemble affect the model behavior metrics. The playcount group re-ranking strategy is applied as follows after generat- used logarithmic bucketing in base 10 to divide user into ing different predictions from each track representation four sub-groups (10, 100, 1000 as divisions in this paper), learning module: the gender group divides the each sequence into male, female and neutral, and the track count group also used • Priority 1: Cumulative recommending times in Table 1 Ablation study of Track2Vec. G: Gender. P: Playcount. U: User track count. Total score is computed as ((1) + (2) + (3)) / 3 same as Phase 1 since Phase 2 requires a minimum hit-rate threshold. G P U G+P G+U P+U Track2Vec (ours) Standard RSs metrics (1) 0.0103 0.0118 0.0128 0.0127 0.0136 0.0143 0.0146 Standard metrics -0.0073 -0.0039 -0.0061 -0.0052 -0.0055 -0.0044 -0.0055 on a per-group (2) Behavioral tests (3) -0.0138 0.0014 0.0008 0.0188 0.0156 0.0223 0.0271 MR-ITF (ours) -4.3862 -4.3863 -4.3861 -4.3862 -4.3861 -4.3860 -4.3861 Total Score -0.0048 0.0008 -0.0003 0.0041 0.0035 0.0057 0.0062 If the predictions and ground truths are imbalanced, MR-ITF can attribute more importance to the tracks that are underrepresented. That is, MR-ITF relieves the in- fluence of the majority group to dominate the result of (a) User Gender. (b) User Playcount. (c) Track Count. whether it is a good model. For an edge example with the LFM-1b Dataset, if the number of a less popular song Figure 3: Data distributions of user gender, user playcount, is 1 and the others are all ”As It Was”, the hit-rate of the and track count. model is quite perfect when the predictions are all recom- mend ”As It Was”, which is not fair and homogeneous in real-world applications. In this scenario, MR-ITF can cap- descending order. ture this unfair condition to evaluate the model behavior. • Priority 2: Original individual module ranking in It is worth noting that the nominator can be changed order. to any existing metrics, which not only demonstrates the generalizability of our proposed metric but also the automatic adjustment without manual settings. 3.3. Our Fairness Metric: MR-ITF Currently, HR, nDCG and MRR are the most used metrics 4. Experiment in recommendation systems to evaluate the effectiveness of models, but they fail to reflect the model behavior. To 4.1. Experimental Setting address the issue, MRED, being less wrong and latent diversity are proposed as an evaluation metric by RecList To implement our Track2Vec, we adopted Word2Vec[18] [3]. However, these metrics require human settings for as the track representation learning module. The dimen- the number of bins, but it is hard to generalize the same sion of each track embedding was set to 100, the window configurations to other tasks and datasets. Inspired by size was set to 60, the minimum track frequency was 0, term frequency - inverse document frequency [17], which the number of negative sampling was 5, random seed is used for considering the frequency of the words and was set as 27 and the training epochs were set to 10. All for lowering the importance of the high frequency words, the training and evaluation phases were conducted on a we designed a novel metric, miss rate - inverse ground machine with AMD Ryzen Threadripper 3960X 24-Core truth frequency (MR-ITF), to aggregate all scores with Processor and 252GB RAM (we do not report our GPU as different importance weighting for each class. our approach does not require it). The results of the abla- Formally, the computation of MR-ITF is as follows: tive experiments is 4-fold boostrapped cross-validation2 . |𝐶| ∑𝑖=1 𝑀𝑅𝑖 × 𝐼 𝑇 𝐹𝑖 4.2. Results 𝑀𝑅 − 𝐼 𝑇 𝐹 = 𝑁 , (1) ∑𝑗=1 𝑀𝑅𝑗 Offline Performance. We first conducted an ablation study to ensure the effective design of our proposed #𝑡𝑜𝑡𝑎𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 Track2Vec. As shown in Table 1, it is evident that the 𝐼 𝑇 𝐹𝑖 = 𝑙𝑜𝑔( ), (2) #𝑡𝑟𝑎𝑐𝑘𝑖 performance of the total score (i.e., the average of (1) - where 𝐶 is the number of classes (the number of tracks in this paper), 𝑁 is the number of total instances, and 𝑀𝑅𝑖 2 Our code will be available at https://github.com/wwweiwei/- is the miss-rate of the 𝑖-th track, as in [3]. Track2Vec. Table 2 Performance of our Track2Vec in the leaderboard. The formula of the score is normalized with the official baseline and the best score of Phase 1. Standard Standard metrics Rank Model Score Behavioral tests RSs metrics on a per-group 4 Track2Vec 1.1847 0.0088 2.9481 0.2050 - CBOWRecSysBaseline -1.2122 0.0512 -3.7194 0.4527 Improvements (%) - 198 -83 179 -55 Table 3 4.3. Case Study: Track2Vec Behavior Ensemble groups overlapping ratio. To analyze the behavior of each group, we further con- Groups Recommended track_id ducted a case to demonstrate the overlapping coverage Rolling in the Deep, Lights, of the top 100 recommended tracks from each group of Gender Track2Vec. Table 3 illustrates parts of the recommended Get Lucky, ... results and the overlapping ratio. We can observe that Burn, Lights, Playcount there is little overlapping of recommendation results We Found Love, ... across the three groups, which demonstrates the diver- Set Fire to the Rain, We Found sity of each group and shows that Track2Vec is capable User Track Count Love, Titanium, ... of considering from these predictions to achieve both Lights, We Found accurate and fair recommendations. Track2Vec Love, Burn, ... Overlapping ratio 12% 5. Conclusion In this paper, we propose Track2Vec as a fairness rec- (3)) is degraded without either groups (gender, playcount, ommendation system by a customizable—driven groups user track count) compared with our Track2Vec. This to achieve fairness model behavior, track representation result verifies the need to divide a user sequence to the learning to capture different user preferences and an corresponding group based on the configurations. More- ensemble technique to aggregate different aspects. To over, the behavioral tests of G perform the worst, which mitigate the issue of neglecting the minority groups, we indicates that using user gender as the splitting standard introduce MR-ITF by weighting different degrees of im- can achieve better results in each group (i.e., (2)), this portance for each class based on the corresponding fre- hinders the model’s recommendation of accurate music quency, which can be extended to any existing metrics tracks to users as well as the fair diversity. It is noted that without manual settings. By conducting extensive exper- the performances of MR-ITF (i.e., our proposed metric) iments, our Track2Vec achieved superior performance are similar in different models, which indicates that the compared to the official baseline, which shows not only ground truths of tracks are quite diverse in the test set; the capability of recommending fair music tracks but thus, the ITR term of each model are nearly the same. In also an efficient recommendation systems without any addition, the miss rates of our variants are similar within GPU. In addition, our proposed MR-ITF is able to reflect different tracks after our investigation. prediction bias, which uncovers the model behavior and Testing Performance. Table 2 shows the performance fosters researchers to develop more advanced systems. on the EvalRS leaderboard. Our approach achieved a total score of 1.1847, ranking the fourth prize among 17 teams. In addition, these results illustrate that our Track2Vec References outperforms the official baseline (CBOWRecSysBaseline) by nearly 200%, which demonstrates the robust capability [1] P. Covington, J. Adams, E. Sargin, Deep neural net- of our model. Furthermore, our approach demonstrates works for youtube recommendations, in: RecSys, that not only using less computation cost in a GPU-free ACM, 2016, pp. 191–198. framework, but also utilizing only three features and [2] K. Yang, J. Stoyanovich, Measuring fairness in track_id can achieve competitive performance. ranked outputs, in: SSDBM, ACM, 2017, pp. 22:1–22:6. [3] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Beyond NDCG: behavioral testing of recommender [17] G. Salton, C. Buckley, Term-weighting approaches systems with reclist, CoRR abs/2111.09963 (2021). in automatic text retrieval, Inf. Process. Manag. 24 [4] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, (1988) 513–523. C. Greco, G. de Souza P. Moreira, P. J. Chia, Evalrs: a [18] R. Řehůřek, P. Sojka, Software Framework for Topic rounded evaluation of recommender systems, CoRR Modelling with Large Corpora, in: Proceedings of abs/2207.05772 (2022). the LREC 2010 Workshop on New Challenges for [5] Y. Peng, A survey on modern recommendation NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. system based on big data, CoRR abs/2206.02631 45–50. http://is.muni.cz/publication/884893/en. (2022). [6] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, J. Leskovec, Graph convolutional neural networks for web-scale recommender systems, in: KDD, ACM, 2018, pp. 974–983. [7] M. Grbovic, V. Radosavljevic, N. Djuric, N. Bhamidi- pati, J. Savla, V. Bhagwan, D. Sharp, E-commerce in your inbox: Product recommendations at scale, in: Proceedings of the 21th ACM SIGKDD interna- tional conference on knowledge discovery and data mining, 2015, pp. 1809–1818. [8] F. Vasile, E. Smirnova, A. Conneau, Meta-prod2vec: Product embeddings using side-information for rec- ommendation, in: Proceedings of the 10th ACM conference on recommender systems, 2016, pp. 225–232. [9] F. Bianchi, B. Yu, J. Tagliabue, Bert goes shopping: Comparing distributional models for product repre- sentations, arXiv preprint arXiv:2012.09807 (2020). [10] G. de Souza Pereira Moreira, S. Rabhi, J. M. Lee, R. Ak, E. Oldridge, Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation, in: Fifteenth ACM Conference on Recommender Systems, 2021, pp. 143–153. [11] V. Ingale, S. Ellambotla, Literature review on perfor- mance evaluation of recommendation system with different dimensions of metrics, Available at SSRN 4140551 (2022). [12] M. Schedl, The lfm-1b dataset for music retrieval and recommendation, in: ICMR, ACM, 2016, pp. 103–110. [13] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, in: ICLR (Workshop Poster), 2013. [14] W. Wang, K. Chang, Y. Tang, Emotiongif-yankee: A sentiment classifier with robust model based en- semble methods, CoRR abs/2007.02259 (2020). [15] W. Wang, W. Peng, Team yao at factify 2022: Utiliz- ing pre-trained models and co-attention networks for multi-modal fact verification (short paper), in: DE-FACTIFY@AAAI, volume 3199 of CEUR Work- shop Proceedings, CEUR-WS.org, 2022. [16] W. Wang, Y. Tang, W. Du, W. Peng, Nycu_twd@lt- edi-acl2022: Ensemble models with VADER and contrastive learning for detecting signs of depres- sion from social media, in: LT-EDI, Association for Computational Linguistics, 2022, pp. 136–139.