1. Introduction

music recom mendation with a GP U-free customizable-driven framework

Wei-Wei Du

Wei-Yao Wang

Wen-Chih Peng

wcpeng@cs.nycu.edu.tw 0 0 Department of Computer Science, National Yang Ming Chiao Tung University , Hsinchu , Taiwan

Recommendation systems have illustrated the significant progress made in characterizing users' preferences based on their past behaviors. Despite the efectiveness of recommending accurately, there exist several factors that are essential but unexplored for evaluating various facets of recommendation systems, e.g., fairness, diversity, and limited resources. To address these issues, we propose Track2Vec, a GPU-free customizable-driven framework for fairness music recommendation. In order to take both accuracy and fairness into account, our solution consists of three modules, a customized fairness-aware groups for modeling diferent features based on configurable settings, a track representation learning module for learning better user embedding, and an ensemble module for ranking the recommendation results from diferent track representation learning modules. Moreover, inspired by TF-IDF which has been widely used in natural language processing, we introduce a metric called Miss Rate - Inverse Ground Truth Frequency (MR-ITF) to measure the fairness. Extensive experiments demonstrate that our model achieves a 4th price ranking in a GPU-free environment on the leaderboard in the EvalRS @ CIKM 2022 challenge, which is superior to the oficial baseline by about 200% in terms of the oficial scores. In addition, the ablation study illustrates the necessity of ensembling each group to acquire both accurate and fair recommendations.

recommendation system ensemble methods fairness metric

1. Introduction Nowadays, there has been a surge in research focusing

on recommendation systems (RSs) in diferent domains (e.g., movies, videos, news, products) with the aim of increasing the possibility of targeting users to view or buy recommended items based on their historical browses. These approaches introduce their recommendation systems by filtering the most importance and eye-catching information from the collected abundance of data to relieve the information overload problem. In addition, Covington et al. [1] introduced a framework to first select hundreds of video candidates and then rank these videos according to the user history and video content to alleviate the data sparsity problem and to generate more accurate recommendations.

However, most of the work has adopted accuracybased metrics (e.g., hit-rate (HR), mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG)), which fail to consider other factors that reflect the robustness of the models. Therefore, researchers from both academia and industry have paid more attention to investigating the issues of model fairness and diversity in CIKM’22: Proceedings of the 31st ACM International Conference on Information and Knowledge Management ∗Corresponding author. 0000-0002-0627-0314 (W. Du); 0000-0002-6551-1720 (W. Wang); LGOBE [2] introduced fairness measures by generating synthetic data to quantify statistical parity and biases in rankings.

Chia et al. [3] proposed RecList, a general plug-and-play

framework to scale up behavioral testing.

In this challenge hosted by EvalRS1, given user listening history, track metadata, and user metadata, the goal is to recommend K songs for each user as shown in Figure 1.

The recommended predictions are evaluated by standard

RSs metrics (HR, and MRR), standard metrics on a pergroup or slice basis (gender balance, artist popularity, user country, song popularity, and user history), and behavioral tests (be less wrong, and latent diversity) [4]. To tackle the shared task, we propose a framework, Track2Vec, a framework with three modules as a fairness music recommendation system. Specifically, our proposed Track2Vec is composed of a customized fairness-aware groups for dividing user history into multiple facets, a track representation learning module for candidate matching, and an

1https://reclist.io/cikm2022-cup/

ensemble module for better ranking the recommended performance on several recommendation system benchresults. In this manner, Track2Vec is able to not only marks. For instance, Covington et al. [1] introduced a achieve robust performance without auxiliary tasks, but two-stage framework, namely a deep candidate generait can also be deployed with limited resources (e.g., a tion model and a deep ranking model, for YouTube recomGPU-free machine), which demonstrates the practicality mendation. PinSage proposed a data eficient Graph Conof our framework. volutional Network (GCN) algorithm for web-scale rec

Although there are several metrics that evaluate model ommendation [6]. Grbovic et al. [7], Vasile et al. [8] and behavior with diferent numbers of bins, we argue that Bianchi et al. [9] adopted a novel neural language-based existing metrics that rely on manual settings fail to dis- algorithm for product recommendation, and de Souza tinguish the importance of diferent numbers of classes Pereira Moreira et al. [10] employed the transformer arautomatically, which makes it hard to generalize to dif- chitecture for session-based recommendation. ferent tasks and datasets with the same settings. To that In this shared task, one of the constraints is the limited end, we introduce a new metric called Miss Rate - Inverse resources and time for training and inferencing. ThereGround Truth Frequency (MR-ITF), which computes the fore, we introduce a data-driven unsupervised approach less represented categories with more attention to avoid- instead of deep learning supervised approaches to tackle ing popular categories dominating the results. Our pro- the task with both accurate predictions and eficiency. posed metric can be extended to any existing metrics directly by only computing the number of each class 2.2. Fairness Metrics as the denominator. For instance, the numerators can be replaced with MRR or nDCG for evaluating diferent aspects with the diferent numbers of classes.

In summary, our contributions are four-fold:

Most RSs focus on the accuracy of recommended results, for instance, [11] introduces accuracy, coverage, variety, recommender confidence, robustness, scalability and privacy from a common RSs-centric perspective. On the 1. We propose Track2Vec as a fairness recommen- other hand, it is also critical to evaluate trustworthiness, dation system with a customizable-driven frame- utility, risk and usability from a user-centric perspective work, which achieves efective results (i.e., the of RSs. In this EvalRS challenge, the organizers provide fourth prize on the leaderboard) in a GPU-free various metrics for evaluating not only model perforenvironment. mance but also model behavior with standard RSs met2. To tackle the class imbalance issue, we introduce rics, standard metrics on a per-group or slice basis, and a a customized fairness-aware groups to divide user behavior test. history into diferent aspects based on customiz- However, the metrics of model behavior require manable configurations. ual settings of divided bins and are hard to generalize to 3. We introduce a novel metric, MR-ITF, to mea- diferent tasks. Thus, we propose a novel metric, MR-ITF, sure the predictive distribution of the model by which is a metric that computes frequent categories with weighting importance based on the number of lower weights and few categories with higher weights predictions of each class, which can be general- to be sure not to dominate the predictions by majority. ized to any existing metrics. This metric can also be extended to any existing metrics 4. We conduct extensive experiments to demon- by only modifying the numerators.

strate the efectiveness of Track2Vec, which outperforms the oficial baseline about 200% in terms of the leaderboard (phase 2) score. Moreover, the 3. Method ablation study verifies the capabilities of the proposed framework. 3.1. Preliminary

2. Related Work 2.1. Recommendation Systems

Nowadays, recommendation systems are able to solve the information overload problem, which predicts a user’s preference based on the user history. In general, the recommendation techniques can be divided into four categories: content-, collaborative filtering-, knowledge-, and hybrid-based recommendation systems [5]. Recently, deep learning approaches have led to state-of-the-art The dataset of this task is based on the LFM-1b Dataset [12], corpus of listening events for music recommendation. It consists of 100M+ listening events and three types of data, users for user background information and patterns of consumption, tracks which the artist and album belong to, and historical interactions for a collection of interactions between users and tracks. The details of the data process procedure can be referred to [4]. 3.2. Our Recommendation System: logarithmic bucketing in base 10 (100, 1000 as division Track2Vec in this paper). Therefore, we chose these three factors in this work to divide users into the corresponding groups.

Figure 2 demonstrates the pipeline of our framework. We note that these factors are configurable, which can Given multiple types of user history, the customized change to others in the framework. fairness-aware groups divides each sequence into a dif- Track Representation Learning. One of the limitaferent track representation learning module according tions in this task is the time constraint (22.5 minutes/fold to the configurable settings. For example, if the input in average); thus it is challenging to learn a fine-grained configuration is selected to focus on the user, a user se- track representation using supervised deep learning apquence will first be checked to be the training instance of proaches in a limited amount of time. Therefore, we focus the representation learning module in each group (e.g., on an unsupervised method to meet the requirement for if the user is male then only the male module has this training and recommending tracks for users. Specifically, instance in the gender group). Afterwards, the predic- we employ Word2Vec [13] to train track embeddings by tions are aggregated by using ensemble techniques to calculating the interactions between tracks, which only generate the top K (K=100 in this paper) predictions for requires both low computational cost and high-quality. the corresponding user. In the training phase, the user As there are two options in Word2Vec (i.e., continuous features are used to separate user into groups based on bag-of-words (CBOW) and skip-gram), we experiment usthe input configuration. In the testing phase, the user ing both methods with diferent negative sampling rates features are obtained by fetching from the training data and window sizes to select the best one. The CBOW ar(by user_id in this dataset). chitecture predicts the current token based on the whole Customized Fairness-Aware Groups. To enable our context, and the skip-gram predicts surrounding tokens model with the fairness behavior, we first discretized each given the current token. feature based on the feature distribution, and bunched Ensemble Techniques. Ensembling prediction results users into diferent groups by the customizable input have demonstrated the robustness of models in previous configuration to avoid the unbalance issue (e.g., majority work [14, 15, 16], which motivated us to adopt ensemdominating the model behavior). ble techniques to produce more robust and diverse re

With the exploratory data analysis shown in Figure sults. To consider diferent factors and to recommend 3, we observed that user playcount, user gender and fairer tracks to users, voting is used for ensembling each track count are the three most important factors that group with diferent priorities. Specifically, the ensemble afect the model behavior metrics. The playcount group re-ranking strategy is applied as follows after generatused logarithmic bucketing in base 10 to divide user into ing diferent predictions from each track representation four sub-groups (10, 100, 1000 as divisions in this paper), learning module: the gender group divides the each sequence into male, female and neutral, and the track count group also used • Priority 1: Cumulative recommending times in (a) User Gender. (b) User Playcount. (c) Track Count. and track count. Ablation study of Track2Vec. G: Gender. P: Playcount. U: User track count. Total score is computed as (( 1 ) + ( 2 ) + ( 3 )) / 3 same as Phase 1 since Phase 2 requires a minimum hit-rate threshold.

Standard RSs metrics ( 1 )

Standard metrics on a per-group ( 2 ) Behavioral tests ( 3 )

MR-ITF (ours)

Total Score

4. Experiment 4.1. Experimental Setting

To implement our Track2Vec, we adopted Word2Vec[18] as the track representation learning module. The dimension of each track embedding was set to 100, the window size was set to 60, the minimum track frequency was 0, the number of negative sampling was 5, random seed was set as 27 and the training epochs were set to 10. All machine with AMD Ryzen Threadripper 3960X 24-Core

Processor and 252GB RAM (we do not report our GPU as

our approach does not require it). The results of the ablative experiments is 4-fold boostrapped cross-validation2.

4.2. Results

Ofline Performance.

We first conducted an ablation

study to ensure the efective design of our proposed

Track2Vec. As shown in Table 1, it is evident that the

performance of the total score (i.e., the average of ( 1 )

2Our code will be available at https://github.com/wwweiwei/

Track2Vec. descending order.

order.

• Priority 2: Original individual module ranking in

3.3. Our Fairness Metric: MR-ITF

Currently, HR, nDCG and MRR are the most used metrics in recommendation systems to evaluate the efectiveness of models, but they fail to reflect the model behavior. To address the issue, MRED, being less wrong and latent diversity are proposed as an evaluation metric by RecList [3]. However, these metrics require human settings for the number of bins, but it is hard to generalize the same configurations to other tasks and datasets. Inspired by term frequency - inverse document frequency [17], which is used for considering the frequency of the words and we designed a novel metric, miss rate - inverse ground truth frequency (MR-ITF), to aggregate all scores with diferent importance weighting for each class.

Formally, the computation of MR-ITF is as follows: − = = (

∑|=| 1 #

∑=1 # ×

, ), where is the number of classes (the number of tracks in this paper), is the number of total instances, and is the miss-rate of the -th track, as in [3].

( 1 ) ( 2 )

for lowering the importance of the high frequency words, the training and evaluation phases were conducted on a ( 3 )) is degraded without either groups (gender, playcount, user track count) compared with our Track2Vec. This result verifies the need to divide a user sequence to the corresponding group based on the configurations. Moreover, the behavioral tests of G perform the worst, which indicates that using user gender as the splitting standard can achieve better results in each group (i.e., ( 2 )), this hinders the model’s recommendation of accurate music tracks to users as well as the fair diversity. It is noted that the performances of MR-ITF (i.e., our proposed metric) are similar in diferent models, which indicates that the ground truths of tracks are quite diverse in the test set; thus, the ITR term of each model are nearly the same. In addition, the miss rates of our variants are similar within diferent tracks after our investigation.

Testing Performance. Table 2 shows the performance on the EvalRS leaderboard. Our approach achieved a total score of 1.1847, ranking the fourth prize among 17 teams.

In addition, these results illustrate that our Track2Vec outperforms the oficial baseline (CBOWRecSysBaseline) by nearly 200%, which demonstrates the robust capability of our model. Furthermore, our approach demonstrates that not only using less computation cost in a GPU-free framework, but also utilizing only three features and track_id can achieve competitive performance.

Standard RSs metrics

To analyze the behavior of each group, we further con

ducted a case to demonstrate the overlapping coverage of the top 100 recommended tracks from each group of Track2Vec. Table 3 illustrates parts of the recommended results and the overlapping ratio. We can observe that there is little overlapping of recommendation results across the three groups, which demonstrates the diversity of each group and shows that Track2Vec is capable of considering from these predictions to achieve both accurate and fair recommendations.

5. Conclusion

In this paper, we propose Track2Vec as a fairness recommendation system by a customizable—driven groups to achieve fairness model behavior, track representation learning to capture diferent user preferences and an ensemble technique to aggregate diferent aspects. To mitigate the issue of neglecting the minority groups, we introduce MR-ITF by weighting diferent degrees of importance for each class based on the corresponding frequency, which can be extended to any existing metrics without manual settings. By conducting extensive experiments, our Track2Vec achieved superior performance compared to the oficial baseline, which shows not only the capability of recommending fair music tracks but also an eficient recommendation systems without any GPU. In addition, our proposed MR-ITF is able to reflect prediction bias, which uncovers the model behavior and fosters researchers to develop more advanced systems. Beyond NDCG: behavioral testing of recommender [17] G. Salton, C. Buckley, Term-weighting approaches systems with reclist, CoRR abs/2111.09963 (2021). in automatic text retrieval, Inf. Process. Manag. 24 [4] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, (1988) 513–523.

C. Greco, G. de Souza P. Moreira, P. J. Chia, Evalrs: a [18] R. Řehůřek, P. Sojka, Software Framework for Topic rounded evaluation of recommender systems, CoRR Modelling with Large Corpora, in: Proceedings of abs/2207.05772 (2022). the LREC 2010 Workshop on New Challenges for [5] Y. Peng, A survey on modern recommendation NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. system based on big data, CoRR abs/2206.02631 45–50. http://is.muni.cz/publication/884893/en. (2022). [6] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L.

Hamilton, J. Leskovec, Graph convolutional neural networks for web-scale recommender systems, in:

KDD, ACM, 2018, pp. 974–983. [7] M. Grbovic, V. Radosavljevic, N. Djuric, N. Bhamidipati, J. Savla, V. Bhagwan, D. Sharp, E-commerce in your inbox: Product recommendations at scale, in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1809–1818. [8] F. Vasile, E. Smirnova, A. Conneau, Meta-prod2vec:

Product embeddings using side-information for recommendation, in: Proceedings of the 10th ACM conference on recommender systems, 2016, pp.

225–232. [9] F. Bianchi, B. Yu, J. Tagliabue, Bert goes shopping:

Comparing distributional models for product representations, arXiv preprint arXiv:2012.09807 (2020). [10] G. de Souza Pereira Moreira, S. Rabhi, J. M. Lee,

R. Ak, E. Oldridge, Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation, in: Fifteenth ACM Conference on Recommender Systems, 2021, pp. 143–153. [11] V. Ingale, S. Ellambotla, Literature review on performance evaluation of recommendation system with diferent dimensions of metrics, Available at SSRN 4140551 (2022). [12] M. Schedl, The lfm-1b dataset for music retrieval and recommendation, in: ICMR, ACM, 2016, pp.

103–110. [13] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in vector space, in: ICLR (Workshop Poster), 2013. [14] W. Wang, K. Chang, Y. Tang, Emotiongif-yankee:

A sentiment classifier with robust model based ensemble methods, CoRR abs/2007.02259 (2020). [15] W. Wang, W. Peng, Team yao at factify 2022: Utilizing pre-trained models and co-attention networks for multi-modal fact verification (short paper), in: DE-FACTIFY@AAAI, volume 3199 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. [16] W. Wang, Y. Tang, W. Du, W. Peng, Nycu_twd@ltedi-acl2022: Ensemble models with VADER and contrastive learning for detecting signs of depression from social media, in: LT-EDI, Association for Computational Linguistics, 2022, pp. 136–139.

[1]

Covington ,

Adams , E. Sargin, Deep neural networks for youtube recommendations , in: RecSys, ACM, 2016 , pp. 191 - 198 .

[2]

Yang ,

Stoyanovich , Measuring fairness in ranked outputs , in: SSDBM, ACM, 2017 , pp. 22 : 1 - 22 : 6 .

[3]

P. J.

Chia ,

Tagliabue ,

Bianchi ,

He ,

Ko ,