<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>music recom mendation with a GP U-free customizable-driven framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wei-Wei Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei-Yao Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wen-Chih Peng</string-name>
          <email>wcpeng@cs.nycu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, National Yang Ming Chiao Tung University</institution>
          ,
          <addr-line>Hsinchu</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recommendation systems have illustrated the significant progress made in characterizing users' preferences based on their past behaviors. Despite the efectiveness of recommending accurately, there exist several factors that are essential but unexplored for evaluating various facets of recommendation systems, e.g., fairness, diversity, and limited resources. To address these issues, we propose Track2Vec, a GPU-free customizable-driven framework for fairness music recommendation. In order to take both accuracy and fairness into account, our solution consists of three modules, a customized fairness-aware groups for modeling diferent features based on configurable settings, a track representation learning module for learning better user embedding, and an ensemble module for ranking the recommendation results from diferent track representation learning modules. Moreover, inspired by TF-IDF which has been widely used in natural language processing, we introduce a metric called Miss Rate - Inverse Ground Truth Frequency (MR-ITF) to measure the fairness. Extensive experiments demonstrate that our model achieves a 4th price ranking in a GPU-free environment on the leaderboard in the EvalRS @ CIKM 2022 challenge, which is superior to the oficial baseline by about 200% in terms of the oficial scores. In addition, the ablation study illustrates the necessity of ensembling each group to acquire both accurate and fair recommendations.</p>
      </abstract>
      <kwd-group>
        <kwd>recommendation system</kwd>
        <kwd>ensemble methods</kwd>
        <kwd>fairness metric</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Nowadays, there has been a surge in research focusing</title>
        <p>on recommendation systems (RSs) in diferent domains
(e.g., movies, videos, news, products) with the aim of
increasing the possibility of targeting users to view or buy
recommended items based on their historical browses.
These approaches introduce their recommendation
systems by filtering the most importance and eye-catching
information from the collected abundance of data to
relieve the information overload problem. In addition,
Covington et al. [1] introduced a framework to first select
hundreds of video candidates and then rank these videos
according to the user history and video content to
alleviate the data sparsity problem and to generate more
accurate recommendations.</p>
        <p>However, most of the work has adopted
accuracybased metrics (e.g., hit-rate (HR), mean reciprocal rank
(MRR), and normalized discounted cumulative gain
(nDCG)), which fail to consider other factors that reflect
the robustness of the models. Therefore, researchers from
both academia and industry have paid more attention to
investigating the issues of model fairness and diversity in
CIKM’22: Proceedings of the 31st ACM International Conference on
Information and Knowledge Management
∗Corresponding author.
0000-0002-0627-0314 (W. Du); 0000-0002-6551-1720 (W. Wang);
LGOBE
[2] introduced fairness measures by generating synthetic
data to quantify statistical parity and biases in rankings.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Chia et al. [3] proposed RecList, a general plug-and-play</title>
        <p>framework to scale up behavioral testing.</p>
        <p>In this challenge hosted by EvalRS1, given user
listening history, track metadata, and user metadata, the goal is
to recommend K songs for each user as shown in Figure 1.</p>
      </sec>
      <sec id="sec-1-3">
        <title>The recommended predictions are evaluated by standard</title>
        <p>RSs metrics (HR, and MRR), standard metrics on a
pergroup or slice basis (gender balance, artist popularity, user
country, song popularity, and user history), and behavioral
tests (be less wrong, and latent diversity) [4]. To tackle
the shared task, we propose a framework, Track2Vec, a
framework with three modules as a fairness music
recommendation system. Specifically, our proposed Track2Vec
is composed of a customized fairness-aware groups for
dividing user history into multiple facets, a track
representation learning module for candidate matching, and an</p>
      </sec>
      <sec id="sec-1-4">
        <title>1https://reclist.io/cikm2022-cup/</title>
        <p>ensemble module for better ranking the recommended performance on several recommendation system
benchresults. In this manner, Track2Vec is able to not only marks. For instance, Covington et al. [1] introduced a
achieve robust performance without auxiliary tasks, but two-stage framework, namely a deep candidate
generait can also be deployed with limited resources (e.g., a tion model and a deep ranking model, for YouTube
recomGPU-free machine), which demonstrates the practicality mendation. PinSage proposed a data eficient Graph
Conof our framework. volutional Network (GCN) algorithm for web-scale
rec</p>
        <p>Although there are several metrics that evaluate model ommendation [6]. Grbovic et al. [7], Vasile et al. [8] and
behavior with diferent numbers of bins, we argue that Bianchi et al. [9] adopted a novel neural language-based
existing metrics that rely on manual settings fail to dis- algorithm for product recommendation, and de Souza
tinguish the importance of diferent numbers of classes Pereira Moreira et al. [10] employed the transformer
arautomatically, which makes it hard to generalize to dif- chitecture for session-based recommendation.
ferent tasks and datasets with the same settings. To that In this shared task, one of the constraints is the limited
end, we introduce a new metric called Miss Rate - Inverse resources and time for training and inferencing.
ThereGround Truth Frequency (MR-ITF), which computes the fore, we introduce a data-driven unsupervised approach
less represented categories with more attention to avoid- instead of deep learning supervised approaches to tackle
ing popular categories dominating the results. Our pro- the task with both accurate predictions and eficiency.
posed metric can be extended to any existing metrics
directly by only computing the number of each class 2.2. Fairness Metrics
as the denominator. For instance, the numerators can
be replaced with MRR or nDCG for evaluating diferent
aspects with the diferent numbers of classes.</p>
        <p>In summary, our contributions are four-fold:</p>
        <p>Most RSs focus on the accuracy of recommended results,
for instance, [11] introduces accuracy, coverage, variety,
recommender confidence, robustness, scalability and
privacy from a common RSs-centric perspective. On the
1. We propose Track2Vec as a fairness recommen- other hand, it is also critical to evaluate trustworthiness,
dation system with a customizable-driven frame- utility, risk and usability from a user-centric perspective
work, which achieves efective results (i.e., the of RSs. In this EvalRS challenge, the organizers provide
fourth prize on the leaderboard) in a GPU-free various metrics for evaluating not only model
perforenvironment. mance but also model behavior with standard RSs
met2. To tackle the class imbalance issue, we introduce rics, standard metrics on a per-group or slice basis, and a
a customized fairness-aware groups to divide user behavior test.
history into diferent aspects based on customiz- However, the metrics of model behavior require
manable configurations. ual settings of divided bins and are hard to generalize to
3. We introduce a novel metric, MR-ITF, to mea- diferent tasks. Thus, we propose a novel metric, MR-ITF,
sure the predictive distribution of the model by which is a metric that computes frequent categories with
weighting importance based on the number of lower weights and few categories with higher weights
predictions of each class, which can be general- to be sure not to dominate the predictions by majority.
ized to any existing metrics. This metric can also be extended to any existing metrics
4. We conduct extensive experiments to demon- by only modifying the numerators.</p>
        <p>strate the efectiveness of Track2Vec, which
outperforms the oficial baseline about 200% in terms
of the leaderboard (phase 2) score. Moreover, the 3. Method
ablation study verifies the capabilities of the
proposed framework. 3.1. Preliminary</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Recommendation Systems</title>
        <p>Nowadays, recommendation systems are able to solve the
information overload problem, which predicts a user’s
preference based on the user history. In general, the
recommendation techniques can be divided into four
categories: content-, collaborative filtering-, knowledge-,
and hybrid-based recommendation systems [5]. Recently,
deep learning approaches have led to state-of-the-art
The dataset of this task is based on the LFM-1b Dataset
[12], corpus of listening events for music
recommendation. It consists of 100M+ listening events and three types
of data, users for user background information and
patterns of consumption, tracks which the artist and album
belong to, and historical interactions for a collection of
interactions between users and tracks. The details of the
data process procedure can be referred to [4].
3.2. Our Recommendation System: logarithmic bucketing in base 10 (100, 1000 as division
Track2Vec in this paper). Therefore, we chose these three factors in
this work to divide users into the corresponding groups.</p>
        <p>Figure 2 demonstrates the pipeline of our framework. We note that these factors are configurable, which can
Given multiple types of user history, the customized change to others in the framework.
fairness-aware groups divides each sequence into a dif- Track Representation Learning. One of the
limitaferent track representation learning module according tions in this task is the time constraint (22.5 minutes/fold
to the configurable settings. For example, if the input in average); thus it is challenging to learn a fine-grained
configuration is selected to focus on the user, a user se- track representation using supervised deep learning
apquence will first be checked to be the training instance of proaches in a limited amount of time. Therefore, we focus
the representation learning module in each group (e.g., on an unsupervised method to meet the requirement for
if the user is male then only the male module has this training and recommending tracks for users. Specifically,
instance in the gender group). Afterwards, the predic- we employ Word2Vec [13] to train track embeddings by
tions are aggregated by using ensemble techniques to calculating the interactions between tracks, which only
generate the top K (K=100 in this paper) predictions for requires both low computational cost and high-quality.
the corresponding user. In the training phase, the user As there are two options in Word2Vec (i.e., continuous
features are used to separate user into groups based on bag-of-words (CBOW) and skip-gram), we experiment
usthe input configuration. In the testing phase, the user ing both methods with diferent negative sampling rates
features are obtained by fetching from the training data and window sizes to select the best one. The CBOW
ar(by user_id in this dataset). chitecture predicts the current token based on the whole
Customized Fairness-Aware Groups. To enable our context, and the skip-gram predicts surrounding tokens
model with the fairness behavior, we first discretized each given the current token.
feature based on the feature distribution, and bunched Ensemble Techniques. Ensembling prediction results
users into diferent groups by the customizable input have demonstrated the robustness of models in previous
configuration to avoid the unbalance issue (e.g., majority work [14, 15, 16], which motivated us to adopt
ensemdominating the model behavior). ble techniques to produce more robust and diverse
re</p>
        <p>
          With the exploratory data analysis shown in Figure sults. To consider diferent factors and to recommend
3, we observed that user playcount, user gender and fairer tracks to users, voting is used for ensembling each
track count are the three most important factors that group with diferent priorities. Specifically, the ensemble
afect the model behavior metrics. The playcount group re-ranking strategy is applied as follows after
generatused logarithmic bucketing in base 10 to divide user into ing diferent predictions from each track representation
four sub-groups (10, 100, 1000 as divisions in this paper), learning module:
the gender group divides the each sequence into male,
female and neutral, and the track count group also used • Priority 1: Cumulative recommending times in
(a) User Gender. (b) User Playcount. (c) Track Count.
and track count.
Ablation study of Track2Vec. G: Gender. P: Playcount. U: User track count. Total score is computed as ((
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) + (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) + (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )) / 3 same
as Phase 1 since Phase 2 requires a minimum hit-rate threshold.
        </p>
        <p>
          Standard RSs metrics (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>
          Standard metrics
on a per-group (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
Behavioral tests (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
        </p>
        <p>MR-ITF (ours)</p>
        <p>Total Score</p>
        <p>G</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiment</title>
      <sec id="sec-3-1">
        <title>4.1. Experimental Setting</title>
        <p>To implement our Track2Vec, we adopted Word2Vec[18]
as the track representation learning module. The
dimension of each track embedding was set to 100, the window
size was set to 60, the minimum track frequency was 0,
the number of negative sampling was 5, random seed
was set as 27 and the training epochs were set to 10. All
machine with AMD Ryzen Threadripper 3960X 24-Core</p>
        <sec id="sec-3-1-1">
          <title>Processor and 252GB RAM (we do not report our GPU as</title>
          <p>our approach does not require it). The results of the
ablative experiments is 4-fold boostrapped cross-validation2.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Results</title>
        <p>Ofline Performance.</p>
        <sec id="sec-3-2-1">
          <title>We first conducted an ablation</title>
          <p>study to ensure the efective design of our proposed</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Track2Vec. As shown in Table 1, it is evident that the</title>
          <p>
            performance of the total score (i.e., the average of (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
          </p>
        </sec>
        <sec id="sec-3-2-3">
          <title>2Our code will be available at https://github.com/wwweiwei/</title>
          <p>Track2Vec.
descending order.</p>
          <p>order.</p>
          <p>• Priority 2: Original individual module ranking in</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Our Fairness Metric: MR-ITF</title>
        <p>Currently, HR, nDCG and MRR are the most used metrics
in recommendation systems to evaluate the efectiveness
of models, but they fail to reflect the model behavior. To
address the issue, MRED, being less wrong and latent
diversity are proposed as an evaluation metric by RecList
[3]. However, these metrics require human settings for
the number of bins, but it is hard to generalize the same
configurations to other tasks and datasets. Inspired by
term frequency - inverse document frequency [17], which
is used for considering the frequency of the words and
we designed a novel metric, miss rate - inverse ground
truth frequency (MR-ITF), to aggregate all scores with
diferent importance weighting for each class.</p>
        <p>Formally, the computation of MR-ITF is as follows:
  −    =
    = (</p>
        <p>∑|=| 1  
#</p>
        <p>∑=1  
# 

 ×</p>
        <p>,
),
where  is the number of classes (the number of tracks in
this paper),  is the number of total instances, and  
is the miss-rate of the  -th track, as in [3].</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
        </p>
        <p>
          for lowering the importance of the high frequency words, the training and evaluation phases were conducted on a
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )) is degraded without either groups (gender, playcount,
user track count) compared with our Track2Vec. This
result verifies the need to divide a user sequence to the
corresponding group based on the configurations.
Moreover, the behavioral tests of G perform the worst, which
indicates that using user gender as the splitting standard
can achieve better results in each group (i.e., (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )), this
hinders the model’s recommendation of accurate music
tracks to users as well as the fair diversity. It is noted that
the performances of MR-ITF (i.e., our proposed metric)
are similar in diferent models, which indicates that the
ground truths of tracks are quite diverse in the test set;
thus, the ITR term of each model are nearly the same. In
addition, the miss rates of our variants are similar within
diferent tracks after our investigation.
        </p>
        <p>Testing Performance. Table 2 shows the performance
on the EvalRS leaderboard. Our approach achieved a total
score of 1.1847, ranking the fourth prize among 17 teams.</p>
        <p>In addition, these results illustrate that our Track2Vec
outperforms the oficial baseline (CBOWRecSysBaseline)
by nearly 200%, which demonstrates the robust capability
of our model. Furthermore, our approach demonstrates
that not only using less computation cost in a GPU-free
framework, but also utilizing only three features and
track_id can achieve competitive performance.</p>
        <p>Standard
RSs metrics</p>
        <sec id="sec-3-3-1">
          <title>To analyze the behavior of each group, we further con</title>
          <p>ducted a case to demonstrate the overlapping coverage
of the top 100 recommended tracks from each group of
Track2Vec. Table 3 illustrates parts of the recommended
results and the overlapping ratio. We can observe that
there is little overlapping of recommendation results
across the three groups, which demonstrates the
diversity of each group and shows that Track2Vec is capable
of considering from these predictions to achieve both
accurate and fair recommendations.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion</title>
      <p>In this paper, we propose Track2Vec as a fairness
recommendation system by a customizable—driven groups
to achieve fairness model behavior, track representation
learning to capture diferent user preferences and an
ensemble technique to aggregate diferent aspects. To
mitigate the issue of neglecting the minority groups, we
introduce MR-ITF by weighting diferent degrees of
importance for each class based on the corresponding
frequency, which can be extended to any existing metrics
without manual settings. By conducting extensive
experiments, our Track2Vec achieved superior performance
compared to the oficial baseline, which shows not only
the capability of recommending fair music tracks but
also an eficient recommendation systems without any
GPU. In addition, our proposed MR-ITF is able to reflect
prediction bias, which uncovers the model behavior and
fosters researchers to develop more advanced systems.
Beyond NDCG: behavioral testing of recommender [17] G. Salton, C. Buckley, Term-weighting approaches
systems with reclist, CoRR abs/2111.09963 (2021). in automatic text retrieval, Inf. Process. Manag. 24
[4] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, (1988) 513–523.</p>
      <p>C. Greco, G. de Souza P. Moreira, P. J. Chia, Evalrs: a [18] R. Řehůřek, P. Sojka, Software Framework for Topic
rounded evaluation of recommender systems, CoRR Modelling with Large Corpora, in: Proceedings of
abs/2207.05772 (2022). the LREC 2010 Workshop on New Challenges for
[5] Y. Peng, A survey on modern recommendation NLP Frameworks, ELRA, Valletta, Malta, 2010, pp.
system based on big data, CoRR abs/2206.02631 45–50. http://is.muni.cz/publication/884893/en.
(2022).
[6] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L.</p>
      <p>Hamilton, J. Leskovec, Graph convolutional neural
networks for web-scale recommender systems, in:</p>
      <p>KDD, ACM, 2018, pp. 974–983.
[7] M. Grbovic, V. Radosavljevic, N. Djuric, N.
Bhamidipati, J. Savla, V. Bhagwan, D. Sharp, E-commerce
in your inbox: Product recommendations at scale,
in: Proceedings of the 21th ACM SIGKDD
international conference on knowledge discovery and data
mining, 2015, pp. 1809–1818.
[8] F. Vasile, E. Smirnova, A. Conneau, Meta-prod2vec:</p>
      <p>Product embeddings using side-information for
recommendation, in: Proceedings of the 10th ACM
conference on recommender systems, 2016, pp.</p>
      <p>225–232.
[9] F. Bianchi, B. Yu, J. Tagliabue, Bert goes shopping:</p>
      <p>Comparing distributional models for product
representations, arXiv preprint arXiv:2012.09807 (2020).
[10] G. de Souza Pereira Moreira, S. Rabhi, J. M. Lee,</p>
      <p>R. Ak, E. Oldridge, Transformers4rec: Bridging
the gap between nlp and sequential/session-based
recommendation, in: Fifteenth ACM Conference
on Recommender Systems, 2021, pp. 143–153.
[11] V. Ingale, S. Ellambotla, Literature review on
performance evaluation of recommendation system with
diferent dimensions of metrics, Available at SSRN
4140551 (2022).
[12] M. Schedl, The lfm-1b dataset for music retrieval
and recommendation, in: ICMR, ACM, 2016, pp.</p>
      <p>103–110.
[13] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient
estimation of word representations in vector space,
in: ICLR (Workshop Poster), 2013.
[14] W. Wang, K. Chang, Y. Tang, Emotiongif-yankee:</p>
      <p>A sentiment classifier with robust model based
ensemble methods, CoRR abs/2007.02259 (2020).
[15] W. Wang, W. Peng, Team yao at factify 2022:
Utilizing pre-trained models and co-attention networks
for multi-modal fact verification (short paper), in:
DE-FACTIFY@AAAI, volume 3199 of CEUR
Workshop Proceedings, CEUR-WS.org, 2022.
[16] W. Wang, Y. Tang, W. Du, W. Peng,
Nycu_twd@ltedi-acl2022: Ensemble models with VADER and
contrastive learning for detecting signs of
depression from social media, in: LT-EDI, Association for
Computational Linguistics, 2022, pp. 136–139.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Covington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Adams</surname>
          </string-name>
          , E. Sargin,
          <article-title>Deep neural networks for youtube recommendations</article-title>
          , in: RecSys, ACM,
          <year>2016</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <article-title>Measuring fairness in ranked outputs</article-title>
          , in: SSDBM, ACM,
          <year>2017</year>
          , pp.
          <volume>22</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          :
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Chia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tagliabue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>