RecFormer: personalized temporal-aware transformer for
fair music recommendation
Wei-Yao Wang1 , Wei-Wei Du1 and Wen-Chih Peng1
1
    Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan


                                             Abstract
                                             Recommendation systems have improved the characterization of user preferences by modeling their digital footprints and
                                             item content. However, another facet, model behavior, has attracted a great deal of attention in both academic and industry
                                             fields in recent years due to the increasing awareness of fairness. The shared task, a Rounded Evaluation of Recommender
                                             Systems (EvalRS @ CIKM 2022), is introduced to broadly measure multifaceted model predictions for music recommendation.
                                             To tackle the problem, we propose the RecFormer architecture with a personalized temporal-aware transformer to model
                                             the interactions among user history in a single framework. Specifically, RecFormer adopts a masked language modeling
                                             task as the training objective, which enables the model to capture fine-grained track embeddings by reconstructing tracks.
                                             Meanwhile, it also integrates a temporal-aware self-attention mechanism into the Transformer architecture so that the model
                                             is able to consider time-variant information among different users. Moreover, we introduce linearized attention to reduce
                                             quadratic computation and memory cost since the limited time is one of the challenges in this task. Extensive experiments
                                             and analysis are conducted to demonstrate the effectiveness of our RecFormer compared with the official baseline, and we
                                             examine the model contribution from the ablation study. Our team, yao0510, won the seventh prize with a total score of
                                             0.1964 in the EvalRS challenge, which illustrates that our model achieved competitive performance. The source code will be
                                             publicly available at https://github.com/wywyWang/RecFormer.

                                             Keywords
                                             Recommendation System, User Fairness, Transformer, Temporal-Aware


1. Introduction
In recent years, characterizing users with accurate in-
terests has evolved due to the use of advanced recom-
mendation systems (RSs). These RSs have been used to
develop several real-world applications in industry, for
example, Amazon ROSE [1]. While there has been sig-                                                                                   Figure 1: An example pipeline of the music recommendation
nificant progress in predicting accurate items for users                                                                              with our proposed RecFormer.
of RSs, the awareness of model behavior has attracted a
great deal of attention from both academic and industry
researchers. As recommendation systems are built on                                                                                   recommendation system aims to predict a set of songs
top of users, data, and models, it is likely that the system                                                                          that are most likely to be listened to by the target users.
will make unfair suggestions due to the biases of these                                                                               On top of the standard retrieval metrics (i.e., hit-rate
candidates [2], which illustrates the increasing need to                                                                              (HR), mean reciprocal rank (MRR), and normalized dis-
investigate model behavior.                                                                                                           counted cumulative gain (nDCG)), the organizers further
   To that end, a rounded evaluation of recommender                                                                                   employ multi-dimensional non-observational perspec-
systems hosted by EvalRS @ CIKM 2022 is introduced                                                                                    tives to evaluate model behaviors.
to tackle both standard evaluation metrics and model                                                                                     However, we introduce that there are at least three
behavior tests for music recommendation. Figure 1 illus-                                                                              challenges in addressing this shared task. 1) Valida-
trates an example of a music recommendation system.                                                                                   tion strategy. The task uses bootstrapped nested cross-
Given multiple past tracks from a set of target users, our                                                                            validation1 that randomly samples a track from the user
                                                                                                                                      sequence as the test set, which means that this task can-
CIKM’22: Proceedings of the 31st ACM International Conference on
Information and Knowledge Management                                                                                                  not be formulated as a conventional sequence classifi-
Envelope-Open sf1638.cs05@nctu.edu.tw (W. Wang); wwdu.cs10@nycu.edu.tw                                                                cation task that uses first 𝑁 tracks to predict the 𝑁 + 1
(W. Du); wcpeng@cs.nycu.edu.tw (W. Peng)                                                                                              track since the test set may not be selected from the lat-
GLOBE https://wywywang.github.io/ (W. Wang);                                                                                          est track the user listened to. 2) Time limitation. The
https://wwweiwei.github.io/ (W. Du)                                                                                                   training and inference time is required to be less than
Orcid 0000-0002-6551-1720 (W. Wang); 0000-0002-0627-0314 (W. Du);
0000-0002-0172-7311 (W. Peng)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                       Attribution 4.0 International (CC BY 4.0).                                                         https://github.com/RecList/evalRS-CIKM-2022/blob/main/im-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                            ages/loop.jpg
22.5 minutes per fold, which is challenging when adopt-          in RecFormer.
ing deep learning techniques to have a fine-grained track
representation if the space and computation complexity
is high. 3) Time-variant event. The target users have            2. Related Work
their corresponding habits of listening music songs dur-
                                                                 Recommendation System. The recent progress of rec-
ing the day; for example, students are likely to listen to
                                                                 ommendation systems (RSs) has brought great incomes
music after school, while office workers are likely to lis-
                                                                 for industry since the preferred items of the target au-
ten during working hours. For example, Koren [3] is the
                                                                 dience can be marked precisely based on collecting and
first approach that verifies the effectiveness of modeling
                                                                 analyzing their corresponding digital footprints. Early
temporal effect in the Netflix competition. Therefore, con-
                                                                 work on RSs typically employed collaborative filtering
verting user history into a sequence for recurrent-based
                                                                 (CF) [5] and matrix factorization (MF) [6]. Recently, there
approaches directly ignores the time-variant influence of
                                                                 have been several sequential-based recommendation sys-
listening to music tracks. Moreover, there is no existing
                                                                 tems to model user behaviors in the temporal aspect,
evaluation metric that is able to measure the model be-
                                                                 which was ignored in the early RSs. For instance, Hidasi
havior in terms of the time domain. It is essential to take
                                                                 et al. [7] introduced GRU4Rec with sequential models
temporal information into account and design a proper
                                                                 and ranking loss, which used the idea of taking previous
metric for the corresponding evaluation.
                                                                 users’ records into account to predict future preferences.
   To address the aforementioned challenges, we propose
                                                                 However, one of the limitations in this task is the vali-
RecFormer, a novel personalized temporal-aware Trans-
                                                                 dation strategy, which does not guarantee that the test
former for fair music recommendation, which consists of
                                                                 set is from the latest timestamp. This hinders the above
personalized user embeddings, a temporal-aware multi-
                                                                 approaches to tackle this task effectively. Inspired by
head linearized-attention in a modified Transformer [4],
                                                                 the robust pretrained tasks of BERT [8], we aim to adopt
and a track classifier to predict the possible tracks. Specif-
                                                                 masked language modeling tasks to randomly mask the
ically, the personalized user embeddings take rich user-
                                                                 user sequence and reconstruct them, which is able to
related metadata into account to represent each user. To
                                                                 learn a fine-grained track representation as well as the
tackle the first challenge, we employ the masked lan-
                                                                 robustness of the time-invariant test set. This motiva-
guage modeling task as our training objective, which
                                                                 tion has been used by [9], who proposed BERT4Rec, a
randomly masks a proportion of all tracks in the input
                                                                 bidirectional Transformer with masked language model-
sequence. For the second and third challenges, we in-
                                                                 ing tasks, and demonstrated the effectiveness in several
troduced a temporal-aware linearized-attention, incor-
                                                                 sequential recommendation tasks.
porating attention bias with temporal information and
                                                                 Dataset Introduction. The selected dataset in this task
replacing standard softmax computation with kernel com-
                                                                 is LFM-1b [10], which is a dataset focused on music rec-
putation. This not only models time-variant information
                                                                 ommendation on Last.fm. The dataset is composed of
into an attention score but also reduces both the memory
                                                                 120k users, 63k artists, 1.3M albums, 821k tracks, and
and computation complexity while preserving competi-
                                                                 38M listening events, which is filtered with some pre-
tive performance. In addition, we propose a new metric,
                                                                 processing procedure introduced in [11]. Furthermore, it
MRED_DOH, to evaluate the difference performance in
                                                                 provides rich song (i.e., artist and album) and user (i.e.,
terms of various listening times in a day, which reflects
                                                                 country, age, gender, listening preference) metadata for
another critical but unexplored dimension of fairness.
                                                                 evaluating multi-dimensional behaviors of models. In
   In summary, our contributions are as follows: 1) We
                                                                 general, this dataset is able to help researchers achieve
propose RecFormer, a novel personalized temporal-aware
                                                                 not only quantitative performance but also non-standard
Transformer for fair music recommendation by adopting
                                                                 metrics for fairness.
masked language modeling tasks as training objectives
to learn fine-grained track representations. 2) To reduce
the computation complexity and model time-variant in-            3. Methodology
formation, a temporal-aware linearized-attention is de-
signed by replacing softmax with kernel computation and          Figure 2 illustrates an overview of our proposed Rec-
integrating temporal embeddings into attention scores.           Former, some of which is inspired by the recent research
Furthermore, we propose a new metric (MRED_DOH)                  on natural language understanding. Given a user his-
to reflect the different performance of predicting tracks        tory (sequence), we first apply random masks to the
in each hour, which is also an essential but challenging         sequence, and then personalized user embeddings are
dimension of fairness. 3) Our RecFormer outperforms              generated based on the user metadata and correspond-
the official baseline at least 116% in terms of the total        ing tracks. Afterwards, the RecFormer, which modifies
score. Moreover, extensive experiments were further              the self-attention mechanism to the proposed temporal-
conducted to examine the contribution of each module             aware linearized-attention in each layer, is introduced
Figure 2: The architecture of our proposed RecFormer.


to model the masked user embeddings. Finally, a track                  The output dimension of 𝐻 is 𝑑, and the inner dimension
classifier is used to reconstruct the masked tracks to the             of FFN is 𝑑𝑖𝑛𝑛𝑒𝑟 .
original tracks.                                                       Temporal-Aware Linearized-Attention (TALA): To
                                                                       reduce the quadratic memory and computation complex-
3.1. Personalized User Embeddings                                      ity from the conventional self-attention mechanism, we
                                                                       replace the softmax with the kernel computation, which
As each user has multiple tracks they have listened to                 only requires linear computation [13]:
and the corresponding user metadata, we incorporate
each type of user metadata with each track to model each                            𝑄 = 𝐸𝑊 𝑄 , 𝐾 = 𝐸𝑊 𝐾 , 𝑉 = 𝐸𝑊 𝑉 ,           (3)
persona, which is inspired by [12]. Specifically, the input
embedding for the RecFormer at the 𝑖-th timestamp is                                  𝑇 𝐴𝐿𝐴(𝐸) = (𝜙(𝑄)𝜙(𝐾 )𝑇 )𝑉 ,              (4)
constructed by adding the track embedding 𝑡𝑖 , positional              where 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 ∈ ℝ𝑑×𝑑 , and 𝜙(⋅) is applied rowwise
embedding 𝑝𝑖 , and metadata embeddings (𝑔, 𝑐, 𝑎), which                and is used elu(⋅) + 1 as suggested in [13].
are all projected with corresponding embedding layers                     Since users listen to music tracks following their per-
to 𝑑 dimensional vectors:                                              sonal habits (e.g., at work or on a bus), it is essential to
                                                                       take time-invariant information into account for recom-
𝐸 = (𝑒1 , ⋯ , 𝑒𝐿 ) = (𝑡1 + 𝑝1 + 𝑔 + 𝑐 + 𝑎, ⋯ , 𝑡𝐿 + 𝑝𝐿 + 𝑔 + 𝑐 + 𝑎),
                                                                       mending user preferred tracks, e.g., [14, 3]. Therefore,
                                                                 (1)
                                                                       in addition to the linearized-attention, we also incorpo-
where 𝑔 denotes gender, 𝑐 denotes country, 𝑎 denotes
                                                                       rate the listening time of each track as the attention bias,
age and 𝐿 is the max sequence length. The user age is
                                                                       which is motivated from [15]. Formally, Equ. 4 is ex-
discretized to 15 bins since it is a continuous variable.
                                                                       tended as:

3.2. RecFormer                                                                     𝑇 𝐴𝐿𝐴(𝐸) = (𝜙(𝑄)𝜙(𝐾 + 𝐸𝐻 )𝑇 )𝑉 ,            (5)
RecFormer aims to capture the temporal order of a user                 where 𝐸𝐻 is the 𝑑 dimensional hour embedding (from 0
history, which is hard to consider with traditional CF and             to 23, total 24 categories). It is noted that we empirically
MF approaches. To that end, we introduce RecFormer                     add hour embeddings to key matrices from experiments.
based on the Transformer encoder [4] to not only encode
all tracks in a sequence but also to speed up the training
procedure with parallel computation of the attention
                                                                       3.3. Track Classifier
mechanism compared with the recurrent-based models.                    After 𝐿 layers of RecFormer to encode multi-hop infor-
   Formally, the personalized user embedding 𝐸 follows                 mation, we get the final output 𝐻 𝐿 for all items of the
the standard Transformer encoding steps to encode the                  user sequence. The track classifier is employed to predict
sequence with the proposed temporal-aware linearized-                  the masked tracks as shown in Figure 2. Specifically, we
attention (TALA), residual connection and layer normal-                apply a feed-forward layer to generate the 𝑖-th output:
ization (Norm), dropout, and feed forward network (FFN)
as follows:                                                                        𝑍𝑖 = 𝜎(𝐻𝑖𝐿 𝑊 𝑍 ); 𝑦𝑖̂ = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑍𝑖 ),      (6)

 𝐻̃ = 𝑁 𝑜𝑟𝑚(𝐸 + 𝑇 𝐴𝐿𝐴(𝐸)), 𝐻 = 𝑁 𝑜𝑟𝑚(𝐻̃ + 𝐹 𝐹 𝑁 (𝐻̃ )).                where 𝑊 𝑍 ∈ ℝ𝑑×𝑇 , 𝑇 is the total number of tracks, and 𝜎
                                                     (2)               is the activation function.
Table 1
Ablation study of our model. U: Personalized user embeddings. T: Temporal-aware computation in TALA. Total score is
computed as ((1) + (2) + (3)) / 3 as Phase 2 requires a minimum hit-rate threshold.

                                                  RecFormer (ours)         -U          -T        -U -T
                      Standard RSs metrics (1)         0.0093            0.0071      0.0092     0.0072
                         Standard metrics
                                                       -0.0061           -0.0077    -0.0158     -0.0099
                         on a per-group (2)
                        Behavioral tests (3)           -0.0213           -0.1097    -0.0225     -0.1101
                        MRED_DOH (ours)                -0.0047           -0.0030    -0.0033    -0.0030
                          Total Score                  -0.0060           -0.0368    -0.0096    -0.0376


Training and Testing. As one of the challenges in this           a per-group or slice basis (gender balance, artist popular-
task is the validation strategy, it is expected that the test    ity, user country, song popularity, and user history), and
set may not be sampled at the latest timestamp. To tackle        behavioral tests (be less wrong, and latent diversity) [11].
this issue, we applied masked language modeling tasks            The data are pre-processed by filling NaN values of user
(MLM) as the training objective to learn a robust track          gender and country with n and UNKNOWN, respectively.
representation. The goal of MLM is to reconstruct the            Proposed MRED_DOH Metric: Since one of the chal-
masked tracks by giving a user sequence, which enables           lenges we aim to address is the time-variant event, we
the model to learn the relation between tracks.                  propose a new metric, MRED_DOH, to evaluate the dif-
   Following [9], we use the final output 𝐻 with the track       ference performance in terms of various listening times
classifier to predict the masked tracks, and the loss func-      in a day, which reflects another critical but unexplored
tion is defined as follows:                                      dimension of fairness. That is, this MRED_DOH enables
                           |𝑈 |                                  us to investigate if a model bias to predictions that are
                    𝕃 = − ∑ 𝑦𝑖 𝑙𝑜𝑔(𝑦𝑖̂ ),                 (7)    from users listening to in specific time slots. In other
                           𝑖=1                                   words, we operationalize this metric as the smaller the
where 𝑈 is the set of user sequences.                            difference similar to MRED_Gender proposed in Reclist
   In the inference phase, we empirically set the mask           [16]: the fairer the model towards potential temporal
in the last timestamp of a sequence to predict the most          biases.
possible 100 tracks that the user are likely to listen to. In        Specifically, we represent the hour when the user most
addition, as we cannot fetch the timestamp in the test           often listens to select the test set as sub-groups of listeners
set, we set the predicted hour as the hour that the user         (i.e., there are total 24 sub-groups), which is represented
often listens to music tracks to fit into TALA.                  as the user hour. Afterwards, we can evaluate the MRED
                                                                 score using the existing RecList with the user hour to
                                                                 measure the model performance at each hour in a day.
4. Experiments and Analysis                                      We note that this is a aspect-driven metric, which can be
                                                                 adjusted by monitoring different temporal dimensions
4.1. Experimental Setup                                          based on user needs. For example, the hour when the user
The dimension 𝑑 was set to 64, the inner dimension of            most often listens can be easily changed to the least active
the feed-forward layer was 256, and the number of heads          hours or the average active hours. Moreover, it can also
was set to 1. The dropout rate was 0.0, and the max              be employed in the sequential recommendation systems
sequence length (𝐿) was 60 due to the time limit, which          by changing the user hours to sequential positions.
kept about 25% tracks on average. The batch size was
100, the learning rate was set to 1e-3, the training epochs      4.2. Overall Performance Comparison
were set to 50, and the seeds were tested from 42 to 52. It
                                                                 Ablation Study. To verify the contribution of each mod-
is noted that the number of predicted tracks is based on
                                                                 ule in RecFormer, we conduct ablative experiments by
the train set. That is, we hypothesize that our RecFormer
                                                                 removing personalized user embeddings, temporal-aware
only recommends tracks that have been listened to before.
                                                                 computation, and both. From Table 1, we can observe
All the training and evaluation phases were conducted on
                                                                 that removing any one module in RecFormer results in
a machine with AMD Ryzen Threadripper 3960X 24-Core
                                                                 a performance drop in terms of metrics adopted in this
Processor, Nvidia GeForce RTX 3090, and 252GB RAM.
                                                                 task, which testifies to the effective design of RecFormer.
   The evaluation metrics include different perspectives:
                                                                 However, our RecFormer performs the worst in terms
standard RSs metrics (HR, and MRR), standard metrics on
Table 2
Official performance of RecFormer. The score is normalized with the official baseline and the best score of Phase 1.

                                                      Standard          Standard metrics
   Rank               Model              Score                                                Behavioral tests    MRED_DOH
                                                     RSs metrics         on a per-group
     7              RecFormer            0.7526         0.0098               -0.0056               -0.0011             -0.0047
    -            BERT4Rec [9]            -100.0         0.0016               -0.0030               -0.2729             -0.0014
    -         CBOWRecSysBaseline        -1.2122         0.0512               -3.7194                0.4527             -0.0034
 Imp. (%)             -                   116              -                     -                     -


of our proposed MRED_DOH, which indicates that our               References
method still fails to meet the fairness in the temporal
aspect. In addition, this result also indicates that con-    [1] C. Luo, V. Lakshman, A. Shrivastava, T. Cao, S. Nag,
sidering temporal-awareness is not able to address the           R. Goutam, H. Lu, Y. Song, B. Yin, ROSE: robust
temporal fairness, which will be investigated in our fu-         caches for amazon product search, in: WWW (Com-
ture research.                                                   panion Volume), ACM, 2022, pp. 89–93.
   We note that several continuous variables (e.g., nov-     [2] S. Mehta, Why is the fairness in rec-
elty_artist related features) are also included in the per-      ommender           systems      required?,        2022.
sonalized user embeddings with projecting as in [17], but        URL:                 https://analyticsindiamag.com/
the training loss cannot converge.                               why-is-the-fairness-in-recommender-systems-required/.
Official Score. Table 2 shows the performance in the         [3] Y. Koren, Collaborative filtering with temporal dy-
formal phase. We also implemented BERT4Rec to com-               namics, in: Proceedings of the 15th ACM SIGKDD
pare the performance, which can be viewed as one of our          international conference on Knowledge discovery
variants. It can be observed that the MRED_DOH per-              and data mining, 2009, pp. 447–456.
formance of BERT4Rec is the best, but the performance        [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
of standard RSs metrics fails to meet the requirement            L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
(hit-rate > 0.015). One of the reasons is that BERT4Rec          Attention is all you need, in: NIPS, 2017, pp.
is not converged using the same hyper-parameters due             5998–6008.
to the computational complexity, which is attributed by      [5] B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl,
linear attention in our RecFormer. Therefore, these fair-        Item-based collaborative filtering recommendation
ness results cannot be directly compared with the offi-          algorithms, in: WWW, ACM, 2001, pp. 285–295.
cial baseline and our RecFormer since BERT4Rec cannot        [6] J. Chen, C. Wang, S. Zhou, Q. Shi, J. Chen, Y. Feng,
recommend possible tracks. Our framework achieved                C. Chen, Fast adaptively weighted matrix factor-
0.1964 of the total score, which outperformed the official       ization for recommendation with implicit feedback,
baseline by 116%, while it still has some gaps compared          in: AAAI, AAAI Press, 2020, pp. 3470–3477.
to the first prize. Despite the result, our approach still   [7] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk,
demonstrates that using MLM as the training objective            Session-based recommendations with recurrent
can achieve competitive performance.                             neural networks, in: ICLR (Poster), 2016.
                                                             [8] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
                                                                 pre-training of deep bidirectional transformers for
5. Conclusion                                                    language understanding, in: NAACL-HLT (1), As-
                                                                 sociation for Computational Linguistics, 2019, pp.
In this paper, we propose RecFormer incorporating                4171–4186.
personalized user embeddings and temporal-aware              [9] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,
linearized-attention to recommend accurate and fair              Bert4rec: Sequential recommendation with bidirec-
tracks to users based on their personal listening habits for     tional encoder representations from transformer,
the EvalRS task. Furthermore, the linearized-attention           in: CIKM, ACM, 2019, pp. 1441–1450.
reduces both computation and memory complexity by [10] M. Schedl, The lfm-1b dataset for music retrieval
making use of kernel computation. The ablation study             and recommendation, in: ICMR, ACM, 2016, pp.
with the proposed metric demonstrates the effectiveness          103–110.
and fairness of our proposed approach. From the leader- [11] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio,
board score, our method illustrates that using MLM for           C. Greco, G. de Souza P. Moreira, P. J. Chia, Evalrs: a
learning track representations can achieve competitive           rounded evaluation of recommender systems, CoRR
performance.                                                     abs/2207.05772 (2022).
[12] Y. Zheng, R. Zhang, M. Huang, X. Mao, A pre-
     training based personalized dialogue generation
     model with persona-sparse data, in: AAAI, AAAI
     Press, 2020, pp. 9693–9700.
[13] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret,
     Transformers are rnns: Fast autoregressive trans-
     formers with linear attention, in: ICML, volume
     119 of Proceedings of Machine Learning Research,
     PMLR, 2020, pp. 5156–5165.
[14] J. Bao, Y. Zhang, Time-aware recommender system
     via continuous-time modeling, in: CIKM, 2021, pp.
     2872–2876.
[15] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention
     with relative position representations, in: NAACL-
     HLT (2), Association for Computational Linguistics,
     2018, pp. 464–468.
[16] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Be-
     yond NDCG: behavioral testing of recommender
     systems with reclist, in: WWW (Companion Vol-
     ume), ACM, 2022, pp. 99–104.
[17] W. Wang, H. Shuai, K. Chang, W. Peng, Shut-
     tlenet: Position-aware fusion of rally progress and
     player styles for stroke forecasting in badminton,
     in: AAAI, AAAI Press, 2022, pp. 4219–4227.