RecFormer: personalized temporal-aware transformer for fair music recommendation Wei-Yao Wang1 , Wei-Wei Du1 and Wen-Chih Peng1 1 Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan Abstract Recommendation systems have improved the characterization of user preferences by modeling their digital footprints and item content. However, another facet, model behavior, has attracted a great deal of attention in both academic and industry fields in recent years due to the increasing awareness of fairness. The shared task, a Rounded Evaluation of Recommender Systems (EvalRS @ CIKM 2022), is introduced to broadly measure multifaceted model predictions for music recommendation. To tackle the problem, we propose the RecFormer architecture with a personalized temporal-aware transformer to model the interactions among user history in a single framework. Specifically, RecFormer adopts a masked language modeling task as the training objective, which enables the model to capture fine-grained track embeddings by reconstructing tracks. Meanwhile, it also integrates a temporal-aware self-attention mechanism into the Transformer architecture so that the model is able to consider time-variant information among different users. Moreover, we introduce linearized attention to reduce quadratic computation and memory cost since the limited time is one of the challenges in this task. Extensive experiments and analysis are conducted to demonstrate the effectiveness of our RecFormer compared with the official baseline, and we examine the model contribution from the ablation study. Our team, yao0510, won the seventh prize with a total score of 0.1964 in the EvalRS challenge, which illustrates that our model achieved competitive performance. The source code will be publicly available at https://github.com/wywyWang/RecFormer. Keywords Recommendation System, User Fairness, Transformer, Temporal-Aware 1. Introduction In recent years, characterizing users with accurate in- terests has evolved due to the use of advanced recom- mendation systems (RSs). These RSs have been used to develop several real-world applications in industry, for example, Amazon ROSE [1]. While there has been sig- Figure 1: An example pipeline of the music recommendation nificant progress in predicting accurate items for users with our proposed RecFormer. of RSs, the awareness of model behavior has attracted a great deal of attention from both academic and industry researchers. As recommendation systems are built on recommendation system aims to predict a set of songs top of users, data, and models, it is likely that the system that are most likely to be listened to by the target users. will make unfair suggestions due to the biases of these On top of the standard retrieval metrics (i.e., hit-rate candidates [2], which illustrates the increasing need to (HR), mean reciprocal rank (MRR), and normalized dis- investigate model behavior. counted cumulative gain (nDCG)), the organizers further To that end, a rounded evaluation of recommender employ multi-dimensional non-observational perspec- systems hosted by EvalRS @ CIKM 2022 is introduced tives to evaluate model behaviors. to tackle both standard evaluation metrics and model However, we introduce that there are at least three behavior tests for music recommendation. Figure 1 illus- challenges in addressing this shared task. 1) Valida- trates an example of a music recommendation system. tion strategy. The task uses bootstrapped nested cross- Given multiple past tracks from a set of target users, our validation1 that randomly samples a track from the user sequence as the test set, which means that this task can- CIKM’22: Proceedings of the 31st ACM International Conference on Information and Knowledge Management not be formulated as a conventional sequence classifi- Envelope-Open sf1638.cs05@nctu.edu.tw (W. Wang); wwdu.cs10@nycu.edu.tw cation task that uses first 𝑁 tracks to predict the 𝑁 + 1 (W. Du); wcpeng@cs.nycu.edu.tw (W. Peng) track since the test set may not be selected from the lat- GLOBE https://wywywang.github.io/ (W. Wang); est track the user listened to. 2) Time limitation. The https://wwweiwei.github.io/ (W. Du) training and inference time is required to be less than Orcid 0000-0002-6551-1720 (W. Wang); 0000-0002-0627-0314 (W. Du); 0000-0002-0172-7311 (W. Peng) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). https://github.com/RecList/evalRS-CIKM-2022/blob/main/im- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ages/loop.jpg 22.5 minutes per fold, which is challenging when adopt- in RecFormer. ing deep learning techniques to have a fine-grained track representation if the space and computation complexity is high. 3) Time-variant event. The target users have 2. Related Work their corresponding habits of listening music songs dur- Recommendation System. The recent progress of rec- ing the day; for example, students are likely to listen to ommendation systems (RSs) has brought great incomes music after school, while office workers are likely to lis- for industry since the preferred items of the target au- ten during working hours. For example, Koren [3] is the dience can be marked precisely based on collecting and first approach that verifies the effectiveness of modeling analyzing their corresponding digital footprints. Early temporal effect in the Netflix competition. Therefore, con- work on RSs typically employed collaborative filtering verting user history into a sequence for recurrent-based (CF) [5] and matrix factorization (MF) [6]. Recently, there approaches directly ignores the time-variant influence of have been several sequential-based recommendation sys- listening to music tracks. Moreover, there is no existing tems to model user behaviors in the temporal aspect, evaluation metric that is able to measure the model be- which was ignored in the early RSs. For instance, Hidasi havior in terms of the time domain. It is essential to take et al. [7] introduced GRU4Rec with sequential models temporal information into account and design a proper and ranking loss, which used the idea of taking previous metric for the corresponding evaluation. users’ records into account to predict future preferences. To address the aforementioned challenges, we propose However, one of the limitations in this task is the vali- RecFormer, a novel personalized temporal-aware Trans- dation strategy, which does not guarantee that the test former for fair music recommendation, which consists of set is from the latest timestamp. This hinders the above personalized user embeddings, a temporal-aware multi- approaches to tackle this task effectively. Inspired by head linearized-attention in a modified Transformer [4], the robust pretrained tasks of BERT [8], we aim to adopt and a track classifier to predict the possible tracks. Specif- masked language modeling tasks to randomly mask the ically, the personalized user embeddings take rich user- user sequence and reconstruct them, which is able to related metadata into account to represent each user. To learn a fine-grained track representation as well as the tackle the first challenge, we employ the masked lan- robustness of the time-invariant test set. This motiva- guage modeling task as our training objective, which tion has been used by [9], who proposed BERT4Rec, a randomly masks a proportion of all tracks in the input bidirectional Transformer with masked language model- sequence. For the second and third challenges, we in- ing tasks, and demonstrated the effectiveness in several troduced a temporal-aware linearized-attention, incor- sequential recommendation tasks. porating attention bias with temporal information and Dataset Introduction. The selected dataset in this task replacing standard softmax computation with kernel com- is LFM-1b [10], which is a dataset focused on music rec- putation. This not only models time-variant information ommendation on Last.fm. The dataset is composed of into an attention score but also reduces both the memory 120k users, 63k artists, 1.3M albums, 821k tracks, and and computation complexity while preserving competi- 38M listening events, which is filtered with some pre- tive performance. In addition, we propose a new metric, processing procedure introduced in [11]. Furthermore, it MRED_DOH, to evaluate the difference performance in provides rich song (i.e., artist and album) and user (i.e., terms of various listening times in a day, which reflects country, age, gender, listening preference) metadata for another critical but unexplored dimension of fairness. evaluating multi-dimensional behaviors of models. In In summary, our contributions are as follows: 1) We general, this dataset is able to help researchers achieve propose RecFormer, a novel personalized temporal-aware not only quantitative performance but also non-standard Transformer for fair music recommendation by adopting metrics for fairness. masked language modeling tasks as training objectives to learn fine-grained track representations. 2) To reduce the computation complexity and model time-variant in- 3. Methodology formation, a temporal-aware linearized-attention is de- signed by replacing softmax with kernel computation and Figure 2 illustrates an overview of our proposed Rec- integrating temporal embeddings into attention scores. Former, some of which is inspired by the recent research Furthermore, we propose a new metric (MRED_DOH) on natural language understanding. Given a user his- to reflect the different performance of predicting tracks tory (sequence), we first apply random masks to the in each hour, which is also an essential but challenging sequence, and then personalized user embeddings are dimension of fairness. 3) Our RecFormer outperforms generated based on the user metadata and correspond- the official baseline at least 116% in terms of the total ing tracks. Afterwards, the RecFormer, which modifies score. Moreover, extensive experiments were further the self-attention mechanism to the proposed temporal- conducted to examine the contribution of each module aware linearized-attention in each layer, is introduced Figure 2: The architecture of our proposed RecFormer. to model the masked user embeddings. Finally, a track The output dimension of 𝐻 is 𝑑, and the inner dimension classifier is used to reconstruct the masked tracks to the of FFN is π‘‘π‘–π‘›π‘›π‘’π‘Ÿ . original tracks. Temporal-Aware Linearized-Attention (TALA): To reduce the quadratic memory and computation complex- 3.1. Personalized User Embeddings ity from the conventional self-attention mechanism, we replace the softmax with the kernel computation, which As each user has multiple tracks they have listened to only requires linear computation [13]: and the corresponding user metadata, we incorporate each type of user metadata with each track to model each 𝑄 = πΈπ‘Š 𝑄 , 𝐾 = πΈπ‘Š 𝐾 , 𝑉 = πΈπ‘Š 𝑉 , (3) persona, which is inspired by [12]. Specifically, the input embedding for the RecFormer at the 𝑖-th timestamp is 𝑇 𝐴𝐿𝐴(𝐸) = (πœ™(𝑄)πœ™(𝐾 )𝑇 )𝑉 , (4) constructed by adding the track embedding 𝑑𝑖 , positional where π‘Š 𝑄 , π‘Š 𝐾 , π‘Š 𝑉 ∈ ℝ𝑑×𝑑 , and πœ™(β‹…) is applied rowwise embedding 𝑝𝑖 , and metadata embeddings (𝑔, 𝑐, π‘Ž), which and is used elu(β‹…) + 1 as suggested in [13]. are all projected with corresponding embedding layers Since users listen to music tracks following their per- to 𝑑 dimensional vectors: sonal habits (e.g., at work or on a bus), it is essential to take time-invariant information into account for recom- 𝐸 = (𝑒1 , β‹― , 𝑒𝐿 ) = (𝑑1 + 𝑝1 + 𝑔 + 𝑐 + π‘Ž, β‹― , 𝑑𝐿 + 𝑝𝐿 + 𝑔 + 𝑐 + π‘Ž), mending user preferred tracks, e.g., [14, 3]. Therefore, (1) in addition to the linearized-attention, we also incorpo- where 𝑔 denotes gender, 𝑐 denotes country, π‘Ž denotes rate the listening time of each track as the attention bias, age and 𝐿 is the max sequence length. The user age is which is motivated from [15]. Formally, Equ. 4 is ex- discretized to 15 bins since it is a continuous variable. tended as: 3.2. RecFormer 𝑇 𝐴𝐿𝐴(𝐸) = (πœ™(𝑄)πœ™(𝐾 + 𝐸𝐻 )𝑇 )𝑉 , (5) RecFormer aims to capture the temporal order of a user where 𝐸𝐻 is the 𝑑 dimensional hour embedding (from 0 history, which is hard to consider with traditional CF and to 23, total 24 categories). It is noted that we empirically MF approaches. To that end, we introduce RecFormer add hour embeddings to key matrices from experiments. based on the Transformer encoder [4] to not only encode all tracks in a sequence but also to speed up the training procedure with parallel computation of the attention 3.3. Track Classifier mechanism compared with the recurrent-based models. After 𝐿 layers of RecFormer to encode multi-hop infor- Formally, the personalized user embedding 𝐸 follows mation, we get the final output 𝐻 𝐿 for all items of the the standard Transformer encoding steps to encode the user sequence. The track classifier is employed to predict sequence with the proposed temporal-aware linearized- the masked tracks as shown in Figure 2. Specifically, we attention (TALA), residual connection and layer normal- apply a feed-forward layer to generate the 𝑖-th output: ization (Norm), dropout, and feed forward network (FFN) as follows: 𝑍𝑖 = 𝜎(𝐻𝑖𝐿 π‘Š 𝑍 ); 𝑦𝑖̂ = π‘ π‘œπ‘“ π‘‘π‘šπ‘Žπ‘₯(𝑍𝑖 ), (6) 𝐻̃ = 𝑁 π‘œπ‘Ÿπ‘š(𝐸 + 𝑇 𝐴𝐿𝐴(𝐸)), 𝐻 = 𝑁 π‘œπ‘Ÿπ‘š(𝐻̃ + 𝐹 𝐹 𝑁 (𝐻̃ )). where π‘Š 𝑍 ∈ ℝ𝑑×𝑇 , 𝑇 is the total number of tracks, and 𝜎 (2) is the activation function. Table 1 Ablation study of our model. U: Personalized user embeddings. T: Temporal-aware computation in TALA. Total score is computed as ((1) + (2) + (3)) / 3 as Phase 2 requires a minimum hit-rate threshold. RecFormer (ours) -U -T -U -T Standard RSs metrics (1) 0.0093 0.0071 0.0092 0.0072 Standard metrics -0.0061 -0.0077 -0.0158 -0.0099 on a per-group (2) Behavioral tests (3) -0.0213 -0.1097 -0.0225 -0.1101 MRED_DOH (ours) -0.0047 -0.0030 -0.0033 -0.0030 Total Score -0.0060 -0.0368 -0.0096 -0.0376 Training and Testing. As one of the challenges in this a per-group or slice basis (gender balance, artist popular- task is the validation strategy, it is expected that the test ity, user country, song popularity, and user history), and set may not be sampled at the latest timestamp. To tackle behavioral tests (be less wrong, and latent diversity) [11]. this issue, we applied masked language modeling tasks The data are pre-processed by filling NaN values of user (MLM) as the training objective to learn a robust track gender and country with n and UNKNOWN, respectively. representation. The goal of MLM is to reconstruct the Proposed MRED_DOH Metric: Since one of the chal- masked tracks by giving a user sequence, which enables lenges we aim to address is the time-variant event, we the model to learn the relation between tracks. propose a new metric, MRED_DOH, to evaluate the dif- Following [9], we use the final output 𝐻 with the track ference performance in terms of various listening times classifier to predict the masked tracks, and the loss func- in a day, which reflects another critical but unexplored tion is defined as follows: dimension of fairness. That is, this MRED_DOH enables |π‘ˆ | us to investigate if a model bias to predictions that are 𝕃 = βˆ’ βˆ‘ 𝑦𝑖 π‘™π‘œπ‘”(𝑦𝑖̂ ), (7) from users listening to in specific time slots. In other 𝑖=1 words, we operationalize this metric as the smaller the where π‘ˆ is the set of user sequences. difference similar to MRED_Gender proposed in Reclist In the inference phase, we empirically set the mask [16]: the fairer the model towards potential temporal in the last timestamp of a sequence to predict the most biases. possible 100 tracks that the user are likely to listen to. In Specifically, we represent the hour when the user most addition, as we cannot fetch the timestamp in the test often listens to select the test set as sub-groups of listeners set, we set the predicted hour as the hour that the user (i.e., there are total 24 sub-groups), which is represented often listens to music tracks to fit into TALA. as the user hour. Afterwards, we can evaluate the MRED score using the existing RecList with the user hour to measure the model performance at each hour in a day. 4. Experiments and Analysis We note that this is a aspect-driven metric, which can be adjusted by monitoring different temporal dimensions 4.1. Experimental Setup based on user needs. For example, the hour when the user The dimension 𝑑 was set to 64, the inner dimension of most often listens can be easily changed to the least active the feed-forward layer was 256, and the number of heads hours or the average active hours. Moreover, it can also was set to 1. The dropout rate was 0.0, and the max be employed in the sequential recommendation systems sequence length (𝐿) was 60 due to the time limit, which by changing the user hours to sequential positions. kept about 25% tracks on average. The batch size was 100, the learning rate was set to 1e-3, the training epochs 4.2. Overall Performance Comparison were set to 50, and the seeds were tested from 42 to 52. It Ablation Study. To verify the contribution of each mod- is noted that the number of predicted tracks is based on ule in RecFormer, we conduct ablative experiments by the train set. That is, we hypothesize that our RecFormer removing personalized user embeddings, temporal-aware only recommends tracks that have been listened to before. computation, and both. From Table 1, we can observe All the training and evaluation phases were conducted on that removing any one module in RecFormer results in a machine with AMD Ryzen Threadripper 3960X 24-Core a performance drop in terms of metrics adopted in this Processor, Nvidia GeForce RTX 3090, and 252GB RAM. task, which testifies to the effective design of RecFormer. The evaluation metrics include different perspectives: However, our RecFormer performs the worst in terms standard RSs metrics (HR, and MRR), standard metrics on Table 2 Official performance of RecFormer. The score is normalized with the official baseline and the best score of Phase 1. Standard Standard metrics Rank Model Score Behavioral tests MRED_DOH RSs metrics on a per-group 7 RecFormer 0.7526 0.0098 -0.0056 -0.0011 -0.0047 - BERT4Rec [9] -100.0 0.0016 -0.0030 -0.2729 -0.0014 - CBOWRecSysBaseline -1.2122 0.0512 -3.7194 0.4527 -0.0034 Imp. (%) - 116 - - - of our proposed MRED_DOH, which indicates that our References method still fails to meet the fairness in the temporal aspect. In addition, this result also indicates that con- [1] C. Luo, V. Lakshman, A. Shrivastava, T. Cao, S. Nag, sidering temporal-awareness is not able to address the R. Goutam, H. Lu, Y. Song, B. Yin, ROSE: robust temporal fairness, which will be investigated in our fu- caches for amazon product search, in: WWW (Com- ture research. panion Volume), ACM, 2022, pp. 89–93. We note that several continuous variables (e.g., nov- [2] S. Mehta, Why is the fairness in rec- elty_artist related features) are also included in the per- ommender systems required?, 2022. sonalized user embeddings with projecting as in [17], but URL: https://analyticsindiamag.com/ the training loss cannot converge. why-is-the-fairness-in-recommender-systems-required/. Official Score. Table 2 shows the performance in the [3] Y. Koren, Collaborative filtering with temporal dy- formal phase. We also implemented BERT4Rec to com- namics, in: Proceedings of the 15th ACM SIGKDD pare the performance, which can be viewed as one of our international conference on Knowledge discovery variants. It can be observed that the MRED_DOH per- and data mining, 2009, pp. 447–456. formance of BERT4Rec is the best, but the performance [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, of standard RSs metrics fails to meet the requirement L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, (hit-rate > 0.015). One of the reasons is that BERT4Rec Attention is all you need, in: NIPS, 2017, pp. is not converged using the same hyper-parameters due 5998–6008. to the computational complexity, which is attributed by [5] B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, linear attention in our RecFormer. Therefore, these fair- Item-based collaborative filtering recommendation ness results cannot be directly compared with the offi- algorithms, in: WWW, ACM, 2001, pp. 285–295. cial baseline and our RecFormer since BERT4Rec cannot [6] J. Chen, C. Wang, S. Zhou, Q. Shi, J. Chen, Y. Feng, recommend possible tracks. Our framework achieved C. Chen, Fast adaptively weighted matrix factor- 0.1964 of the total score, which outperformed the official ization for recommendation with implicit feedback, baseline by 116%, while it still has some gaps compared in: AAAI, AAAI Press, 2020, pp. 3470–3477. to the first prize. Despite the result, our approach still [7] B. Hidasi, A. Karatzoglou, L. Baltrunas, D. Tikk, demonstrates that using MLM as the training objective Session-based recommendations with recurrent can achieve competitive performance. neural networks, in: ICLR (Poster), 2016. [8] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for 5. Conclusion language understanding, in: NAACL-HLT (1), As- sociation for Computational Linguistics, 2019, pp. In this paper, we propose RecFormer incorporating 4171–4186. personalized user embeddings and temporal-aware [9] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, linearized-attention to recommend accurate and fair Bert4rec: Sequential recommendation with bidirec- tracks to users based on their personal listening habits for tional encoder representations from transformer, the EvalRS task. Furthermore, the linearized-attention in: CIKM, ACM, 2019, pp. 1441–1450. reduces both computation and memory complexity by [10] M. Schedl, The lfm-1b dataset for music retrieval making use of kernel computation. The ablation study and recommendation, in: ICMR, ACM, 2016, pp. with the proposed metric demonstrates the effectiveness 103–110. and fairness of our proposed approach. From the leader- [11] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, board score, our method illustrates that using MLM for C. Greco, G. de Souza P. Moreira, P. J. Chia, Evalrs: a learning track representations can achieve competitive rounded evaluation of recommender systems, CoRR performance. abs/2207.05772 (2022). [12] Y. Zheng, R. Zhang, M. Huang, X. Mao, A pre- training based personalized dialogue generation model with persona-sparse data, in: AAAI, AAAI Press, 2020, pp. 9693–9700. [13] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret, Transformers are rnns: Fast autoregressive trans- formers with linear attention, in: ICML, volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 5156–5165. [14] J. Bao, Y. Zhang, Time-aware recommender system via continuous-time modeling, in: CIKM, 2021, pp. 2872–2876. [15] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position representations, in: NAACL- HLT (2), Association for Computational Linguistics, 2018, pp. 464–468. [16] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Be- yond NDCG: behavioral testing of recommender systems with reclist, in: WWW (Companion Vol- ume), ACM, 2022, pp. 99–104. [17] W. Wang, H. Shuai, K. Chang, W. Peng, Shut- tlenet: Position-aware fusion of rally progress and player styles for stroke forecasting in badminton, in: AAAI, AAAI Press, 2022, pp. 4219–4227.