Exploiting Global Behavior Contextual Correlation in Sequential Recommendation Augmentation Qian Yu1 , Xiangdong Wu1 , Chen Yang1 , Zihao Zhao1 , Haoxin Liu2 , Chaosheng Fan1 , Changping Peng1 , Zhangang Lin1 , Jinghe Hu1 and Jingping Shao1 1 Marketing & Commercialization Center, JD.com 2 Tsinghua University Abstract The recently proposed Sequential Recommendation Augmentation (SRA) paradigm has shown valuable potential in sequential recommendation, especially for handling long-tail problem via extending short behavior sequences. However, the self- supervised SRA adopts autoregressive learning with fixed forward or backward direction, which cannot make full use of the contextual correlation information in the training behavior sequences. Due to the direction difference, discrepancy problem exists in the two training stages of SRA, i.e., pretraining and finetuning. In order to overcome the restriction of specific sequential learning direction, we propose to equip SRA with permutation autoregressive learning to extract global contextual correlation information from the behavior sequences in both directions. The adapted SRA method is implemented with two-stream self-attention. Empirical evaluations on multiple sequential recommendation benchmark datasets demonstrate the effectiveness of our proposed model, and the augmented data can significantly reduce the convergence rate. Keywords Sequential Recommendation, Data Augmentation, Permutation Autoregressive Learning 1. Introduction Sequential recommendation aims to find the behavior pat- tern or item transition from the user behavior sequences. Variant architectures are developed, including Markov Chains [1], RNN [2], attention-based sequence models [3] and graph models [4], etc. (a) Pretraining (b) Augmenting (c) Finetuning Data sparsity severely defects the performance of sequential recommendation. Data augmentation is a Figure 1: Stages in SRA. (a) Pretraining with inversed se- quence. (b) Pseudo-prior item augmentation. (c) Finetuning. straightforward solution for handling short behavior se- quences in sequential recommendation [5]. There are mainly two kinds of data augmentation methods, namely pretrains the sequence model with reversed training se- the heuristic augmentation [6] and generative augmenta- quence, in order to generate the pseudo-prior items, while tion [7]. Recently, Sequential Recommendation Augmen- the finetuning stage adopts the normal autoregressive tation (SRA) is proposed as an augmentation paradigm in objective. Therefore, the trained model is never aware of sequential recommendation [7]. Consisting of two train- the bidirectional context behaviors around the current ing stages, namely pretraining and finetuning, SRA is a position, namely the learning of the framework is insuf- verified effective solution for handling short sequences ficient. Besides, the two stages update to the same set of in sequential recommendation, namely long-tail problem. parameters but with different learning directions. Similar discrepancy problems are commonly seen in this kind of However, the current learning procedure of sequential pretrain-finetune methods, and it remains a constraint recommendation cannot completely extract the contex- for further performance improvement. tual correlation in the given sequential training instances. To addressing the abovementioned problems, we pro- In any learning stage, the item is predicted given the sub- pose to exploit global contextual correlation informa- sequence located at the single side of it. Specifically, SRA tion with Permutation Autoregressive Learning (PAL) for SRA. Specifically, we unify the learning objectives with DL4SR’22: Workshop on Deep Learning for Search and Recommen- permutation language model objective and implement dation, co-located with the 31st ACM International Conference on it on sequential recommendation with two-stream self- Information and Knowledge Management (CIKM), October 17-21, 2022, attention mechanism. PAL helps the model to exploit dif- Atlanta, USA ferent permutations of the input in order to exploit global $ yuqian81@jd.com (Q. Yu) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License contextual information without restrictions of learning Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 direction. For the inference stage, we try to adopt the beam search for generating more suitable subsequence as the augmented data. A latest revision named BiCAT [8] comes with a similar motivation by implemented via an additional loss regularization, but it is not designed for extracting contextual correlation information around the predicted position, and we will empirically compare (a) Content-stream Attention (b) Target-stream Attention them. Figure 2: Different Forms of Attention. Our contributions can be summarized as following: (a) Global contextual correlation information is explored in Sequential Recommendation Augmentation (SRA). (b) For details of SRA paradigm, please refer to [7]. Equipped with Permutation Autoregressive Learning and beam search method, an adapted SRA framework is de- signed and evaluated. (c) The proposed framework out- 3. Methodology performs the state-of-the-art methods for sequential rec- ommendation augmentation without extra information 3.1. Permutation Autoregressive Learning or heuristic rules. Now our intention is to help the SRA framework making use of the global contextual correlation in the behav- 2. Sequential Recommendation ior sequences. The idea of β€œmask and reconstruct” is a commonly used method for helping the sequence model Augmentation to learn from contextual information in arbitrary posi- tion, but the incorporation of [MASK] token in behavior The sequential recommendation task can be regarded as sequence will bring more severe discrepancy problem the next-item prediction given the historical behavior as in BERT [10], especially considering that the trained sequence. We denote the user set as 𝒰 and the item set model will be used for recursively generating behavior as 𝒳 . The interaction behavior of the given user 𝑒 ∈ 𝒰 𝑒 𝑒 𝑒 𝑒 sequences. Our solution to exploit context information is is denoted as 𝑆 = {π‘₯1 , π‘₯2 , ..., π‘₯𝑛 }. The sequential adopting the permutation modeling objective [11], which recommendation task can be formulated as: gather the information in bidirectional context while re- π‘₯𝑒𝑛+1 = arg max 𝑝(π‘₯|𝑆 𝑒 ) (1) maining the autoregressive learning paradigm. π‘₯ which means finding the next item π‘₯𝑒𝑛+1 with the largest 3.1.1. Permutations with Original Position probability given the user behavior sequence 𝑆 𝑒 . Encoding Recently, a Sequential Recommendation Augmenta- tion (SRA) paradigm is proposed [7], and the basic SRA In order to exploit bidirectional item correlation in bidi- method is also known as ASPeP. As illustrated in Fig 1, rectional context, we propose to train the sequence model ASPeP utilizes reverse pretraining for data augmentation. with different permutations of each training sequence in We take Transformer as an example backbone for de- an autoregressive way. scribing this learning paradigm. The key component in Assume that the length of the behavior sequence is 𝑇 , Transformer, i.e., multi-head self-attention is constructed we denote the set of all possible permutation of index as with linear transformation and the scaled dot-product 𝑍 𝑇 . For example, if 𝑇 = 4, then the original permuta- attention [9]. There are two stages in the training proce- tion is [1, 2, 3, 4], and the number of permutation in 𝑍 𝑇 dure updating the same set of model parameter πœƒ, namely is 𝑇 ! = 24. For each permutation z in 𝑍 𝑇 , 𝑧<𝑑 stands reverse pretraining and left-to-right finetuning. The re- for indices of all the element before 𝑑-th element 𝑧𝑑 , and verse pretraining intents to learn the inverse sequence 𝑧𝑑 is for the current elements. Similar to autoregressive generation via the autoregressive learning objective: learning, we try to predict 𝑧𝑑 given 𝑧<𝑑 . In this way, the learning objective is rewritten according to the permuta- max π‘πœƒ (π‘₯𝑒𝑖 |π‘₯𝑒𝑖+1 , π‘₯𝑒𝑖+2 , ..., π‘₯𝑒𝑛 ) (2) tion z instead of the original order of original sequence πœƒ With the pretrained model, pseudo-prior items can be x as follows. recursively generated for short sequences, in order to max EzβˆΌπ‘ 𝑇 log π‘πœƒ (π‘₯𝑧𝑑 |π‘₯𝑧<𝑑 ) (4) eliminate the data sparsity in recommendation and fur- πœƒ ther improve the quality of the whole training set. The where π‘₯𝑧<𝑑 stand for the items in x whose index is in pretrained model can be further finetuned for next-item 𝑧 . This new objective calculates the probability of an prediction, and the learning objective is the forward au- <𝑑 item conditioned on all possible permutations of items toregressive learning objective: in an autoregressive way, as opposed to just those to the max π‘πœƒ (π‘₯𝑒𝑖 |π‘₯𝑒1 , π‘₯𝑒2 , ..., π‘₯π‘’π‘–βˆ’1 ) (3) left side or right side of the target item in the existing πœƒ Algorithm 1 Permutation Autoregressive Learning for dataset Beauty Phones Sports Tools Baby Office SRA #user 22363 27879 35598 16638 19445 4905 1: Input: A set of behavior sequence {𝑆 𝑒 } #item 12101 10429 18357 10217 7050 2420 2: Output: A sequential recommendation model #instance 198502 138681 296337 134476 160792 53258 3: procedure PAL({𝑆 𝑒 }) avg. length 6.88 4.97 6.32 6.08 6.27 8.86 4: for each epoch do Table 1 5: for each instance in batch do Statistics of the Datasets. 6: Sample 𝑛 permutations with length 𝐿. 7: Pretrain with Eqn 4. 8: Save the result pretrained model β„³0 for generation. The first one is the content-stream representation 9: Select behavior sequences shorter than π‘š. 10: Generate the pseudo-prior items with beam width π‘˜. which is exactly the same as the hidden state in the 11: for each epoch do standard self-attention. This corresponding attention 12: for each batch do is named Content-stream attention: 13: Finetune β„³0 with Eqn 3. 14: Save the result finetuned model for sequential recom- β„Ž(π‘š) 𝑧𝑑 ← Attention(𝑄 = β„Ž(π‘šβˆ’1) 𝑧𝑑 , 𝐾𝑉 = β„Ž(π‘šβˆ’1) 𝑧≀𝑑 ; πœƒ) mendation. (6) (π‘š) where β„Žπ‘§π‘‘ is the output of the π‘š-th block Transformer. methods for sequential recommendation augmentation. The second representation is target representation, which It should be emphasised that only the indices, namely contains only the position information of the target in which elements are used for prediction, are changed, order to avoid content information leaking. The Target- while the position of each item in the original sequence stream attention is: is retrained. The above learning objective can help each position β„Ž(π‘š) ΜƒοΈ€ 𝑧𝑑 ← Attention(𝑄 = ΜƒοΈ€ β„Ž(π‘šβˆ’1) 𝑧𝑑 , 𝐾𝑉 = β„Ž(π‘šβˆ’1) 𝑧<𝑑 ; πœƒ) to learn information from bidirectional context, but it (7) brings a new issue about the position information. In pre- (0) diction given several known items, the predicted position The content representation β„Žπ‘§π‘‘ is initialized by the or index is not fixed as in original autoregressive learn- item embedding e(π‘₯ 𝑧𝑑 ) which is added with the posi- ing. So we need to learn the target-aware representation tional encoding as in normal Transformer, and all the (0) which can tell the position that the current predicted item target representation ΜƒοΈ€ β„Žπ‘§π‘‘ is initialized by an identical located in. Therefore, the π‘πœƒ (π‘₯𝑧𝑑 |π‘₯𝑧<𝑑 ) is formulated as: trainable vector w. The output ΜƒοΈ€ β„Žπ‘§π‘‘ of the last layer Trans- former is used as hΜƒοΈ€πœƒ (π‘₯𝑧 ,𝑧 ) in Eqn 5 for prediction. In <𝑑 𝑑 exp[e(π‘₯𝑧𝑑 ) hΜƒοΈ€πœƒ (π‘₯𝑧 ,𝑧 )] π‘πœƒ (π‘₯𝑧𝑑 |π‘₯𝑧<𝑑 ) = βˆ‘οΈ€ <𝑑 𝑑 (5) this way, equipped with this two-stream attention, we * ΜƒοΈ€ π‘₯* exp[e(π‘₯ ) hπœƒ (π‘₯𝑧<𝑑 ,𝑧𝑑 )] can force each position in the sequence to learn bidirec- tional information while maintaining the normal behav- where h ΜƒοΈ€πœƒ (π‘₯𝑧 ,𝑧 ) is the learned target-aware represen- ior order. <𝑑 𝑑 tation for the item with the 𝑧𝑑 -th index. 3.1.3. Optimization of Sequence Augmentation 3.1.2. Two-stream Attention for Contextual Representation We propose beam-search for obtain the optimal sequence as the pseudo-prior items. Instead of recursively predict As aforementioned, the self-attention module in the se- the next item in a greedy way, Beam Search method quence model need to be modified for obtaining target- maintains a buffer of candidate subsequences and selects aware representations. The two-stream attention struc- the best one with the largest joint probability. Beam ture was used in language modeling to provide target width value is denoted as π‘˜. More details can be found position information without leaking the content infor- in [12]. mation of the target. Specifically, two separated streams of attention vectors are maintained to store content information and position 3.2. PAL Algorithm for SRA information. For each position 𝑧𝑑 in the factorization z, Considering the computing complexity of the permuta- we keep updating the intermediate vectors β„Žπ‘§π‘‘ and ΜƒοΈ€ β„Žπ‘§π‘‘ , tion autoregressive learning, we need to sample from the representing content stream and target stream respec- set of permutation and predict partial of the sequence. tively. Each stream is learned by a designated attention For each sampled permutation, we train the model via mechanism. The detailed formulations of the two-stream maximizing the probability of last item. The detailed attention structure is described as follows. learning procedure is described in Algorithm 1. 4. Experiments Model R@5 R@10 NDCG@5 NDCG@10 MRR SASRec 0.3849 0.4863 0.2884 0.3212 0.2870 4.1. Datasets and Baseline Models BERT4Rec 0.4243 0.5371 0.3075 0.3598 0.3021 Beauty ASReP 0.4583 0.5743 0.3465 0.4042 0.3540 Following the setting in [7] and [8], 6 datasets are adopted BiCAT 0.4901 0.5892 0.3704 +6.8% 0.4289 +6.1% 0.3712 +4.8% which are collected from π΄π‘šπ‘Žπ‘§π‘œπ‘›.π‘π‘œπ‘š1 . For the behav- PAL 0.4934 0.6048 0.3873 +11.7% 0.4400 +8.8% 0.3803 +7.4% ior sequence construction, we regard the presence of PAL++ 0.4936 0.6036 0.3879 +11.9% 0.4415 +9.2% 0.3821 +7.9% review as an interaction between a user and an item, and SASRec 0.3517 0.4706 0.2475 0.2859 0.2470 construct the behavior sequence according to the times- BERT4Rec 0.3732 0.4942 0.2687 0.3006 0.2684 Phones tamp. Following the preprocession in [7], we use the last ASReP 0.5489 0.6758 0.4107 0.4518 0.3946 item in a sequence for test. The statistics of datasets is BiCAT 0.5663 0.7032 0.4274 +4.0% 0.4729 +4.6% 0.3990 +1.1% shown in Table 1. PAL 0.5736 0.7178 0.4432 +7.9% 0.4798 +6.2% 0.4100 +3.9% PAL++ 0.5745 0.7239 0.4436 +8.0% 0.4809 +6.4% 0.4113 +4.2% We compare our proposed method with the following methods including the state-of-the-art BiCAT [8] method SASRec 0.3847 0.5051 0.2732 0.3122 0.2699 in sequential recommendation augmentation. SASRec BERT4Rec 0.4136 0.5325 0.3014 0.3561 0.2988 Sports ASReP 0.4734 0.6011 0.3470 0.3884 0.3370 [13] utilize the transformer to extract the correlation BiCAT 0.4842 0.6246 0.3649 +5.1% 0.4003 +3.1% 0.3562 +5.7% from the training sequences and predict the next item. PAL 0.4936 0.6385 0.3784 +9.0% 0.4112 +5.9% 0.3712 +10.1% BERT4Rec [14] exploit the training method in BERT PAL++ 0.4940 0.6398 0.3796 +9.4% 0.4174 +7.5% 0.3730 +10.7% to learn transformer for SR. ASReP [7] reversely pre- SASRec 0.2853 0.3903 0.1987 0.2325 0.2037 train the transformer to generate pseudo-prior items for BERT4Rec 0.3613 0.5600 0.3190 0.3574 0.3011 short sequence and then finetune the transformer for ASReP 0.4133 0.5347 0.3014 0.3406 0.2976 SR. BiCAT [8] is the latest model for sequential recom- Tools BiCAT 0.4287 0.5509 0.3279 +8.8% 0.3571 +4.8% 0.3100 +4.2% mendation augmentation. It incorporates an additional PAL 0.4327 0.5624 0.3404 +12.9% 0.3767 +10.6% 0.3231 +8.6% objective in pretraining. PAL is our proposed learning PAL++ 0.4374 0.5681 0.3421 +13.5% 0.3805 +11.7% 0.3273 +10.0% method. PAL++ equips PAL with beam search. SASRec 0.3076 0.4358 0.2094 0.2509 0.2144 BERT4Rec 0.3295 0.4701 0.2212 0.2758 0.2338 Baby 4.2. Implementation Details ASReP 0.3581 0.4885 0.2499 0.2920 0.2508 BiCAT 0.3682 0.4972 0.2603 +4.2% 0.3007 +3.0% 0.2587 +3.1% We select the Transformer as the backbone to verify our PAL 0.3759 0.5123 0.2741 +9.7% 0.3178 +8.8% 0.2704 +7.8% SRA solution. The block number is fixed to 2. The hidden PAL++ 0.3785 0.5123 0.2804 +12.2% 0.3200 +9.6% 0.2724 +8.6% length is selected in {32, 64, 128}. The head number in SASRec 0.4053 0.5098 0.2994 0.3335 0.2947 attention is selected in {2, 4}. The learning rate is fixed BERT4Rec 0.4400 0.5682 0.3149 0.3589 0.3024 Office at 0.001 since the results are similar with other settings. ASReP 0.4689 0.6101 0.3303 0.3764 0.3186 The dropout rate is fixed to 0.5. The short sequence BiCAT 0.4801 0.6221 0.3462 +4.8% 0.3894 +3.5% 0.3326 +4.4% length threshold π‘š is set to 18, and each short sequence PAL 0.4982 0.6353 0.3572 +8.1% 0.3997 +6.2% 0.3486 +9.4% PAL++ 0.4994 0.6363 0.3602 +9.1% 0.4011 +6.6% 0.3497 +9.8% is augmented with 15 pseudo-prior items. For the number of sampled permutations (𝑛 in Algorithm 1), we select Table 2 it in {2, 4, 6} with model selection. The epoch number Performance of Different Methods on Sequential Recommen- is fixed to 200 which is sufficient for all the models to dation. Relative changes are based on ASReP. converge. We conduct model selection via grid search. For each behavior sequence, we randomly sample 100 negative items for ranking with the last item, which is the performance results are presented in Table 2. All the ground-truth. Recall@n, NDCG@n, and Mean Reciprocal SRA methods achieve better performance than the oth- Rank (MRR) are employed for as the evaluation metrics, ers, which verified the effectiveness of augmentation. and 𝑛 is selected in {5, 10}. Compared with the strongest sequential recommenda- tion augmentation baseline BiCAT, the proposed PAL 4.3. Performance of PAL can provide around 2% to 5% improvement on NDCG10, We performance sequential recommendation on all the which is significant in these sparse datasets. The beam 6 datasets with the above mentioned baseline models search method (PAL++) consistently shows effectiveness to demonstrate the effectiveness of PAL. Previous SRA on all the datasets, and the further performance improve- work has shown the advantage of self-attention sequence ment on the β€œTools and Home Improvement” and β€œBaby” model for recommendation, so we use the SASRec and are more significant than other datasets. The explana- BERT4Rec as two baselines without augmentation. The tion for this improvement difference is that the behavior diversity varies in different datasets. 1 http://jmcauley.ucsd.edu/data/amazon/ 4.5. Analysis on Convergence Rate One interesting finding is that the pseudo sequence gen- erated by PAL can significantly improve the converge rate of the finetuning stage in sequential recommenda- tion augmentation. We depict the loss value during the finetuning stage in Fig 4, where we can observe that the PAL method can converge earlier than ASReP to Figure 3: Performance on Short Sequence Instances. achieve a stable loss value. Similar results can be found in other datasets. Due to the permutation learning objec- tive, the PAL is of advantage in the generated data and 4.3.1. Effectiveness of PAL for Short Sequences pretrained model, which lead to the improvement of the convergence rate in the final finetuning stage. Performance improvement on short behavior sequence is critical for an augmentation paradigm. To further analyze the advantages of PAL for short sequence, we reconstruct the test set with all the behavior sequence shorter than 3 and evaluate all the baseline methods and our PAL and PAL++. The results is presented in Figure 3. Improve- (a) Beauty (b) Cell Phones (c) Sports ment on short sequence is a critical for an augmentation Figure 4: Illustration of Convergence Rate in Finetuning. The paradigm. We can find that the PAL and PAL++ method x-axis is the epoch, and the y-axis is the loss value. The grey can significantly outperform the other sequential rec- line is for ASReP method, and the blue one is for PAL. ommendation augmentation methods on all the datasets. This result illustrate that the proposed learning method can incorporate more contextual correlation information References into the short sequence augmentation. [1] C. Cai, R. He, J. McAuley, Spmc: socially-aware personalized markov chains for sparse sequential 4.4. Analysis on Backbone Model recommendation, in: IJCAI, 2017, pp. 1476–1482. The default backbone model of the SRA methods, i.e., [2] K. Song, M. Ji, S. Park, I.-C. Moon, Hierarchical ASReP, BiCAT, PAL, PAL++ in Section 4.3 is the basic context enabled recurrent neural network for rec- Transformer which is the same as in SASRec. All the SRA ommendation, in: AAAI, volume 33, 2019, pp. 4983– methods can also be applied to all the Transformer-based 4991. SR methods, such as SASRec, BERT4Rec and TiSASRec. [3] C. Xu, J. Feng, P. Zhao, F. Zhuang, D. Wang, Y. Liu, For TiSASRec, we ignore the temporal information in the V. S. Sheng, Long-and short-term self-attention pretraining and sequence generating stages, and assign network for sequential recommendation, Neuro- the smallest timestamp in the original sequence for the computing 423 (2021) 580–589. generated items. Here we report part of performance [4] J. Wu, X. Wang, F. Feng, X. He, L. Chen, J. Lian, (NDCG@5) comparison on β€œBeauty” dataset in Table 3. X. Xie, Self-supervised graph learning for recom- According to the results, equipped with SRA methods, mendation, in: SIGIR, 2021, pp. 726–735. the performance of all the backbone models are improved, [5] M. Wang, P. Ren, L. Mei, Z. Chen, J. Ma, M. de Rijke, and the proposed PAL / PAL++ achieve the best results. A collaborative session-based recommendation ap- Please note the TiSASRec is outperformed by SASRec as proach with parallel memory modules, in: SIGIR, our current augmentation methods has not incorporated 2019, pp. 345–354. the temporal information, which is a future work for [6] Z. Liu, Y. Chen, J. Li, P. S. Yu, J. McAuley, C. Xiong, SRA. Contrastive self-supervised sequential recommen- dation with robust augmentation, arXiv preprint Backbone SASRec BERT4Rec TiSASRec arXiv:2108.06479 (2021). Base 0.2884 0.3075 0.3076 [7] Z. Liu, Z. Fan, Y. Wang, P. S. Yu, Augmenting se- ASReP 0.3465 0.3562 0.3427 quential recommendation with pseudo-prior items BiCAT 0.3704 0.3746 0.3625 via reversely pre-training transformer, in: SIGIR, PAL 0.3873 0.3886 0.3771 SIGIR ’21, Association for Computing Machinery, PAL++ 0.3879 0.3886 0.3791 New York, NY, USA, 2021, p. 1608–1612. URL: https: Table 3 //doi.org/10.1145/3404835.3463036. doi:10.1145/ NDCG@5 with Different Backbone Model 3404835.3463036. [8] J. Jiang, Y. Luo, J. B. Kim, K. Zhang, S. Kim, Sequen- tial recommendation with bidirectional chronolog- ical augmentation of transformer, arXiv preprint arXiv:2112.06460 (2021). [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- tention is all you need, NeurIPS 30 (2017). [10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT (1), 2019. [11] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut- dinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, NeurIPS 32 (2019). [12] S. Wiseman, A. M. Rush, Sequence-to-sequence learning as beam-search optimization, in: EMNLP, 2016, pp. 1296–1306. [13] W.-C. Kang, J. McAuley, Self-attentive sequential recommendation, in: ICDM, IEEE, 2018, pp. 197– 206. [14] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang, Bert4rec: Sequential recommendation with bidirec- tional encoder representations from transformer, in: CIKM, 2019, pp. 1441–1450.