1. Introduction

Exploiting Global Behavior Contextual Correlation in Sequential Recommendation Augmentation

Qian Yu

Xiangdong Wu

Chen Yang

Zihao Zhao

Haoxin Liu

Chaosheng Fan

Changping Peng

Zhangang Lin

Jinghe Hu

Jingping Shao

0 0 Marketing & Commercialization Center , JD.com 1 Tsinghua University

The recently proposed Sequential Recommendation Augmentation (SRA) paradigm has shown valuable potential in sequential recommendation, especially for handling long-tail problem via extending short behavior sequences. However, the selfsupervised SRA adopts autoregressive learning with fixed forward or backward direction, which cannot make full use of the contextual correlation information in the training behavior sequences. Due to the direction diference, discrepancy problem exists in the two training stages of SRA, i.e., pretraining and finetuning. In order to overcome the restriction of specific sequential learning direction, we propose to equip SRA with permutation autoregressive learning to extract global contextual correlation information from the behavior sequences in both directions. The adapted SRA method is implemented with two-stream self-attention. Empirical evaluations on multiple sequential recommendation benchmark datasets demonstrate the efectiveness of our proposed model, and the augmented data can significantly reduce the convergence rate.

eol>Sequential Recommendation Data Augmentation Permutation Autoregressive Learning

1. Introduction

DL4SR’22: Workshop on Deep Learning for Search and Recommendation, co-located with the 31st ACM International Conference on Information and Knowledge Management (CIKM), October 17-21, 2022, Atlanta, USA $ yuqian81@jd.com (Q. Yu)

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License contextual information without restrictions of learning CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) direction. For the inference stage, we try to adopt the beam search for generating more suitable subsequence as the augmented data. A latest revision named BiCAT [ 8 ] comes with a similar motivation by implemented via an additional loss regularization, but it is not designed for extracting contextual correlation information around the predicted position, and we will empirically compare them.

Our contributions can be summarized as following: (a) Global contextual correlation information is explored in Sequential Recommendation Augmentation (SRA). (b) Equipped with Permutation Autoregressive Learning and beam search method, an adapted SRA framework is designed and evaluated. (c) The proposed framework outperforms the state-of-the-art methods for sequential recommendation augmentation without extra information or heuristic rules.

2. Sequential Recommendation Augmentation

The sequential recommendation task can be regarded as the next-item prediction given the historical behavior recommendation task can be formulated as: sequence. We denote the user set as and the item set as . The interaction behavior of the given user ∈ is denoted as = {1, 2, ..., }. The sequential +1 = arg max (|)

which means finding the next item

+1 with the largest probability given the user behavior sequence .

Recently, a Sequential Recommendation Augmentation (SRA) paradigm is proposed [ 7 ], and the basic SRA method is also known as ASPeP. As illustrated in Fig 1, ASPeP utilizes reverse pretraining for data augmentation. dure updating the same set of model parameter , namely reverse pretraining and left-to-right finetuning. The reverse pretraining intents to learn the inverse sequence generation via the autoregressive learning objective:

max ( |+1, +2, ..., )

(2) With the pretrained model, pseudo-prior items can be recursively generated for short sequences, in order to eliminate the data sparsity in recommendation and further improve the quality of the whole training set. The pretrained model can be further finetuned for next-item prediction, and the learning objective is the forward autoregressive learning objective: max ( |1, 2, ..., − 1)

We take Transformer as an example backbone for de- an autoregressive way. scribing this learning paradigm. The key component in Transformer, i.e., multi-head self-attention is constructed with linear transformation and the scaled dot-product

Assume that the length of the behavior sequence is , we denote the set of all possible permutation of index as . For example, if = 4, then the original permutaattention [ 9 ]. There are two stages in the training proce- tion is [ 1, 2, 3, 4 ], and the number of permutation in (a) Content-stream Attention

(b) Target-stream Attention

For details of SRA paradigm, please refer to [ 7 ]. 3.

Methodology 3.1. Permutation Autoregressive Learning

Now our intention is to help the SRA framework making use of the global contextual correlation in the behavior sequences. The idea of “mask and reconstruct” is a commonly used method for helping the sequence model to learn from contextual information in arbitrary position, but the incorporation of [MASK] token in behavior sequence will bring more severe discrepancy problem as in BERT [ 10 ], especially considering that the trained model will be used for recursively generating behavior sequences. Our solution to exploit context information is adopting the permutation modeling objective [ 11 ], which gather the information in bidirectional context while re

3.1.1. Permutations with Original Position Encoding

In order to exploit bidirectional item correlation in bidirectional context, we propose to train the sequence model with diferent permutations of each training sequence in (1) maining the autoregressive learning paradigm. where < stand for the items in x whose index is in <. This new objective calculates the probability of an item conditioned on all possible permutations of items in an autoregressive way, as opposed to just those to the (3) left side or right side of the target item in the existing is ! = 24. For each permutation z in , < stands for indices of all the element before -th element , and is for the current elements. Similar to autoregressive learning, we try to predict given <. In this way, the learning objective is rewritten according to the permutation z instead of the original order of original sequence x as follows.

max Ez∼ log ( |< ) (4) Algorithm 1 Permutation Autoregressive Learning for SRA 1: Input: A set of behavior sequence {} 2: Output: A sequential recommendation model 3: procedure PAL({}) 4: for each epoch do 5: for each instance in batch do 6: Sample permutations with length . 7: Pretrain with Eqn 4. 8: Save the result pretrained model ℳ0 for generation. 9: Select behavior sequences shorter than . 10: Generate the pseudo-prior items with beam width . 11: for each epoch do 12: for each batch do 13: Finetune ℳ0 with Eqn 3. 14: Save the result finetuned model for sequential recommendation. dataset #user #item #instance avg. length

The first one is the content-stream representation which is exactly the same as the hidden state in the standard self-attention. This corresponding attention is named Content-stream attention: ℎ() ←

Attention( = ℎ(− 1), = ℎ(≤− 1); )

(6) () is the output of the -th block Transformer. where ℎ methods for sequential recommendation augmentation. The second representation is target representation, which It should be emphasised that only the indices, namely contains only the position information of the target in which elements are used for prediction, are changed, order to avoid content information leaking. The Targetwhile the position of each item in the original sequence stream attention is: is rTehtreaianbeodv.e learning objective can help each position ̃ℎ︀() ← Attention( = ̃ℎ︀(− 1), = ℎ(<− 1); ) to learn information from bidirectional context, but it (7) brings a new issue about the position information. In prediction given several known items, the predicted position The content representation ℎ(0) is initialized by the or index is not fixed as in original autoregressive learn- item embedding e( ) which is added with the posiing. So we need to learn the target-aware representation tional encoding as in normal Transformer, and all the which can tell the position that the current predicted item target representation ̃ℎ︀(0) is initialized by an identical located in. Therefore, the ( |< ) is formulated as: trainable vector w. The output ̃ℎ︀ of the last layer Trans ( |< ) = ∑︀* exp[e(* ) h̃︀ (<, )] exp[e( ) h̃︀ (<, )] (5) former is used as h̃︀ (<, ) in Eqn 5 for prediction. In this way, equipped with this two-stream attention, we can force each position in the sequence to learn bidirectional information while maintaining the normal behavwhere h̃︀ (<, ) is the learned target-aware represen- ior order. tation for the item with the -th index.

3.1.2. Two-stream Attention for Contextual Representation

We propose beam-search for obtain the optimal sequence as the pseudo-prior items. Instead of recursively predict As aforementioned, the self-attention module in the se- the next item in a greedy way, Beam Search method quence model need to be modified for obtaining target- maintains a bufer of candidate subsequences and selects aware representations. The two-stream attention struc- the best one with the largest joint probability. Beam ture was used in language modeling to provide target width value is denoted as . More details can be found position information without leaking the content infor- in [ 12 ]. mation of the target.

Specifically, two separated streams of attention vectors are maintained to store content information and position 3.2. PAL Algorithm for SRA information. For each position in the factorization z, Considering the computing complexity of the permutawe keep updating the intermediate vectors ℎ and ̃ℎ︀ , tion autoregressive learning, we need to sample from the representing content stream and target stream respec- set of permutation and predict partial of the sequence. tively. Each stream is learned by a designated attention For each sampled permutation, we train the model via mechanism. The detailed formulations of the two-stream maximizing the probability of last item. The detailed attention structure is described as follows. learning procedure is described in Algorithm 1.

3.1.3. Optimization of Sequence Augmentation 4. Experiments 4.1. Datasets and Baseline Models

Following the setting in [ 7 ] and [ 8 ], 6 datasets are adopted which are collected from .1. For the behavior sequence construction, we regard the presence of review as an interaction between a user and an item, and construct the behavior sequence according to the timestamp. Following the preprocession in [ 7 ], we use the last item in a sequence for test. The statistics of datasets is shown in Table 1.

We compare our proposed method with the following methods including the state-of-the-art BiCAT [ 8 ] method in sequential recommendation augmentation. SASRec [ 13 ] utilize the transformer to extract the correlation from the training sequences and predict the next item. BERT4Rec [ 14 ] exploit the training method in BERT to learn transformer for SR. ASReP [ 7 ] reversely pretrain the transformer to generate pseudo-prior items for short sequence and then finetune the transformer for SR. BiCAT [ 8 ] is the latest model for sequential recommendation augmentation. It incorporates an additional objective in pretraining. PAL is our proposed learning method. PAL++ equips PAL with beam search.

4.2. Implementation Details

We select the Transformer as the backbone to verify our SRA solution. The block number is fixed to 2. The hidden length is selected in {32, 64, 128}. The head number in attention is selected in {2, 4}. The learning rate is fixed at 0.001 since the results are similar with other settings. The dropout rate is fixed to 0.5. The short sequence length threshold is set to 18, and each short sequence is augmented with 15 pseudo-prior items. For the number of sampled permutations ( in Algorithm 1), we select it in {2, 4, 6} with model selection. The epoch number is fixed to 200 which is suficient for all the models to converge. We conduct model selection via grid search. For each behavior sequence, we randomly sample 100 negative items for ranking with the last item, which is the ground-truth. Recall@n, NDCG@n, and Mean Reciprocal Rank (MRR) are employed for as the evaluation metrics, and is selected in {5, 10}.

4.3. Performance of PAL

We performance sequential recommendation on all the 6 datasets with the above mentioned baseline models to demonstrate the efectiveness of PAL. Previous SRA work has shown the advantage of self-attention sequence model for recommendation, so we use the SASRec and BERT4Rec as two baselines without augmentation. The 1http://jmcauley.ucsd.edu/data/amazon/ y t u a e B s e n o h P s t r o p S s l o o T y b a B e c fi O

Model

MRR SASRec 0.3849 0.4863 0.2884 0.3212 0.2870 BERT4Rec 0.4243 0.5371 0.3075 0.3598 0.3021 ASReP 0.4583 0.5743 0.3465 0.4042 0.3540 BiCAT 0.4901 0.5892 0.3704 +6.8% 0.4289 +6.1% 0.3712 +4.8% PAL 0.4934 0.6048 0.3873 +11.7% 0.4400 +8.8% 0.3803 +7.4% PAL++ 0.4936 0.6036 0.3879 +11.9% 0.4415 +9.2% 0.3821 +7.9% performance results are presented in Table 2. All the SRA methods achieve better performance than the others, which verified the efectiveness of augmentation. Compared with the strongest sequential recommendation augmentation baseline BiCAT, the proposed PAL can provide around 2% to 5% improvement on NDCG10, which is significant in these sparse datasets. The beam search method (PAL++) consistently shows efectiveness on all the datasets, and the further performance improvement on the “Tools and Home Improvement” and “Baby” are more significant than other datasets. The explanation for this improvement diference is that the behavior diversity varies in diferent datasets.

4.3.1. Efectiveness of PAL for Short Sequences

Performance improvement on short behavior sequence is critical for an augmentation paradigm. To further analyze the advantages of PAL for short sequence, we reconstruct the test set with all the behavior sequence shorter than 3 and evaluate all the baseline methods and our PAL and PAL++. The results is presented in Figure 3. Improvement on short sequence is a critical for an augmentation paradigm. We can find that the PAL and PAL++ method can significantly outperform the other sequential recommendation augmentation methods on all the datasets. This result illustrate that the proposed learning method can incorporate more contextual correlation information into the short sequence augmentation.

4.4. Analysis on Backbone Model

The default backbone model of the SRA methods, i.e., ASReP, BiCAT, PAL, PAL++ in Section 4.3 is the basic Transformer which is the same as in SASRec. All the SRA methods can also be applied to all the Transformer-based SR methods, such as SASRec, BERT4Rec and TiSASRec. For TiSASRec, we ignore the temporal information in the pretraining and sequence generating stages, and assign the smallest timestamp in the original sequence for the generated items. Here we report part of performance (NDCG@5) comparison on “Beauty” dataset in Table 3. According to the results, equipped with SRA methods, the performance of all the backbone models are improved, and the proposed PAL / PAL++ achieve the best results. Please note the TiSASRec is outperformed by SASRec as our current augmentation methods has not incorporated the temporal information, which is a future work for SRA.

Backbone

Base ASReP BiCAT PAL PAL++

SASRec

0.2884 0.3465 0.3704 0.3873 0.3879

BERT4Rec

0.3075 0.3562 0.3746 0.3886 0.3886

TiSASRec

0.3076 0.3427 0.3625 0.3771 0.3791

4.5. Analysis on Convergence Rate

One interesting finding is that the pseudo sequence generated by PAL can significantly improve the converge rate of the finetuning stage in sequential recommendation augmentation. We depict the loss value during the ifnetuning stage in Fig 4, where we can observe that the PAL method can converge earlier than ASReP to achieve a stable loss value. Similar results can be found in other datasets. Due to the permutation learning objective, the PAL is of advantage in the generated data and pretrained model, which lead to the improvement of the convergence rate in the final finetuning stage.

(a) Beauty (b) Cell Phones (c) Sports Figure 4: Illustration of Convergence Rate in Finetuning. The x-axis is the epoch, and the y-axis is the loss value. The grey line is for ASReP method, and the blue one is for PAL.

[1]

Cai ,

He , J. McAuley , Spmc: socially-aware personalized markov chains for sparse sequential recommendation , in: IJCAI , 2017 , pp. 1476 - 1482 .

[2]

Song ,

Ji ,

Park , I.-C. Moon, Hierarchical context enabled recurrent neural network for recommendation , in: AAAI , volume 33 , 2019 , pp. 4983 - 4991 .

[3]

Xu ,

Feng ,

Zhao ,

Zhuang ,

Wang ,

Liu , V. S. Sheng, Long-and short-term self-attention network for sequential recommendation , Neurocomputing 423 ( 2021 ) 580 - 589 .

[4]

Wu ,

Wang ,

Feng ,

He ,

Chen ,

Lian ,

Xie , Self-supervised graph learning for recommendation , in: SIGIR , 2021 , pp. 726 - 735 .

[5]

Wang ,

Ren ,

Mei ,

Chen , J. Ma, M. de Rijke, A collaborative session-based recommendation approach with parallel memory modules , in: SIGIR , 2019 , pp. 345 - 354 .

[6]

Liu ,

Chen ,

Li ,

P. S.

Yu ,

McAuley ,

Xiong , Contrastive self-supervised sequential recommendation with robust augmentation , arXiv preprint arXiv:2108.06479 ( 2021 ).

[7]

Liu ,

Fan ,

Wang ,

P. S.

Yu , Augmenting sequential recommendation with pseudo-prior items via reversely pre-training transformer , in: SIGIR, SIGIR '21, Association for Computing Machinery, New York, NY, USA, 2021 , p. 1608 - 1612 . URL: https: //doi.org/10.1145/3404835.3463036. doi: 10 .1145/ 3404835.3463036.

[8]

Jiang ,

Luo ,

J. B.

Kim ,

Zhang , S. Kim, Sequential recommendation with bidirectional chronological augmentation of transformer , arXiv preprint arXiv:2112.06460 ( 2021 ).

[9]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , NeurIPS 30 ( 2017 ).

[10]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , in: NAACL-HLT (1) , 2019 .

[11]

Yang ,

Dai ,

Yang , J. Carbonell,

R. R.

Salakhutdinov ,

Q. V.

Le , Xlnet: Generalized autoregressive pretraining for language understanding , NeurIPS 32 ( 2019 ).

[12]

Wiseman ,

A. M.

Rush , Sequence-to-sequence learning as beam-search optimization , in: EMNLP, 2016 , pp. 1296 - 1306 .

[13] W.-C. Kang , J. McAuley , Self-attentive sequential recommendation , in: ICDM, IEEE, 2018 , pp. 197 - 206 .

[14]

Sun , J. Liu,

Wu ,

Pei ,

Lin ,

Ou , P. Jiang, Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer , in: CIKM , 2019 , pp. 1441 - 1450 .