Exploiting Global Behavior Contextual Correlation in
Sequential Recommendation Augmentation
Qian Yu1 , Xiangdong Wu1 , Chen Yang1 , Zihao Zhao1 , Haoxin Liu2 , Chaosheng Fan1 ,
Changping Peng1 , Zhangang Lin1 , Jinghe Hu1 and Jingping Shao1
1
    Marketing & Commercialization Center, JD.com
2
    Tsinghua University


                                       Abstract
                                       The recently proposed Sequential Recommendation Augmentation (SRA) paradigm has shown valuable potential in sequential
                                       recommendation, especially for handling long-tail problem via extending short behavior sequences. However, the self-
                                       supervised SRA adopts autoregressive learning with fixed forward or backward direction, which cannot make full use of the
                                       contextual correlation information in the training behavior sequences. Due to the direction difference, discrepancy problem
                                       exists in the two training stages of SRA, i.e., pretraining and finetuning. In order to overcome the restriction of specific
                                       sequential learning direction, we propose to equip SRA with permutation autoregressive learning to extract global contextual
                                       correlation information from the behavior sequences in both directions. The adapted SRA method is implemented with
                                       two-stream self-attention. Empirical evaluations on multiple sequential recommendation benchmark datasets demonstrate
                                       the effectiveness of our proposed model, and the augmented data can significantly reduce the convergence rate.

                                       Keywords
                                       Sequential Recommendation, Data Augmentation, Permutation Autoregressive Learning


1. Introduction
Sequential recommendation aims to find the behavior pat-
tern or item transition from the user behavior sequences.
Variant architectures are developed, including Markov
Chains [1], RNN [2], attention-based sequence models
[3] and graph models [4], etc.                                                                   (a) Pretraining         (b) Augmenting          (c) Finetuning
   Data sparsity severely defects the performance of
sequential recommendation. Data augmentation is a                                              Figure 1: Stages in SRA. (a) Pretraining with inversed se-
                                                                                               quence. (b) Pseudo-prior item augmentation. (c) Finetuning.
straightforward solution for handling short behavior se-
quences in sequential recommendation [5]. There are
mainly two kinds of data augmentation methods, namely                                                  pretrains the sequence model with reversed training se-
the heuristic augmentation [6] and generative augmenta-                                                quence, in order to generate the pseudo-prior items, while
tion [7]. Recently, Sequential Recommendation Augmen-                                                  the finetuning stage adopts the normal autoregressive
tation (SRA) is proposed as an augmentation paradigm in                                                objective. Therefore, the trained model is never aware of
sequential recommendation [7]. Consisting of two train-                                                the bidirectional context behaviors around the current
ing stages, namely pretraining and finetuning, SRA is a                                                position, namely the learning of the framework is insuf-
verified effective solution for handling short sequences                                               ficient. Besides, the two stages update to the same set of
in sequential recommendation, namely long-tail problem.                                                parameters but with different learning directions. Similar
                                                                                                       discrepancy problems are commonly seen in this kind of
   However, the current learning procedure of sequential pretrain-finetune methods, and it remains a constraint
recommendation cannot completely extract the contex- for further performance improvement.
tual correlation in the given sequential training instances.                                              To addressing the abovementioned problems, we pro-
In any learning stage, the item is predicted given the sub- pose to exploit global contextual correlation informa-
sequence located at the single side of it. Specifically, SRA tion with Permutation Autoregressive Learning (PAL) for
                                                                                                       SRA. Specifically, we unify the learning objectives with
DL4SR’22: Workshop on Deep Learning for Search and Recommen- permutation language model objective and implement
dation, co-located with the 31st ACM International Conference on it on sequential recommendation with two-stream self-
Information and Knowledge Management (CIKM), October 17-21, 2022, attention mechanism. PAL helps the model to exploit dif-
Atlanta, USA                                                                                           ferent permutations of the input in order to exploit global
$ yuqian81@jd.com (Q. Yu)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License contextual information without restrictions of learning
          Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                       direction. For the inference stage, we try to adopt the
beam search for generating more suitable subsequence
as the augmented data. A latest revision named BiCAT
[8] comes with a similar motivation by implemented via
an additional loss regularization, but it is not designed
for extracting contextual correlation information around
the predicted position, and we will empirically compare (a) Content-stream Attention      (b) Target-stream Attention
them.                                                      Figure 2: Different Forms of Attention.
   Our contributions can be summarized as following:
(a) Global contextual correlation information is explored
in Sequential Recommendation Augmentation (SRA). (b) For details of SRA paradigm, please refer to [7].
Equipped with Permutation Autoregressive Learning and
beam search method, an adapted SRA framework is de-
signed and evaluated. (c) The proposed framework out- 3. Methodology
performs the state-of-the-art methods for sequential rec-
ommendation augmentation without extra information 3.1. Permutation Autoregressive Learning
or heuristic rules.
                                                           Now our intention is to help the SRA framework making
                                                           use of the global contextual correlation in the behav-
2. Sequential Recommendation                               ior sequences. The idea of “mask and reconstruct” is a
                                                           commonly used method for helping the sequence model
     Augmentation                                          to learn from contextual information in arbitrary posi-
                                                           tion, but the incorporation of [MASK] token in behavior
The sequential recommendation task can be regarded as
                                                           sequence will bring more severe discrepancy problem
the next-item prediction given the historical behavior
                                                           as in BERT [10], especially considering that the trained
sequence. We denote the user set as 𝒰 and the item set
                                                           model will be used for recursively generating behavior
as 𝒳 . The interaction behavior of the given user 𝑢 ∈ 𝒰
                  𝑢       𝑢   𝑢      𝑢                     sequences. Our solution to exploit context information is
is denoted as 𝑆 = {𝑥1 , 𝑥2 , ..., 𝑥𝑛 }. The sequential
                                                           adopting the permutation modeling objective [11], which
recommendation task can be formulated as:
                                                           gather the information in bidirectional context while re-
                𝑥𝑢𝑛+1 = arg max 𝑝(𝑥|𝑆 𝑢 )              (1) maining the autoregressive learning paradigm.
                                 𝑥

which means finding the next item 𝑥𝑢𝑛+1 with the largest      3.1.1. Permutations with Original Position
probability given the user behavior sequence 𝑆 𝑢 .                   Encoding
   Recently, a Sequential Recommendation Augmenta-
tion (SRA) paradigm is proposed [7], and the basic SRA        In order to exploit bidirectional item correlation in bidi-
method is also known as ASPeP. As illustrated in Fig 1,       rectional context, we propose to train the sequence model
ASPeP utilizes reverse pretraining for data augmentation.     with different permutations of each training sequence in
We take Transformer as an example backbone for de-            an autoregressive way.
scribing this learning paradigm. The key component in            Assume that the length of the behavior sequence is 𝑇 ,
Transformer, i.e., multi-head self-attention is constructed   we denote the set of all possible permutation of index as
with linear transformation and the scaled dot-product         𝑍 𝑇 . For example, if 𝑇 = 4, then the original permuta-
attention [9]. There are two stages in the training proce-    tion is [1, 2, 3, 4], and the number of permutation in 𝑍 𝑇
dure updating the same set of model parameter 𝜃, namely       is 𝑇 ! = 24. For each permutation z in 𝑍 𝑇 , 𝑧<𝑡 stands
reverse pretraining and left-to-right finetuning. The re-     for indices of all the element before 𝑡-th element 𝑧𝑡 , and
verse pretraining intents to learn the inverse sequence       𝑧𝑡 is for the current elements. Similar to autoregressive
generation via the autoregressive learning objective:         learning, we try to predict 𝑧𝑡 given 𝑧<𝑡 . In this way, the
                                                              learning objective is rewritten according to the permuta-
            max 𝑝𝜃 (𝑥𝑢𝑖 |𝑥𝑢𝑖+1 , 𝑥𝑢𝑖+2 , ..., 𝑥𝑢𝑛 )    (2)    tion z instead of the original order of original sequence
              𝜃

With the pretrained model, pseudo-prior items can be          x as follows.
recursively generated for short sequences, in order to
                                                                        max Ez∼𝑍 𝑇 log 𝑝𝜃 (𝑥𝑧𝑡 |𝑥𝑧<𝑡 )            (4)
eliminate the data sparsity in recommendation and fur-                    𝜃
ther improve the quality of the whole training set. The
                                                          where 𝑥𝑧<𝑡 stand for the items in x whose index is in
pretrained model can be further finetuned for next-item
                                                          𝑧 . This new objective calculates the probability of an
prediction, and the learning objective is the forward au- <𝑡
                                                          item conditioned on all possible permutations of items
toregressive learning objective:
                                                          in an autoregressive way, as opposed to just those to the
             max 𝑝𝜃 (𝑥𝑢𝑖 |𝑥𝑢1 , 𝑥𝑢2 , ..., 𝑥𝑢𝑖−1 )    (3) left side or right side of the target item in the existing
                  𝜃
Algorithm 1 Permutation Autoregressive Learning for                 dataset      Beauty       Phones   Sports      Tools     Baby    Office
SRA                                                                  #user       22363         27879   35598      16638     19445    4905
 1: Input: A set of behavior sequence {𝑆 𝑢 }                         #item       12101         10429   18357      10217      7050    2420
 2: Output: A sequential recommendation model                      #instance     198502       138681   296337     134476    160792   53258
 3: procedure PAL({𝑆 𝑢 })                                         avg. length     6.88          4.97    6.32        6.08     6.27     8.86
 4:    for each epoch do                                        Table 1
 5:        for each instance in batch do                        Statistics of the Datasets.
 6:            Sample 𝑛 permutations with length 𝐿.
 7:            Pretrain with Eqn 4.
 8:   Save the result pretrained model ℳ0 for generation.
                                                                   The first one is the content-stream representation
 9:   Select behavior sequences shorter than 𝑚.
10:   Generate the pseudo-prior items with beam width 𝑘.
                                                                which is exactly the same as the hidden state in the
11:   for each epoch do                                         standard self-attention. This corresponding attention
12:       for each batch do                                     is named Content-stream attention:
13:           Finetune ℳ0 with Eqn 3.
14:   Save the result finetuned model for sequential recom-
                                                                  ℎ(𝑚)
                                                                   𝑧𝑡  ← Attention(𝑄 = ℎ(𝑚−1)
                                                                                        𝑧𝑡    , 𝐾𝑉 = ℎ(𝑚−1)
                                                                                                      𝑧≤𝑡   ; 𝜃)
    mendation.                                                                                                 (6)
                                                                         (𝑚)
                                                              where ℎ𝑧𝑡 is the output of the 𝑚-th block Transformer.
methods for sequential recommendation augmentation. The second representation is target representation, which
It should be emphasised that only the indices, namely contains only the position information of the target in
which elements are used for prediction, are changed, order to avoid content information leaking. The Target-
while the position of each item in the original sequence stream attention is:
is retrained.
   The above learning objective can help each position          ℎ(𝑚)
                                                                ̃︀ 𝑧𝑡   ← Attention(𝑄 = ̃︀    ℎ(𝑚−1)
                                                                                               𝑧𝑡     , 𝐾𝑉 = ℎ(𝑚−1)
                                                                                                                  𝑧<𝑡    ; 𝜃)
to learn information from bidirectional context, but it                                                                     (7)
brings a new issue about the position information. In pre-                                          (0)
diction given several known items, the predicted position        The content representation ℎ𝑧𝑡 is initialized by the
or index is not fixed as in original autoregressive learn-    item    embedding  e(𝑥  𝑧𝑡 ) which   is added with the posi-
ing. So we need to learn the target-aware representation      tional   encoding as  in  normal    Transformer,     and all the
                                                                                          (0)
which can tell the position that the current predicted item target representation ̃︀   ℎ𝑧𝑡 is initialized by an identical
located in. Therefore, the 𝑝𝜃 (𝑥𝑧𝑡 |𝑥𝑧<𝑡 ) is formulated as: trainable vector w. The output ̃︀    ℎ𝑧𝑡 of the last layer Trans-
                                                              former is used as h̃︀𝜃 (𝑥𝑧 ,𝑧 ) in Eqn 5 for prediction. In
                                                                                         <𝑡 𝑡
                         exp[e(𝑥𝑧𝑡 ) h̃︀𝜃 (𝑥𝑧 ,𝑧 )]
    𝑝𝜃 (𝑥𝑧𝑡 |𝑥𝑧<𝑡 ) = ∑︀                     <𝑡 𝑡
                                                         (5) this way, equipped with this two-stream attention, we
                                     * ̃︀
                          𝑥* exp[e(𝑥 ) h𝜃 (𝑥𝑧<𝑡 ,𝑧𝑡 )]
                                                              can force each position in the sequence to learn bidirec-
                                                              tional information while maintaining the normal behav-
where h ̃︀𝜃 (𝑥𝑧 ,𝑧 ) is the learned target-aware represen- ior order.
               <𝑡 𝑡
tation for the item with the 𝑧𝑡 -th index.
                                                              3.1.3. Optimization of Sequence Augmentation
3.1.2. Two-stream Attention for Contextual
        Representation                                        We propose beam-search for obtain the optimal sequence
                                                              as the pseudo-prior items. Instead of recursively predict
As aforementioned, the self-attention module in the se- the next item in a greedy way, Beam Search method
quence model need to be modified for obtaining target- maintains a buffer of candidate subsequences and selects
aware representations. The two-stream attention struc- the best one with the largest joint probability. Beam
ture was used in language modeling to provide target width value is denoted as 𝑘. More details can be found
position information without leaking the content infor- in [12].
mation of the target.
   Specifically, two separated streams of attention vectors
are maintained to store content information and position 3.2. PAL Algorithm for SRA
information. For each position 𝑧𝑡 in the factorization z, Considering the computing complexity of the permuta-
we keep updating the intermediate vectors ℎ𝑧𝑡 and ̃︀    ℎ𝑧𝑡 , tion autoregressive learning, we need to sample from the
representing content stream and target stream respec- set of permutation and predict partial of the sequence.
tively. Each stream is learned by a designated attention For each sampled permutation, we train the model via
mechanism. The detailed formulations of the two-stream maximizing the probability of last item. The detailed
attention structure is described as follows.                  learning procedure is described in Algorithm 1.
4. Experiments                                                             Model     R@5     R@10    NDCG@5       NDCG@10        MRR
                                                                           SASRec    0.3849 0.4863 0.2884        0.3212          0.2870
4.1. Datasets and Baseline Models                                         BERT4Rec   0.4243 0.5371 0.3075        0.3598          0.3021


                                                                Beauty
                                                                           ASReP     0.4583 0.5743 0.3465        0.4042          0.3540
Following the setting in [7] and [8], 6 datasets are adopted               BiCAT     0.4901 0.5892 0.3704 +6.8% 0.4289 +6.1%     0.3712 +4.8%
which are collected from 𝐴𝑚𝑎𝑧𝑜𝑛.𝑐𝑜𝑚1 . For the behav-                       PAL      0.4934 0.6048 0.3873 +11.7% 0.4400 +8.8%    0.3803 +7.4%
ior sequence construction, we regard the presence of                       PAL++     0.4936 0.6036 0.3879 +11.9% 0.4415 +9.2%    0.3821 +7.9%
review as an interaction between a user and an item, and                   SASRec    0.3517 0.4706 0.2475         0.2859         0.2470
construct the behavior sequence according to the times-                   BERT4Rec   0.3732 0.4942 0.2687         0.3006         0.2684


                                                                 Phones
tamp. Following the preprocession in [7], we use the last                  ASReP     0.5489 0.6758 0.4107         0.4518         0.3946
item in a sequence for test. The statistics of datasets is                 BiCAT     0.5663 0.7032 0.4274 +4.0%   0.4729 +4.6%   0.3990 +1.1%
shown in Table 1.                                                           PAL      0.5736 0.7178 0.4432 +7.9%   0.4798 +6.2%   0.4100 +3.9%
                                                                           PAL++     0.5745 0.7239 0.4436 +8.0%   0.4809 +6.4%   0.4113 +4.2%
   We compare our proposed method with the following
methods including the state-of-the-art BiCAT [8] method                    SASRec    0.3847 0.5051 0.2732         0.3122         0.2699
in sequential recommendation augmentation. SASRec                         BERT4Rec   0.4136 0.5325 0.3014         0.3561         0.2988


                                                                Sports
                                                                           ASReP     0.4734 0.6011 0.3470         0.3884         0.3370
[13] utilize the transformer to extract the correlation
                                                                           BiCAT     0.4842 0.6246 0.3649 +5.1%   0.4003 +3.1%   0.3562 +5.7%
from the training sequences and predict the next item.                      PAL      0.4936 0.6385 0.3784 +9.0%   0.4112 +5.9%   0.3712 +10.1%
BERT4Rec [14] exploit the training method in BERT                          PAL++     0.4940 0.6398 0.3796 +9.4%   0.4174 +7.5%   0.3730 +10.7%
to learn transformer for SR. ASReP [7] reversely pre-
                                                                           SASRec    0.2853 0.3903 0.1987        0.2325        0.2037
train the transformer to generate pseudo-prior items for                  BERT4Rec   0.3613 0.5600 0.3190        0.3574        0.3011
short sequence and then finetune the transformer for                       ASReP     0.4133 0.5347 0.3014        0.3406        0.2976
SR. BiCAT [8] is the latest model for sequential recom-          Tools     BiCAT     0.4287 0.5509 0.3279 +8.8% 0.3571 +4.8% 0.3100 +4.2%
mendation augmentation. It incorporates an additional                       PAL      0.4327 0.5624 0.3404 +12.9% 0.3767 +10.6% 0.3231 +8.6%
objective in pretraining. PAL is our proposed learning                     PAL++     0.4374 0.5681 0.3421 +13.5% 0.3805 +11.7% 0.3273 +10.0%
method. PAL++ equips PAL with beam search.                                 SASRec    0.3076 0.4358 0.2094        0.2509          0.2144
                                                                          BERT4Rec   0.3295 0.4701 0.2212        0.2758          0.2338
                                                                Baby


4.2. Implementation Details                                                ASReP     0.3581 0.4885 0.2499        0.2920          0.2508
                                                                           BiCAT     0.3682 0.4972 0.2603 +4.2% 0.3007 +3.0%     0.2587 +3.1%
We select the Transformer as the backbone to verify our                     PAL      0.3759 0.5123 0.2741 +9.7% 0.3178 +8.8%     0.2704 +7.8%
SRA solution. The block number is fixed to 2. The hidden                   PAL++     0.3785 0.5123 0.2804 +12.2% 0.3200 +9.6%    0.2724 +8.6%
length is selected in {32, 64, 128}. The head number in                    SASRec    0.4053 0.5098 0.2994         0.3335         0.2947
attention is selected in {2, 4}. The learning rate is fixed               BERT4Rec   0.4400 0.5682 0.3149         0.3589         0.3024
                                                                 Office


at 0.001 since the results are similar with other settings.                ASReP     0.4689 0.6101 0.3303         0.3764         0.3186
The dropout rate is fixed to 0.5. The short sequence                       BiCAT     0.4801 0.6221 0.3462 +4.8%   0.3894 +3.5%   0.3326 +4.4%
length threshold 𝑚 is set to 18, and each short sequence                    PAL      0.4982 0.6353 0.3572 +8.1%   0.3997 +6.2%   0.3486 +9.4%
                                                                           PAL++     0.4994 0.6363 0.3602 +9.1%   0.4011 +6.6%   0.3497 +9.8%
is augmented with 15 pseudo-prior items. For the number
of sampled permutations (𝑛 in Algorithm 1), we select          Table 2
it in {2, 4, 6} with model selection. The epoch number         Performance of Different Methods on Sequential Recommen-
is fixed to 200 which is sufficient for all the models to      dation. Relative changes are based on ASReP.
converge. We conduct model selection via grid search.
For each behavior sequence, we randomly sample 100
negative items for ranking with the last item, which is the    performance results are presented in Table 2. All the
ground-truth. Recall@n, NDCG@n, and Mean Reciprocal            SRA methods achieve better performance than the oth-
Rank (MRR) are employed for as the evaluation metrics,         ers, which verified the effectiveness of augmentation.
and 𝑛 is selected in {5, 10}.                                  Compared with the strongest sequential recommenda-
                                                               tion augmentation baseline BiCAT, the proposed PAL
4.3. Performance of PAL                                        can provide around 2% to 5% improvement on NDCG10,
We performance sequential recommendation on all the            which is significant in these sparse datasets. The beam
6 datasets with the above mentioned baseline models            search method (PAL++) consistently shows effectiveness
to demonstrate the effectiveness of PAL. Previous SRA          on all the datasets, and the further performance improve-
work has shown the advantage of self-attention sequence        ment on the “Tools and Home Improvement” and “Baby”
model for recommendation, so we use the SASRec and             are more significant than other datasets. The explana-
BERT4Rec as two baselines without augmentation. The            tion for this improvement difference is that the behavior
                                                               diversity varies in different datasets.
1
    http://jmcauley.ucsd.edu/data/amazon/
                                                            4.5. Analysis on Convergence Rate
                                                            One interesting finding is that the pseudo sequence gen-
                                                            erated by PAL can significantly improve the converge
                                                            rate of the finetuning stage in sequential recommenda-
                                                            tion augmentation. We depict the loss value during the
                                                            finetuning stage in Fig 4, where we can observe that
                                                            the PAL method can converge earlier than ASReP to
Figure 3: Performance on Short Sequence Instances.          achieve a stable loss value. Similar results can be found
                                                            in other datasets. Due to the permutation learning objec-
                                                            tive, the PAL is of advantage in the generated data and
4.3.1. Effectiveness of PAL for Short Sequences             pretrained model, which lead to the improvement of the
                                                            convergence rate in the final finetuning stage.
Performance improvement on short behavior sequence is
critical for an augmentation paradigm. To further analyze
the advantages of PAL for short sequence, we reconstruct
the test set with all the behavior sequence shorter than 3
and evaluate all the baseline methods and our PAL and
PAL++. The results is presented in Figure 3. Improve-          (a) Beauty        (b) Cell Phones       (c) Sports
ment on short sequence is a critical for an augmentation Figure 4: Illustration of Convergence Rate in Finetuning. The
paradigm. We can find that the PAL and PAL++ method x-axis is the epoch, and the y-axis is the loss value. The grey
can significantly outperform the other sequential rec- line is for ASReP method, and the blue one is for PAL.
ommendation augmentation methods on all the datasets.
This result illustrate that the proposed learning method
can incorporate more contextual correlation information References
into the short sequence augmentation.
                                                           [1] C. Cai, R. He, J. McAuley, Spmc: socially-aware
                                                               personalized markov chains for sparse sequential
4.4. Analysis on Backbone Model                                recommendation, in: IJCAI, 2017, pp. 1476–1482.
The default backbone model of the SRA methods, i.e., [2] K. Song, M. Ji, S. Park, I.-C. Moon, Hierarchical
ASReP, BiCAT, PAL, PAL++ in Section 4.3 is the basic           context enabled recurrent neural network for rec-
Transformer which is the same as in SASRec. All the SRA        ommendation, in: AAAI, volume 33, 2019, pp. 4983–
methods can also be applied to all the Transformer-based       4991.
SR methods, such as SASRec, BERT4Rec and TiSASRec. [3] C. Xu, J. Feng, P. Zhao, F. Zhuang, D. Wang, Y. Liu,
For TiSASRec, we ignore the temporal information in the        V. S. Sheng, Long-and short-term self-attention
pretraining and sequence generating stages, and assign         network for sequential recommendation, Neuro-
the smallest timestamp in the original sequence for the        computing 423 (2021) 580–589.
generated items. Here we report part of performance        [4] J. Wu, X. Wang, F. Feng, X. He, L. Chen, J. Lian,
(NDCG@5) comparison on “Beauty” dataset in Table 3.            X. Xie, Self-supervised graph learning for recom-
According to the results, equipped with SRA methods,           mendation, in: SIGIR, 2021, pp. 726–735.
the performance of all the backbone models are improved, [5] M. Wang, P. Ren, L. Mei, Z. Chen, J. Ma, M. de Rijke,
and the proposed PAL / PAL++ achieve the best results.         A collaborative session-based recommendation ap-
Please note the TiSASRec is outperformed by SASRec as          proach with parallel memory modules, in: SIGIR,
our current augmentation methods has not incorporated          2019, pp. 345–354.
the temporal information, which is a future work for       [6] Z. Liu, Y. Chen, J. Li, P. S. Yu, J. McAuley, C. Xiong,
SRA.                                                           Contrastive self-supervised sequential recommen-
                                                               dation with robust augmentation, arXiv preprint
     Backbone      SASRec     BERT4Rec      TiSASRec           arXiv:2108.06479 (2021).
        Base        0.2884       0.3075       0.3076       [7] Z. Liu, Z. Fan, Y. Wang, P. S. Yu, Augmenting se-
       ASReP        0.3465       0.3562       0.3427           quential recommendation with pseudo-prior items
       BiCAT        0.3704       0.3746       0.3625
                                                               via reversely pre-training transformer, in: SIGIR,
        PAL         0.3873       0.3886       0.3771
                                                               SIGIR ’21, Association for Computing Machinery,
       PAL++        0.3879       0.3886       0.3791
                                                               New York, NY, USA, 2021, p. 1608–1612. URL: https:
Table 3                                                        //doi.org/10.1145/3404835.3463036. doi:10.1145/
NDCG@5 with Different Backbone Model                           3404835.3463036.
 [8] J. Jiang, Y. Luo, J. B. Kim, K. Zhang, S. Kim, Sequen-
     tial recommendation with bidirectional chronolog-
     ical augmentation of transformer, arXiv preprint
     arXiv:2112.06460 (2021).
 [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     tention is all you need, NeurIPS 30 (2017).
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
     Pre-training of deep bidirectional transformers for
     language understanding, in: NAACL-HLT (1), 2019.
[11] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut-
     dinov, Q. V. Le, Xlnet: Generalized autoregressive
     pretraining for language understanding, NeurIPS
     32 (2019).
[12] S. Wiseman, A. M. Rush, Sequence-to-sequence
     learning as beam-search optimization, in: EMNLP,
     2016, pp. 1296–1306.
[13] W.-C. Kang, J. McAuley, Self-attentive sequential
     recommendation, in: ICDM, IEEE, 2018, pp. 197–
     206.
[14] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,
     Bert4rec: Sequential recommendation with bidirec-
     tional encoder representations from transformer,
     in: CIKM, 2019, pp. 1441–1450.