<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploiting Global Behavior Contextual Correlation in Sequential Recommendation Augmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qian Yu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangdong Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chen Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zihao Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoxin Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chaosheng Fan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Changping Peng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhangang Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jinghe Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jingping Shao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Marketing &amp; Commercialization Center</institution>
          ,
          <addr-line>JD.com</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tsinghua University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The recently proposed Sequential Recommendation Augmentation (SRA) paradigm has shown valuable potential in sequential recommendation, especially for handling long-tail problem via extending short behavior sequences. However, the selfsupervised SRA adopts autoregressive learning with fixed forward or backward direction, which cannot make full use of the contextual correlation information in the training behavior sequences. Due to the direction diference, discrepancy problem exists in the two training stages of SRA, i.e., pretraining and finetuning. In order to overcome the restriction of specific sequential learning direction, we propose to equip SRA with permutation autoregressive learning to extract global contextual correlation information from the behavior sequences in both directions. The adapted SRA method is implemented with two-stream self-attention. Empirical evaluations on multiple sequential recommendation benchmark datasets demonstrate the efectiveness of our proposed model, and the augmented data can significantly reduce the convergence rate.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sequential Recommendation</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Permutation Autoregressive Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>DL4SR’22: Workshop on Deep Learning for Search and
Recommendation, co-located with the 31st ACM International Conference on
Information and Knowledge Management (CIKM), October 17-21, 2022,
Atlanta, USA
$ yuqian81@jd.com (Q. Yu)</p>
      <p>
        © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License contextual information without restrictions of learning
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) direction. For the inference stage, we try to adopt the
beam search for generating more suitable subsequence
as the augmented data. A latest revision named BiCAT
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] comes with a similar motivation by implemented via
an additional loss regularization, but it is not designed
for extracting contextual correlation information around
the predicted position, and we will empirically compare
them.
      </p>
      <p>Our contributions can be summarized as following:
(a) Global contextual correlation information is explored
in Sequential Recommendation Augmentation (SRA). (b)
Equipped with Permutation Autoregressive Learning and
beam search method, an adapted SRA framework is
designed and evaluated. (c) The proposed framework
outperforms the state-of-the-art methods for sequential
recommendation augmentation without extra information
or heuristic rules.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Sequential Recommendation</title>
    </sec>
    <sec id="sec-3">
      <title>Augmentation</title>
      <p>The sequential recommendation task can be regarded as
the next-item prediction given the historical behavior
recommendation task can be formulated as:
sequence. We denote the user set as  and the item set
as  . The interaction behavior of the given user  ∈ 
is denoted as  = {1, 2, ..., }. The sequential
+1 = arg max (|)</p>
      <p>which means finding the next item</p>
      <p>+1 with the largest
probability given the user behavior sequence .</p>
      <p>
        Recently, a Sequential Recommendation
Augmentation (SRA) paradigm is proposed [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the basic SRA
method is also known as ASPeP. As illustrated in Fig 1,
ASPeP utilizes reverse pretraining for data augmentation.
dure updating the same set of model parameter  , namely
reverse pretraining and left-to-right finetuning. The
reverse pretraining intents to learn the inverse sequence
generation via the autoregressive learning objective:
      </p>
      <p>max  ( |+1, +2, ..., )</p>
      <p>(2)
With the pretrained model, pseudo-prior items can be
recursively generated for short sequences, in order to
eliminate the data sparsity in recommendation and
further improve the quality of the whole training set. The
pretrained model can be further finetuned for next-item
prediction, and the learning objective is the forward
autoregressive learning objective:
max  ( |1, 2, ..., − 1)</p>
      <p>We take Transformer as an example backbone for de- an autoregressive way.
scribing this learning paradigm. The key component in
Transformer, i.e., multi-head self-attention is constructed
with linear transformation and the scaled dot-product</p>
      <p>
        Assume that the length of the behavior sequence is  ,
we denote the set of all possible permutation of index as
 . For example, if  = 4, then the original
permutaattention [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. There are two stages in the training proce- tion is [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], and the number of permutation in 
(a) Content-stream Attention
      </p>
      <p>(b) Target-stream Attention</p>
      <p>
        For details of SRA paradigm, please refer to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <sec id="sec-4-1">
        <title>3.1. Permutation Autoregressive Learning</title>
        <p>
          Now our intention is to help the SRA framework making
use of the global contextual correlation in the
behavior sequences. The idea of “mask and reconstruct” is a
commonly used method for helping the sequence model
to learn from contextual information in arbitrary
position, but the incorporation of [MASK] token in behavior
sequence will bring more severe discrepancy problem
as in BERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], especially considering that the trained
model will be used for recursively generating behavior
sequences. Our solution to exploit context information is
adopting the permutation modeling objective [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which
gather the information in bidirectional context while
re
        </p>
        <sec id="sec-4-1-1">
          <title>3.1.1. Permutations with Original Position</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Encoding</title>
          <p>In order to exploit bidirectional item correlation in
bidirectional context, we propose to train the sequence model
with diferent permutations of each training sequence in
(1)
maining the autoregressive learning paradigm.
where &lt; stand for the items in x whose index is in
&lt;. This new objective calculates the probability of an
item conditioned on all possible permutations of items
in an autoregressive way, as opposed to just those to the
(3) left side or right side of the target item in the existing
is  ! = 24. For each permutation z in  , &lt; stands
for indices of all the element before -th element , and
 is for the current elements. Similar to autoregressive
learning, we try to predict  given &lt;. In this way, the
learning objective is rewritten according to the
permutation z instead of the original order of original sequence
x as follows.</p>
          <p>max Ez∼  log  ( |&lt; )

(4)
Algorithm 1 Permutation Autoregressive Learning for
SRA
1: Input: A set of behavior sequence {}
2: Output: A sequential recommendation model
3: procedure PAL({})
4: for each epoch do
5: for each instance in batch do
6: Sample  permutations with length .
7: Pretrain with Eqn 4.
8: Save the result pretrained model ℳ0 for generation.
9: Select behavior sequences shorter than .
10: Generate the pseudo-prior items with beam width .
11: for each epoch do
12: for each batch do
13: Finetune ℳ0 with Eqn 3.
14: Save the result finetuned model for sequential
recommendation.
dataset
#user
#item
#instance
avg. length</p>
          <p>The first one is the content-stream representation
which is exactly the same as the hidden state in the
standard self-attention. This corresponding attention
is named Content-stream attention:
ℎ() ←</p>
          <p>Attention( = ℎ(− 1),  = ℎ(≤− 1);  )</p>
          <p>(6)
() is the output of the -th block Transformer.
where ℎ
methods for sequential recommendation augmentation. The second representation is target representation, which
It should be emphasised that only the indices, namely contains only the position information of the target in
which elements are used for prediction, are changed, order to avoid content information leaking. The
Targetwhile the position of each item in the original sequence stream attention is:
is rTehtreaianbeodv.e learning objective can help each position ̃ℎ︀() ← Attention( = ̃ℎ︀(− 1),  = ℎ(&lt;− 1);  )
to learn information from bidirectional context, but it (7)
brings a new issue about the position information. In
prediction given several known items, the predicted position The content representation ℎ(0) is initialized by the
or index is not fixed as in original autoregressive learn- item embedding e( ) which is added with the
posiing. So we need to learn the target-aware representation tional encoding as in normal Transformer, and all the
which can tell the position that the current predicted item target representation ̃ℎ︀(0) is initialized by an identical
located in. Therefore, the  ( |&lt; ) is formulated as: trainable vector w. The output ̃ℎ︀ of the last layer
Trans ( |&lt; ) = ∑︀* exp[e(* ) h̃︀ (&lt;, )]
exp[e( ) h̃︀ (&lt;, )]
(5)
former is used as h̃︀ (&lt;, ) in Eqn 5 for prediction. In
this way, equipped with this two-stream attention, we
can force each position in the sequence to learn
bidirectional information while maintaining the normal
behavwhere h̃︀ (&lt;, ) is the learned target-aware represen- ior order.
tation for the item with the -th index.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>3.1.2. Two-stream Attention for Contextual</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Representation</title>
          <p>
            We propose beam-search for obtain the optimal sequence
as the pseudo-prior items. Instead of recursively predict
As aforementioned, the self-attention module in the se- the next item in a greedy way, Beam Search method
quence model need to be modified for obtaining target- maintains a bufer of candidate subsequences and selects
aware representations. The two-stream attention struc- the best one with the largest joint probability. Beam
ture was used in language modeling to provide target width value is denoted as . More details can be found
position information without leaking the content infor- in [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
mation of the target.
          </p>
          <p>Specifically, two separated streams of attention vectors
are maintained to store content information and position 3.2. PAL Algorithm for SRA
information. For each position  in the factorization z, Considering the computing complexity of the
permutawe keep updating the intermediate vectors ℎ and ̃ℎ︀ , tion autoregressive learning, we need to sample from the
representing content stream and target stream respec- set of permutation and predict partial of the sequence.
tively. Each stream is learned by a designated attention For each sampled permutation, we train the model via
mechanism. The detailed formulations of the two-stream maximizing the probability of last item. The detailed
attention structure is described as follows. learning procedure is described in Algorithm 1.</p>
        </sec>
        <sec id="sec-4-1-5">
          <title>3.1.3. Optimization of Sequence Augmentation</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <sec id="sec-5-1">
        <title>4.1. Datasets and Baseline Models</title>
        <p>
          Following the setting in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], 6 datasets are adopted
which are collected from .1. For the
behavior sequence construction, we regard the presence of
review as an interaction between a user and an item, and
construct the behavior sequence according to the
timestamp. Following the preprocession in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we use the last
item in a sequence for test. The statistics of datasets is
shown in Table 1.
        </p>
        <p>
          We compare our proposed method with the following
methods including the state-of-the-art BiCAT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] method
in sequential recommendation augmentation. SASRec
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] utilize the transformer to extract the correlation
from the training sequences and predict the next item.
BERT4Rec [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] exploit the training method in BERT
to learn transformer for SR. ASReP [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] reversely
pretrain the transformer to generate pseudo-prior items for
short sequence and then finetune the transformer for
SR. BiCAT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is the latest model for sequential
recommendation augmentation. It incorporates an additional
objective in pretraining. PAL is our proposed learning
method. PAL++ equips PAL with beam search.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Implementation Details</title>
        <p>We select the Transformer as the backbone to verify our
SRA solution. The block number is fixed to 2. The hidden
length is selected in {32, 64, 128}. The head number in
attention is selected in {2, 4}. The learning rate is fixed
at 0.001 since the results are similar with other settings.
The dropout rate is fixed to 0.5. The short sequence
length threshold  is set to 18, and each short sequence
is augmented with 15 pseudo-prior items. For the number
of sampled permutations ( in Algorithm 1), we select
it in {2, 4, 6} with model selection. The epoch number
is fixed to 200 which is suficient for all the models to
converge. We conduct model selection via grid search.
For each behavior sequence, we randomly sample 100
negative items for ranking with the last item, which is the
ground-truth. Recall@n, NDCG@n, and Mean Reciprocal
Rank (MRR) are employed for as the evaluation metrics,
and  is selected in {5, 10}.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Performance of PAL</title>
        <p>We performance sequential recommendation on all the
6 datasets with the above mentioned baseline models
to demonstrate the efectiveness of PAL. Previous SRA
work has shown the advantage of self-attention sequence
model for recommendation, so we use the SASRec and
BERT4Rec as two baselines without augmentation. The
1http://jmcauley.ucsd.edu/data/amazon/
y
t
u
a
e
B
s
e
n
o
h
P
s
t
r
o
p
S
s
l
o
o
T
y
b
a
B
e
c
fi
O</p>
        <p>Model</p>
        <p>MRR
SASRec 0.3849 0.4863 0.2884 0.3212 0.2870
BERT4Rec 0.4243 0.5371 0.3075 0.3598 0.3021
ASReP 0.4583 0.5743 0.3465 0.4042 0.3540
BiCAT 0.4901 0.5892 0.3704 +6.8% 0.4289 +6.1% 0.3712 +4.8%
PAL 0.4934 0.6048 0.3873 +11.7% 0.4400 +8.8% 0.3803 +7.4%
PAL++ 0.4936 0.6036 0.3879 +11.9% 0.4415 +9.2% 0.3821 +7.9%
performance results are presented in Table 2. All the
SRA methods achieve better performance than the
others, which verified the efectiveness of augmentation.
Compared with the strongest sequential
recommendation augmentation baseline BiCAT, the proposed PAL
can provide around 2% to 5% improvement on NDCG10,
which is significant in these sparse datasets. The beam
search method (PAL++) consistently shows efectiveness
on all the datasets, and the further performance
improvement on the “Tools and Home Improvement” and “Baby”
are more significant than other datasets. The
explanation for this improvement diference is that the behavior
diversity varies in diferent datasets.</p>
        <sec id="sec-5-3-1">
          <title>4.3.1. Efectiveness of PAL for Short Sequences</title>
          <p>Performance improvement on short behavior sequence is
critical for an augmentation paradigm. To further analyze
the advantages of PAL for short sequence, we reconstruct
the test set with all the behavior sequence shorter than 3
and evaluate all the baseline methods and our PAL and
PAL++. The results is presented in Figure 3.
Improvement on short sequence is a critical for an augmentation
paradigm. We can find that the PAL and PAL++ method
can significantly outperform the other sequential
recommendation augmentation methods on all the datasets.
This result illustrate that the proposed learning method
can incorporate more contextual correlation information
into the short sequence augmentation.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Analysis on Backbone Model</title>
        <p>The default backbone model of the SRA methods, i.e.,
ASReP, BiCAT, PAL, PAL++ in Section 4.3 is the basic
Transformer which is the same as in SASRec. All the SRA
methods can also be applied to all the Transformer-based
SR methods, such as SASRec, BERT4Rec and TiSASRec.
For TiSASRec, we ignore the temporal information in the
pretraining and sequence generating stages, and assign
the smallest timestamp in the original sequence for the
generated items. Here we report part of performance
(NDCG@5) comparison on “Beauty” dataset in Table 3.
According to the results, equipped with SRA methods,
the performance of all the backbone models are improved,
and the proposed PAL / PAL++ achieve the best results.
Please note the TiSASRec is outperformed by SASRec as
our current augmentation methods has not incorporated
the temporal information, which is a future work for
SRA.</p>
        <p>Backbone</p>
        <p>Base
ASReP
BiCAT
PAL
PAL++</p>
        <sec id="sec-5-4-1">
          <title>SASRec</title>
          <p>0.2884
0.3465
0.3704
0.3873
0.3879</p>
        </sec>
        <sec id="sec-5-4-2">
          <title>BERT4Rec</title>
          <p>0.3075
0.3562
0.3746
0.3886
0.3886</p>
        </sec>
        <sec id="sec-5-4-3">
          <title>TiSASRec</title>
          <p>0.3076
0.3427
0.3625
0.3771
0.3791</p>
        </sec>
      </sec>
      <sec id="sec-5-5">
        <title>4.5. Analysis on Convergence Rate</title>
        <p>One interesting finding is that the pseudo sequence
generated by PAL can significantly improve the converge
rate of the finetuning stage in sequential
recommendation augmentation. We depict the loss value during the
ifnetuning stage in Fig 4, where we can observe that
the PAL method can converge earlier than ASReP to
achieve a stable loss value. Similar results can be found
in other datasets. Due to the permutation learning
objective, the PAL is of advantage in the generated data and
pretrained model, which lead to the improvement of the
convergence rate in the final finetuning stage.</p>
        <p>(a) Beauty (b) Cell Phones (c) Sports
Figure 4: Illustration of Convergence Rate in Finetuning. The
x-axis is the epoch, and the y-axis is the loss value. The grey
line is for ASReP method, and the blue one is for PAL.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Spmc: socially-aware personalized markov chains for sparse sequential recommendation</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1476</fpage>
          -
          <lpage>1482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          , I.-C. Moon,
          <article-title>Hierarchical context enabled recurrent neural network for recommendation</article-title>
          ,
          <source>in: AAAI</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>4983</fpage>
          -
          <lpage>4991</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. S.</surname>
          </string-name>
          <article-title>Sheng, Long-and short-term self-attention network for sequential recommendation</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>423</volume>
          (
          <year>2021</year>
          )
          <fpage>580</fpage>
          -
          <lpage>589</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Self-supervised graph learning for recommendation</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>726</fpage>
          -
          <lpage>735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , J. Ma, M. de Rijke,
          <article-title>A collaborative session-based recommendation approach with parallel memory modules</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>345</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McAuley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <article-title>Contrastive self-supervised sequential recommendation with robust augmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2108.06479</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Augmenting sequential recommendation with pseudo-prior items via reversely pre-training transformer</article-title>
          , in: SIGIR, SIGIR '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>1608</fpage>
          -
          <lpage>1612</lpage>
          . URL: https: //doi.org/10.1145/3404835.3463036. doi:
          <volume>10</volume>
          .1145/ 3404835.3463036.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Kim,
          <article-title>Sequential recommendation with bidirectional chronological augmentation of transformer</article-title>
          ,
          <source>arXiv preprint arXiv:2112.06460</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>NeurIPS</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: NAACL-HLT (1)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. Carbonell,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          ,
          <source>NeurIPS</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiseman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>Sequence-to-sequence learning as beam-search optimization</article-title>
          , in: EMNLP,
          <year>2016</year>
          , pp.
          <fpage>1296</fpage>
          -
          <lpage>1306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>W.-C. Kang</surname>
            ,
            <given-names>J. McAuley</given-names>
          </string-name>
          ,
          <article-title>Self-attentive sequential recommendation</article-title>
          , in: ICDM, IEEE,
          <year>2018</year>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ou</surname>
          </string-name>
          , P. Jiang,
          <article-title>Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer</article-title>
          ,
          <source>in: CIKM</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1441</fpage>
          -
          <lpage>1450</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>