<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scaling Generative Pre-training for User Ad Activity Sequences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sharad Chitlangia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krishna Reddy Kesari</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajat Agarwal</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>User activity sequence modeling has significantly improved performance across a range tasks in advertising spanning across supervised learning tasks like ad response prediction to unsupervised tasks like robot and ad fraud detection. Self-supervised learning using autoregressive generative models has garnered interest due to performance improvements on time series and natural language data. In this paper, we present a scalable autoregressive generative pre-training framework to model user ad activity sequences and inspect its scaling properties with respect to model size, dataset size and compute. We show that test loss on pre-training task follows power law scaling with respect to model size, with larger models being more data and compute eficient than smaller models. We also demonstrate that improvement in pre-training test loss translates into better downstream task performance by benchmarking the models on conversion prediction and robot detection tasks in advertising.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;generative pre-training</kwd>
        <kwd>self-supervised learning</kwd>
        <kwd>scaling</kwd>
        <kwd>invalid trafic</kwd>
        <kwd>robot detection</kwd>
        <kwd>digital advertising</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        on which enables the model to learn robust task-agnostic
embeddings capturing important characteristics and
feaAdvances in deep learning have driven a rapid adoption tures about the dataset. Autoregressive models, a class
of sequence models applied to user behavioral data for ad- of generative models that perform maximum likelihood
vertising use cases spanning across personalization, ad re- estimation by defining an ordering over the input, are
sponse prediction, bidding and robot and fraud detection. a natural fit for language and time series data and have
Deep sequence models reduce reliance on manual feature yielded state-of-art results by training highly
parallelizengineering while utilizing fine grained event level in- able deep sequence model architectures like
Transformformation about the users’ activity, leading to improved ers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on the next-token prediction objective. This has
performance across a wide range of tasks. For tasks like motivated exploration of learning user embeddings using
ad response prediction, where labeled data is available at next event prediction on their ad activity sequences as
scale, typical approaches use supervised learning to train the self-supervised pre-training objective [
        <xref ref-type="bibr" rid="ref12 ref7">7, 12</xref>
        ].
deep sequence models [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, in domains like ad An interesting property of generative pre-training of
fraud detection, obtaining accurate labels at scale is im- Transformers is their enhanced performance with
growplausible and error prone due to unavailability of high ing model size, data size and compute. Analysis of these
coverage ground truth, and attempts to create pseudo scaling properties has garnered interest in the research
labels are fraught with risks of introducing bias. In such community, with primary focus so far being on natural
scenarios, learning self-supervised user representations language and computer vision data [16, 17]. In this work,
is a natural choice. Recent advances have shown that we investigate the scaling properties for autoregressive
self-supervised pre-training of sequence models not only pre-training of user activity sequence models in
adverimproves performance on tasks with low-labeled data tising. Rather than generalizing scaling laws in natural
volumes but also enhances performance over traditional language processing to advertising, we believe user
activsupervised learning on large labeled datasets. ity sequence models merit an independent scaling
anal
      </p>
      <p>Generative models, which aim to model the input data ysis, since they are diferent from text based models in
distribution  (), have been at the forefront in demon- three significant ways. First, instead of a homogeneous
strating the efectiveness of self-supervised learning. The time-series of text tokens, user activity sequence is a
key idea in self-supervised learning is to construct a multi-dimensional time series where each event in the
seproxy task on unlabeled data available at scale, training quence can be described using a variety of features types
typically seen in advertising, spanning across discrete,
high cardinality, real valued and natural language types.</p>
      <p>Second, data size in advertising is upper bounded by the
number of users interacting with the ad program. This
KDD 2023 Workshop on Artificial Intelligence for Computational
Advertising (AdKDD), August 7, 2023, Long Beach, CA
$ chitshar@amazon.com (S. Chitlangia); kkesari@amazon.com
(K. R. Kesari); agrajat@amazon.com (R. Agarwal)</p>
      <p>
        © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License is in contrast with scaling of text-based models where
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) while increasing model size, dataset size is considered to
be unbounded as one could crawl more webpages to get avoid label bias. There has been very little work towards
additional data. Finally, since trafic patterns and user exploring scaling behavior in these domains. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
studbehavior in advertising are continuously evolving, adver- ied scaling behavior of DLRM style recommender system
tising models need to be retrained continuously, making models across parameters, compute and data to show
training cost and time a critical factor in deciding the that unlike text data, model scaling does not contribute
scaling strategy for deployed settings. as much to performance improvements in recommender
      </p>
      <p>The paper is structured as follows: Section 2 describes systems. Previous works on scaling laws in advertising
the related work, we outline a scalable autoregressive use CLIP based models and assume data access across
generative pretraining framework for user activity se- multiple domains [15]. To the best of our knowledge,
quences in Section 3. In Section 4, we analyze diferent our work is the first to study scaling behavior of
Transscaling properties of the model with respect to model formers built on user activity sequences alone that uses
size, dataset size and compute. We present how improve- vanilla autoregressive pre-training and also includes an
ment in test loss during pre-training translates to down- evaluation of large models on downstream tasks relevant
stream performance on a supervised task of conversion in advertising and fraud detection.
prediction and an unsupervised task of bot detection in
advertising in Section 5. We discuss the key learnings
and contrast them with scaling properties in other data 3. Modeling Framework
domains in Section 6 and conclude in Section 7.</p>
      <sec id="sec-1-1">
        <title>3.1. Constructing Input Sequences</title>
        <p>2. Related Work We order ad events (clicks) from the user based on
timestamps to construct the activity sequence. Each event in
Generative models aim to learn the input data distribu- the sequence is described using multiple features,
cretion  (), and help either estimate the probability of a ating a multi-dimensional time series of the user’s ad
given data point or sample a data point from the input dis- activity. To handle multiple feature types describing the
tribution [31, 32, 33]. In this paper, we are primarily con- ad event, we encode each feature using an embedding
cerned with autoregressive deep generative models. The function which is learnt in an end-to-end manner with
autoregressive formulation factorizes learning the distri- the model training objective. Real-valued are converted
bution  () as the product of conditional probabilities of to categorical using bucketing to tackle the large range.
current value given all previous values in a pre-defined Formally, let  be the n-length sequence of events for a
ordering. This framework has been applied successfully user entity ordered in time, where  indicates the event
across various domain such as image synthesis (pixel- in position  in . Let [1, 2,..,] be the feature set
RNN [22], CPCv2[26]), audio synthesis (Wavenet[25]) used for the event description and let [1, 2,..,] be
and text (GPT[23, 24]). Previous work has also applied the embedding functions for the corresponding features.
amsoidmeillianrgaouftourseegrreascstiivveityfrasmeqeuweonrckess t[o7,se2l,f-3s,u1p2e]r,vwisiethd  = [1, 2, ...., , ..., ] (1)
benefits across various downstream tasks.  = [1(); 2(); ...; ()] (2)</p>
        <p>Power law based scaling properties for generative
pretraining on text datasets using Transformer models was The descriptor of each ad event is a concatenation of
studied in [16]. These properties have been successfully associated feature embeddings.  represents
concatenaused to create large language models such as GPT-3 [27], tion of these embeddings for the event at position .
GPT-4 [28], PaLM [29], LLaMA [30], etc. Works to estab-  = (1(1()), 2(2()), ..., (()))
lish scaling laws for other data domains such as vision (3)
followed in quick succession [17, 18, 19]. More recently,
there has been a line of work suggesting that these laws
might be less universal than earlier suggested [18]. In ad- 3.2. Training Objective
dition, methods have been proposed to make the scaling The time series of events S represented by their
concateexponential for certain tasks, by either pruning the data nated feature representations C is provided as input to
efectively [ 21] or pre-training with a diferent objective an of-shelf autoregressive deep sequence model. The
[20]. output representation at the last time step is taken as the</p>
        <p>
          Self-supervised learning has emerged as an important output representation of the sequence.
technique in domains of recommender systems [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
advertising [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and fraud detection [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Particularly, in  =  () (4)
detection of fraud [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ] where we typically observe a
lack of precise labels, pre-training representations help
where R is the output representation (embedding)
obtained at the final time step of the model.
        </p>
        <p>The model parameters along with the embedding ma- The final loss function now consists of two parts -
extrices are trained using next event prediction as the self- act cross entropy loss for low cardinality features and
supervised objective. At each time step, the model pre- contrastive loss for high cardinality and natural language
dicts the probability of the next event given only the features. Let  be an indicator variable that takes the
history, making autoregressive property a necessary con- value 1 if feature  is a high cardinality / natural
landition for the choice of the deep sequence model. We guage feature and 0 otherwise. The self-supervised loss
use the Transformer decoder block as the autoregressive function hence becomes:
model. The goal is to maximize the following likelihood,
() = ∑︁ ∑︁ log (+1|1, .., ;  )</p>
        <p>(5)
where for each user entity , (+1|1, .., ) is the
output probability of the next event at each time step
and  corresponds to the model and embedding matrix
parameters. Assuming each feature of predicted event to
be independent given the history,
−  =</p>
        <p>1
 − 1
− 1 
∑︁ ∑︁</p>
        <p>=1
((1 −  )−  () + ( )   ()) (9)</p>
      </sec>
      <sec id="sec-1-2">
        <title>3.3. Data and Hyperparameters</title>
        <sec id="sec-1-2-1">
          <title>The dataset consists of user ad click sequences aggre</title>
          <p>
            gated over a pre-defined time window for a large-scale
(+1|1, .., ) = ∏︁ ( ( + 1)|1, .., ) (6) advertising program. Only sequences above a minimum
=1 length are considered and maximum sequence length is
bounded to recent  events. We split users into train,
For the probability terms corresponding to low cardinal- validation and test sets in an 80:10:10 ratio. The models
ity and bucketed real valued feature inputs, full softmax train on TensorFlow in a distributed multi-machine setup
can be computed without any computational bottleneck with NVIDIA V100 GPUs using synchronous weight
upand cross-entropy of the predicted distribution with the dates. The loss is computed and optimized using AdamW
next event feature is used in the loss function. [35] optimizer with  1 as 0.9 and  2 as 0.95. We clip the
global norm of the gradients at 1.0. Decoupled weight
−  () = ( ( + 1), ^  ()) (7) decay with a rate of 0.1 is applied. Unless otherwise
mentioned, we use a fixed learning rate of 1 − 4 after an
initial warmup schedule that steadily increases learning
rate from 0 to 1 − 4 over the first epoch.
where (, ) is the cross entropy between probability
distributions  and ,  ( + 1) is the ground truth
probability distribution for the ℎ feature of the next
event and ^  () is its predicted output probability
distribution from the softmax function at time step . To avoid
the computational bottleneck in case of high cardinality
and natural language features, contrastive predictive cod- 4.1. Model Size
ing [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] is used, which classifies the ground truth feature
value of the next time step against a set of randomly
chosen negative examples directly in the embedding space.
          </p>
          <p>The dot product between the predicted embedding and
the target embedding (ground truth or negative samples)
represents the logits, using which the cross entropy is
computed.</p>
          <p>() = − log ( ( ( + 1))|^ , , {})
= − log</p>
          <p>( ( (+1))) ^ ,
( ( (+1))) ^ , + ∑︀ ( ( ())) ^ ,
where ^ , is the prediction for the next time step
embedding for the high cardinality / natural language
feature  and {} are the set of events that form the
negative samples.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Scaling Analysis</title>
      <p>We scale the model size in terms of the number of
nonembedding trainable parameters in the Transformer by
increasing the number of layers, the latent state
dimension and number of heads. We vary the number of
nonembedding parameters over 4 orders of magnitude and
train each model till convergence on the entire
training dataset, which is the upper bound of the available
data. Table 1 shows the diferent model configurations
and their test loss at convergence. We note that the
performance varies only weakly with the individual layer
hyperparameters but strongly with the overall model
non-embedding parameter count as shown in [16].</p>
      <p>Plotting the test loss at convergence follows a
power(8) law relationship with number of non-embedding
parameters at constant dataset size, as shown in Figure 1. We
extrapolate the power-law trend observed between 50k
parameters and 25M parameters to 85M parameters and
highlight that the estimated test loss of 7.429 closely
matches the experimental value of 7.425. This implies
106
Non-embedding Parameters</p>
      <p>107
that even at 85M parameters, we are not bottlenecked by
the dataset size, indicating that a billion-scale
parameter model is unlikely to overfit due to a data bottleneck.</p>
      <p>However, we acknowledge that this trend must eventu- Finally, for a fixed model size, increasing dataset size
ally saturate, beyond which it would not be useful to shows diminishing returns in terms of test loss
improvefurther increase model size under the current training ment, suggesting that to maximize performance, model
framework, as we are upper bounded in terms of organi- size and dataset size must be scaled in tandem.
Howcally available data. ever, in the practical setting of activity sequence models,
where the dataset size is upper bounded, it would still
7.65 be useful to train the largest possible model to maximize
performance within the bounds suggested in Section 4.1.
7.6 As larger model training requires significantly more
compute, we explore the compute allocation strategy in the
next section.
s
so7.55
l
t
s
eT 7.5</p>
      <p>106 107
Non-embedding Parameters</p>
      <p>108</p>
      <sec id="sec-2-1">
        <title>4.2. Data Size</title>
        <sec id="sec-2-1-1">
          <title>We analyze the impact of data scaling by considering</title>
          <p>diferent train dataset sizes, created by considering 0.1%,
1%, 25% and 100% of available user sequence data. The
learning rate is kept constant and we scale number of
GPUs with increased model size to keep the global batch
size constant. We train three model sizes, with 410K,
3.1M and 12.6M trainable parameters, until convergence
on varying dataset sizes and plot their test loss in Figure
2.</p>
          <p>We obtain three key insights from Figure 2 - first, we
observe that larger models are more data eficient. That
is, larger models require a smaller dataset to achieve a
ifxed test loss. Second, smaller models benefit more from
increasing dataset size when compared to larger models.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.3. Compute</title>
        <p>In industry settings, training of models is bounded by
monetary constraints. We use wall-clock GPU-hours
on a homogenous GPU setup (NVIDIA V100s) as the
measure of compute as against the standard
PetaFLOPdays, as the monetary cost incurred to train a model in
a standard cloud setup is a function of GPU wall-clock
time usage and not GPU utilization. In this section, we
explore for a fixed budget (monetary value or equivalent
GPU-hours), the eficient scaling strategy for model size
and data parallelism (global batch size) to achieve the
lowest possible test loss. Scaling up model size at a fixed
global batch size would require more GPUs to run in
parallel, reducing the number of serial gradient update
steps that can performed in a fixed GPU-hour budget.</p>
        <p>Alternatively, one could reduce global batch size and
number of parallel GPUs for a model and increase the
number of serial steps. We analyze this trade-of by fixing
the number of GPU-hours and varying the configuration
across diferent model sizes and global batch sizes in a
way that GPU utilization stays maximized. . We plot
the test loss for diferent model sizes at maximum GPU
utilization in Figure 3 for configurations detailed in Table
2. We note that the learning rate is scaled proportionately
with the global batch size [36].
time before its loss outperforms smaller models due to
more serial gradient steps in smaller models early on
in the training. We define  as the minimum wall
clock time required for a model with batch size  to</p>
        <p>We draw two key insights from Figure 3 - first, for outperform all smaller models trained at their individual
ifxed number of parallel GPUs and wall-clock time, larger  configuration for the same wall clock time.
models reach a lower test loss - indicating that sample We empirically demonstrate the existence of  in
eficiency (1/(number of serial gradient updates × global Figure 4, where the larger 25M parameter model at batch
batch size)) increases with model size for a target test loss. size 16k eventually achieves a lower test loss than smaller
Second, for all model sizes, increasing number of serial 410k parameter model with a more compute eficient
gradient updates is more efective than increasing batch configuration of batch size 5k, where both batch sizes are
size. Hence, the eficient scaling strategy would suggest greater than . Further extending to a fixed compute
scaling up model size while lowering the global batch size budget, Table 4 demonstrates that a larger 25M parameter
for a given compute budget. However, reducing batch model with batch size 8K achieves a lower test loss at the
size to extremely low numbers would make the gradient end of 30 minutes compared to a smaller 410K parameter
updates noisier. We empirically demonstrate in Table model with batch size 7K at the end of 120 minutes on
3 that lowering batch size below a minimum threshold the wall-clock, indicating that the  for the 25M
 for a larger model leads to worse performance than parameter model lies within the regime of the allocated
a smaller model at fixed compute. compute budget even when data parallelism for the larger
model was set at a more ineficient configuration than
8 the smaller model.</p>
        <sec id="sec-2-2-1">
          <title>While scaling up model size at  ensures eficient</title>
          <p>allocation of compute between data parallelism and serial
steps, a larger model at  requires a certain wall clock
pConversion AUC
Params</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Downstream Task Evaluation</title>
      <sec id="sec-3-1">
        <title>We evaluate the performance of the learnt user represen</title>
        <p>tations on two downstream tasks - first, where accurate
labels are available for training a classifier and another
where no task specific fine-tuning is possible due to lack
of labels.</p>
        <sec id="sec-3-1-1">
          <title>5.1. Linear Separability in Classification</title>
          <p>In this experiment, we benchmark the user embeddings
on the user conversion prediction task based on linear
separability. We train a linear binary classifier on the
learnt user embeddings (output of the last timestep in the
sequence) to predict if the user converts, and evaluate the
eficacy based on AUC-ROC. Higher AUC-ROC implies
that the embeddings have better linear separation with
respect to the downstream conversion label.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>5.2. Click bot detection</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Due to absence of accurate ground truth labels, super</title>
        <p>vised techniques fall short in bot detection scenarios.</p>
        <p>While labeling individual samples accurately may not be
possible, multiple domain-knowledge based heuristics
can be applied to reliably evaluate if a given group of law, making it challenging to predict the potential
perusers are robotic. Hence, we cluster self-supervised user formance gains from a larger size model apriori.
embeddings using k-means and clusters of users based Figure 5 shows the relative count of bot accounts
on these heuristics are marked as robotic. lfagged by individual models, split into diferent click
se</p>
        <p>We calibrate the heuristics to achieve a fixed False Pos- quence length buckets. It is evident that the larger models
itive Rate (FPR), which refers to the fraction of genuine are highly efective in identifying bot activity with low
human trafic flagged as robotic by the algorithm. Since click bucket bot detection improving by 42% and medium
we do not have ground truth labels, FPR is approximated click bot detection improving by 20% across the model
by using converting users as a proxy for the distribution sizes considered. This indicates that larger models are
of human labels. The fraction of converting clicks that able to learn better representations for smaller sequence
were marked as robotic is computed as FPR. We also de- lengths and help disambiguate more sophisticated bot
ifne Invalidation Rate (IVR) as the fraction of total ad patterns with limited data.
clicks flagged as robotic by the algorithm at the program
level. For a fixed operating point FPR, the model with 6. Discussion
higher IVR indicates better robotic recall.</p>
      </sec>
      <sec id="sec-3-3">
        <title>We show that the test loss of activity sequence models</title>
        <p>5.3. Results trained using generative pre-pretraining follows a
powerlaw relationship with model size at constant dataset size,
We consider embeddings from models described in Sec- similar to observations made in text, images and audio
tion 4.1, where we scale the non-embedding parameter domains [16, 17, 34]. Unlike text and images domains
count over 4 orders of magnitude on the entire training where increasing dataset size is relatively easier by
gathdata and train till convergence. Table 5 shows the down- ering data from the web, user activity sequence datasets
stream performance of the models on the conversion have a hard upper bound on dataset size, governed by
prediction and the robot detection tasks. number of users interacting with the ad program. Thus,</p>
        <p>Unsurprisingly, lower test loss of the larger models increasing model sizes would eventually lead to
overfittranslates to better downstream performance for both su- ting, saturating the power law curve. However, our data
pervised task of conversion prediction and unsupervised scaling experiments show that present model sizes do
task of robot detection. We note that scaling patterns on not show saturating behavior even on 1% dataset size,
downstream tasks do not necessarily follow the power indicating that there is significant room for model
scaling at our current dataset size. We also show that larger
models are more data eficient, achieving a lower test loss
at fixed dataset size, consistent with the trends observed
in text and image domain [16, 17] with a key distinction
that smaller models benefit more from increased data in
the activity sequence domain.</p>
        <p>As monetary constraints are a key consideration in
compute scaling in most industrial settings, we presented
a strategy to allot fixed GPU-hours across model size and
global batch size. In contrast to observations in natural
language models [16], we observe that scaling serial
gradient update steps are more efective than batch size, as
long as the batch size is above . Compute eficient
training of activity sequence models involves limiting the
number of GPUs such that a global batch size of 
is achieved, and picking a model size such that training
is performed for at least  wall clock time. Thus,
compute eficient training stops far short of convergence,
as highlighted to also be the case in natural language and
computer vision models. While larger models have been
shown to be sample eficient [ 16, 17, 34], we show that
the same translates to activity sequence models, even
under an additional constraint of fixed GPU-hours.</p>
        <p>Finally, we show performance on downstream tasks of
bot detection and conversion prediction improves with
generative pre-training of larger model sizes. While we
obtain performance gains, they do not follow a power law
relationship, making it dificult to predict performance
gains on business tasks with model size scaling. This
observation is also consistent with findings in the text
domain where just scaling model size has shown
significant improvements in downstream task performance
[28] that may not always follow the power law.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusion and Future Work</title>
      <sec id="sec-4-1">
        <title>We presented model, data and compute based scaling</title>
        <p>properties for generative pre-training of user activity
sequence Transformer models and demonstrated how
scaling translates to better next event prediction eficacy
which in turn leads to better downstream performance
on advertising tasks.</p>
        <p>
          In future work we plan to to study scaling
properties with respect to activity sequence lengths, by using
longer time windows as a mechanism to scale the
current bounded dataset size. We will also experiment with
more eficient training strategies that help improve over
the current power law, while reducing training costs.
Finally, with recent work on joint representation learning
of time-varying sequence data and fixed tabular data
using masked language modeling [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we will attempt to
study if scaling properties from this work also generalize
to other pre-training objectives.
Bhargav Bhushanam and Adnan Aziz. “Understand- biah, Jared D. Kaplan, Prafulla Dhariwal, Arvind
Neeing Scaling Laws for Recommendation Models.” ArXiv lakantan et al. "Language models are few-shot learners."
abs/2208.08489 (2022): n. pag. Advances in neural information processing systems 33
[15] Shin, Kyuyong, Hanock Kwak, KyungHyun Kim, Su (2020): 1877-1901.
        </p>
        <p>Young Kim and Max Nihl’en Ramstrom. “Scaling Law [28] OpenAI. “GPT-4 Technical Report.” ArXiv abs/2303.08774
for Recommendation Models: Towards General-purpose (2023): n. pag.</p>
        <p>User Representations.” ArXiv abs/2111.11294 (2021): n. [29] Chowdhery, Aakanksha, Sharan Narang, Jacob
Depag. vlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
[16] Kaplan, Jared, Sam McCandlish, T. J. Henighan, Tom B. Paul Barham, Hyung Won Chung, Charles Sutton,
SeBrown, Benjamin Chess, Rewon Child, Scott Gray, Alec bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha
Radford, Jef Wu and Dario Amodei. “Scaling Laws for Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker
Neural Language Models.” ArXiv abs/2001.08361 (2020): Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar
Prabn. pag. hakaran, Emily Reif, Nan Du, Benton C. Hutchinson,
[17] Zhai, Xiaohua, Alexander Kolesnikov, Neil Houlsby Reiner Pope, James Bradbury, Jacob Austin, Michael
Isand Lucas Beyer. “Scaling Vision Transformers.” 2022 ard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm
IEEE/CVF Conference on Computer Vision and Pattern Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk
Recognition (CVPR) (2021): 1204-1213. Michalewski, Xavier García, Vedant Misra, Kevin
[18] Hofmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego David Luan, Hyeontaek Lim, Barret Zoph,
Alexande Las Casas, Lisa Anne Hendricks, Johannes Welbl, der Spiridonov, Ryan Sepassi, David Dohan, Shivani
Aidan Clark, Tom Hennigan, Eric Noland, Katie Milli- Agrawal, Mark Omernick, Andrew M. Dai,
Thanucan, George van den Driessche, Bogdan Damoc, Aurelia malayan Sankaranarayana Pillai, Marie Pellat, Aitor
Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack Lewkowycz, Erica Moreira, Rewon Child, Oleksandr
W. Rae, Oriol Vinyals and L. Sifre. “Training Compute- Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang,
Optimal Large Language Models.” ArXiv abs/2203.15556 Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta,
(2022): n. pag. Jason Wei, Kathleen S. Meier-Hellstern, Douglas Eck, Jef
[19] Henighan, Tom, Jared Kaplan, Mor Katz, Mark Chen, Dean, Slav Petrov and Noah Fiedel. “PaLM: Scaling
LanChristopher Hesse, Jacob Jackson, Heewoo Jun et al. guage Modeling with Pathways.” ArXiv abs/2204.02311
"Scaling laws for autoregressive generative modeling." (2022): n. pag.</p>
        <p>arXiv preprint arXiv:2010.14701 (2020). [30] Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier
[20] Tay, Yi, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
BapJason Wei, Xuezhi Wang, Hyung Won Chung et al. "Ul2: tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
Unifying language learning paradigms." In The Eleventh Aur’elien Rodriguez, Armand Joulin, Edouard Grave and
International Conference on Learning Representations. Guillaume Lample. “LLaMA: Open and Eficient
Founda2022. tion Language Models.” ArXiv abs/2302.13971 (2023): n.
[21] Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya pag.</p>
        <p>Ganguli, and Ari Morcos. "Beyond neural scaling laws: [31] Creswell, Antonia, Tom White, Vincent Dumoulin, Kai
beating power law scaling via data pruning." Advances in Arulkumaran, Biswa Sengupta, and Anil A. Bharath.
Neural Information Processing Systems 35 (2022): 19523- "Generative adversarial networks: An overview." IEEE
19536. signal processing magazine 35, no. 1 (2018): 53-65.
[22] Van Den Oord, Aäron, Nal Kalchbrenner, and Koray [32] Kingma, Diederik P., and Max Welling. "Auto-encoding
Kavukcuoglu. "Pixel recurrent neural networks." In Inter- variational bayes." arXiv preprint arXiv:1312.6114 (2013).
national conference on machine learning, pp. 1747-1756. [33] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising
PMLR, 2016. difusion probabilistic models." Advances in Neural
In[23] Radford, Alec, Karthik Narasimhan, Tim Salimans, and formation Processing Systems 33 (2020): 6840-6851.</p>
        <p>Ilya Sutskever. "Improving language understanding by [34] Pu, J., Yang, Y., Li, R., Elibol, O., Droppo, J. (2021)
Scalgenerative pre-training." (2018). ing Efect of Self-Supervised Speech Models. Proc.
Inter[24] Radford, Alec, Jefrey Wu, Rewon Child, David Luan, speech 2021, 1084-1088, doi:
10.21437/Interspeech.2021Dario Amodei, and Ilya Sutskever. "Language models are 1935
unsupervised multitask learners." OpenAI blog 1, no. 8 [35] Loshchilov, Ilya, and Frank Hutter. "Decoupled weight
(2019): 9. decay regularization." arXiv preprint arXiv:1711.05101
[25] Oord, Aaron van den, Sander Dieleman, Heiga Zen, (2017).</p>
        <p>Karen Simonyan, Oriol Vinyals, Alex Graves, Nal [36] Goyal, Priya, Piotr Dollár, Ross Girshick, Pieter
NoordKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
"Wavenet: A generative model for raw audio." arXiv Yangqing Jia, and Kaiming He. "Accurate, large
minipreprint arXiv:1609.03499 (2016). batch sgd: Training imagenet in 1 hour." arXiv preprint
[26] Henaf, Olivier. "Data-eficient image recognition with arXiv:1706.02677 (2017).</p>
        <p>contrastive predictive coding." In International
conference on machine learning, pp. 4182-4192. PMLR, 2020.
[27] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie
Sub</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Vaswani</surname>
            , Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
            <given-names>Aidan N.</given-names>
          </string-name>
          <string-name>
            <surname>Gomez</surname>
            , Łukasz Kaiser, and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Agarwal</surname>
            , Rajat, Shailendra Agarwal, Agniva Som, and
            <given-names>Hemant</given-names>
          </string-name>
          <string-name>
            <surname>Kowshik</surname>
          </string-name>
          .
          <article-title>Using Customer Ad Click Sequences to Identify Invalid Trafic in Sponsored Products</article-title>
          .
          <source>In Amazon Machine Learning Conference</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Agarwal</surname>
            , Rajat, Agniva Som, Arvind Srinivasan, Jerin Francis, Anand Muralidhar, and
            <given-names>Hemant</given-names>
          </string-name>
          <string-name>
            <surname>Kowshik</surname>
          </string-name>
          .
          <article-title>Selfsupervised Representation Learning for User Ad Activity Sequences</article-title>
          .
          <source>In Amazon Machine Learning Conference</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Gligorijevic</surname>
            , Djordje,
            <given-names>Jelena</given-names>
          </string-name>
          <string-name>
            <surname>Gligorijevic</surname>
            , and
            <given-names>Aaron</given-names>
          </string-name>
          <string-name>
            <surname>Flores</surname>
          </string-name>
          .
          <article-title>Time-Aware Prospective Modeling of Users for Online Display Advertising</article-title>
          . arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>05100</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Oord</surname>
          </string-name>
          , Aaron van den,
          <string-name>
            <surname>Yazhe Li</surname>
            ,
            <given-names>and Oriol</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
          </string-name>
          .
          <article-title>Representation learning with contrastive predictive coding</article-title>
          . arXiv preprint arXiv:
          <year>1807</year>
          .
          <volume>03748</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>He</surname>
            , Kaiming, Xinlei Chen, Saining Xie,
            <given-names>Yanghao</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Piotr</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
            , and
            <given-names>Ross</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          .
          <article-title>Masked autoencoders are scalable vision learners</article-title>
          .
          <source>arXiv preprint arXiv:2111.06377</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>Yiping.</given-names>
          </string-name>
          <article-title>On the Efectiveness of Self-supervised Pretraining for Modeling User Behavior Sequences</article-title>
          . In AdKDD,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Naumov</surname>
          </string-name>
          , Maxim, Dheevatsa Mudigere,
          <string-name>
            <surname>Hao-Jun Michael</surname>
            <given-names>Shi</given-names>
          </string-name>
          , Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang et al.
          <article-title>Deep learning recommendation model for personalization and recommendation systems</article-title>
          . arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>00091</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Abadi</surname>
            , Martín, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
            <given-names>Jefrey</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>Matthieu</given-names>
          </string-name>
          <string-name>
            <surname>Devin</surname>
          </string-name>
          et al.
          <article-title>Tensorflow: A system for large-scale machine learning</article-title>
          .
          <source>In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)</source>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yao</surname>
          </string-name>
          , Tiansheng, Xinyang Yi, Derek Zhiyuan Cheng, Felix Yu, Ting Chen, Aditya Menon, Lichan Hong et al.
          <article-title>"Self-supervised learning for large-scale item recommendations."</article-title>
          <source>In Proceedings of the 30th ACM International Conference on Information &amp; Knowledge Management</source>
          , pp.
          <fpage>4321</fpage>
          -
          <lpage>4330</lpage>
          .
          <year>2021</year>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Guo</surname>
            , Wei, Can Zhang, Zhicheng He, Jiarui Qin, Huifeng Guo, Bo Chen,
            <given-names>Rui</given-names>
            ming Tang, Xiuqiang He and Rui
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          . “
          <article-title>MISS: Multi-Interest Self-Supervised Learning Framework for Click-Through Rate Prediction</article-title>
          .”
          <source>2022 IEEE 38th International Conference on Data Engineering (ICDE)</source>
          (
          <year>2021</year>
          ):
          <fpage>727</fpage>
          -
          <lpage>740</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Agarwal</surname>
            , Rajat, Anand Muralidhar, Agniva Som and
            <given-names>Hemant</given-names>
          </string-name>
          <string-name>
            <surname>Kowshik</surname>
          </string-name>
          . “
          <article-title>Self-supervised Representation Learning Across Sequential and Tabular Features Using Transformers</article-title>
          .” (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Chitlangia</surname>
            , Sharad, Anand Muralidhar and
            <given-names>Rajat</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
          </string-name>
          . “
          <article-title>Self Supervised Pre-training for Large Scale Tabular Data</article-title>
          .” (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ardalani</surname>
          </string-name>
          , Newsha,
          <string-name>
            <surname>Carole-Jean</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Zeliang Chen,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>