1. Introduction

Scaling Generative Pre-training for User Ad Activity Sequences

Sharad Chitlangia

Krishna Reddy Kesari

Rajat Agarwal

User activity sequence modeling has significantly improved performance across a range tasks in advertising spanning across supervised learning tasks like ad response prediction to unsupervised tasks like robot and ad fraud detection. Self-supervised learning using autoregressive generative models has garnered interest due to performance improvements on time series and natural language data. In this paper, we present a scalable autoregressive generative pre-training framework to model user ad activity sequences and inspect its scaling properties with respect to model size, dataset size and compute. We show that test loss on pre-training task follows power law scaling with respect to model size, with larger models being more data and compute eficient than smaller models. We also demonstrate that improvement in pre-training test loss translates into better downstream task performance by benchmarking the models on conversion prediction and robot detection tasks in advertising.

eol>generative pre-training self-supervised learning scaling invalid trafic robot detection digital advertising

1. Introduction

on which enables the model to learn robust task-agnostic embeddings capturing important characteristics and feaAdvances in deep learning have driven a rapid adoption tures about the dataset. Autoregressive models, a class of sequence models applied to user behavioral data for ad- of generative models that perform maximum likelihood vertising use cases spanning across personalization, ad re- estimation by defining an ordering over the input, are sponse prediction, bidding and robot and fraud detection. a natural fit for language and time series data and have Deep sequence models reduce reliance on manual feature yielded state-of-art results by training highly parallelizengineering while utilizing fine grained event level in- able deep sequence model architectures like Transformformation about the users’ activity, leading to improved ers [ 1 ] on the next-token prediction objective. This has performance across a wide range of tasks. For tasks like motivated exploration of learning user embeddings using ad response prediction, where labeled data is available at next event prediction on their ad activity sequences as scale, typical approaches use supervised learning to train the self-supervised pre-training objective [ 7, 12 ]. deep sequence models [ 4 ]. However, in domains like ad An interesting property of generative pre-training of fraud detection, obtaining accurate labels at scale is im- Transformers is their enhanced performance with growplausible and error prone due to unavailability of high ing model size, data size and compute. Analysis of these coverage ground truth, and attempts to create pseudo scaling properties has garnered interest in the research labels are fraught with risks of introducing bias. In such community, with primary focus so far being on natural scenarios, learning self-supervised user representations language and computer vision data [16, 17]. In this work, is a natural choice. Recent advances have shown that we investigate the scaling properties for autoregressive self-supervised pre-training of sequence models not only pre-training of user activity sequence models in adverimproves performance on tasks with low-labeled data tising. Rather than generalizing scaling laws in natural volumes but also enhances performance over traditional language processing to advertising, we believe user activsupervised learning on large labeled datasets. ity sequence models merit an independent scaling anal

Generative models, which aim to model the input data ysis, since they are diferent from text based models in distribution (), have been at the forefront in demon- three significant ways. First, instead of a homogeneous strating the efectiveness of self-supervised learning. The time-series of text tokens, user activity sequence is a key idea in self-supervised learning is to construct a multi-dimensional time series where each event in the seproxy task on unlabeled data available at scale, training quence can be described using a variety of features types typically seen in advertising, spanning across discrete, high cardinality, real valued and natural language types.

Second, data size in advertising is upper bounded by the number of users interacting with the ad program. This KDD 2023 Workshop on Artificial Intelligence for Computational Advertising (AdKDD), August 7, 2023, Long Beach, CA $ chitshar@amazon.com (S. Chitlangia); kkesari@amazon.com (K. R. Kesari); agrajat@amazon.com (R. Agarwal)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License is in contrast with scaling of text-based models where CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) while increasing model size, dataset size is considered to be unbounded as one could crawl more webpages to get avoid label bias. There has been very little work towards additional data. Finally, since trafic patterns and user exploring scaling behavior in these domains. [ 14 ] studbehavior in advertising are continuously evolving, adver- ied scaling behavior of DLRM style recommender system tising models need to be retrained continuously, making models across parameters, compute and data to show training cost and time a critical factor in deciding the that unlike text data, model scaling does not contribute scaling strategy for deployed settings. as much to performance improvements in recommender

The paper is structured as follows: Section 2 describes systems. Previous works on scaling laws in advertising the related work, we outline a scalable autoregressive use CLIP based models and assume data access across generative pretraining framework for user activity se- multiple domains [15]. To the best of our knowledge, quences in Section 3. In Section 4, we analyze diferent our work is the first to study scaling behavior of Transscaling properties of the model with respect to model formers built on user activity sequences alone that uses size, dataset size and compute. We present how improve- vanilla autoregressive pre-training and also includes an ment in test loss during pre-training translates to down- evaluation of large models on downstream tasks relevant stream performance on a supervised task of conversion in advertising and fraud detection. prediction and an unsupervised task of bot detection in advertising in Section 5. We discuss the key learnings and contrast them with scaling properties in other data 3. Modeling Framework domains in Section 6 and conclude in Section 7.

3.1. Constructing Input Sequences

2. Related Work We order ad events (clicks) from the user based on timestamps to construct the activity sequence. Each event in Generative models aim to learn the input data distribu- the sequence is described using multiple features, cretion (), and help either estimate the probability of a ating a multi-dimensional time series of the user’s ad given data point or sample a data point from the input dis- activity. To handle multiple feature types describing the tribution [31, 32, 33]. In this paper, we are primarily con- ad event, we encode each feature using an embedding cerned with autoregressive deep generative models. The function which is learnt in an end-to-end manner with autoregressive formulation factorizes learning the distri- the model training objective. Real-valued are converted bution () as the product of conditional probabilities of to categorical using bucketing to tackle the large range. current value given all previous values in a pre-defined Formally, let be the n-length sequence of events for a ordering. This framework has been applied successfully user entity ordered in time, where indicates the event across various domain such as image synthesis (pixel- in position in . Let [1, 2,..,] be the feature set RNN [22], CPCv2[26]), audio synthesis (Wavenet[25]) used for the event description and let [1, 2,..,] be and text (GPT[23, 24]). Previous work has also applied the embedding functions for the corresponding features. amsoidmeillianrgaouftourseegrreascstiivveityfrasmeqeuweonrckess t[o7,se2l,f-3s,u1p2e]r,vwisiethd = [1, 2, ...., , ..., ] (1) benefits across various downstream tasks. = [1(); 2(); ...; ()] (2)

Power law based scaling properties for generative pretraining on text datasets using Transformer models was The descriptor of each ad event is a concatenation of studied in [16]. These properties have been successfully associated feature embeddings. represents concatenaused to create large language models such as GPT-3 [27], tion of these embeddings for the event at position . GPT-4 [28], PaLM [29], LLaMA [30], etc. Works to estab- = (1(1()), 2(2()), ..., (())) lish scaling laws for other data domains such as vision (3) followed in quick succession [17, 18, 19]. More recently, there has been a line of work suggesting that these laws might be less universal than earlier suggested [18]. In ad- 3.2. Training Objective dition, methods have been proposed to make the scaling The time series of events S represented by their concateexponential for certain tasks, by either pruning the data nated feature representations C is provided as input to efectively [ 21] or pre-training with a diferent objective an of-shelf autoregressive deep sequence model. The [20]. output representation at the last time step is taken as the

Self-supervised learning has emerged as an important output representation of the sequence. technique in domains of recommender systems [ 10 ], advertising [ 11 ] and fraud detection [ 12 ]. Particularly, in = () (4) detection of fraud [ 12, 13 ] where we typically observe a lack of precise labels, pre-training representations help where R is the output representation (embedding) obtained at the final time step of the model.

The model parameters along with the embedding ma- The final loss function now consists of two parts - extrices are trained using next event prediction as the self- act cross entropy loss for low cardinality features and supervised objective. At each time step, the model pre- contrastive loss for high cardinality and natural language dicts the probability of the next event given only the features. Let be an indicator variable that takes the history, making autoregressive property a necessary con- value 1 if feature is a high cardinality / natural landition for the choice of the deep sequence model. We guage feature and 0 otherwise. The self-supervised loss use the Transformer decoder block as the autoregressive function hence becomes: model. The goal is to maximize the following likelihood, () = ∑︁ ∑︁ log (+1|1, .., ; )

(5) where for each user entity , (+1|1, .., ) is the output probability of the next event at each time step and corresponds to the model and embedding matrix parameters. Assuming each feature of predicted event to be independent given the history, − =

1 − 1 − 1 ∑︁ ∑︁

=1 ((1 − )− () + ( ) ()) (9)

3.3. Data and Hyperparameters The dataset consists of user ad click sequences aggre

gated over a pre-defined time window for a large-scale (+1|1, .., ) = ∏︁ ( ( + 1)|1, .., ) (6) advertising program. Only sequences above a minimum =1 length are considered and maximum sequence length is bounded to recent events. We split users into train, For the probability terms corresponding to low cardinal- validation and test sets in an 80:10:10 ratio. The models ity and bucketed real valued feature inputs, full softmax train on TensorFlow in a distributed multi-machine setup can be computed without any computational bottleneck with NVIDIA V100 GPUs using synchronous weight upand cross-entropy of the predicted distribution with the dates. The loss is computed and optimized using AdamW next event feature is used in the loss function. [35] optimizer with 1 as 0.9 and 2 as 0.95. We clip the global norm of the gradients at 1.0. Decoupled weight − () = ( ( + 1), ^ ()) (7) decay with a rate of 0.1 is applied. Unless otherwise mentioned, we use a fixed learning rate of 1 − 4 after an initial warmup schedule that steadily increases learning rate from 0 to 1 − 4 over the first epoch. where (, ) is the cross entropy between probability distributions and , ( + 1) is the ground truth probability distribution for the ℎ feature of the next event and ^ () is its predicted output probability distribution from the softmax function at time step . To avoid the computational bottleneck in case of high cardinality and natural language features, contrastive predictive cod- 4.1. Model Size ing [ 5 ] is used, which classifies the ground truth feature value of the next time step against a set of randomly chosen negative examples directly in the embedding space.

The dot product between the predicted embedding and the target embedding (ground truth or negative samples) represents the logits, using which the cross entropy is computed.

() = − log ( ( ( + 1))|^ , , {}) = − log

( ( (+1))) ^ , ( ( (+1))) ^ , + ∑︀ ( ( ())) ^ , where ^ , is the prediction for the next time step embedding for the high cardinality / natural language feature and {} are the set of events that form the negative samples.

4. Scaling Analysis

We scale the model size in terms of the number of nonembedding trainable parameters in the Transformer by increasing the number of layers, the latent state dimension and number of heads. We vary the number of nonembedding parameters over 4 orders of magnitude and train each model till convergence on the entire training dataset, which is the upper bound of the available data. Table 1 shows the diferent model configurations and their test loss at convergence. We note that the performance varies only weakly with the individual layer hyperparameters but strongly with the overall model non-embedding parameter count as shown in [16].

Plotting the test loss at convergence follows a power(8) law relationship with number of non-embedding parameters at constant dataset size, as shown in Figure 1. We extrapolate the power-law trend observed between 50k parameters and 25M parameters to 85M parameters and highlight that the estimated test loss of 7.429 closely matches the experimental value of 7.425. This implies 106 Non-embedding Parameters

107 that even at 85M parameters, we are not bottlenecked by the dataset size, indicating that a billion-scale parameter model is unlikely to overfit due to a data bottleneck.

However, we acknowledge that this trend must eventu- Finally, for a fixed model size, increasing dataset size ally saturate, beyond which it would not be useful to shows diminishing returns in terms of test loss improvefurther increase model size under the current training ment, suggesting that to maximize performance, model framework, as we are upper bounded in terms of organi- size and dataset size must be scaled in tandem. Howcally available data. ever, in the practical setting of activity sequence models, where the dataset size is upper bounded, it would still 7.65 be useful to train the largest possible model to maximize performance within the bounds suggested in Section 4.1. 7.6 As larger model training requires significantly more compute, we explore the compute allocation strategy in the next section. s so7.55 l t s eT 7.5

106 107 Non-embedding Parameters

108

4.2. Data Size We analyze the impact of data scaling by considering

diferent train dataset sizes, created by considering 0.1%, 1%, 25% and 100% of available user sequence data. The learning rate is kept constant and we scale number of GPUs with increased model size to keep the global batch size constant. We train three model sizes, with 410K, 3.1M and 12.6M trainable parameters, until convergence on varying dataset sizes and plot their test loss in Figure 2.

We obtain three key insights from Figure 2 - first, we observe that larger models are more data eficient. That is, larger models require a smaller dataset to achieve a ifxed test loss. Second, smaller models benefit more from increasing dataset size when compared to larger models.

4.3. Compute

In industry settings, training of models is bounded by monetary constraints. We use wall-clock GPU-hours on a homogenous GPU setup (NVIDIA V100s) as the measure of compute as against the standard PetaFLOPdays, as the monetary cost incurred to train a model in a standard cloud setup is a function of GPU wall-clock time usage and not GPU utilization. In this section, we explore for a fixed budget (monetary value or equivalent GPU-hours), the eficient scaling strategy for model size and data parallelism (global batch size) to achieve the lowest possible test loss. Scaling up model size at a fixed global batch size would require more GPUs to run in parallel, reducing the number of serial gradient update steps that can performed in a fixed GPU-hour budget.

Alternatively, one could reduce global batch size and number of parallel GPUs for a model and increase the number of serial steps. We analyze this trade-of by fixing the number of GPU-hours and varying the configuration across diferent model sizes and global batch sizes in a way that GPU utilization stays maximized. . We plot the test loss for diferent model sizes at maximum GPU utilization in Figure 3 for configurations detailed in Table 2. We note that the learning rate is scaled proportionately with the global batch size [36]. time before its loss outperforms smaller models due to more serial gradient steps in smaller models early on in the training. We define as the minimum wall clock time required for a model with batch size to

We draw two key insights from Figure 3 - first, for outperform all smaller models trained at their individual ifxed number of parallel GPUs and wall-clock time, larger configuration for the same wall clock time. models reach a lower test loss - indicating that sample We empirically demonstrate the existence of in eficiency (1/(number of serial gradient updates × global Figure 4, where the larger 25M parameter model at batch batch size)) increases with model size for a target test loss. size 16k eventually achieves a lower test loss than smaller Second, for all model sizes, increasing number of serial 410k parameter model with a more compute eficient gradient updates is more efective than increasing batch configuration of batch size 5k, where both batch sizes are size. Hence, the eficient scaling strategy would suggest greater than . Further extending to a fixed compute scaling up model size while lowering the global batch size budget, Table 4 demonstrates that a larger 25M parameter for a given compute budget. However, reducing batch model with batch size 8K achieves a lower test loss at the size to extremely low numbers would make the gradient end of 30 minutes compared to a smaller 410K parameter updates noisier. We empirically demonstrate in Table model with batch size 7K at the end of 120 minutes on 3 that lowering batch size below a minimum threshold the wall-clock, indicating that the for the 25M for a larger model leads to worse performance than parameter model lies within the regime of the allocated a smaller model at fixed compute. compute budget even when data parallelism for the larger model was set at a more ineficient configuration than 8 the smaller model.

While scaling up model size at ensures eficient

allocation of compute between data parallelism and serial steps, a larger model at requires a certain wall clock pConversion AUC Params

5. Downstream Task Evaluation We evaluate the performance of the learnt user represen

tations on two downstream tasks - first, where accurate labels are available for training a classifier and another where no task specific fine-tuning is possible due to lack of labels.

5.1. Linear Separability in Classification

In this experiment, we benchmark the user embeddings on the user conversion prediction task based on linear separability. We train a linear binary classifier on the learnt user embeddings (output of the last timestep in the sequence) to predict if the user converts, and evaluate the eficacy based on AUC-ROC. Higher AUC-ROC implies that the embeddings have better linear separation with respect to the downstream conversion label.

5.2. Click bot detection Due to absence of accurate ground truth labels, super

vised techniques fall short in bot detection scenarios.

While labeling individual samples accurately may not be possible, multiple domain-knowledge based heuristics can be applied to reliably evaluate if a given group of law, making it challenging to predict the potential perusers are robotic. Hence, we cluster self-supervised user formance gains from a larger size model apriori. embeddings using k-means and clusters of users based Figure 5 shows the relative count of bot accounts on these heuristics are marked as robotic. lfagged by individual models, split into diferent click se

We calibrate the heuristics to achieve a fixed False Pos- quence length buckets. It is evident that the larger models itive Rate (FPR), which refers to the fraction of genuine are highly efective in identifying bot activity with low human trafic flagged as robotic by the algorithm. Since click bucket bot detection improving by 42% and medium we do not have ground truth labels, FPR is approximated click bot detection improving by 20% across the model by using converting users as a proxy for the distribution sizes considered. This indicates that larger models are of human labels. The fraction of converting clicks that able to learn better representations for smaller sequence were marked as robotic is computed as FPR. We also de- lengths and help disambiguate more sophisticated bot ifne Invalidation Rate (IVR) as the fraction of total ad patterns with limited data. clicks flagged as robotic by the algorithm at the program level. For a fixed operating point FPR, the model with 6. Discussion higher IVR indicates better robotic recall.

We show that the test loss of activity sequence models

5.3. Results trained using generative pre-pretraining follows a powerlaw relationship with model size at constant dataset size, We consider embeddings from models described in Sec- similar to observations made in text, images and audio tion 4.1, where we scale the non-embedding parameter domains [16, 17, 34]. Unlike text and images domains count over 4 orders of magnitude on the entire training where increasing dataset size is relatively easier by gathdata and train till convergence. Table 5 shows the down- ering data from the web, user activity sequence datasets stream performance of the models on the conversion have a hard upper bound on dataset size, governed by prediction and the robot detection tasks. number of users interacting with the ad program. Thus,

Unsurprisingly, lower test loss of the larger models increasing model sizes would eventually lead to overfittranslates to better downstream performance for both su- ting, saturating the power law curve. However, our data pervised task of conversion prediction and unsupervised scaling experiments show that present model sizes do task of robot detection. We note that scaling patterns on not show saturating behavior even on 1% dataset size, downstream tasks do not necessarily follow the power indicating that there is significant room for model scaling at our current dataset size. We also show that larger models are more data eficient, achieving a lower test loss at fixed dataset size, consistent with the trends observed in text and image domain [16, 17] with a key distinction that smaller models benefit more from increased data in the activity sequence domain.

As monetary constraints are a key consideration in compute scaling in most industrial settings, we presented a strategy to allot fixed GPU-hours across model size and global batch size. In contrast to observations in natural language models [16], we observe that scaling serial gradient update steps are more efective than batch size, as long as the batch size is above . Compute eficient training of activity sequence models involves limiting the number of GPUs such that a global batch size of is achieved, and picking a model size such that training is performed for at least wall clock time. Thus, compute eficient training stops far short of convergence, as highlighted to also be the case in natural language and computer vision models. While larger models have been shown to be sample eficient [ 16, 17, 34], we show that the same translates to activity sequence models, even under an additional constraint of fixed GPU-hours.

Finally, we show performance on downstream tasks of bot detection and conversion prediction improves with generative pre-training of larger model sizes. While we obtain performance gains, they do not follow a power law relationship, making it dificult to predict performance gains on business tasks with model size scaling. This observation is also consistent with findings in the text domain where just scaling model size has shown significant improvements in downstream task performance [28] that may not always follow the power law.

7. Conclusion and Future Work We presented model, data and compute based scaling

properties for generative pre-training of user activity sequence Transformer models and demonstrated how scaling translates to better next event prediction eficacy which in turn leads to better downstream performance on advertising tasks.

In future work we plan to to study scaling properties with respect to activity sequence lengths, by using longer time windows as a mechanism to scale the current bounded dataset size. We will also experiment with more eficient training strategies that help improve over the current power law, while reducing training costs. Finally, with recent work on joint representation learning of time-varying sequence data and fixed tabular data using masked language modeling [ 12 ], we will attempt to study if scaling properties from this work also generalize to other pre-training objectives. Bhargav Bhushanam and Adnan Aziz. “Understand- biah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neeing Scaling Laws for Recommendation Models.” ArXiv lakantan et al. "Language models are few-shot learners." abs/2208.08489 (2022): n. pag. Advances in neural information processing systems 33 [15] Shin, Kyuyong, Hanock Kwak, KyungHyun Kim, Su (2020): 1877-1901.

Young Kim and Max Nihl’en Ramstrom. “Scaling Law [28] OpenAI. “GPT-4 Technical Report.” ArXiv abs/2303.08774 for Recommendation Models: Towards General-purpose (2023): n. pag.

User Representations.” ArXiv abs/2111.11294 (2021): n. [29] Chowdhery, Aakanksha, Sharan Narang, Jacob Depag. vlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, [16] Kaplan, Jared, Sam McCandlish, T. J. Henighan, Tom B. Paul Barham, Hyung Won Chung, Charles Sutton, SeBrown, Benjamin Chess, Rewon Child, Scott Gray, Alec bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Radford, Jef Wu and Dario Amodei. “Scaling Laws for Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Neural Language Models.” ArXiv abs/2001.08361 (2020): Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabn. pag. hakaran, Emily Reif, Nan Du, Benton C. Hutchinson, [17] Zhai, Xiaohua, Alexander Kolesnikov, Neil Houlsby Reiner Pope, James Bradbury, Jacob Austin, Michael Isand Lucas Beyer. “Scaling Vision Transformers.” 2022 ard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm IEEE/CVF Conference on Computer Vision and Pattern Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Recognition (CVPR) (2021): 1204-1213. Michalewski, Xavier García, Vedant Misra, Kevin [18] Hofmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego David Luan, Hyeontaek Lim, Barret Zoph, Alexande Las Casas, Lisa Anne Hendricks, Johannes Welbl, der Spiridonov, Ryan Sepassi, David Dohan, Shivani Aidan Clark, Tom Hennigan, Eric Noland, Katie Milli- Agrawal, Mark Omernick, Andrew M. Dai, Thanucan, George van den Driessche, Bogdan Damoc, Aurelia malayan Sankaranarayana Pillai, Marie Pellat, Aitor Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack Lewkowycz, Erica Moreira, Rewon Child, Oleksandr W. Rae, Oriol Vinyals and L. Sifre. “Training Compute- Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Optimal Large Language Models.” ArXiv abs/2203.15556 Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta, (2022): n. pag. Jason Wei, Kathleen S. Meier-Hellstern, Douglas Eck, Jef [19] Henighan, Tom, Jared Kaplan, Mor Katz, Mark Chen, Dean, Slav Petrov and Noah Fiedel. “PaLM: Scaling LanChristopher Hesse, Jacob Jackson, Heewoo Jun et al. guage Modeling with Pathways.” ArXiv abs/2204.02311 "Scaling laws for autoregressive generative modeling." (2022): n. pag.

arXiv preprint arXiv:2010.14701 (2020). [30] Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier [20] Tay, Yi, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Martinet, Marie-Anne Lachaux, Timothée Lacroix, BapJason Wei, Xuezhi Wang, Hyung Won Chung et al. "Ul2: tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Unifying language learning paradigms." In The Eleventh Aur’elien Rodriguez, Armand Joulin, Edouard Grave and International Conference on Learning Representations. Guillaume Lample. “LLaMA: Open and Eficient Founda2022. tion Language Models.” ArXiv abs/2302.13971 (2023): n. [21] Sorscher, Ben, Robert Geirhos, Shashank Shekhar, Surya pag.

Ganguli, and Ari Morcos. "Beyond neural scaling laws: [31] Creswell, Antonia, Tom White, Vincent Dumoulin, Kai beating power law scaling via data pruning." Advances in Arulkumaran, Biswa Sengupta, and Anil A. Bharath. Neural Information Processing Systems 35 (2022): 19523- "Generative adversarial networks: An overview." IEEE 19536. signal processing magazine 35, no. 1 (2018): 53-65. [22] Van Den Oord, Aäron, Nal Kalchbrenner, and Koray [32] Kingma, Diederik P., and Max Welling. "Auto-encoding Kavukcuoglu. "Pixel recurrent neural networks." In Inter- variational bayes." arXiv preprint arXiv:1312.6114 (2013). national conference on machine learning, pp. 1747-1756. [33] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising PMLR, 2016. difusion probabilistic models." Advances in Neural In[23] Radford, Alec, Karthik Narasimhan, Tim Salimans, and formation Processing Systems 33 (2020): 6840-6851.

Ilya Sutskever. "Improving language understanding by [34] Pu, J., Yang, Y., Li, R., Elibol, O., Droppo, J. (2021) Scalgenerative pre-training." (2018). ing Efect of Self-Supervised Speech Models. Proc. Inter[24] Radford, Alec, Jefrey Wu, Rewon Child, David Luan, speech 2021, 1084-1088, doi: 10.21437/Interspeech.2021Dario Amodei, and Ilya Sutskever. "Language models are 1935 unsupervised multitask learners." OpenAI blog 1, no. 8 [35] Loshchilov, Ilya, and Frank Hutter. "Decoupled weight (2019): 9. decay regularization." arXiv preprint arXiv:1711.05101 [25] Oord, Aaron van den, Sander Dieleman, Heiga Zen, (2017).

Karen Simonyan, Oriol Vinyals, Alex Graves, Nal [36] Goyal, Priya, Piotr Dollár, Ross Girshick, Pieter NoordKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, "Wavenet: A generative model for raw audio." arXiv Yangqing Jia, and Kaiming He. "Accurate, large minipreprint arXiv:1609.03499 (2016). batch sgd: Training imagenet in 1 hour." arXiv preprint [26] Henaf, Olivier. "Data-eficient image recognition with arXiv:1706.02677 (2017).

contrastive predictive coding." In International conference on machine learning, pp. 4182-4192. PMLR, 2020. [27] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Sub

[1] Vaswani , Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.

Gomez , Łukasz Kaiser, and Illia

Polosukhin . Attention is all you need . In Advances in neural information processing systems , pp. 5998 - 6008 . 2017 .

[2] Agarwal , Rajat, Shailendra Agarwal, Agniva Som, and Hemant Kowshik . Using Customer Ad Click Sequences to Identify Invalid Trafic in Sponsored Products . In Amazon Machine Learning Conference , 2020 .

[3] Agarwal , Rajat, Agniva Som, Arvind Srinivasan, Jerin Francis, Anand Muralidhar, and Hemant Kowshik . Selfsupervised Representation Learning for User Ad Activity Sequences . In Amazon Machine Learning Conference , 2021 .

[4] Gligorijevic , Djordje, Jelena

Gligorijevic , and Aaron

Flores . Time-Aware Prospective Modeling of Users for Online Display Advertising . arXiv preprint arXiv: 1911 . 05100 ( 2019 ).

[5] Oord , Aaron van den, Yazhe Li , and Oriol Vinyals . Representation learning with contrastive predictive coding . arXiv preprint arXiv: 1807 . 03748 ( 2018 ).

[6] He , Kaiming, Xinlei Chen, Saining Xie, Yanghao

Li , Piotr

Dollár , and Ross

Girshick . Masked autoencoders are scalable vision learners . arXiv preprint arXiv:2111.06377 ( 2021 ).

[7] Liao , Yiping. On the Efectiveness of Self-supervised Pretraining for Modeling User Behavior Sequences . In AdKDD, 2020 .

[8] Naumov , Maxim, Dheevatsa Mudigere, Hao-Jun Michael

Shi

, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang et al. Deep learning recommendation model for personalization and recommendation systems . arXiv preprint arXiv: 1906 . 00091 ( 2019 ).

[9] Abadi , Martín, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jefrey

Dean , Matthieu

Devin et al. Tensorflow: A system for large-scale machine learning . In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) , pp. 265 - 283 . 2016 .

[10] Yao , Tiansheng, Xinyang Yi, Derek Zhiyuan Cheng, Felix Yu, Ting Chen, Aditya Menon, Lichan Hong et al. "Self-supervised learning for large-scale item recommendations." In Proceedings of the 30th ACM International Conference on Information & Knowledge Management , pp. 4321 - 4330 . 2021

[11] Guo , Wei, Can Zhang, Zhicheng He, Jiarui Qin, Huifeng Guo, Bo Chen, Rui ming Tang, Xiuqiang He and Rui Zhang . “ MISS: Multi-Interest Self-Supervised Learning Framework for Click-Through Rate Prediction .” 2022 IEEE 38th International Conference on Data Engineering (ICDE) ( 2021 ): 727 - 740 .

[12] Agarwal , Rajat, Anand Muralidhar, Agniva Som and Hemant Kowshik . “ Self-supervised Representation Learning Across Sequential and Tabular Features Using Transformers .” ( 2022 ).

[13] Chitlangia , Sharad, Anand Muralidhar and Rajat Agarwal . “ Self Supervised Pre-training for Large Scale Tabular Data .” ( 2022 ).

[14] Ardalani , Newsha, Carole-Jean

, Zeliang Chen,