1. Introduction

On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models

Rohan Anil

Sandra Gadanho

Da Huang

Nijith Jacob

Zhuoshu Li

Dong Lin

Todd Phillips

Cristina Pop

Kevin Regan

Gil I. Shamir

Rakesh Shivanna

Qiqi Yan

Google Inc.

For industrial-scale advertising systems, prediction of ad click-through rate (CTR) is a central problem. Ad clicks constitute a significant class of user engagements and are often used as the primary signal for the usefulness of ads to users. Additionally, in cost-per-click advertising systems where advertisers are charged per click, click rate expectations feed directly into value estimation. Accordingly, CTR model development is a significant investment for most Internet advertising companies. Engineering for such problems requires many machine learning (ML) techniques suited to online learning that go well beyond traditional accuracy improvements, especially concerning eficiency, reproducibility, calibration, credit attribution. We present a case study of practical techniques deployed in a search ads CTR model at a large Internet company. This paper provides an industry case study highlighting important areas of current ML research and illustrating how impactful new ML methods are evaluated and made useful in a large-scale industrial setting.

eol>Personalization Recommender system Content optimization Content ranking Content diversity Causal bandit Contextual bandit View-through attribution Holistic optimization

1. Introduction

viewed video, search query, or other. Search advertising specifically looks at matching a query with an ad Ad click-through rate (CTR) prediction is a key com- . CTR models for recommendation specifically aim to ponent of online advertising systems that has a direct predict the probability (|) , where the input is impact on revenue, and continues to be an area of active an ad-query pair ⟨, ⟩ , potentially adorned with addiresearch [1, 2, 3, 4]. This paper presents a detailed case tional factors afecting CTR, especially related to user study to give the reader a ”tour of the factory floor” of a interface: how ads will be positioned and rendered on a production CTR prediction system, describing challenges results page (Section 6). specific to this category of large industrial ML systems Beyond surfacing maximally useful results, recomand highlighting techniques that have proven to work mender systems for ads have important additional caliwell in practice. bration requirements. Actual click labels are stochastic,

The production CTR prediction model consists of bil- reflecting noisy responses from users. For any given adlions of weights, trains on more than one hundred bil- query and binary label , we typically hope to achieve lion examples, and is required to perform inference at precisely (| ) ∶= ⟨ , ⟩∼ [ = | ] over some well over one hundred thousand requests per second. sample of examples (in test or training). While a typical The techniques described here balance accuracy improve- log-likelihood objective in supervised training will result ments with training and serving costs, without adding in zero aggregate calibration bias across a validation set, undue complexity: the model is the target of sustained per-example bias is often non-zero. and substantial R&D and must allow for efectively build- Ads pricing and allocation problems create the pering on top of what came before. example calibration requirement. Typically, predictions will flow through to an auction mechanism that incor1.1. CTR for Search Ads porates bids to determine advertiser pricing. Auction Recommendations pricing schemes (e.g, VCG [5]) rely on the relative value of various potential outcomes. This requires that preThe recommender problem surfaces a result or set of re- dictions for all potential choices of be well calibrated sults from a given corpus, for a given initial context. The with respect to each other. Additionally, unlike simple initial context may be a user demographic, previously- recommenders, ads systems frequently opt to show no ads. This requires estimating the value of individual ads relative to this ”null-set” of no ads, rather than simply maximizing for ad relevance.

Consider a query like ”yarn for sale”; estimated CTR for an ad from ”yarn-site-1.com” might be 15.3%. EstiORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender Systems and User Modeling, jointly with the 16th ACM Conference on Recommender Systems, September 23rd, 2022, Seattle, WA, USA

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) mated CTR for an ad from ”yarn-site-2.com” might be producibility. 10.4%. Though such estimates can be informed by the This paper makes the following contributions: 1) we semantic relevance of the websites, the requirements for discuss practical ML considerations from many perspecprecision are more than what one should expect from gen- tives including accuracy, eficiency and reproducibility, eral models of language. Additionally, click-through data 2) we detail the real-world application of techniques that is highly non-stationary: click prediction is fundamen- have improved eficiency and accuracy, in some cases tally an online recommendation problem. An expectation describing adaptations specific to online learning, and of 15.3% is not static ground truth in the same sense as, 3) we describe how models can better generalize across for example, translation or image recommendation; it is UI treatments through model factorization and bias condefinitively more subject to evolution over time. straints.

1.2. Outline 2. Model and Training Overview

For ads CTR predictors, minor improvements to model quality will often translate into improved user experience A major design choice is how to represent an ad-query and overall ads system gains. This motivates continuous pair . The semantic information in the language of the investments in model research and development. Theo- query and the ad headlines is the most critical component. retical and benchmark improvements from ML literature Usage of attention layers on top of raw text tokens may rarely transfer directly to problem-dependent settings of generate the most useful language embeddings in current real-world applications. As such, model research must literature [12], but we find better accuracy and eficiency be primarily empirical and experimental. Consequently, trade-ofs by combining variations of fully-connected a great deal of attention must be paid to the machine DNNs with simple feature generation such as bi-grams costs of model training experiments while evaluating and n-grams on sub-word units. The short nature of user new techniques. In Section 2 we first give a general queries and ad headlines is a contributing factor. Data overview of the model and training setup; Section 3 then is highly sparse for these features, with typically only a discusses eficiency concerns and details several suc- tiny fraction of non-zero feature values per example. cessfully deployed techniques. In Section 4, we survey All features are treated as categorical and mapped to applications of modern ML techniques targeted at improv- sparse embedding tables. Given an input , we concateing measures of accuracy and geared explicitly toward nate the embedding values for all features to form a vector very-large-scale models. Section 4.4 summarizes empiri- , the embedding input layer of our DNN. denotes a cal results roughly characterizing the relative impact of minibatch of embedding values across several examples. these techniques. Next, we formally describe a simplified version of

Deep neural networks (DNNs) provide substantial the model’s fully-connected neural network architecimprovements over previous methods in many appli- ture. Later sections will introduce variations to this arcations, including large-scale industry settings. How- chitecture that improve accuracy, eficiency, or reproever, non-convex optimization reveals (and exacerbates) ducibility. We feed into a fully-connected hidden layer a critical problem of prediction: irreproducibility 1 = ( 1) that performs a linear transformation of [6, 7, 8, 9, 10, 11]. Training the same model twice (iden- using weights 1 followed by non-linear activation tical architecture, hyper-parameters, training data) may . Hidden layers = ( −1 ) are stacked, with the lead to metrics of the second model being very difer- output of the th layer feeding into an output layer ent from the first. We distinguish between model ir- =̂ sigmoid( +1 ) that generates the model’s predicreproducibility, strictly related to predictions on fixed tion corresponding to a click estimate .̂ Model weights data, and system irreproducibility, where a deployed are optimized following min ∑ ℒ ( , ̂ ). We found irreproducible model afects important system metrics. ReLUs to be a good choice for the activation function; Section 5 characterizes the problem and describes im- Section 5 describes improvements using smoothed activaprovements to model irreproducibility. tion functions. The model is trained through supervised

An efective click prediction model must be able to learning with the logistic loss of the observed click label generalize across diferent UI treatments , including: with respect to .̂ Sections 4 and 7 describe additional where an ad is shown on the page and any changes to losses that have improved our model. Training uses synthe formatting of the ad (e.g., bolding specific text or chronous minibatch SGD on Tensor Processing Units adding an image). Section 6 describes a specific model fac- (TPUs) [13]: at each training step , compute gradients torization that improves UI generalization performance. of the loss on a batch of examples (ranging up to milFinally, Section 7 details a general-purpose technique lions of examples), and weights are optimized with an for adding bias constraints to the model that has been adaptive optimizer. We find that AdaGrad [ 14, 15] works applied to both improve generalization and system irre- well for optimizing both embedding weights and dense network weights. Moreover, In Section 4.2 discusses ac- for ML training is implemented via maximizing model curacy improvements from deploying a second-order throughput, subject to constraints on minimum bandoptimizer: Distributed Shampoo [16] for training dense width and maximum training latency. We find that renetwork weights, which to our knowledge, is the first quired bandwidth is most frequently governed by the known large-scale deployment in a production scale neu- number of researchers addressing a fixed task. For an ral network training system. impactful ads model at a large internet company, this may represent many dozens of engineers attempting in2.1. Online Optimization cremental progress on a single modelling task. Allowable training latency is a function of researcher preference, Given the non-stationarity of data in ads optimization, varying from hours to weeks in practice. Varying parwe find that online learning methods perform best in allelism (i.e., number of accelerator chips) in training practice [1]. Models train using a single sequential pass controls development latency. As in many systems, lowover logged examples in chronological order. Each model ered latency often comes at the expense of throughput. continues to process new query-ad examples as data ar- For example, using twice the number of chips speeds rives [17]. For evaluation, we use models’ predictions on up training, but most often does so sub-linearly (traineach example from before the example is trained on (i.e., ing is less than twice as fast) because of parallelization progressive validation) [18]. This setup has a number overhead. of practical advantages. Since all metrics are computed For any given ML advancement, immediate gains must before an example is trained on, we have an immediate be weighed against the long-term cost to future R&D. measure of generalization that reflects our deployment For instance, naively scaling up the size of a large DNN setup. Because we do not need to maintain a holdout might provide immediate accuracy but add prohibitive validation set, we can efectively use all data for training, cost to future training (Table 1 includes a comparison of leading to higher confidence measurements. This setup techniques and includes one such naive scaling baseline). allows the entire learning platform to be implemented as We have found that there are many techniques and a single-pass streaming algorithm, facilitating the use of model architectures from literature that ofer significant large datasets. improvements in model accuracy, but fail the test of whether these improvements are worth the trade-ofs 3. ML Eficiency (e.g., ensembling many models, or full stochastic variational Bayesian inference [20]). We have also found that Our CTR prediction system provides predictions for all many accuracy-improving ML techniques can be recast ads shown to users, scoring a large set of eligible ads for as eficiency-improving via adjusting model parameters billions of queries per day and requiring support for infer- (especially total number of weights) in order to lower ence at rates above 100,000 QPS. Any increase in compute training costs. Thus, when we evaluate a technique, we used for inference directly translates into substantial ad- are often interested in two tuning points: 1) what is the ditional deployment costs. Latency of inference is also improvement in accuracy when training cost is neutral critical for real-time CTR prediction and related auctions. and 2) what is the training cost improvement if model caAs we evaluate improvements to our model, we carefully pacity is lowered until accuracy is neutral. In our setting, weigh any accuracy improvements against increases in some techniques are much better at improving training inference cost. costs (e.g., distillation in Section 4.1.2) while others are

Model training costs are likewise important to consider. better at improving accuracy. Figure 1 illustrates these For continuous research with a fixed computational bud- two tuning axes. get, the most important axes for measuring costs are We survey some successfully deployed eficiency techbandwidth (number of models that can be trained con- niques in the remainder of this section. Section 3.1 details currently), latency (end-to-end evaluation time for a new the use of matrix factorization bottlenecks to approximodel), and throughput (models that can be trained per mate large matrix multiplication with reduced cost. Secunit time). tion 3.2 describes AutoML, an eficient RL-based architec

Where inference and training costs may difer, several ture search that is used to identify model configurations ML techniques are available to make trade-ofs. Distilla- that balance cost and accuracy. Section 3.3 discusses a tion is particularly useful for controlling inference costs set of efective sampling strategies to reduce data used or amortizing training costs (see Section 4.1.2). Tech- for training without hurting accuracy. niques related to adaptive network growth [19] can control training costs relative to a larger final model (with 3.1. Bottlenecks larger inference cost).

Eficient management of computational resources

One practical way to achieve accuracy is to scale up the widths of all the layers in the network. The wider they

are, the more non-linearities there are in the model, and 3.2. AutoML for Eficiency in practice this improves model accuracy. On the other hand, the size of the matrices involved in the loss and To develop an ads CTR prediction model architecture gradient calculations increases, making the underlying with optimal accuracy/cost trade-of, we typically have matmul computations slower. Unfortunately, the cost of to tune the embedding widths of dozens of features and matmul operations (naively) scale up quadratically in the layer widths for each layer in the DNN. Assuming even size of their inputs. To compute the output of a hidden just a small constant number of options for each such Tl×ahyeer‘wmi duelt=ripisl y(b-eatdt−ed1ro’pste)rraawtteihgoeynrsetyfopriceaa∈llcyhℝiis×nnp’tu,ctworsoetw-pefeiencrtf iovr−em1. wicnoitsdrtta-hec,tfeactbhtileveecsotcomalcebosinn.adFtuoocrrtiiantrldasudesiattrirocianhla-sslpcaaarclceehmiqteuocidctueklrlsye, sirteeaiasrcchnheost [21]. We find that carefully inserting bottleneck layers with multiple iterations [22, 23]. We have successfully of low-rank matrices between layers of non-linearities adopted neural architecture search based on weight shargreatly reduces scaling costs, with only a small loss of ing [24] to eficiently explore network configurations relative accuracy. (e.g., varying layer width, embedding dimension) to find

Applying singular value decomposition to ’s, we of- versions of our model that provide neutral accuracy with ten observe that the top half of singular values contribute decreased training and serving cost. As illustrated in to over 90% of the norm of singular values. This suggests Figure 2, this is achieved by three components: a weightsharing network, an RL controller, and constraints. tsphmu−ata1tellwreeend,uocwucaegnhsheatrope. p×Fr oo+∈rx iaℝm×fix× aetde, ,−,w∈i1fhℝwi×cehbs.cycaTaanlheebb,eoatmstilogebunnynieficctcaokonnftlscaftooyamrnert- caosnTstuhabein-nwineegtiwgaholltrc-ksashn.adIrniidntahgteinsaewrtcwahyoi,trewkcetbuucraeilnsdistnraatihsnueapsleelrac-rancnhedtswipdoaarctkee , compute scales only linearly with . Empirically, we architectures simultaneously in a single iteration and found that accuracy loss from this approximation was select a specific architecture by activating part of the indeed small. By carefully balancing the following two super-network with masking. This setup significantly refactors, we were able to leverage bottlenecks to achieve duces the number of exploration iterations from O(1000) better accuracy without increasing computation cost: (1) to O(1). increasing layer sizes toward better accuracy, at the cost The reinforcement learning controller maintains a samof more compute, and (2) inserting bottleneck layers to pling distribution, , over candidate networks. It samreduce compute, at a small loss of accuracy. Balancing ples a set of decisions ( 1, 2, ...) to activate a sub-network of these two can be done manually or via AutoML tech- at each training step. We then do a forward pass for the niques (discussed in the next section). A recent man- activated sub-network to compute loss and cost. Based ual application of this technique to the model (without on that, we estimate the reward value ( 1, 2, ...) and AutoML tuning) reduced time per training step by 7% conduct a policy gradient update using the REINFORCE without impacting accuracy (See Table 2 for a summary algorithm [25] as follows: of eficiency techniques).

+ 0 ⋅ (( 1, 2, ...) − ) ⋅ ∇ log ( 1, 2, ...| ), = where denotes the moving average value of the reward and 0 is the learning rate for the reinforcement learning algorithm. Through the update at each training step, the sampling rate of better architectures will gradually increase and the sampling distribution will eventually converge to a promising architecture. We select the architecture with maximum likelihood at the end of the training. Constraints specify how to compute the cost of the activated sub-network, which can typically be done by estimating the number of floating-point operations or running a pre-built hardware-aware neural cost model. The reinforcement learning controller incorporates the provided cost estimate into the reward (e.g., = accuracy + ⋅ | cost/target − 1|, where < 0 ) [24] in order to force the sampling distribution to converge to a cost-constrained point. In order to search for architectures with lower training cost but neutral accuracy, in our system we set up multiple AutoML tasks with diferent constraint targets (e.g. 85%/90%/95% of the baseline cost) and selected the one with neutral accuracy and smallest training cost. A recent application of this architecture search to the model reduced time per training step by 16% without reducing accuracy.

3.3. Data Sampling

Historical examples of clicks on search ads make up a large dataset that increases substantially every day. The diminishing returns of ever larger datasets dictate that it is not beneficial to retain all the data. The marginal value for improving model quality goes toward zero, and eventually does not justify any extra machine costs for training compute and data storage. Alongside using ML optimization techniques to improve ML eficiency, we also use data sampling to control training costs. Given that training is a single-pass over data in time-order, there are two ways to reduce the training dataset: 1) restricting the time range of data consumed; and 2) sampling the data within that range. Limiting training data to more recent periods is intuitive. As we extend our date range further back in time, the data becomes less relevant to future problems. Within any range, clicked examples are more infrequent and more important to our learning task; so we sample the non-clicked examples to achieve rough class balance. Since this is primarily for eficiency, exact class balance is unnecessary. A constant sampling rate (a constant class imbalance prior) can be used with a simple single-pass filter. To keep model predictions unbiased, importance weighting is used to up-weight negative examples by the inverse of the sampling rate. Two additional sampling strategies that have proved efective are as follows: • Sampling examples associated with a low logistic loss (typically examples with low estimated CTR and no click). • Sampling examples that are very unlikely to have been seen by the user based on their position on the page.

The thresholds for the conditions above are hand

tuned and chosen to maximize data reduction without hurting model accuracy. These strategies are implemented by applying a small, constant sampling rate to all examples meeting any of the conditions above. PseudoRandom sampling determines whether examples should be kept and re-weighted or simply discarded. This ensures that all training models train on the same data. This scheme may be viewed as a practical version of [26] for large problem instances with expensive evaluation. Simple random sampling allows us to keep model estimates unbiased with simple constant importance re-weighting. It is important to avoid very small sampling rates in this scheme, the consequent large up-weighting can lead to model instability. Re-weighting is particularly important for maintaining calibration, since these sampling strategies are directly correlated to labels.

For sampling strategies that involve knowing the loss on an example, calculating that loss would require running inference on the training example, removing most of the performance gains. For this reason, we use a proxy value based on a prediction made by a ”teacher model”. In this two-pass approach. We first train once over all data to compute losses and associated sampling rates, and then once on the sub-sampled data. The first pass uses the same teacher model for distillation (Section 4.1.2) and is only done once. Iterative research can then be performed solely on the sub-sampled data. While these latter models will have diferent losses per example, the ifrst pass loss-estimates still provide a good signal for the ‘dificulty’ of the training example and leads to good results in practice. Overall our combination of class rebalancing and loss-based sampling strategies reduces the data to < 25% of the original dataset for any given period without significant loss in accuracy.

4. Accuracy

Next we detail a set of techniques aimed at improving the accuracy of the system. We discuss: additional losses that better align ofline training-time metrics with important business metrics, the application of distillation to our online training setting, the adaptation of the Shampoo second-order optimizer to our model, and the use of Deep and & Cross networks.

4.1. Loss Engineering Loss engineering plays an important role in our system.

As the goal of our model is to predict whether an ad will be clicked, our model generally optimizes for logis- the model’s prediction is unbiased per example. More tic loss, often thought of as the cross-entropy between detail can be found in Section 7. Application of ranklosses model predictions and the binary task (click/no-click) to the model generated accuracy improvements of −0.81% labels for each example. Using logistic loss allows model with a slight increase in training cost of 1%. predictions to be unbiased so that the prediction can be interpreted directly as a calibrated probability. Bi- 4.1.2. Distillation. nary predictions can be improved by introducing soft prediction through distillation methods [27]. Beyond es- Distillation adds an additional auxiliary loss requiring timating the CTR per ad, it is important that the set of matching the predictions of a high-capacity teacher candidate ads for a particular query is correctly ranked model, treating teacher predictions as soft labels [ 27]. (such that ads with clicks have higher CTR than ads with- In our model, we use a two-pass online distillation out clicks), thus incorporating proper ranking losses is setup. On the first pass, a teacher model records its predicalso important. In this section, we discuss novel auxil- tions progressively before training on examples. Student iary losses and introduce multi-task and multi-objective models consume the teacher’s predictions while trainmethods for joint training with these losses ing on the second pass. Thus, the cost of generating the predictions from the single teacher can be amortized 4.1.1. Rank Losses across many students (without requiring the teacher to repeat inference to generate predictions). In addition We found that Area under the ROC curve computed per to improving accuracy, distillation can also be used for query (PerQueryAUC) is a metric well correlated with reducing training data costs. Since the high-capacity business metrics quantifying the overall performance teacher is trained once, it can be trained on a larger data of a model. In addition to using PerQueryAUC during set. Students benefit implicitly from the teachers prior evaluation, we also use a relaxation of this metric, i.e., knowledge of the larger training set, and so require trainrank-loss, as a second training loss in our model. There ing only smaller and more recent data. The addition of are many rank losses in the learning-to-rank family [28, distillation to the model improved accuracy by 0.41% 29]. We find one efective approximation is Ranknet loss without increasing training costs (in the student). [30], which is a pairwise logistic loss: − ∑ ∑ log(sigmoid( , )),

∈{ =1} ∈{ ≠1} where , are logit scores of two examples.

Rank losses should be trained jointly with logistic loss; there are several potential optimization setups. In one setup, we create a multi-objective optimization problem [31]: ℒ ( ) = 1ℒrank(yrank, s) + (1 − 1)ℒlogistic(y, s), 4.1.3. Curriculums of Losses In machine learning, curriculum learning [34] typically involves a model learning easy tasks first and gradually switching to harder tasks. We found that training on all classes of losses in the beginning of training increased model instability (manifesting as outlier gradients which cause quality to diverge). Thus, we apply an approach similar to curriculum learning to ramp up losses, starting with the binary logistic loss and gradually ramping up distillation and rank losses over the course of training. where s are logit scores for examples, yrank are ranking labels, y are the binary task labels, and 1 ∈ (0, 1) is the rank-loss weight. Another solution is to use multi-task learning [32, 33], where the model produces multiple diferent estimates for each loss.

Second-order optimization methods that use second derivatives and/or second-order statistics are known to have better convergence properties to first-order methods [35]. Yet to our knowledge, second-order methods are ℒ ( shared, logistic, rank) = rarely reported to be used in production ML systems for 1ℒrank(y, srank) + (1 − 1)ℒlogistic(y, slogistic), DNNs. Recent work on Distributed Shampoo [16, 36] has made second-order optimization feasible for our model where shared are weights shared between the two losses, by leveraging the heterogneous compute ofered by TPUs logistic are for the logistic loss output, and rank are and host-CPUs, and employing additional algorithmic for the rank-loss output. In this case, the ranking loss and eficiency improvements. afects the ”main” prediction slogistic as a ”regularizer” on In our system, Distributed Shampoo provided much shared. faster convergence with respect to training steps, and

As rank losses are not naturally calibrated predictors of yielded better accuracy when compared to standard adapclick probabilities, the model’s predictions will be biased. tive optimization techniques including AdaGrad [15], A strong bias correction component is needed to ensure

4.2. Second-order Optimization

Adam [37], Yogi [38], and LAMB [39]. While, second- described in [41]. order methods like Distributed Shampoo is known to Stability & Eficiency. Distributed Shampoo has higher provide faster convergence compared to first-order meth- computational complexity per step as it involves multiods in the literature - It often fails to provide competitive plication of large matrices for preconditioning and statiswall-clock time due to the computational overheads in tics/preconditioner computation. We address these overthe optimizer on smaller scale benchmarks. For our train- heads with several techniques in our deployment. For ing system, second-order optimization method was an example, the block-diagonalization suggested in [16] was ideal candidate due to large batch sizes used in training efective at reducing the computational complexity while which amortizes the cost of costly update rule. Train- also allowing the implementation of parallel updates for ing time only increased by approximately 10% and the each block in the data-parallel setting via weight-update improvements to model accuracy far outweighed the in- sharding [42]. This reduced the overall step time. Morecrease in training time. We next discuss some of the more over, optimizer overheads are independent of batch size, salient implementation details specific to our model. and thus we increased batch size to reduce overall com

Learning Rate Grafting. One of the main challenges in putational overhead. Finally, we found that condition online optimization is defining a learning rate schedule. number of statistics used for preconditioning can vary In contrast to training on static datasets, the number of in range reaching more than 1010. Because, numerical steps an online model will require is not known and may stability and robustness is of utmost importance in probe unbounded. Accordingly, popular learning rate sched- duction; we make use of double precision numerics. To ules from literature depending on fixed time horizons, compute the preconditioners, we use the CPUs attached such as cosine decay or exponential decay, perform worse to the TPUs to run inverse- th roots and exploit a faster in contrast to the implicit data-dependent adaptive sched- algorithm, the coupled Newton iteration for larger preule from AdaGrad [15]. As observed in literature [40], we conditioners [43] as in Figure 3. also find that AdaGrad’s implicit schedule works quite When integrated with the ad click prediction model well in the online setting; especially after the parame- the optimizer improved our primary measure of accuter (the initial accumulator value) is tuned. Accordingly, racy, Area under the ROC curve computed per query we bootstrap the schedule for Distributed Shampoo via (PerQueryAuc), by 0.44%. Accuracy improvements above grafting the per layer step size from AdaGrad. More pre- 0.1% are considered significant. For comparison: a naive cisely, we use the direction from Shampoo while using scaling of the deep network by 2x yields a PerQueryAUC the magnitude of step size from AdaGrad at a per-layer improvement of 0.13%. See Table 1 for a summary of granularity. An important feature of this bootstrapping accuracy technique results. is that it allowed us to inherit hyper-parameters from previous AdaGrad tunings to search for a Pareto optimal 4.3. Deep & Cross Network configuration.

Momentum. Another efective implementation choice Learning efective feature crosses is critical for recomis the combination of Nesterov-styled momentum with mender systems [3, 44]. We adopt an eficient variant of the preconditioned gradient. Our analysis suggests that DCNv2 [44] using bottlenecks. This is added between the momentum added modest gains on top of Shampoo embedding layer described in Section 2 and the DNN. without increasing the computational overhead while We next describe the Deep & Cross Network architecmarginally increasing the memory overhead. Computa- ture and its embedding layer input. We use a standard tional overhead was addressed via the approximations embedding projection layer for sparse categorical fea0.18% 0.44% techniques.∗ Distillation does not include teacher cost which, due to amortization, is a small fraction of overall training costs. tures. We project categorical feature from a higher dimensional sparse space to a lower dimensional dense space using ̃ = , where ∈ {0, 1} ; ∈ ℝ × is the learned projection matrix; ̃ is the dense embedding representation; and and represent the vocabulary and dense embedding sizes respectively. For multivalent features, we use average pooling of embedding vectors.

Embedding dimensions { } are tuned for eficiency and accuracy trade-ofs using AutoML (Section

3.2). Output of the embedding layer is a wide concatenated vector 0 = concat( 1̃, 2̃ … ̃ ) ∈ ℝ for features. For crosses, we adopt an eficient variant of [

44], applied directly on top of the embedding layer to explicitly learn feature crosses: = 2( 0 ⊙ −1 ) + −1 , where , −1 ∈ ℝ represent the output and input of the th cross layer, respectively; ∈ ℝ× and ∈ ℝ×

are the learned weight matrices leveraging bottlenecks (Section 3.1) for eficiency; 2 is a scalar, ramping up from 0 → 1 during initial training, allowing the model to first learn the embeddings and then the crosses in a curriculum fashion.

Furthermore, this ReZero initialization [45] also improves

model stability and reproducibility (Section 5).

In practice adding the Deep & Cross Network to the

model yielded an accuracy improvement of 0.18% with a minimal increase in training cost of 3%.

4.4. Summary of Eficiency and Accuracy Results

Below we share measurements of the relative impact of the previously discussed eficiency and accuracy techniques as applied to the production model. The goal is to give a very rough sense of the impact of these techniques and their accuracy vs. eficiency tradeofs. While precise measures of accuracy improvement on one particular model are not necessarily meaningful, we believe the coarse ranking of techniques and rough magnitude of results are interesting (and are consistent with our general experience).

The baseline 2x DNN size model doubles the number

of hidden layers. Note, that sparse embedding lookups add to the overall training cost, thus doubling the number layers does not proportionally increase the cost.

5. Irreproducibility

Irreproducibility, noted in Section 1, may not be easy to detect because it may appear in post deployment system metrics and not in progressive validation quality metrics. A pair of duplicate models may converge to two diferent optima of the highly non-convex objective, giving equal average accuracy, but diferent individual predictions, but with diferent downstream system/auction outcomes. Model deployment leads to further divergence, as ads selected by deployed models become part of subsequent training examples [17]. This can critically afect R&D: experimental models may appear beneficial, but gains may disappear when they are retrained and deployed in production. Theoretical analysis is complex even in the simple convex case, which is considered only in very recent work [46]. Many factors contribute to irreproducibility [47, 48, 49, 10, 50, 51], including random initialization, non-determinism in training due to highlyparallelized and highly-distributed training pipelines, numerical errors, hardware, and more. Slight deviations early in training may lead to very diferent models [ 52]. While standard training metrics do not expose system irreproducibility, we can use deviations of predictions on individual examples as a cheap proxy, allowing us to fail fast prior to evaluation at deployment-time. Common statistical metrics (standard deviation, various divergences) can be used [53, 54] but they require training many more models, which is undesirable at our scale. Instead, we use the Relative Prediction Diference (PD) [7, 9] metric △ Δ = 1/ ⋅ ∑ | ,̂1 − ,̂2 |/[( ,̂1 + ,̂2 )/2] , measuring absolute point-wise diference in model predictions for a pair of models (subscripts 1 and 2), normalized by the pair’s average prediction. Computing PD requires training a pair of models instead of one, but we have observed that reducing PD is suficient to improve reproducibility of important system metrics. In this section, we focus on methods to improve PD; Section 7 focuses on directly improving system metrics.

PDs may be as high as 20% for deep models. Perhaps surprisingly, standard methods such as fixed initialization, regularization, dropout, data augmentation, as well as new methods imposing constraints [55, 56] either failed to improve PD or improved PD at the cost of accuracy degradation. Techniques like warm-starting model weights to values of previously trained models may not be preferable because they can anchor the model to a potentially bad solution space and do not help the development cycle for newer more reproducible models for which there is no anchor.

Other techniques have shown varying levels of success. Ensembles [57], specifically self-ensembles [58], where we average predictions of multiple model duplicates (each initialized diferently), can reduce prediction variance and PD. However, maintaining ensembles in a production system with multiple components builds up substantial technical debt [59]. While some literature [60, 61, 62] describes accuracy advantages for ensembles, in our regime, ensembles degraded accuracy relative to equal-cost single networks. We believe this is because, unlike in the benchmark image models, examples in online CTR systems are visited once, and, more importantly, the learned model parameters are dominated by sparse embeddings. Relatedly, more sophisticated techniques based on ensembling and constraints can also improve PD [63, 7, 8].

Techniques described above trade accuracy and complexity for better reproducibility, requiring either ensembles or constraints. Further study and experimentation revealed that the popular use of Rectified Linear Unit (ReLU) activations contributes to increased PD. ReLU’s gradient discontinuity at 0 induces a highly non-convex loss landscape. Smoother activations, on the other hand, reduce the amount of non-convexity, and can lead to more reproducible models [9]. Empirical evaluations of various smooth activations [64, 65, 66, 67] have shown not only better reproducibility compared to ReLU, but also slightly better accuracy. The best reproducibilityaccuracy trade-ofs in our system were attained by the simple Smooth reLU (SmeLU) activation proposed in [9]. The function form is: SmeLU() =

0; ≤ − ⎧⎨ (+4) 2 ; || ≤ ⎩ ; ≥ . (1)

In our system, 3-component ensembles reduced PD

from 17% to 12% and anti-distillation reduced PD further to 10% with no accuracy loss. SmeLU allowed launching a non-ensemble model with PD less than 10% that also improved accuracy by 0.1%. System reproducibility metrics also improved to acceptable levels compared to the unacceptable levels of ReLU single component models.

6. Generalizing Across UI Treatments

One of the major factors in CTR performance of an ad is its UI treatment, including positioning, placement relative to other results on the page, and specific renderings such as bolded text or inlined images. A complex auction must explore not just the set of results to show, but how they should be positioned relative to other content, and how they should be individually rendered [68]. This exploration must take place eficiently over a combinatorially large space of possible treatments.

We solve this through model factorization, replacing estimated CTR with ( ⋅ ) , composed of a transfer function where , are separable models that output vectorized representations of the Quality and the UI, respectively, and are combined using an inner-product.

While , consisting of a large DNN and various feature embeddings, is a costly model, it needs to be evaluated only once per ad, irrespective of the number of UI treatments. In contrast, , being a much lighter model, can be evaluated hundreds of times per ad. Moreover, due to the relatively small feature space of the UI model, outputs can be cached to absorb a significant portion of lookup costs (as seen in Figure 4).

Separately from model performance requirements, accounting for the influence of UI treatments on CTR is also a crucial factor for model quality. Auction dynamics deliberately create strong correlations between individual ads and specific UI treatments. Results that are lower on the page may have low CTR regardless of their relevance to the query. Failure to properly disentangle these correlations creates inaccuracy when generalizing over UI treatments (e.g., estimating CTR if the same ad was shown higher on the page). Pricing and eligibility decisions depend crucially on CTR estimates of sub-optimal to our objective function, enforced on relevant slices of UIs that are rarely occurring in the wild. For instance, our either the training set or a separate, labelled dataset. This system shouldn’t show irrelevant ads, and so such sce- allows us reduce non-identifiability by anchoring model narios will not be in the training corpus, and so estimates loss to a desired part of the solution space (e.g., one that of their irrelevance (low CTR) will be out of distribution. satisfies calibration) (Figure 5a). By extension, we reduce But these estimates are needed to ensure the ads do not irreproducibility by anchoring a retrained model to the show. Even for relevant ads, there is a similar problem. same solution.

Performance of ads that rarely show in first position may Our technique is more lightweight than other methstill be used to set the price of those ads that often do ods used for large-scale, online training (counterfactual show in first position. This creates a specific generaliza- reasoning [69], variations of inverse propensity scoring tion problem related to UI, addressed in Section 7. [70, 71]): in practice, there are fewer parameters to tune,

Calibration is an important characteristic for large- and we simply add an additional term to our objective scale ads recommendation. We define calibration bias rather than changing the model structure. To address calas label minus prediction, and want this to be near zero ibration, [72] adjusts model predictions in a separate calper ad. A calibrated model allows us to use estimated ibration step using isotonic regression, a non-parametric CTR to determine the trade-of between showing and method. Our technique does calibration jointly with estinot showing an ad, and between showing one ad versus mation, and is more similar to methods which consider another; both calculations can be used in downstream eficient optimization of complex and augmented objectasks such as UI treatment selection, auction pricing, or tives (e.g., [73, 74]). Using additional constraints on the understanding of ad viewability. objective allows us to address a wide range of calibration

The related concept of credit attribution is similar to and credit attribution issues. counterfactual reasoning [69] or bias in implicit feedback [70]. It is a specific non-identifiability in model weights that can contribute to irreproducibility (Section 5). Con- 7. Bias Constraints sider an example to illustrate the UI efect (Section 6): assume that model A has seen many training examples 7.1. Online Optimization of Bias with high-CTR ads in high positions, and (incorrectly) Constraints learned that ad position most influences CTR. Model B, defined similar to A, trains first on the few examples where high-CTR ads appear in low positions, and (correctly) learns that something else (e.g., ad relevancy to query) is causing high CTR. Both models produce the same estimated CTR for these ads but for diferent reasons, and when they are deployed, model A will likely show fewer ads because it will not consider otherwise useful ads in lower positions; these models will show system irreproducibility.

In our system, we use a novel, general-purpose technique called bias constraints to address both calibration and credit attribution. We add calibration bias constraints We now optimize our original objective function with the constraint that ∀∀ ∈ , ( − ̂ ) = 0. Here, are subsets of the training set which we’d like to be calibrated (e.g., under-represented classes of data) or new training data that we may or may not optimize the original model weights over (e.g., out-of-distribution or of-policy data gathered from either randomized interventions or exploration scavenging [75, 70, 76]). To aid optimization, we ifrst transform this into an unconstrained optimization problem by introducing a dual variable , for each constraint and maximizing the Lagrangian relative to the dual variables. Next, instead of enforcing zero bias per example, we ask that the squared average bias across is zero. This reduces the number of dual variables to { }, and is equivalent to adding an L2 regularization on with a constraint of zero average bias. For a constant 3 controlling regularization, and tuned via typical hyperplacement, we will include it in { } ). For the model in Table 3, we saw substantial bias improvements on several data subsets related to out-of-distribution ad placement and more reproducibility with minimal accuracy impact parameter tuning techniques (e.g. grid search), our new when adding bias constraints. optimization is: Any degraded accuracy or stability is mitigated by com- Progressive validation and deployed system metrics reported

min max ∑ ℒ ( , ̂ ) + ∑

∑( ( − ̂ ) − =1 ∈

23 2) aren’t optimally selected. binations of the following tunings, ordered by impact: ramping up the bias constraint term, reducing the learning rate on { }, increasing 3, or adding more or finergrained constraints (breaking up ). We believe the first two can help normalize any diferences between the magnitude of the dual variables and other weights, and the latter two help lessen the strength of the bias term if

7.2. Bias Constraints for General Calibration

1 Bias 2 Bias 3 Bias -15% -75% -43%

Loss +0.03%

Ads/Query Churn -85% as a percent change for a bias constraint over the original model (negative is better). Ads/Query Churn records how much the percent diference in the number of ads shown above search results per query between two model retrains changes when deployed in similar conditions; we want this to be close to zero.

Viewing the bias constraints as anchoring loss rather

than changing the loss landscape (Figure 5a), we find that the technique does not fix model irreproducibility but rather mitigates system irreproducibility: we were able to cut the number of components in the ensemble by half and achieve the same level of reproducibility.

If we plot calibration bias across buckets of interesting variables, such as estimated CTR or other system met However, for several axes of interest, our system shows

rics, we expect a calibrated model to have uniform bias. 8. Conclusion higher bias at the ends of the range (Figure 5b). We apWe detailed a set of techniques for large-scale CTR preply bias constraints to this problem by defining to be diction that have proven to be truly efective “in proexamples in each bucket of, e.g., estimated CTR. Since duction”: balancing improvements to accuracy, training we don’t use the dual variables during inference, we can and deployment cost, system reproducibility and model include estimated CTR in our training objective. With complexity—along with describing approaches for generbias constraints, bias across buckets of interest becomes alizing across UI treatments. We hope that this brief visit much more uniform: variance is reduced by more than to the factory floor will be of interest to ML practitioners half. This can in turn improve accuracy of downstream of CTR prediction systems, recommender systems, online consumers of estimated CTR.

7.3. Exploratory Data and Bias Constraints

We can also use bias constraints to solve credit attribution for UI treatments. We pick by focusing on classes of examples that represent uncommon UI presentations for competitive queries where the ads shown may be quite diferent. For example, 1 might be examples where a high-CTR ad showed at the bottom of the page, 2 examples where a high-CTR ad showed in the second-tolast position on the page, etc. Depending on how model training is implemented, it may be easier to define in terms of existing model features (e.g., for a binary feature , we split one sum over into two sums). We choose { } to include features that generate partitions large enough to not impact convergence but small enough that we expect the bias per individual example will be driven to zero (e.g., if we think that query language impacts ad training systems, and more generally to those interested in large indust rial settings. ads, in: WWW, 2017 . [69] L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. ibration: A simple way to improve click models, Charles, D. M. Chickering, E. Po rtugaly, D. Ray, CIKM (2018 ).

P. Simard, E. Snelson, Counterfactual reasoning and [73] E. Eban, M. Schain, A. Mackey, A. Gordon, R. A. learning systems: The example of computational Saurous, G. Elidan, Scalable learning of nonadvertising, JMLR (2013). decomposable objectives, in: AIStats, 2017. [70] T. Joachims, A. Swaminathan, T. Schnabel, Un- [74] G. S. Mann, A. McCallum, Simple, robust, scalable biased learning-to-rank with biased feedback, in: semi-supervised learning via expectation regularWSDM, 2017 . ization, in: ICML, 2007. [71] D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, [75] X. Wang, M. Bendersky, D. Metzler, M. Najork, M. de Rijke, Large-scale validation of counterfac- Learning to rank with selection bias in personal tual learning methods: A test-bed, arXiv p reprint search, in: ACM SIGIR, 2016 .

arXiv:1612.00367 (2016). [76] J. Langford, A. Strehl, J. Wortman, Exploration [72] A. Borisov, J. Kiseleva, I. Markov, M. de Rijke, Cal- scaveng ing, in: ICML, 2008 .

facebook , 2014 . [1] H. B. McMahan , G.

Holt , D.

Sculley , M.

Young , D.

Ebner , J.

Grady , L.

Nie , T.

Phillips , E.

Davydov , D.

Golovin , et al., Ad click prediction: a view from the trenches , in: SIGKDD , 2013 . [2]

He ,

Pan ,

Jin ,

Xu ,

Liu ,

Xu ,

Shi ,

Atallah ,

Herbrich ,

Bowers , J. Q. n. Candela, Practical lessons from predicting clicks on ads at [3]

Zhou ,

Zhu ,

Song ,

Fan ,

Zhu ,

Ma ,

Yan ,

Jin ,

Li ,

Gai , Deep interest network for click-through rate prediction , in: SIGKDD , 2018 . [4]

Ling ,

Deng ,

Gu ,

Zhou ,

Li ,

Sun , Model ensemble for click prediction in bing search [5]

H. R.

Varian ,

Harris , The vcg auction in theory preprint arXiv:1511.05641 ( 2015 ).

and practice, American Economic Review ( 2014 ). [20]

Blundell ,

Cornebise ,

Kavukcuoglu , D. Wier[6]

M. W.

Dusenberry ,

Tran ,

Choi , J. Kemp, stra, Weight uncertainty in neural network , in: InJ. Nixon, G. Jerfel,

Heller ,

A. M.

Dai , Analyzing ternational conference on machine learning, PMLR, the role of model uncertainty for electronic health 2015 , pp. 1613 - 1622 .

records, in: CHIL, 2020 . [21]

Denil ,

Shakibi ,

Dinh ,

Ranzato , N. de Fre[7]

G. I.

Shamir , L. Coviello, Anti-distillation: Im- itas, Predicting parameters in deep learning, CoRR proving reproducibility of deep networks , arXiv (????).

preprint arXiv: 2010 . 09923 ( 2020 ). [22]

Zoph ,

Vasudevan ,

Shlens ,

Q. V.

Le , Learn[8]

G. I.

Shamir , L. Coviello, Distilling from ensem- ing transferable architectures for scalable image bles to improve reproducibility of neural networks, recognition , in: CVPR, 2018 .

2020 . [23]

Real ,

Aggarwal ,

Huang ,

Q. V.

Le , Regu[9]

G. I.

Shamir ,

Lin ,

Coviello , Smooth activa- larized evolution for image classifier architecture tions and reproducibility in deep networks, arXiv search , in: AAAI, 2019 .

preprint arXiv: 2010 . 09931 ( 2020 ). [24]

Bender ,

Liu ,

Chen , G. Chu, S. Cheng, P.-J. [10]

R. R.

Snapp , G. I. Shamir , Synthesizing irre- Kindermans,

Q. V.

Le , Can weight sharing outperproducibility in deep networks , arXiv preprint form random architecture search? an investigation arXiv:2102.10696 ( 2021 ). with tunas , in: CVPR , 2020 . [11] A. D'Amour , K.

Heller , D.

Moldovan , B.

Adlam , [25] R. J.

Williams , Simple statistical gradient-following B . Alipanahi , A.

Beutel , C.

Chen , J.

Deaton , J.

Eisen- algorithms for connectionist reinforcement learnstein , M. D. Hofman , F. Hormozdiari , N. Houlsby, ing, Machine learning ( 1992 ).

Hou ,

Jerfel ,

Karthikesalingam ,

Lucic , [26]

Fithian , T. Hastie, Local case-control sampling: Y . Ma,

McLean ,

Mincu ,

Mitani , A. Mon- Eficient subsampling in imbalanced data sets, The tanari ,

Nado ,

Natarajan ,

Nielson , T. F . Annals of Statistics ( 2014 ).

Osborne , R.

Raman , K.

Ramasamy , R.

Sayres , [27] G.

Hinton , O.

Vinyals , J.

Dean , Distilling the knowlJ . Schrouf,

Seneviratne ,

Sequeira , H. Suresh, edge in a neural network, in: NIPS Deep Learning V . Veitch,

Vladymyrov ,

Wang ,

Webster , and Representation Learning Workshop, 2015 .

Yadlowsky ,

Yun ,

Zhai ,

Sculley , Un- [28]

R. K.

Pasumarthi ,

Bruch ,

Wang ,

Li , M.

Benderspecification presents challenges for credibil- dersky,

M. Najork , J.

Pfeifer , N.

Golbandi , R. Anil, ity in modern machine learning, arXiv preprint S. Wolf, Tf-ranking: Scalable tensorflow library for arXiv: 2011 . 03395 ( 2020 ). learning-to-rank , in: SIGKDD , 2019 . [12]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit, [29]

C. J.

Burges , From Ranknet to Lambdarank to LambL. Jones,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , At- daMart: An overview , Learning ( 2010 ).

tention is all you need , in: NeurIPS, 2017 . [30]

Burges ,

Shaked ,

Renshaw ,

Lazier , [13]

N. P.

Jouppi ,

D. H.

Yoon , G. Kurian,

Li ,

Patil ,

Deeds ,

Hamilton ,

Hullender ,

Learning J.

Laudon ,

Young ,

Patterson , A domain- to rank using gradient descent , in: ICML, 2005 .

specific supercomputer for training deep neural [31]

Sculley , Combined regression and ranking , in: networks, Communications of the ACM ( 2020 ). In KDD'10 , 2010 . [14] H. B. McMahan , M.

Streeter , Adaptive bound op- [32] R.

Caruana , Multitask learning, Machine Learning timization for online convex optimization , arXiv ( 1997 ).

preprint arXiv:1002.4908 ( 2010 ). [33]

Ruder , An overview of multi-task learning in [15]

Duchi , E. Hazan,

Singer , Adaptive subgradient deep neural networks , 2017 . arXiv: 1706 . 05098 .

methods for online learning and stochastic opti- [34]

Bengio ,

Louradour ,

Collobert , J. Weston, mization, JMLR ( 2011 ). Curriculum learning , in: ICML, 2009 . [16]

Anil ,

Gupta ,

Koren ,

Regan ,

Singer , [35]

Nocedal ,

S. J.

Wright , Numerical Optimization, Second order optimization made practical, Springer, 2006 .

https://arxiv.org/abs/ 2002 .09018 ( 2020 ). [36]

Gupta ,

Koren ,

Singer , Shampoo: Precon[17]

Swaminathan , T. Joachims, Batch learning from ditioned stochastic tensor optimization, in: ICML, logged bandit feedback through counterfactual risk 2018 .

minimization , JMLR ( 2015 ). [37]

D. P.

Kingma ,

Ba , Adam: A method for stochas[18]

Blum ,

Kalai ,

Langford , Beating the hold-out: tic optimization , arXiv preprint arXiv:1412 . 6980 Bounds for k-fold and progressive cross-validation , ( 2014 ).

in: COLT , 1999 . [38]

Zaheer ,

Reddi ,

Sachan ,

Kale , S. Ku[19]

Chen , I. Goodfellow , J. Shlens, Net2net: Ac- mar, Adaptive methods for nonconvex optimization, celerating learning via knowledge transfer , arXiv NeurIPS ( 2018 ). [39]

You ,

Li ,

Reddi ,

Hseu , S. Kumar, ble prediction variation from neuron activation S. Bhojanapalli ,

Song ,

Demmel , K. Keutzer, strength in recommender systems , arXiv preprint C. Hsieh, Large batch optimization for deep learn - arXiv: 2008 . 07032 ( 2020 ).

ing: Training bert in 76 minutes , arXiv preprint [54]

Yu ,

Chen ,

Lin ,

Shamir , J. Han, Dropout arXiv: 1904 . 00962 ( 2019 ). prediction variation estimation using neuron acti [40]

Agarwal ,

Anil , E. Hazan,

Koren , C. Zhang, vation strength, arXiv preprint arXiv:2110.06435 Disentangling adaptive gradient methods from ( 2021 ).

learning rates , arXiv preprint arXiv: 2002 . 11803 [55]

Bhojanapalli ,

K. J.

Wilber ,

Veit , A. S. Rawat , ( 2020 ). S. Kim,

A. K.

Menon ,

Kumar , On the reproducibil[41]

Sutskever ,

Martens , G. Dahl, G. Hinton, On the ity of neural network predictions , 2021 .

importance of initialization and momentum in deep [56] G. I. Shamir , Systems and methods for improved learning , in: ICML , 2013 . generalization, reproducibility, and stabilization of [42]

Xu ,

Lee ,

Chen ,

Choi ,

Hechtman , neural networks via error control code constraints, S. Wang, Automatic cross-replica sharding of 2018.

weight update in data-parallel training , arXiv [57]

T. G.

Dietterich , Ensemble methods in machine preprint arXiv: 2004 . 13336 ( 2020 ). learning, Lecture Notes in Computer Science ( 2000 ). [43] C.-H. Guo , N. J.

Higham , A schur-newton method [58] Z.

Allen-Zhu , Y.

Li , Towards understanding enfor the matrix\boldmath p th root and its inverse, semble, knowledge distillation and self-distillation SIAM Journal on Matrix Analysis and Applications in deep learning , arXiv preprint arXiv: 2012 . 09816 ( 2006 ). ( 2020 ). [44]

Wang ,

Shivanna , D. Cheng, S. Jain,

Lin , [59]

Sculley ,

Holt ,

Golovin ,

Davydov ,

Hong , E. Chi, Dcn v2: Improved deep & cross

Phillips ,

Ebner ,

Chaudhary ,

Young , Manetwork and practical lessons for web-scale learn- chine learning: The high interest credit card of ing to rank systems , in: WWW, 2021 . technical debt ( 2014 ). [45]

Bachlechner ,

B. P.

Majumder ,

Mao , G. Cottrell, [60]

Kondratyuk ,

Tan ,

Brown ,

Gong , J. McAuley , Rezero is all you need: Fast conver- When ensembling smaller models is more efigence at large depth , in: UAI , 2021 . cient than single large models , arXiv preprint [46]

Ahn ,

Jain ,

Ji ,

Kale ,

Netrapalli , G. I. arXiv : 2005 . 00570 ( 2020 ).

Shamir , Reproducibility in optimization: The- [61] E.

Lobacheva , N.

Chirkova , M.

Kodryan , D.

Vetrov, oretical framework and limits, arXiv preprint On power laws in deep ensembles , arXiv preprint arXiv:2202.04598 ( 2022 ). arXiv: 2007 . 08483 ( 2020 ). [47]

Fort ,

Hu ,

Lakshminarayanan , Deep en- [62]

Wang ,

Kondratyuk ,

Christiansen , K. M.

sembles: A loss landscape perspective , 2020 . Kitani,

Alon , E. Eban, Wisdom of committees: An arXiv: 1912 .02757. overlooked approach to faster and more accurate [48]

Frankle ,

G. K.

Dziugaite ,

Roy ,

Carbin , Linear models, arXiv preprint arXiv: 2012 . 01988 ( 2021 ).

mode connectivity and the lottery ticket hypothesis , [63]

Anil ,

Pereyra ,

Passos ,

Ormandi , G. E.

in: International Conference on Machine Learning , Dahl,

G. E.

Hinton , Large scale distributed neural 2020. network training through online distillation , arXiv [49]

C. J.

Shallue ,

Lee ,

Antognini , J. Sohl-Dickstein, preprint arXiv: 1804 . 03235 ( 2018 ).

Frostig ,

G. E.

Dahl , Measuring the efects of [64]

J. T.

Barron , Continuously diferentiable exponendata parallelism on neural network training, arXiv tial linear units , arXiv preprint arXiv:1704 .07483 preprint arXiv: 1811 . 03600 ( 2018 ). ( 2017 ). [50]

Summers ,

M. J.

Dinneen , On nondeterminism [65]

Hendrycks ,

Gimpel , Gaussian error linand instability in neural network optimization, ear units (gelus ), arXiv preprint arXiv:1606.08415 2021 . ( 2016 ). [51]

Zhuang ,

Zhang ,

S. L.

Song , S. Hooker, [66]

Ramachandran ,

Zoph ,

Q. V.

Le , SearchRandomness in neural network training: Char- ing for activation functions, arXiv preprint acterizing the impact of tooling , arXiv preprint arXiv:1710.05941 ( 2017 ).

arXiv:2106.11872 ( 2021 ). [67]

Zheng ,

Yang , W. Liu,

Liang ,

Li , Improv[52]

Achille ,

Rovere ,

Soatto , Critical learning ing deep neural networks using softplus units, in: periods in deep neural networks , arXiv preprint IJCNN , 2015 .

arXiv:1711.08856 ( 2017 ). [68]

Cavallo ,

Krishnamurthy ,

Sviridenko , C. A. [53]

Chen ,

Wang ,

Lin , D. Cheng, L. Hong , E. Chi, Wilkens, Sponsored search auctions with rich ads, C. Cui, Beyond point estimate: Inferring ensem - CoRR abs/1701 .05948 ( 2017 ). URL: http://arxiv.org/ abs/1701.05948. arXiv: 1701 . 05948 .