On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models Rohan Anil1 , Sandra Gadanho1 , Da Huang1 , Nijith Jacob1 , Zhuoshu Li1 , Dong Lin1 , Todd Phillips1 , Cristina Pop1 , Kevin Regan1 , Gil I. Shamir1 , Rakesh Shivanna1 and Qiqi Yan1 1 Google Inc. Abstract For industrial-scale advertising systems, prediction of ad click-through rate (CTR) is a central problem. Ad clicks constitute a significant class of user engagements and are often used as the primary signal for the usefulness of ads to users. Additionally, in cost-per-click advertising systems where advertisers are charged per click, click rate expectations feed directly into value estimation. Accordingly, CTR model development is a significant investment for most Internet advertising companies. Engineering for such problems requires many machine learning (ML) techniques suited to online learning that go well beyond traditional accuracy improvements, especially concerning efficiency, reproducibility, calibration, credit attribution. We present a case study of practical techniques deployed in a search ads CTR model at a large Internet company. This paper provides an industry case study highlighting important areas of current ML research and illustrating how impactful new ML methods are evaluated and made useful in a large-scale industrial setting. Keywords Personalization, Recommender system, Content optimization, Content ranking, Content diversity, Causal bandit, Contextual bandit, View-through attribution, Holistic optimization 1. Introduction viewed video, search query, or other. Search advertis- ing specifically looks at matching a query π‘ž with an ad Ad click-through rate (CTR) prediction is a key com- π‘Ž. CTR models for recommendation specifically aim to ponent of online advertising systems that has a direct predict the probability 𝑃(π‘π‘™π‘–π‘π‘˜|π‘₯), where the input π‘₯ is impact on revenue, and continues to be an area of active an ad-query pair βŸ¨π‘Ž, π‘žβŸ©, potentially adorned with addi- research [1, 2, 3, 4]. This paper presents a detailed case tional factors affecting CTR, especially related to user study to give the reader a ”tour of the factory floor” of a interface: how ads will be positioned and rendered on a production CTR prediction system, describing challenges results page (Section 6). specific to this category of large industrial ML systems Beyond surfacing maximally useful results, recom- and highlighting techniques that have proven to work mender systems for ads have important additional cali- well in practice. bration requirements. Actual click labels are stochastic, The production CTR prediction model consists of bil- reflecting noisy responses from users. For any given ad- lions of weights, trains on more than one hundred bil- query π‘₯𝑖 and binary label 𝑦𝑖 , we typically hope to achieve lion examples, and is required to perform inference at precisely 𝑃(π‘π‘™π‘–π‘π‘˜|π‘₯𝑖 ) ∢= π”ΌβŸ¨π‘₯ ,𝑦 ⟩∼𝐷 [𝑦𝑖 = π‘π‘™π‘–π‘π‘˜|π‘₯𝑖 ] over some 𝑖 𝑖 well over one hundred thousand requests per second. sample of examples 𝐷 (in test or training). While a typical The techniques described here balance accuracy improve- log-likelihood objective in supervised training will result ments with training and serving costs, without adding in zero aggregate calibration bias across a validation set, undue complexity: the model is the target of sustained per-example bias is often non-zero. and substantial R&D and must allow for effectively build- Ads pricing and allocation problems create the per- ing on top of what came before. example calibration requirement. Typically, predictions will flow through to an auction mechanism that incor- 1.1. CTR for Search Ads porates bids to determine advertiser pricing. Auction Recommendations pricing schemes (e.g, VCG [5]) rely on the relative value of various potential outcomes. This requires that pre- The recommender problem surfaces a result or set of re- dictions for all potential choices of π‘₯ be well calibrated sults from a given corpus, for a given initial context. The with respect to each other. Additionally, unlike simple initial context may be a user demographic, previously- recommenders, ads systems frequently opt to show no ads. This requires estimating the value of individual ads ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender relative to this ”null-set” of no ads, rather than simply Systems and User Modeling, jointly with the 16th ACM Conference on maximizing for ad relevance. Recommender Systems, September 23rd, 2022, Seattle, WA, USA Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Consider a query like ”yarn for sale”; estimated CTR Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 for an ad from ”yarn-site-1.com” might be 15.3%. Esti- mated CTR for an ad from ”yarn-site-2.com” might be producibility. 10.4%. Though such estimates can be informed by the This paper makes the following contributions: 1) we semantic relevance of the websites, the requirements for discuss practical ML considerations from many perspec- precision are more than what one should expect from gen- tives including accuracy, efficiency and reproducibility, eral models of language. Additionally, click-through data 2) we detail the real-world application of techniques that is highly non-stationary: click prediction is fundamen- have improved efficiency and accuracy, in some cases tally an online recommendation problem. An expectation describing adaptations specific to online learning, and of 15.3% is not static ground truth in the same sense as, 3) we describe how models can better generalize across for example, translation or image recommendation; it is UI treatments through model factorization and bias con- definitively more subject to evolution over time. straints. 1.2. Outline 2. Model and Training Overview For ads CTR predictors, minor improvements to model quality will often translate into improved user experience A major design choice is how to represent an ad-query and overall ads system gains. This motivates continuous pair π‘₯. The semantic information in the language of the investments in model research and development. Theo- query and the ad headlines is the most critical component. retical and benchmark improvements from ML literature Usage of attention layers on top of raw text tokens may rarely transfer directly to problem-dependent settings of generate the most useful language embeddings in current real-world applications. As such, model research must literature [12], but we find better accuracy and efficiency be primarily empirical and experimental. Consequently, trade-offs by combining variations of fully-connected a great deal of attention must be paid to the machine DNNs with simple feature generation such as bi-grams costs of model training experiments while evaluating and n-grams on sub-word units. The short nature of user new techniques. In Section 2 we first give a general queries and ad headlines is a contributing factor. Data overview of the model and training setup; Section 3 then is highly sparse for these features, with typically only a discusses efficiency concerns and details several suc- tiny fraction of non-zero feature values per example. cessfully deployed techniques. In Section 4, we survey All features are treated as categorical and mapped to applications of modern ML techniques targeted at improv- sparse embedding tables. Given an input π‘₯, we concate- ing measures of accuracy and geared explicitly toward nate the embedding values for all features to form a vector very-large-scale models. Section 4.4 summarizes empiri- 𝑒, the embedding input layer of our DNN. 𝐸 denotes a cal results roughly characterizing the relative impact of minibatch of embedding values 𝑒 across several examples. these techniques. Next, we formally describe a simplified version of Deep neural networks (DNNs) provide substantial the model’s fully-connected neural network architec- improvements over previous methods in many appli- ture. Later sections will introduce variations to this ar- cations, including large-scale industry settings. How- chitecture that improve accuracy, efficiency, or repro- ever, non-convex optimization reveals (and exacerbates) ducibility. We feed 𝐸 into a fully-connected hidden layer a critical problem of prediction: irreproducibility 𝐻1 = 𝜎(πΈπ‘Š1 ) that performs a linear transformation of [6, 7, 8, 9, 10, 11]. Training the same model twice (iden- 𝐸 using weights π‘Š1 followed by non-linear activation tical architecture, hyper-parameters, training data) may 𝜎. Hidden layers 𝐻𝑖 = 𝜎(π»π‘–βˆ’1 π‘Šπ‘– ) are stacked, with the lead to metrics of the second model being very differ- output of the π‘˜th layer feeding into an output layer ent from the first. We distinguish between model ir- 𝑦̂ = sigmoid(π»π‘˜ π‘Šπ‘˜+1 ) that generates the model’s predic- reproducibility, strictly related to predictions on fixed tion corresponding to a click estimate 𝑦.Μ‚ Model weights data, and system irreproducibility, where a deployed are optimized following minπ‘Š βˆ‘π‘– β„’ (𝑦𝑖 , 𝑦𝑖̂ ). We found irreproducible model affects important system metrics. ReLUs to be a good choice for the activation function; Section 5 characterizes the problem and describes im- Section 5 describes improvements using smoothed activa- provements to model irreproducibility. tion functions. The model is trained through supervised An effective click prediction model must be able to learning with the logistic loss of the observed click label generalize across different UI treatments, including: 𝑦 with respect to 𝑦.Μ‚ Sections 4 and 7 describe additional where an ad is shown on the page and any changes to losses that have improved our model. Training uses syn- the formatting of the ad (e.g., bolding specific text or chronous minibatch SGD on Tensor Processing Units adding an image). Section 6 describes a specific model fac- (TPUs) [13]: at each training step 𝑑, compute gradients torization that improves UI generalization performance. 𝐺𝑑 of the loss on a batch of examples (ranging up to mil- Finally, Section 7 details a general-purpose technique lions of examples), and weights are optimized with an for adding bias constraints to the model that has been adaptive optimizer. We find that AdaGrad [14, 15] works applied to both improve generalization and system irre- well for optimizing both embedding weights and dense network weights. Moreover, In Section 4.2 discusses ac- for ML training is implemented via maximizing model curacy improvements from deploying a second-order throughput, subject to constraints on minimum band- optimizer: Distributed Shampoo [16] for training dense width and maximum training latency. We find that re- network weights, which to our knowledge, is the first quired bandwidth is most frequently governed by the known large-scale deployment in a production scale neu- number of researchers addressing a fixed task. For an ral network training system. impactful ads model at a large internet company, this may represent many dozens of engineers attempting in- 2.1. Online Optimization cremental progress on a single modelling task. Allowable training latency is a function of researcher preference, Given the non-stationarity of data in ads optimization, varying from hours to weeks in practice. Varying par- we find that online learning methods perform best in allelism (i.e., number of accelerator chips) in training practice [1]. Models train using a single sequential pass controls development latency. As in many systems, low- over logged examples in chronological order. Each model ered latency often comes at the expense of throughput. continues to process new query-ad examples as data ar- For example, using twice the number of chips speeds rives [17]. For evaluation, we use models’ predictions on up training, but most often does so sub-linearly (train- each example from before the example is trained on (i.e., ing is less than twice as fast) because of parallelization progressive validation) [18]. This setup has a number overhead. of practical advantages. Since all metrics are computed For any given ML advancement, immediate gains must before an example is trained on, we have an immediate be weighed against the long-term cost to future R&D. measure of generalization that reflects our deployment For instance, naively scaling up the size of a large DNN setup. Because we do not need to maintain a holdout might provide immediate accuracy but add prohibitive validation set, we can effectively use all data for training, cost to future training (Table 1 includes a comparison of leading to higher confidence measurements. This setup techniques and includes one such naive scaling baseline). allows the entire learning platform to be implemented as We have found that there are many techniques and a single-pass streaming algorithm, facilitating the use of model architectures from literature that offer significant large datasets. improvements in model accuracy, but fail the test of whether these improvements are worth the trade-offs (e.g., ensembling many models, or full stochastic varia- 3. ML Efficiency tional Bayesian inference [20]). We have also found that Our CTR prediction system provides predictions for all many accuracy-improving ML techniques can be recast ads shown to users, scoring a large set of eligible ads for as efficiency-improving via adjusting model parameters billions of queries per day and requiring support for infer- (especially total number of weights) in order to lower ence at rates above 100,000 QPS. Any increase in compute training costs. Thus, when we evaluate a technique, we used for inference directly translates into substantial ad- are often interested in two tuning points: 1) what is the ditional deployment costs. Latency of inference is also improvement in accuracy when training cost is neutral critical for real-time CTR prediction and related auctions. and 2) what is the training cost improvement if model ca- As we evaluate improvements to our model, we carefully pacity is lowered until accuracy is neutral. In our setting, weigh any accuracy improvements against increases in some techniques are much better at improving training inference cost. costs (e.g., distillation in Section 4.1.2) while others are Model training costs are likewise important to consider. better at improving accuracy. Figure 1 illustrates these For continuous research with a fixed computational bud- two tuning axes. get, the most important axes for measuring costs are We survey some successfully deployed efficiency tech- bandwidth (number of models that can be trained con- niques in the remainder of this section. Section 3.1 details currently), latency (end-to-end evaluation time for a new the use of matrix factorization bottlenecks to approxi- model), and throughput (models that can be trained per mate large matrix multiplication with reduced cost. Sec- unit time). tion 3.2 describes AutoML, an efficient RL-based architec- Where inference and training costs may differ, several ture search that is used to identify model configurations ML techniques are available to make trade-offs. Distilla- that balance cost and accuracy. Section 3.3 discusses a tion is particularly useful for controlling inference costs set of effective sampling strategies to reduce data used or amortizing training costs (see Section 4.1.2). Tech- for training without hurting accuracy. niques related to adaptive network growth [19] can con- trol training costs relative to a larger final model (with 3.1. Bottlenecks larger inference cost). One practical way to achieve accuracy is to scale up the Efficient management of computational resources widths of all the layers in the network. The wider they Figure 1: ”Switch-backs” of incremental costly quality-improving techniques and Figure 2: Weight-sharing based NAS with cost constraints. efficiency methods. (Illustration not to any scale.) are, the more non-linearities there are in the model, and 3.2. AutoML for Efficiency in practice this improves model accuracy. On the other To develop an ads CTR prediction model architecture hand, the size of the matrices involved in the loss and with optimal accuracy/cost trade-off, we typically have gradient calculations increases, making the underlying to tune the embedding widths of dozens of features and matmul computations slower. Unfortunately, the cost of layer widths for each layer in the DNN. Assuming even matmul operations (naively) scale up quadratically in the just a small constant number of options for each such size of their inputs. To compute the output of a hidden width, the combinatorial search space quickly reaches layer 𝐻𝑖 = 𝜎(π»π‘–βˆ’1 π‘Šπ‘– ) where π‘Šπ‘– ∈ β„π‘šΓ—π‘› , we perform intractable scales. For industrial-scale models, it is not π‘š Γ— 𝑛 multiply-add operations for each input row in π»π‘–βˆ’1 . cost-effective to conduct traditional architecture search The β€˜wider is better’ strategy typically isn’t cost-effective with multiple iterations [22, 23]. We have successfully [21]. We find that carefully inserting bottleneck layers adopted neural architecture search based on weight shar- of low-rank matrices between layers of non-linearities ing [24] to efficiently explore network configurations greatly reduces scaling costs, with only a small loss of (e.g., varying layer width, embedding dimension) to find relative accuracy. versions of our model that provide neutral accuracy with Applying singular value decomposition to π‘Šπ‘– ’s, we of- decreased training and serving cost. As illustrated in ten observe that the top half of singular values contribute Figure 2, this is achieved by three components: a weight- to over 90% of the norm of singular values. This suggests sharing network, an RL controller, and constraints. that we can approximate π»π‘–βˆ’1 π‘Šπ‘– by a bottleneck layer The weight-sharing network builds a super-network π»π‘–βˆ’1 π‘ˆπ‘– 𝑉𝑖 , where π‘ˆπ‘– ∈ β„π‘šΓ—π‘˜ , 𝑉𝑖 ∈ β„π‘˜Γ—π‘› . The amount of com- containing all candidate architectures in the search space pute reduces to π‘š Γ— π‘˜ + π‘˜ Γ— 𝑛, which can be significant for as sub-networks. In this way, we can train all candidate small enough π‘˜. For a fixed π‘˜, if we scale π‘š, 𝑛 by constant architectures simultaneously in a single iteration and 𝑐, compute scales only linearly with 𝑐. Empirically, we select a specific architecture by activating part of the found that accuracy loss from this approximation was super-network with masking. This setup significantly re- indeed small. By carefully balancing the following two duces the number of exploration iterations from O(1000) factors, we were able to leverage bottlenecks to achieve to O(1). better accuracy without increasing computation cost: (1) The reinforcement learning controller maintains a sam- increasing layer sizes toward better accuracy, at the cost pling distribution, πœƒπ‘‘π‘–π‘ π‘‘ , over candidate networks. It sam- of more compute, and (2) inserting bottleneck layers to ples a set of decisions (𝑑1 , 𝑑2 , ...) to activate a sub-network reduce compute, at a small loss of accuracy. Balancing at each training step. We then do a forward pass for the of these two can be done manually or via AutoML tech- activated sub-network to compute loss and cost. Based niques (discussed in the next section). A recent man- on that, we estimate the reward value 𝑅(𝑑1 , 𝑑2 , ...) and ual application of this technique to the model (without conduct a policy gradient update using the REINFORCE AutoML tuning) reduced time per training step by 7% algorithm [25] as follows: without impacting accuracy (See Table 2 for a summary of efficiency techniques). πœƒπ‘‘π‘–π‘ π‘‘ = πœƒπ‘‘π‘–π‘ π‘‘ + 𝛼0 β‹… (𝑅(𝑑1 , 𝑑2 , ...) βˆ’ 𝑅) β‹… βˆ‡ log 𝑃(𝑑1 , 𝑑2 , ...|πœƒπ‘‘π‘–π‘ π‘‘ ), where 𝑅 denotes the moving average value of the reward β€’ Sampling examples that are very unlikely to have and 𝛼0 is the learning rate for the reinforcement learn- been seen by the user based on their position on ing algorithm. Through the update at each training step, the page. the sampling rate of better architectures will gradually increase and the sampling distribution will eventually The thresholds for the conditions above are hand- converge to a promising architecture. We select the ar- tuned and chosen to maximize data reduction without chitecture with maximum likelihood at the end of the hurting model accuracy. These strategies are imple- training. Constraints specify how to compute the cost mented by applying a small, constant sampling rate to all of the activated sub-network, which can typically be examples meeting any of the conditions above. Pseudo- done by estimating the number of floating-point opera- Random sampling determines whether examples should tions or running a pre-built hardware-aware neural cost be kept and re-weighted or simply discarded. This en- model. The reinforcement learning controller incorpo- sures that all training models train on the same data. This rates the provided cost estimate into the reward (e.g., scheme may be viewed as a practical version of [26] for 𝑅 = 𝑅accuracy + 𝛾 β‹… |cost/target βˆ’ 1|, where 𝛾 < 0) [24] in large problem instances with expensive evaluation. Sim- order to force the sampling distribution to converge to ple random sampling allows us to keep model estimates a cost-constrained point. In order to search for architec- unbiased with simple constant importance re-weighting. tures with lower training cost but neutral accuracy, in our It is important to avoid very small sampling rates in this system we set up multiple AutoML tasks with different scheme, the consequent large up-weighting can lead to constraint targets (e.g. 85%/90%/95% of the baseline cost) model instability. Re-weighting is particularly impor- and selected the one with neutral accuracy and smallest tant for maintaining calibration, since these sampling training cost. A recent application of this architecture strategies are directly correlated to labels. search to the model reduced time per training step by For sampling strategies that involve knowing the loss 16% without reducing accuracy. on an example, calculating that loss would require run- ning inference on the training example, removing most of the performance gains. For this reason, we use a proxy 3.3. Data Sampling value based on a prediction made by a ”teacher model”. Historical examples of clicks on search ads make up a In this two-pass approach. We first train once over all large dataset that increases substantially every day. The data to compute losses and associated sampling rates, and diminishing returns of ever larger datasets dictate that then once on the sub-sampled data. The first pass uses it is not beneficial to retain all the data. The marginal the same teacher model for distillation (Section 4.1.2) value for improving model quality goes toward zero, and and is only done once. Iterative research can then be eventually does not justify any extra machine costs for performed solely on the sub-sampled data. While these training compute and data storage. Alongside using ML latter models will have different losses per example, the optimization techniques to improve ML efficiency, we first pass loss-estimates still provide a good signal for also use data sampling to control training costs. Given the β€˜difficulty’ of the training example and leads to good that training is a single-pass over data in time-order, there results in practice. Overall our combination of class re- are two ways to reduce the training dataset: 1) restricting balancing and loss-based sampling strategies reduces the the time range of data consumed; and 2) sampling the data to < 25% of the original dataset for any given period data within that range. Limiting training data to more without significant loss in accuracy. recent periods is intuitive. As we extend our date range further back in time, the data becomes less relevant to future problems. Within any range, clicked examples 4. Accuracy are more infrequent and more important to our learning Next we detail a set of techniques aimed at improving the task; so we sample the non-clicked examples to achieve accuracy of the system. We discuss: additional losses that rough class balance. Since this is primarily for efficiency, better align offline training-time metrics with important exact class balance is unnecessary. A constant sampling business metrics, the application of distillation to our rate (a constant class imbalance prior) can be used with a online training setting, the adaptation of the Shampoo simple single-pass filter. To keep model predictions unbi- second-order optimizer to our model, and the use of Deep ased, importance weighting is used to up-weight negative and & Cross networks. examples by the inverse of the sampling rate. Two addi- tional sampling strategies that have proved effective are as follows: 4.1. Loss Engineering β€’ Sampling examples associated with a low logistic Loss engineering plays an important role in our system. loss (typically examples with low estimated CTR As the goal of our model is to predict whether an ad and no click). will be clicked, our model generally optimizes for logis- the model’s prediction is unbiased per example. More tic loss, often thought of as the cross-entropy between detail can be found in Section 7. Application of ranklosses model predictions and the binary task (click/no-click) to the model generated accuracy improvements of βˆ’0.81% labels for each example. Using logistic loss allows model with a slight increase in training cost of 1%. predictions to be unbiased so that the prediction can be interpreted directly as a calibrated probability. Bi- 4.1.2. Distillation. nary predictions can be improved by introducing soft prediction through distillation methods [27]. Beyond es- Distillation adds an additional auxiliary loss requiring timating the CTR per ad, it is important that the set of matching the predictions of a high-capacity teacher candidate ads for a particular query is correctly ranked model, treating teacher predictions as soft labels [27]. (such that ads with clicks have higher CTR than ads with- In our model, we use a two-pass online distillation out clicks), thus incorporating proper ranking losses is setup. On the first pass, a teacher model records its predic- also important. In this section, we discuss novel auxil- tions progressively before training on examples. Student iary losses and introduce multi-task and multi-objective models consume the teacher’s predictions while train- methods for joint training with these losses ing on the second pass. Thus, the cost of generating the predictions from the single teacher can be amortized across many students (without requiring the teacher to 4.1.1. Rank Losses repeat inference to generate predictions). In addition We found that Area under the ROC curve computed per to improving accuracy, distillation can also be used for query (PerQueryAUC) is a metric well correlated with reducing training data costs. Since the high-capacity business metrics quantifying the overall performance teacher is trained once, it can be trained on a larger data of a model. In addition to using PerQueryAUC during set. Students benefit implicitly from the teachers prior evaluation, we also use a relaxation of this metric, i.e., knowledge of the larger training set, and so require train- rank-loss, as a second training loss in our model. There ing only smaller and more recent data. The addition of are many rank losses in the learning-to-rank family [28, distillation to the model improved accuracy by 0.41% 29]. We find one effective approximation is Ranknet loss without increasing training costs (in the student). [30], which is a pairwise logistic loss: 4.1.3. Curriculums of Losses βˆ’ βˆ‘ βˆ‘ log(sigmoid(𝑠𝑖 , 𝑠𝑗 )), π‘–βˆˆ{𝑦𝑖 =1} π‘—βˆˆ{𝑦𝑗 β‰ 1} In machine learning, curriculum learning [34] typically involves a model learning easy tasks first and gradually where 𝑠𝑖 , 𝑠𝑗 are logit scores of two examples. switching to harder tasks. We found that training on all Rank losses should be trained jointly with logistic loss; classes of losses in the beginning of training increased there are several potential optimization setups. In one model instability (manifesting as outlier gradients which setup, we create a multi-objective optimization problem cause quality to diverge). Thus, we apply an approach [31]: similar to curriculum learning to ramp up losses, starting with the binary logistic loss and gradually ramping up β„’ (π‘Š ) = 𝛼1 β„’rank (yrank , s) + (1 βˆ’ 𝛼1 )β„’logistic (y, s), distillation and rank losses over the course of training. where s are logit scores for examples, yrank are ranking labels, y are the binary task labels, and 𝛼1 ∈ (0, 1) is the 4.2. Second-order Optimization rank-loss weight. Another solution is to use multi-task Second-order optimization methods that use second learning [32, 33], where the model produces multiple derivatives and/or second-order statistics are known to different estimates 𝑠 for each loss. have better convergence properties to first-order meth- ods [35]. Yet to our knowledge, second-order methods are β„’ (π‘Šshared , π‘Šlogistic , π‘Šrank ) = rarely reported to be used in production ML systems for 𝛼1 β„’rank (y, srank ) + (1 βˆ’ 𝛼1 )β„’logistic (y, slogistic ), DNNs. Recent work on Distributed Shampoo [16, 36] has made second-order optimization feasible for our model where π‘Šshared are weights shared between the two losses, by leveraging the heterogneous compute offered by TPUs π‘Šlogistic are for the logistic loss output, and π‘Šrank are and host-CPUs, and employing additional algorithmic for the rank-loss output. In this case, the ranking loss and efficiency improvements. affects the ”main” prediction slogistic as a ”regularizer” on In our system, Distributed Shampoo provided much π‘Šshared . faster convergence with respect to training steps, and As rank losses are not naturally calibrated predictors of yielded better accuracy when compared to standard adap- click probabilities, the model’s predictions will be biased. tive optimization techniques including AdaGrad [15], A strong bias correction component is needed to ensure   ̐ Μ‘ 9Γ¨Δ†Γ€ Figure 3: Distributed Shampoo [16]: inverse-𝑝 π‘‘β„Ž root computations in double precision runs every 𝑁 steps and asynchronously pipelined on all CPU cores attached to the TPU accelerators. Adam [37], Yogi [38], and LAMB [39]. While, second- described in [41]. order methods like Distributed Shampoo is known to Stability & Efficiency. Distributed Shampoo has higher provide faster convergence compared to first-order meth- computational complexity per step as it involves multi- ods in the literature - It often fails to provide competitive plication of large matrices for preconditioning and statis- wall-clock time due to the computational overheads in tics/preconditioner computation. We address these over- the optimizer on smaller scale benchmarks. For our train- heads with several techniques in our deployment. For ing system, second-order optimization method was an example, the block-diagonalization suggested in [16] was ideal candidate due to large batch sizes used in training effective at reducing the computational complexity while which amortizes the cost of costly update rule. Train- also allowing the implementation of parallel updates for ing time only increased by approximately 10% and the each block in the data-parallel setting via weight-update improvements to model accuracy far outweighed the in- sharding [42]. This reduced the overall step time. More- crease in training time. We next discuss some of the more over, optimizer overheads are independent of batch size, salient implementation details specific to our model. and thus we increased batch size to reduce overall com- Learning Rate Grafting. One of the main challenges in putational overhead. Finally, we found that condition online optimization is defining a learning rate schedule. number of statistics used for preconditioning can vary In contrast to training on static datasets, the number of in range reaching more than 1010 . Because, numerical steps an online model will require is not known and may stability and robustness is of utmost importance in pro- be unbounded. Accordingly, popular learning rate sched- duction; we make use of double precision numerics. To ules from literature depending on fixed time horizons, compute the preconditioners, we use the CPUs attached such as cosine decay or exponential decay, perform worse to the TPUs to run inverse-𝑝th roots and exploit a faster in contrast to the implicit data-dependent adaptive sched- algorithm, the coupled Newton iteration for larger pre- ule from AdaGrad [15]. As observed in literature [40], we conditioners [43] as in Figure 3. also find that AdaGrad’s implicit schedule works quite When integrated with the ad click prediction model well in the online setting; especially after the πœ– parame- the optimizer improved our primary measure of accu- ter (the initial accumulator value) is tuned. Accordingly, racy, Area under the ROC curve computed per query we bootstrap the schedule for Distributed Shampoo via (PerQueryAuc), by 0.44%. Accuracy improvements above grafting the per layer step size from AdaGrad. More pre- 0.1% are considered significant. For comparison: a naive cisely, we use the direction from Shampoo while using scaling of the deep network by 2x yields a PerQueryAUC the magnitude of step size from AdaGrad at a per-layer improvement of 0.13%. See Table 1 for a summary of granularity. An important feature of this bootstrapping accuracy technique results. is that it allowed us to inherit hyper-parameters from previous AdaGrad tunings to search for a Pareto optimal 4.3. Deep & Cross Network configuration. Momentum. Another effective implementation choice Learning effective feature crosses is critical for recom- is the combination of Nesterov-styled momentum with mender systems [3, 44]. We adopt an efficient variant of the preconditioned gradient. Our analysis suggests that DCNv2 [44] using bottlenecks. This is added between the momentum added modest gains on top of Shampoo embedding layer 𝑒 described in Section 2 and the DNN. without increasing the computational overhead while We next describe the Deep & Cross Network architec- marginally increasing the memory overhead. Computa- ture and its embedding layer input. We use a standard tional overhead was addressed via the approximations embedding projection layer for sparse categorical fea- Technique Accuracy Training Cost Inference Cost Improvement Increase Increase Training Cost Deep & Cross Network 0.18% 3% 1% Technique Decrease Shampoo Optimizer 0.44% 10% 0% Distillation 0.46% <1%βˆ— 0% Bottlenecks 7% Rank Losses 0.81% <1% 0% AutoML 16% Data Sampling 75% Baseline: 2x DNN Size 0.13% 36% 10% Table 2 Table 1 Training cost improvements of Accuracy improvement and training/inference costs for accuracy improving applied techniques. techniques.βˆ— Distillation does not include teacher cost which, due to amortization, is a small fraction of overall training costs. tures. We project categorical feature 𝑖 from a higher of hidden layers. Note, that sparse embedding lookups dimensional sparse space to a lower dimensional dense add to the overall training cost, thus doubling the number space using 𝑒𝑖̃ = π‘Šπ‘– π‘₯𝑖 , where π‘₯𝑖 ∈ {0, 1}𝑣𝑖 ; π‘Šπ‘– ∈ β„π‘šπ‘– ×𝑣𝑖 is layers does not proportionally increase the cost. the learned projection matrix; 𝑒𝑖̃ is the dense embedding representation; and 𝑣𝑖 and π‘šπ‘– represent the vocabulary and dense embedding sizes respectively. For multivalent 5. Irreproducibility features, we use average pooling of embedding vectors. Irreproducibility, noted in Section 1, may not be easy to Embedding dimensions {π‘šπ‘– } are tuned for efficiency and detect because it may appear in post deployment sys- accuracy trade-offs using AutoML (Section 3.2). Output tem metrics and not in progressive validation quality of the embedding layer is a wide concatenated vector metrics. A pair of duplicate models may converge to 𝑒0 = concat(𝑒1Μƒ , 𝑒2Μƒ … 𝑒𝐹̃ ) ∈ β„π‘š for 𝐹 features. For crosses, two different optima of the highly non-convex objective, we adopt an efficient variant of [44], applied directly on giving equal average accuracy, but different individual top of the embedding layer to explicitly learn feature predictions, but with different downstream system/auc- crosses: 𝑒𝑖 = 𝛼2 (𝑒0 βŠ™ π‘ˆπ‘– 𝑉𝑖 π‘’π‘–βˆ’1 ) + π‘’π‘–βˆ’1 , where 𝑒𝑖 , π‘’π‘–βˆ’1 ∈ β„π‘š tion outcomes. Model deployment leads to further diver- represent the output and input of the 𝑖th cross layer, re-gence, as ads selected by deployed models become part spectively; π‘ˆπ‘– ∈ β„π‘šΓ—π‘˜ and 𝑉𝑖 ∈ β„π‘˜Γ—π‘š are the learned of subsequent training examples [17]. This can critically weight matrices leveraging bottlenecks (Section 3.1) for affect R&D: experimental models may appear beneficial, efficiency; 𝛼2 is a scalar, ramping up from 0 β†’ 1 during but gains may disappear when they are retrained and initial training, allowing the model to first learn the em- deployed in production. Theoretical analysis is complex beddings and then the crosses in a curriculum fashion. even in the simple convex case, which is considered only Furthermore, this ReZero initialization [45] also improvesin very recent work [46]. Many factors contribute to ir- model stability and reproducibility (Section 5). reproducibility [47, 48, 49, 10, 50, 51], including random In practice adding the Deep & Cross Network to the initialization, non-determinism in training due to highly- model yielded an accuracy improvement of 0.18% with parallelized and highly-distributed training pipelines, nu- a minimal increase in training cost of 3%. merical errors, hardware, and more. Slight deviations early in training may lead to very different models [52]. 4.4. Summary of Efficiency and Accuracy While standard training metrics do not expose system Results irreproducibility, we can use deviations of predictions on individual examples as a cheap proxy, allowing us to fail Below we share measurements of the relative impact of fast prior to evaluation at deployment-time. Common sta- the previously discussed efficiency and accuracy tech- tistical metrics (standard deviation, various divergences) niques as applied to the production model. The goal is can be used [53, 54] but they require training many more to give a very rough sense of the impact of these tech- models, which is undesirable at our scale. Instead, we use niques and their accuracy vs. efficiency tradeoffs. While the Relative Prediction Difference (PD) [7, 9] metric precise measures of accuracy improvement on one par- β–³ ticular model are not necessarily meaningful, we believe Ξ”π‘Ÿ = 1/𝑀 β‹… βˆ‘ |𝑦𝑖,1 Μ‚ βˆ’ 𝑦𝑖,2 Μ‚ |/[(𝑦𝑖,1 Μ‚ + 𝑦𝑖,2 Μ‚ )/2] the coarse ranking of techniques and rough magnitude 𝑖 of results are interesting (and are consistent with our , measuring absolute point-wise difference in model pre- general experience). dictions for a pair of models (subscripts 1 and 2), nor- The baseline 2x DNN size model doubles the number malized by the pair’s average prediction. Computing PD requires training a pair of models instead of one, but we have observed that reducing PD is sufficient to im- prove reproducibility of important system metrics. In this section, we focus on methods to improve PD; Section 7 focuses on directly improving system metrics. PDs may be as high as 20% for deep models. Per- haps surprisingly, standard methods such as fixed ini- tialization, regularization, dropout, data augmentation, as well as new methods imposing constraints [55, 56] either failed to improve PD or improved PD at the cost Figure 4: Model factorization into separable Quality and UI of accuracy degradation. Techniques like warm-starting models with estimated CTR ∢= 𝜏 (𝑄 β‹… π‘ˆ ) model weights to values of previously trained models may not be preferable because they can anchor the model to a potentially bad solution space and do not help the a non-ensemble model with PD less than 10% that also development cycle for newer more reproducible models improved accuracy by 0.1%. System reproducibility met- for which there is no anchor. rics also improved to acceptable levels compared to the Other techniques have shown varying levels of suc- unacceptable levels of ReLU single component models. cess. Ensembles [57], specifically self-ensembles [58], where we average predictions of multiple model dupli- cates (each initialized differently), can reduce prediction 6. Generalizing Across UI variance and PD. However, maintaining ensembles in a production system with multiple components builds up Treatments substantial technical debt [59]. While some literature One of the major factors in CTR performance of an ad is [60, 61, 62] describes accuracy advantages for ensembles, its UI treatment, including positioning, placement rela- in our regime, ensembles degraded accuracy relative to tive to other results on the page, and specific renderings equal-cost single networks. We believe this is because, such as bolded text or inlined images. A complex auc- unlike in the benchmark image models, examples in on- tion must explore not just the set of results to show, but line CTR systems are visited once, and, more importantly, how they should be positioned relative to other content, the learned model parameters are dominated by sparse and how they should be individually rendered [68]. This embeddings. Relatedly, more sophisticated techniques exploration must take place efficiently over a combinato- based on ensembling and constraints can also improve rially large space of possible treatments. PD [63, 7, 8]. We solve this through model factorization, replacing Techniques described above trade accuracy and com- estimated CTR with 𝜏 (𝑄 β‹… π‘ˆ ), composed of a transfer plexity for better reproducibility, requiring either ensem- function 𝜏 where 𝑄, π‘ˆ are separable models that output bles or constraints. Further study and experimentation vectorized representations of the Quality and the UI, re- revealed that the popular use of Rectified Linear Unit spectively, and are combined using an inner-product. (ReLU) activations contributes to increased PD. ReLU’s While 𝑄, consisting of a large DNN and various feature gradient discontinuity at 0 induces a highly non-convex embeddings, is a costly model, it needs to be evaluated loss landscape. Smoother activations, on the other hand, only once per ad, irrespective of the number of UI treat- reduce the amount of non-convexity, and can lead to ments. In contrast, π‘ˆ, being a much lighter model, can be more reproducible models [9]. Empirical evaluations of evaluated hundreds of times per ad. Moreover, due to the various smooth activations [64, 65, 66, 67] have shown relatively small feature space of the UI model, outputs not only better reproducibility compared to ReLU, but can be cached to absorb a significant portion of lookup also slightly better accuracy. The best reproducibility- costs (as seen in Figure 4). accuracy trade-offs in our system were attained by the Separately from model performance requirements, ac- simple Smooth reLU (SmeLU) activation proposed in [9]. counting for the influence of UI treatments on CTR is The function form is: also a crucial factor for model quality. Auction dynamics 0; 𝑧 ≀ βˆ’π›½ deliberately create strong correlations between individ- ⎧ (𝑧+𝛽)2 ual ads and specific UI treatments. Results that are lower 𝑓SmeLU (𝑧) = ; |𝑧| ≀ 𝛽 (1) on the page may have low CTR regardless of their rele- ⎨ 4𝛽 ⎩ 𝑧; 𝑧 β‰₯ 𝛽. vance to the query. Failure to properly disentangle these In our system, 3-component ensembles reduced PD correlations creates inaccuracy when generalizing over from 17% to 12% and anti-distillation reduced PD further UI treatments (e.g., estimating CTR if the same ad was to 10% with no accuracy loss. SmeLU allowed launching shown higher on the page). Pricing and eligibility deci- Figure 5: (a) Loss landscape for a model with non-identifiability across two weights and how bias constraints help find the right solution: we add additional criteria (red and orange curves) such that we choose the correct solution at optimum loss (dark blue curve). (b) Calibration bias across buckets of estimated CTR. For calibrated predictions, we expect uniform bias (black curve). Whereas a model with the original objective function is biased for certain buckets of estimated CTR (blue curve), we can get much closer to uniform with bias constraints (red curve). sions depend crucially on CTR estimates of sub-optimal to our objective function, enforced on relevant slices of UIs that are rarely occurring in the wild. For instance, our either the training set or a separate, labelled dataset. This system shouldn’t show irrelevant ads, and so such sce- allows us reduce non-identifiability by anchoring model narios will not be in the training corpus, and so estimates loss to a desired part of the solution space (e.g., one that of their irrelevance (low CTR) will be out of distribution. satisfies calibration) (Figure 5a). By extension, we reduce But these estimates are needed to ensure the ads do not irreproducibility by anchoring a retrained model to the show. Even for relevant ads, there is a similar problem. same solution. Performance of ads that rarely show in first position may Our technique is more lightweight than other meth- still be used to set the price of those ads that often do ods used for large-scale, online training (counterfactual show in first position. This creates a specific generaliza- reasoning [69], variations of inverse propensity scoring tion problem related to UI, addressed in Section 7. [70, 71]): in practice, there are fewer parameters to tune, Calibration is an important characteristic for large- and we simply add an additional term to our objective scale ads recommendation. We define calibration bias rather than changing the model structure. To address cal- as label minus prediction, and want this to be near zero ibration, [72] adjusts model predictions in a separate cal- per ad. A calibrated model allows us to use estimated ibration step using isotonic regression, a non-parametric CTR to determine the trade-off between showing and method. Our technique does calibration jointly with esti- not showing an ad, and between showing one ad versus mation, and is more similar to methods which consider another; both calculations can be used in downstream efficient optimization of complex and augmented objec- tasks such as UI treatment selection, auction pricing, or tives (e.g., [73, 74]). Using additional constraints on the understanding of ad viewability. objective allows us to address a wide range of calibration The related concept of credit attribution is similar to and credit attribution issues. counterfactual reasoning [69] or bias in implicit feedback [70]. It is a specific non-identifiability in model weights that can contribute to irreproducibility (Section 5). Con- 7. Bias Constraints sider an example to illustrate the UI effect (Section 6): assume that model A has seen many training examples 7.1. Online Optimization of Bias with high-CTR ads in high positions, and (incorrectly) Constraints learned that ad position most influences CTR. Model B, We now optimize our original objective function with defined similar to A, trains first on the few examples the constraint that βˆ€π‘˜βˆ€π‘– ∈ π‘†π‘˜ , (𝑦𝑖 βˆ’ 𝑦𝑖̂ ) = 0. Here, π‘†π‘˜ are where high-CTR ads appear in low positions, and (cor- subsets of the training set which we’d like to be calibrated rectly) learns that something else (e.g., ad relevancy to (e.g., under-represented classes of data) or new training query) is causing high CTR. Both models produce the data that we may or may not optimize the original model same estimated CTR for these ads but for different rea- weights over (e.g., out-of-distribution or off-policy data sons, and when they are deployed, model A will likely gathered from either randomized interventions or explo- show fewer ads because it will not consider otherwise ration scavenging [75, 70, 76]). To aid optimization, we useful ads in lower positions; these models will show first transform this into an unconstrained optimization system irreproducibility. problem by introducing a dual variable πœ†π‘˜,𝑖 for each con- In our system, we use a novel, general-purpose tech- straint and maximizing the Lagrangian relative to the nique called bias constraints to address both calibration dual variables. Next, instead of enforcing zero bias per and credit attribution. We add calibration bias constraints example, we ask that the squared average bias across π‘†π‘˜ is zero. This reduces the number of dual variables to {πœ†π‘˜ }, placement, we will include it in {𝑓 }). For the model in and is equivalent to adding an L2 regularization on πœ†π‘˜ Table 3, we saw substantial bias improvements on several with a constraint of zero average bias. For a constant 𝛼3 data subsets π‘†π‘˜ related to out-of-distribution ad placement controlling regularization, and tuned via typical hyper- and more reproducibility with minimal accuracy impact parameter tuning techniques (e.g. grid search), our new when adding bias constraints. optimization is: 𝑆1 Bias 𝑆2 Bias 𝑆3 Bias Loss Ads/Query Churn 𝐾 𝛼 -15% -75% -43% +0.03% -85% min max βˆ‘ β„’ (𝑦𝑖 , 𝑦𝑖̂ ) + βˆ‘ βˆ‘(πœ†π‘˜ (𝑦𝑖 βˆ’ 𝑦𝑖̂ ) βˆ’ 3 πœ†π‘˜2 ) π‘Š πœ†π‘˜ 𝑖 π‘˜=1 π‘–βˆˆπ‘† 2 π‘˜ Table 3 Any degraded accuracy or stability is mitigated by com- Progressive validation and deployed system metrics reported binations of the following tunings, ordered by impact: as a percent change for a bias constraint over the original ramping up the bias constraint term, reducing the learn- model (negative is better). Ads/Query Churn records how ing rate on {πœ†π‘˜ }, increasing 𝛼3 , or adding more or finer- much the percent difference in the number of ads shown above search results per query between two model retrains grained constraints (breaking up π‘†π‘˜ ). We believe the first changes when deployed in similar conditions; we want this to two can help normalize any differences between the mag- be close to zero. nitude of the dual variables and other weights, and the latter two help lessen the strength of the bias term if π‘†π‘˜ Viewing the bias constraints as anchoring loss rather aren’t optimally selected. than changing the loss landscape (Figure 5a), we find that the technique does not fix model irreproducibility but 7.2. Bias Constraints for General rather mitigates system irreproducibility: we were able Calibration to cut the number of components in the ensemble by half and achieve the same level of reproducibility. If we plot calibration bias across buckets of interesting variables, such as estimated CTR or other system met- rics, we expect a calibrated model to have uniform bias. 8. Conclusion However, for several axes of interest, our system shows higher bias at the ends of the range (Figure 5b). We ap- We detailed a set of techniques for large-scale CTR pre- ply bias constraints to this problem by defining π‘†π‘˜ to be diction that have proven to be truly effective β€œin pro- examples in each bucket of, e.g., estimated CTR. Since duction”: balancing improvements to accuracy, training we don’t use the dual variables during inference, we can and deployment cost, system reproducibility and model include estimated CTR in our training objective. With complexityβ€”along with describing approaches for gener- bias constraints, bias across buckets of interest becomes alizing across UI treatments. We hope that this brief visit much more uniform: variance is reduced by more than to the factory floor will be of interest to ML practitioners half. This can in turn improve accuracy of downstream of CTR prediction systems, recommender systems, online consumers of estimated CTR. training systems, and more generally to those interested in large industrial settings. 7.3. Exploratory Data and Bias Constraints References We can also use bias constraints to solve credit attribution [1] H. B. McMahan, G. Holt, D. Sculley, M. Young, for UI treatments. We pick π‘†π‘˜ by focusing on classes D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, of examples that represent uncommon UI presentations D. Golovin, et al., Ad click prediction: a view from for competitive queries where the ads shown may be the trenches, in: SIGKDD, 2013. quite different. For example, 𝑆1 might be examples where [2] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, a high-CTR ad showed at the bottom of the page, 𝑆2 A. Atallah, R. Herbrich, S. Bowers, J. Q. n. Candela, examples where a high-CTR ad showed in the second-to- Practical lessons from predicting clicks on ads at last position on the page, etc. Depending on how model facebook, 2014. training is implemented, it may be easier to define π‘†π‘˜ in [3] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, terms of existing model features (e.g., for a binary feature Y. Yan, J. Jin, H. Li, K. Gai, Deep interest network for 𝑓, we split one sum over π‘†π‘˜ into two sums). We choose {𝑓 } click-through rate prediction, in: SIGKDD, 2018. to include features that generate partitions large enough [4] X. Ling, W. Deng, C. Gu, H. Zhou, C. Li, F. Sun, to not impact convergence but small enough that we Model ensemble for click prediction in bing search expect the bias per individual example will be driven to ads, in: WWW, 2017. zero (e.g., if we think that query language impacts ad [5] H. R. Varian, C. Harris, The vcg auction in theory preprint arXiv:1511.05641 (2015). and practice, American Economic Review (2014). [20] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wier- [6] M. W. Dusenberry, D. Tran, E. Choi, J. Kemp, stra, Weight uncertainty in neural network, in: In- J. Nixon, G. Jerfel, K. Heller, A. M. Dai, Analyzing ternational conference on machine learning, PMLR, the role of model uncertainty for electronic health 2015, pp. 1613–1622. records, in: CHIL, 2020. [21] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, N. de Fre- [7] G. I. Shamir, L. Coviello, Anti-distillation: Im- itas, Predicting parameters in deep learning, CoRR proving reproducibility of deep networks, arXiv (????). preprint arXiv:2010.09923 (2020). [22] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learn- [8] G. I. Shamir, L. Coviello, Distilling from ensem- ing transferable architectures for scalable image bles to improve reproducibility of neural networks, recognition, in: CVPR, 2018. 2020. [23] E. Real, A. Aggarwal, Y. Huang, Q. V. Le, Regu- [9] G. I. Shamir, D. Lin, L. Coviello, Smooth activa- larized evolution for image classifier architecture tions and reproducibility in deep networks, arXiv search, in: AAAI, 2019. preprint arXiv:2010.09931 (2020). [24] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P.-J. [10] R. R. Snapp, G. I. Shamir, Synthesizing irre- Kindermans, Q. V. Le, Can weight sharing outper- producibility in deep networks, arXiv preprint form random architecture search? an investigation arXiv:2102.10696 (2021). with tunas, in: CVPR, 2020. [11] A. D’Amour, K. Heller, D. Moldovan, B. Adlam, [25] R. J. Williams, Simple statistical gradient-following B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisen- algorithms for connectionist reinforcement learn- stein, M. D. Hoffman, F. Hormozdiari, N. Houlsby, ing, Machine learning (1992). S. Hou, G. Jerfel, A. Karthikesalingam, M. Lucic, [26] W. Fithian, T. Hastie, Local case-control sampling: Y. Ma, C. McLean, D. Mincu, A. Mitani, A. Mon- Efficient subsampling in imbalanced data sets, The tanari, Z. Nado, V. Natarajan, C. Nielson, T. F. Annals of Statistics (2014). Osborne, R. Raman, K. Ramasamy, R. Sayres, [27] G. Hinton, O. Vinyals, J. Dean, Distilling the knowl- J. Schrouff, M. Seneviratne, S. Sequeira, H. Suresh, edge in a neural network, in: NIPS Deep Learning V. Veitch, M. Vladymyrov, X. Wang, K. Webster, and Representation Learning Workshop, 2015. S. Yadlowsky, T. Yun, X. Zhai, D. Sculley, Un- [28] R. K. Pasumarthi, S. Bruch, X. Wang, C. Li, M. Ben- derspecification presents challenges for credibil- dersky, M. Najork, J. Pfeifer, N. Golbandi, R. Anil, ity in modern machine learning, arXiv preprint S. Wolf, Tf-ranking: Scalable tensorflow library for arXiv:2011.03395 (2020). learning-to-rank, in: SIGKDD, 2019. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, [29] C. J. Burges, From Ranknet to Lambdarank to Lamb- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- daMart: An overview, Learning (2010). tention is all you need, in: NeurIPS, 2017. [30] C. Burges, T. Shaked, E. Renshaw, A. Lazier, [13] N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, M. Deeds, N. Hamilton, G. Hullender, Learning J. Laudon, C. Young, D. Patterson, A domain- to rank using gradient descent, in: ICML, 2005. specific supercomputer for training deep neural [31] D. Sculley, Combined regression and ranking, in: networks, Communications of the ACM (2020). In KDD’10, 2010. [14] H. B. McMahan, M. Streeter, Adaptive bound op- [32] R. Caruana, Multitask learning, Machine Learning timization for online convex optimization, arXiv (1997). preprint arXiv:1002.4908 (2010). [33] S. Ruder, An overview of multi-task learning in [15] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient deep neural networks, 2017. arXiv:1706.05098 . methods for online learning and stochastic opti- [34] Y. Bengio, J. Louradour, R. Collobert, J. Weston, mization, JMLR (2011). Curriculum learning, in: ICML, 2009. [16] R. Anil, V. Gupta, T. Koren, K. Regan, Y. Singer, [35] J. Nocedal, S. J. Wright, Numerical Optimization, Second order optimization made practical, Springer, 2006. https://arxiv.org/abs/2002.09018 (2020). [36] V. Gupta, T. Koren, Y. Singer, Shampoo: Precon- [17] A. Swaminathan, T. Joachims, Batch learning from ditioned stochastic tensor optimization, in: ICML, logged bandit feedback through counterfactual risk 2018. minimization, JMLR (2015). [37] D. P. Kingma, J. Ba, Adam: A method for stochas- [18] A. Blum, A. Kalai, J. Langford, Beating the hold-out: tic optimization, arXiv preprint arXiv:1412.6980 Bounds for k-fold and progressive cross-validation, (2014). in: COLT, 1999. [38] M. Zaheer, S. Reddi, D. Sachan, S. Kale, S. Ku- [19] T. Chen, I. Goodfellow, J. Shlens, Net2net: Ac- mar, Adaptive methods for nonconvex optimization, celerating learning via knowledge transfer, arXiv NeurIPS (2018). [39] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, ble prediction variation from neuron activation S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, strength in recommender systems, arXiv preprint C. Hsieh, Large batch optimization for deep learn- arXiv:2008.07032 (2020). ing: Training bert in 76 minutes, arXiv preprint [54] H. Yu, Z. Chen, D. Lin, G. Shamir, J. Han, Dropout arXiv:1904.00962 (2019). prediction variation estimation using neuron acti- [40] N. Agarwal, R. Anil, E. Hazan, T. Koren, C. Zhang, vation strength, arXiv preprint arXiv:2110.06435 Disentangling adaptive gradient methods from (2021). learning rates, arXiv preprint arXiv:2002.11803 [55] S. Bhojanapalli, K. J. Wilber, A. Veit, A. S. Rawat, (2020). S. Kim, A. K. Menon, S. Kumar, On the reproducibil- [41] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the ity of neural network predictions, 2021. importance of initialization and momentum in deep [56] G. I. Shamir, Systems and methods for improved learning, in: ICML, 2013. generalization, reproducibility, and stabilization of [42] Y. Xu, H. Lee, D. Chen, H. Choi, B. Hechtman, neural networks via error control code constraints, S. Wang, Automatic cross-replica sharding of 2018. weight update in data-parallel training, arXiv [57] T. G. Dietterich, Ensemble methods in machine preprint arXiv:2004.13336 (2020). learning, Lecture Notes in Computer Science (2000). [43] C.-H. Guo, N. J. Higham, A schur–newton method [58] Z. Allen-Zhu, Y. Li, Towards understanding en- for the matrix\boldmath p th root and its inverse, semble, knowledge distillation and self-distillation SIAM Journal on Matrix Analysis and Applications in deep learning, arXiv preprint arXiv:2012.09816 (2006). (2020). [44] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, [59] D. Sculley, G. Holt, D. Golovin, E. Davydov, L. Hong, E. Chi, Dcn v2: Improved deep & cross T. Phillips, D. Ebner, V. Chaudhary, M. Young, Ma- network and practical lessons for web-scale learn- chine learning: The high interest credit card of ing to rank systems, in: WWW, 2021. technical debt (2014). [45] T. Bachlechner, B. P. Majumder, H. Mao, G. Cottrell, [60] D. Kondratyuk, M. Tan, M. Brown, B. Gong, J. McAuley, Rezero is all you need: Fast conver- When ensembling smaller models is more effi- gence at large depth, in: UAI, 2021. cient than single large models, arXiv preprint [46] K. Ahn, P. Jain, Z. Ji, S. Kale, P. Netrapalli, G. I. arXiv:2005.00570 (2020). Shamir, Reproducibility in optimization: The- [61] E. Lobacheva, N. Chirkova, M. Kodryan, D. Vetrov, oretical framework and limits, arXiv preprint On power laws in deep ensembles, arXiv preprint arXiv:2202.04598 (2022). arXiv:2007.08483 (2020). [47] S. Fort, H. Hu, B. Lakshminarayanan, Deep en- [62] X. Wang, D. Kondratyuk, E. Christiansen, K. M. sembles: A loss landscape perspective, 2020. Kitani, Y. Alon, E. Eban, Wisdom of committees: An arXiv:1912.02757 . overlooked approach to faster and more accurate [48] J. Frankle, G. K. Dziugaite, D. Roy, M. Carbin, Linear models, arXiv preprint arXiv:2012.01988 (2021). mode connectivity and the lottery ticket hypothesis, [63] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. in: International Conference on Machine Learning, Dahl, G. E. Hinton, Large scale distributed neural 2020. network training through online distillation, arXiv [49] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, preprint arXiv:1804.03235 (2018). R. Frostig, G. E. Dahl, Measuring the effects of [64] J. T. Barron, Continuously differentiable exponen- data parallelism on neural network training, arXiv tial linear units, arXiv preprint arXiv:1704.07483 preprint arXiv:1811.03600 (2018). (2017). [50] C. Summers, M. J. Dinneen, On nondeterminism [65] D. Hendrycks, K. Gimpel, Gaussian error lin- and instability in neural network optimization, ear units (gelus), arXiv preprint arXiv:1606.08415 2021. (2016). [51] D. Zhuang, X. Zhang, S. L. Song, S. Hooker, [66] P. Ramachandran, B. Zoph, Q. V. Le, Search- Randomness in neural network training: Char- ing for activation functions, arXiv preprint acterizing the impact of tooling, arXiv preprint arXiv:1710.05941 (2017). arXiv:2106.11872 (2021). [67] H. Zheng, Z. Yang, W. Liu, J. Liang, Y. Li, Improv- [52] A. Achille, M. Rovere, S. Soatto, Critical learning ing deep neural networks using softplus units, in: periods in deep neural networks, arXiv preprint IJCNN, 2015. arXiv:1711.08856 (2017). [68] R. Cavallo, P. Krishnamurthy, M. Sviridenko, C. A. [53] Z. Chen, Y. Wang, D. Lin, D. Cheng, L. Hong, E. Chi, Wilkens, Sponsored search auctions with rich ads, C. Cui, Beyond point estimate: Inferring ensem- CoRR abs/1701.05948 (2017). URL: http://arxiv.org/ abs/1701.05948. arXiv:1701.05948 . [69] L. Bottou, J. Peters, J. QuiΓ±onero-Candela, D. X. ibration: A simple way to improve click models, Charles, D. M. Chickering, E. Portugaly, D. Ray, CIKM (2018). P. Simard, E. Snelson, Counterfactual reasoning and [73] E. Eban, M. Schain, A. Mackey, A. Gordon, R. A. learning systems: The example of computational Saurous, G. Elidan, Scalable learning of non- advertising, JMLR (2013). decomposable objectives, in: AIStats, 2017. [70] T. Joachims, A. Swaminathan, T. Schnabel, Un- [74] G. S. Mann, A. McCallum, Simple, robust, scalable biased learning-to-rank with biased feedback, in: semi-supervised learning via expectation regular- WSDM, 2017. ization, in: ICML, 2007. [71] D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, [75] X. Wang, M. Bendersky, D. Metzler, M. Najork, M. de Rijke, Large-scale validation of counterfac- Learning to rank with selection bias in personal tual learning methods: A test-bed, arXiv preprint search, in: ACM SIGIR, 2016. arXiv:1612.00367 (2016). [76] J. Langford, A. Strehl, J. Wortman, Exploration [72] A. Borisov, J. Kiseleva, I. Markov, M. de Rijke, Cal- scavenging, in: ICML, 2008.