On the Factory Floor: ML Engineering for
Industrial-Scale Ads Recommendation Models
Rohan Anil1 , Sandra Gadanho1 , Da Huang1 , Nijith Jacob1 , Zhuoshu Li1 , Dong Lin1 ,
Todd Phillips1 , Cristina Pop1 , Kevin Regan1 , Gil I. Shamir1 , Rakesh Shivanna1 and Qiqi Yan1
1
    Google Inc.


                                       Abstract
                                       For industrial-scale advertising systems, prediction of ad click-through rate (CTR) is a central problem. Ad clicks constitute a
                                       significant class of user engagements and are often used as the primary signal for the usefulness of ads to users. Additionally,
                                       in cost-per-click advertising systems where advertisers are charged per click, click rate expectations feed directly into value
                                       estimation. Accordingly, CTR model development is a significant investment for most Internet advertising companies.
                                       Engineering for such problems requires many machine learning (ML) techniques suited to online learning that go well
                                       beyond traditional accuracy improvements, especially concerning efficiency, reproducibility, calibration, credit attribution.
                                       We present a case study of practical techniques deployed in a search ads CTR model at a large Internet company. This paper
                                       provides an industry case study highlighting important areas of current ML research and illustrating how impactful new ML
                                       methods are evaluated and made useful in a large-scale industrial setting.

                                       Keywords
                                       Personalization, Recommender system, Content optimization, Content ranking, Content diversity, Causal bandit, Contextual
                                       bandit, View-through attribution, Holistic optimization


1. Introduction                                                                                       viewed video, search query, or other. Search advertis-
                                                                                                      ing specifically looks at matching a query 𝑞 with an ad
Ad click-through rate (CTR) prediction is a key com- 𝑎. CTR models for recommendation specifically aim to
ponent of online advertising systems that has a direct predict the probability 𝑃(𝑐𝑙𝑖𝑐𝑘|𝑥), where the input 𝑥 is
impact on revenue, and continues to be an area of active an ad-query pair ⟨𝑎, 𝑞⟩, potentially adorned with addi-
research [1, 2, 3, 4]. This paper presents a detailed case tional factors affecting CTR, especially related to user
study to give the reader a ”tour of the factory floor” of a interface: how ads will be positioned and rendered on a
production CTR prediction system, describing challenges results page (Section 6).
specific to this category of large industrial ML systems                                                 Beyond surfacing maximally useful results, recom-
and highlighting techniques that have proven to work mender systems for ads have important additional cali-
well in practice.                                                                                     bration requirements. Actual click labels are stochastic,
   The production CTR prediction model consists of bil- reflecting noisy responses from users. For any given ad-
lions of weights, trains on more than one hundred bil- query 𝑥𝑖 and binary label 𝑦𝑖 , we typically hope to achieve
lion examples, and is required to perform inference at precisely 𝑃(𝑐𝑙𝑖𝑐𝑘|𝑥𝑖 ) ∶= 𝔼⟨𝑥 ,𝑦 ⟩∼𝐷 [𝑦𝑖 = 𝑐𝑙𝑖𝑐𝑘|𝑥𝑖 ] over some
                                                                                                                                 𝑖 𝑖
well over one hundred thousand requests per second. sample of examples 𝐷 (in test or training). While a typical
The techniques described here balance accuracy improve- log-likelihood objective in supervised training will result
ments with training and serving costs, without adding in zero aggregate calibration bias across a validation set,
undue complexity: the model is the target of sustained per-example bias is often non-zero.
and substantial R&D and must allow for effectively build-                                                Ads pricing and allocation problems create the per-
ing on top of what came before.                                                                       example calibration requirement. Typically, predictions
                                                                                                      will flow through to an auction mechanism that incor-
1.1. CTR for Search Ads                                                                               porates bids to determine advertiser pricing. Auction
       Recommendations                                                                                pricing schemes (e.g, VCG [5]) rely on the relative value
                                                                                                      of various potential outcomes. This requires that pre-
The recommender problem surfaces a result or set of re- dictions for all potential choices of 𝑥 be well calibrated
sults from a given corpus, for a given initial context. The with respect to each other. Additionally, unlike simple
initial context may be a user demographic, previously- recommenders, ads systems frequently opt to show no
                                                                                                      ads. This requires estimating the value of individual ads
ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender relative to this ”null-set” of no ads, rather than simply
Systems and User Modeling, jointly with the 16th ACM Conference on maximizing for ad relevance.
Recommender Systems, September 23rd, 2022, Seattle, WA, USA
         © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License    Consider a query like ”yarn for sale”; estimated CTR
         Attribution 4.0 International (CC BY 4.0).
    CEUR

         CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                      for an ad from ”yarn-site-1.com” might be 15.3%. Esti-
mated CTR for an ad from ”yarn-site-2.com” might be           producibility.
10.4%. Though such estimates can be informed by the              This paper makes the following contributions: 1) we
semantic relevance of the websites, the requirements for      discuss practical ML considerations from many perspec-
precision are more than what one should expect from gen-      tives including accuracy, efficiency and reproducibility,
eral models of language. Additionally, click-through data     2) we detail the real-world application of techniques that
is highly non-stationary: click prediction is fundamen-       have improved efficiency and accuracy, in some cases
tally an online recommendation problem. An expectation        describing adaptations specific to online learning, and
of 15.3% is not static ground truth in the same sense as,     3) we describe how models can better generalize across
for example, translation or image recommendation; it is       UI treatments through model factorization and bias con-
definitively more subject to evolution over time.             straints.

1.2. Outline                                                  2. Model and Training Overview
For ads CTR predictors, minor improvements to model
quality will often translate into improved user experience    A major design choice is how to represent an ad-query
and overall ads system gains. This motivates continuous       pair 𝑥. The semantic information in the language of the
investments in model research and development. Theo-          query and the ad headlines is the most critical component.
retical and benchmark improvements from ML literature         Usage of attention layers on top of raw text tokens may
rarely transfer directly to problem-dependent settings of     generate the most useful language embeddings in current
real-world applications. As such, model research must         literature [12], but we find better accuracy and efficiency
be primarily empirical and experimental. Consequently,        trade-offs by combining variations of fully-connected
a great deal of attention must be paid to the machine         DNNs with simple feature generation such as bi-grams
costs of model training experiments while evaluating          and n-grams on sub-word units. The short nature of user
new techniques. In Section 2 we first give a general          queries and ad headlines is a contributing factor. Data
overview of the model and training setup; Section 3 then      is highly sparse for these features, with typically only a
discusses efficiency concerns and details several suc-        tiny fraction of non-zero feature values per example.
cessfully deployed techniques. In Section 4, we survey            All features are treated as categorical and mapped to
applications of modern ML techniques targeted at improv-      sparse embedding tables. Given an input 𝑥, we concate-
ing measures of accuracy and geared explicitly toward         nate the embedding values for all features to form a vector
very-large-scale models. Section 4.4 summarizes empiri-       𝑒, the embedding input layer of our DNN. 𝐸 denotes a
cal results roughly characterizing the relative impact of     minibatch of embedding values 𝑒 across several examples.
these techniques.                                                 Next, we formally describe a simplified version of
   Deep neural networks (DNNs) provide substantial            the model’s fully-connected neural network architec-
improvements over previous methods in many appli-             ture. Later sections will introduce variations to this ar-
cations, including large-scale industry settings. How-        chitecture that improve accuracy, efficiency, or repro-
ever, non-convex optimization reveals (and exacerbates)       ducibility. We feed 𝐸 into a fully-connected hidden layer
a critical problem of prediction: irreproducibility           𝐻1 = 𝜎(𝐸𝑊1 ) that performs a linear transformation of
[6, 7, 8, 9, 10, 11]. Training the same model twice (iden-    𝐸 using weights 𝑊1 followed by non-linear activation
tical architecture, hyper-parameters, training data) may      𝜎. Hidden layers 𝐻𝑖 = 𝜎(𝐻𝑖−1 𝑊𝑖 ) are stacked, with the
lead to metrics of the second model being very differ-        output of the 𝑘th layer feeding into an output layer
ent from the first. We distinguish between model ir-          𝑦̂ = sigmoid(𝐻𝑘 𝑊𝑘+1 ) that generates the model’s predic-
reproducibility, strictly related to predictions on fixed     tion corresponding to a click estimate 𝑦.̂ Model weights
data, and system irreproducibility, where a deployed          are optimized following min𝑊 ∑𝑖 ℒ (𝑦𝑖 , 𝑦𝑖̂ ). We found
irreproducible model affects important system metrics.        ReLUs to be a good choice for the activation function;
Section 5 characterizes the problem and describes im-         Section 5 describes improvements using smoothed activa-
provements to model irreproducibility.                        tion functions. The model is trained through supervised
   An effective click prediction model must be able to        learning with the logistic loss of the observed click label
generalize across different UI treatments, including:         𝑦 with respect to 𝑦.̂ Sections 4 and 7 describe additional
where an ad is shown on the page and any changes to           losses that have improved our model. Training uses syn-
the formatting of the ad (e.g., bolding specific text or      chronous minibatch SGD on Tensor Processing Units
adding an image). Section 6 describes a specific model fac-   (TPUs) [13]: at each training step 𝑡, compute gradients
torization that improves UI generalization performance.       𝐺𝑡 of the loss on a batch of examples (ranging up to mil-
Finally, Section 7 details a general-purpose technique        lions of examples), and weights are optimized with an
for adding bias constraints to the model that has been        adaptive optimizer. We find that AdaGrad [14, 15] works
applied to both improve generalization and system irre-       well for optimizing both embedding weights and dense
network weights. Moreover, In Section 4.2 discusses ac-         for ML training is implemented via maximizing model
curacy improvements from deploying a second-order               throughput, subject to constraints on minimum band-
optimizer: Distributed Shampoo [16] for training dense          width and maximum training latency. We find that re-
network weights, which to our knowledge, is the first           quired bandwidth is most frequently governed by the
known large-scale deployment in a production scale neu-         number of researchers addressing a fixed task. For an
ral network training system.                                    impactful ads model at a large internet company, this
                                                                may represent many dozens of engineers attempting in-
2.1. Online Optimization                                        cremental progress on a single modelling task. Allowable
                                                                training latency is a function of researcher preference,
Given the non-stationarity of data in ads optimization,         varying from hours to weeks in practice. Varying par-
we find that online learning methods perform best in            allelism (i.e., number of accelerator chips) in training
practice [1]. Models train using a single sequential pass       controls development latency. As in many systems, low-
over logged examples in chronological order. Each model         ered latency often comes at the expense of throughput.
continues to process new query-ad examples as data ar-          For example, using twice the number of chips speeds
rives [17]. For evaluation, we use models’ predictions on       up training, but most often does so sub-linearly (train-
each example from before the example is trained on (i.e.,       ing is less than twice as fast) because of parallelization
progressive validation) [18]. This setup has a number           overhead.
of practical advantages. Since all metrics are computed            For any given ML advancement, immediate gains must
before an example is trained on, we have an immediate           be weighed against the long-term cost to future R&D.
measure of generalization that reflects our deployment          For instance, naively scaling up the size of a large DNN
setup. Because we do not need to maintain a holdout             might provide immediate accuracy but add prohibitive
validation set, we can effectively use all data for training,   cost to future training (Table 1 includes a comparison of
leading to higher confidence measurements. This setup           techniques and includes one such naive scaling baseline).
allows the entire learning platform to be implemented as           We have found that there are many techniques and
a single-pass streaming algorithm, facilitating the use of      model architectures from literature that offer significant
large datasets.                                                 improvements in model accuracy, but fail the test of
                                                                whether these improvements are worth the trade-offs
                                                                (e.g., ensembling many models, or full stochastic varia-
3. ML Efficiency                                                tional Bayesian inference [20]). We have also found that
Our CTR prediction system provides predictions for all          many accuracy-improving ML techniques can be recast
ads shown to users, scoring a large set of eligible ads for     as efficiency-improving via adjusting model parameters
billions of queries per day and requiring support for infer-    (especially total number of weights) in order to lower
ence at rates above 100,000 QPS. Any increase in compute        training costs. Thus, when we evaluate a technique, we
used for inference directly translates into substantial ad-     are often interested in two tuning points: 1) what is the
ditional deployment costs. Latency of inference is also         improvement in accuracy when training cost is neutral
critical for real-time CTR prediction and related auctions.     and 2) what is the training cost improvement if model ca-
As we evaluate improvements to our model, we carefully          pacity is lowered until accuracy is neutral. In our setting,
weigh any accuracy improvements against increases in            some techniques are much better at improving training
inference cost.                                                 costs (e.g., distillation in Section 4.1.2) while others are
   Model training costs are likewise important to consider.     better at improving accuracy. Figure 1 illustrates these
For continuous research with a fixed computational bud-         two tuning axes.
get, the most important axes for measuring costs are               We survey some successfully deployed efficiency tech-
bandwidth (number of models that can be trained con-            niques in the remainder of this section. Section 3.1 details
currently), latency (end-to-end evaluation time for a new       the use of matrix factorization bottlenecks to approxi-
model), and throughput (models that can be trained per          mate large matrix multiplication with reduced cost. Sec-
unit time).                                                     tion 3.2 describes AutoML, an efficient RL-based architec-
   Where inference and training costs may differ, several       ture search that is used to identify model configurations
ML techniques are available to make trade-offs. Distilla-       that balance cost and accuracy. Section 3.3 discusses a
tion is particularly useful for controlling inference costs     set of effective sampling strategies to reduce data used
or amortizing training costs (see Section 4.1.2). Tech-         for training without hurting accuracy.
niques related to adaptive network growth [19] can con-
trol training costs relative to a larger final model (with 3.1. Bottlenecks
larger inference cost).
                                                             One practical way to achieve accuracy is to scale up the
   Efficient management of computational resources
                                                             widths of all the layers in the network. The wider they
Figure 1: ”Switch-backs” of incremental
costly quality-improving techniques and Figure 2: Weight-sharing based NAS with cost constraints.
efficiency methods. (Illustration not to
any scale.)


are, the more non-linearities there are in the model, and       3.2. AutoML for Efficiency
in practice this improves model accuracy. On the other
                                                                To develop an ads CTR prediction model architecture
hand, the size of the matrices involved in the loss and
                                                                with optimal accuracy/cost trade-off, we typically have
gradient calculations increases, making the underlying
                                                                to tune the embedding widths of dozens of features and
matmul computations slower. Unfortunately, the cost of
                                                                layer widths for each layer in the DNN. Assuming even
matmul operations (naively) scale up quadratically in the
                                                                just a small constant number of options for each such
size of their inputs. To compute the output of a hidden
                                                                width, the combinatorial search space quickly reaches
layer 𝐻𝑖 = 𝜎(𝐻𝑖−1 𝑊𝑖 ) where 𝑊𝑖 ∈ ℝ𝑚×𝑛 , we perform
                                                                intractable scales. For industrial-scale models, it is not
𝑚 × 𝑛 multiply-add operations for each input row in 𝐻𝑖−1 .
                                                                cost-effective to conduct traditional architecture search
The ‘wider is better’ strategy typically isn’t cost-effective
                                                                with multiple iterations [22, 23]. We have successfully
[21]. We find that carefully inserting bottleneck layers
                                                                adopted neural architecture search based on weight shar-
of low-rank matrices between layers of non-linearities
                                                                ing [24] to efficiently explore network configurations
greatly reduces scaling costs, with only a small loss of
                                                                (e.g., varying layer width, embedding dimension) to find
relative accuracy.
                                                                versions of our model that provide neutral accuracy with
   Applying singular value decomposition to 𝑊𝑖 ’s, we of-
                                                                decreased training and serving cost. As illustrated in
ten observe that the top half of singular values contribute
                                                                Figure 2, this is achieved by three components: a weight-
to over 90% of the norm of singular values. This suggests
                                                                sharing network, an RL controller, and constraints.
that we can approximate 𝐻𝑖−1 𝑊𝑖 by a bottleneck layer
                                                                   The weight-sharing network builds a super-network
𝐻𝑖−1 𝑈𝑖 𝑉𝑖 , where 𝑈𝑖 ∈ ℝ𝑚×𝑘 , 𝑉𝑖 ∈ ℝ𝑘×𝑛 . The amount of com-
                                                                containing all candidate architectures in the search space
pute reduces to 𝑚 × 𝑘 + 𝑘 × 𝑛, which can be significant for
                                                                as sub-networks. In this way, we can train all candidate
small enough 𝑘. For a fixed 𝑘, if we scale 𝑚, 𝑛 by constant
                                                                architectures simultaneously in a single iteration and
𝑐, compute scales only linearly with 𝑐. Empirically, we
                                                                select a specific architecture by activating part of the
found that accuracy loss from this approximation was
                                                                super-network with masking. This setup significantly re-
indeed small. By carefully balancing the following two
                                                                duces the number of exploration iterations from O(1000)
factors, we were able to leverage bottlenecks to achieve
                                                                to O(1).
better accuracy without increasing computation cost: (1)
                                                                   The reinforcement learning controller maintains a sam-
increasing layer sizes toward better accuracy, at the cost
                                                                pling distribution, 𝜃𝑑𝑖𝑠𝑡 , over candidate networks. It sam-
of more compute, and (2) inserting bottleneck layers to
                                                                ples a set of decisions (𝑑1 , 𝑑2 , ...) to activate a sub-network
reduce compute, at a small loss of accuracy. Balancing
                                                                at each training step. We then do a forward pass for the
of these two can be done manually or via AutoML tech-
                                                                activated sub-network to compute loss and cost. Based
niques (discussed in the next section). A recent man-
                                                                on that, we estimate the reward value 𝑅(𝑑1 , 𝑑2 , ...) and
ual application of this technique to the model (without
                                                                conduct a policy gradient update using the REINFORCE
AutoML tuning) reduced time per training step by 7%
                                                                algorithm [25] as follows:
without impacting accuracy (See Table 2 for a summary
of efficiency techniques).                                      𝜃𝑑𝑖𝑠𝑡 = 𝜃𝑑𝑖𝑠𝑡 + 𝛼0 ⋅ (𝑅(𝑑1 , 𝑑2 , ...) − 𝑅) ⋅ ∇ log 𝑃(𝑑1 , 𝑑2 , ...|𝜃𝑑𝑖𝑠𝑡 ),
where 𝑅 denotes the moving average value of the reward               • Sampling examples that are very unlikely to have
and 𝛼0 is the learning rate for the reinforcement learn-               been seen by the user based on their position on
ing algorithm. Through the update at each training step,               the page.
the sampling rate of better architectures will gradually
increase and the sampling distribution will eventually              The thresholds for the conditions above are hand-
converge to a promising architecture. We select the ar-         tuned and chosen to maximize data reduction without
chitecture with maximum likelihood at the end of the            hurting model accuracy. These strategies are imple-
training. Constraints specify how to compute the cost           mented by applying a small, constant sampling rate to all
of the activated sub-network, which can typically be            examples meeting any of the conditions above. Pseudo-
done by estimating the number of floating-point opera-          Random sampling determines whether examples should
tions or running a pre-built hardware-aware neural cost         be kept and re-weighted or simply discarded. This en-
model. The reinforcement learning controller incorpo-           sures that all training models train on the same data. This
rates the provided cost estimate into the reward (e.g.,         scheme may be viewed as a practical version of [26] for
𝑅 = 𝑅accuracy + 𝛾 ⋅ |cost/target − 1|, where 𝛾 < 0) [24] in     large problem instances with expensive evaluation. Sim-
order to force the sampling distribution to converge to         ple random sampling allows us to keep model estimates
a cost-constrained point. In order to search for architec-      unbiased with simple constant importance re-weighting.
tures with lower training cost but neutral accuracy, in our     It is important to avoid very small sampling rates in this
system we set up multiple AutoML tasks with different           scheme, the consequent large up-weighting can lead to
constraint targets (e.g. 85%/90%/95% of the baseline cost)      model instability. Re-weighting is particularly impor-
and selected the one with neutral accuracy and smallest         tant for maintaining calibration, since these sampling
training cost. A recent application of this architecture        strategies are directly correlated to labels.
search to the model reduced time per training step by               For sampling strategies that involve knowing the loss
16% without reducing accuracy.                                  on an example, calculating that loss would require run-
                                                                ning inference on the training example, removing most
                                                                of the performance gains. For this reason, we use a proxy
3.3. Data Sampling                                              value based on a prediction made by a ”teacher model”.
Historical examples of clicks on search ads make up a           In this two-pass approach. We first train once over all
large dataset that increases substantially every day. The       data to compute losses and associated sampling rates, and
diminishing returns of ever larger datasets dictate that        then once on the sub-sampled data. The first pass uses
it is not beneficial to retain all the data. The marginal       the same teacher model for distillation (Section 4.1.2)
value for improving model quality goes toward zero, and         and is only done once. Iterative research can then be
eventually does not justify any extra machine costs for         performed solely on the sub-sampled data. While these
training compute and data storage. Alongside using ML           latter models will have different losses per example, the
optimization techniques to improve ML efficiency, we            first pass loss-estimates still provide a good signal for
also use data sampling to control training costs. Given         the ‘difficulty’ of the training example and leads to good
that training is a single-pass over data in time-order, there   results in practice. Overall our combination of class re-
are two ways to reduce the training dataset: 1) restricting     balancing and loss-based sampling strategies reduces the
the time range of data consumed; and 2) sampling the            data to < 25% of the original dataset for any given period
data within that range. Limiting training data to more          without significant loss in accuracy.
recent periods is intuitive. As we extend our date range
further back in time, the data becomes less relevant to
future problems. Within any range, clicked examples
                                                                4. Accuracy
are more infrequent and more important to our learning          Next we detail a set of techniques aimed at improving the
task; so we sample the non-clicked examples to achieve          accuracy of the system. We discuss: additional losses that
rough class balance. Since this is primarily for efficiency,    better align offline training-time metrics with important
exact class balance is unnecessary. A constant sampling         business metrics, the application of distillation to our
rate (a constant class imbalance prior) can be used with a      online training setting, the adaptation of the Shampoo
simple single-pass filter. To keep model predictions unbi-      second-order optimizer to our model, and the use of Deep
ased, importance weighting is used to up-weight negative        and & Cross networks.
examples by the inverse of the sampling rate. Two addi-
tional sampling strategies that have proved effective are
as follows:                                                     4.1. Loss Engineering
     • Sampling examples associated with a low logistic         Loss engineering plays an important role in our system.
       loss (typically examples with low estimated CTR          As the goal of our model is to predict whether an ad
       and no click).
will be clicked, our model generally optimizes for logis-      the model’s prediction is unbiased per example. More
tic loss, often thought of as the cross-entropy between        detail can be found in Section 7. Application of ranklosses
model predictions and the binary task (click/no-click)         to the model generated accuracy improvements of −0.81%
labels for each example. Using logistic loss allows model      with a slight increase in training cost of 1%.
predictions to be unbiased so that the prediction can
be interpreted directly as a calibrated probability. Bi-       4.1.2. Distillation.
nary predictions can be improved by introducing soft
prediction through distillation methods [27]. Beyond es-      Distillation adds an additional auxiliary loss requiring
timating the CTR per ad, it is important that the set of      matching the predictions of a high-capacity teacher
candidate ads for a particular query is correctly ranked      model, treating teacher predictions as soft labels [27].
(such that ads with clicks have higher CTR than ads with-     In our model, we use a two-pass online distillation
out clicks), thus incorporating proper ranking losses is      setup. On the first pass, a teacher model records its predic-
also important. In this section, we discuss novel auxil-      tions progressively before training on examples. Student
iary losses and introduce multi-task and multi-objective      models consume the teacher’s predictions while train-
methods for joint training with these losses                  ing on the second pass. Thus, the cost of generating
                                                              the predictions from the single teacher can be amortized
                                                              across many students (without requiring the teacher to
4.1.1. Rank Losses
                                                              repeat inference to generate predictions). In addition
We found that Area under the ROC curve computed per to improving accuracy, distillation can also be used for
query (PerQueryAUC) is a metric well correlated with reducing training data costs. Since the high-capacity
business metrics quantifying the overall performance teacher is trained once, it can be trained on a larger data
of a model. In addition to using PerQueryAUC during set. Students benefit implicitly from the teachers prior
evaluation, we also use a relaxation of this metric, i.e., knowledge of the larger training set, and so require train-
rank-loss, as a second training loss in our model. There ing only smaller and more recent data. The addition of
are many rank losses in the learning-to-rank family [28, distillation to the model improved accuracy by 0.41%
29]. We find one effective approximation is Ranknet loss without increasing training costs (in the student).
[30], which is a pairwise logistic loss:
                                                              4.1.3. Curriculums of Losses
              − ∑ ∑ log(sigmoid(𝑠𝑖 , 𝑠𝑗 )),
                𝑖∈{𝑦𝑖 =1} 𝑗∈{𝑦𝑗 ≠1}                           In machine learning, curriculum learning [34] typically
                                                              involves a model learning easy tasks first and gradually
where 𝑠𝑖 , 𝑠𝑗 are logit scores of two examples.               switching to harder tasks. We found that training on all
   Rank losses should be trained jointly with logistic loss; classes of losses in the beginning of training increased
there are several potential optimization setups. In one model instability (manifesting as outlier gradients which
setup, we create a multi-objective optimization problem cause quality to diverge). Thus, we apply an approach
[31]:                                                         similar to curriculum learning to ramp up losses, starting
                                                              with the binary logistic loss and gradually ramping up
   ℒ (𝑊 ) = 𝛼1 ℒrank (yrank , s) + (1 − 𝛼1 )ℒlogistic (y, s), distillation and rank losses over the course of training.
where s are logit scores for examples, yrank are ranking
labels, y are the binary task labels, and 𝛼1 ∈ (0, 1) is the   4.2. Second-order Optimization
rank-loss weight. Another solution is to use multi-task            Second-order optimization methods that use second
learning [32, 33], where the model produces multiple               derivatives and/or second-order statistics are known to
different estimates 𝑠 for each loss.                               have better convergence properties to first-order meth-
                                                                   ods [35]. Yet to our knowledge, second-order methods are
  ℒ (𝑊shared , 𝑊logistic , 𝑊rank ) =
                                                                   rarely reported to be used in production ML systems for
        𝛼1 ℒrank (y, srank ) + (1 − 𝛼1 )ℒlogistic (y, slogistic ), DNNs. Recent work on Distributed Shampoo [16, 36] has
                                                                   made second-order optimization feasible for our model
where 𝑊shared are weights shared between the two losses, by leveraging the heterogneous compute offered by TPUs
𝑊logistic are for the logistic loss output, and 𝑊rank are and host-CPUs, and employing additional algorithmic
for the rank-loss output. In this case, the ranking loss and efficiency improvements.
affects the ”main” prediction slogistic as a ”regularizer” on         In our system, Distributed Shampoo provided much
𝑊shared .                                                          faster convergence with respect to training steps, and
   As rank losses are not naturally calibrated predictors of yielded better accuracy when compared to standard adap-
click probabilities, the model’s predictions will be biased. tive optimization techniques including AdaGrad [15],
A strong bias correction component is needed to ensure
                                                                 
                                                                 
                                                                ̐
                                                                 ̑


                                                          9èĆä
Figure 3: Distributed Shampoo [16]: inverse-𝑝 𝑡ℎ root computations in double precision runs every 𝑁 steps and asynchronously
pipelined on all CPU cores attached to the TPU accelerators.


Adam [37], Yogi [38], and LAMB [39]. While, second-              described in [41].
order methods like Distributed Shampoo is known to                  Stability & Efficiency. Distributed Shampoo has higher
provide faster convergence compared to first-order meth-         computational complexity per step as it involves multi-
ods in the literature - It often fails to provide competitive    plication of large matrices for preconditioning and statis-
wall-clock time due to the computational overheads in            tics/preconditioner computation. We address these over-
the optimizer on smaller scale benchmarks. For our train-        heads with several techniques in our deployment. For
ing system, second-order optimization method was an              example, the block-diagonalization suggested in [16] was
ideal candidate due to large batch sizes used in training        effective at reducing the computational complexity while
which amortizes the cost of costly update rule. Train-           also allowing the implementation of parallel updates for
ing time only increased by approximately 10% and the             each block in the data-parallel setting via weight-update
improvements to model accuracy far outweighed the in-            sharding [42]. This reduced the overall step time. More-
crease in training time. We next discuss some of the more        over, optimizer overheads are independent of batch size,
salient implementation details specific to our model.            and thus we increased batch size to reduce overall com-
   Learning Rate Grafting. One of the main challenges in         putational overhead. Finally, we found that condition
online optimization is defining a learning rate schedule.        number of statistics used for preconditioning can vary
In contrast to training on static datasets, the number of        in range reaching more than 1010 . Because, numerical
steps an online model will require is not known and may          stability and robustness is of utmost importance in pro-
be unbounded. Accordingly, popular learning rate sched-          duction; we make use of double precision numerics. To
ules from literature depending on fixed time horizons,           compute the preconditioners, we use the CPUs attached
such as cosine decay or exponential decay, perform worse         to the TPUs to run inverse-𝑝th roots and exploit a faster
in contrast to the implicit data-dependent adaptive sched-       algorithm, the coupled Newton iteration for larger pre-
ule from AdaGrad [15]. As observed in literature [40], we        conditioners [43] as in Figure 3.
also find that AdaGrad’s implicit schedule works quite              When integrated with the ad click prediction model
well in the online setting; especially after the 𝜖 parame-       the optimizer improved our primary measure of accu-
ter (the initial accumulator value) is tuned. Accordingly,       racy, Area under the ROC curve computed per query
we bootstrap the schedule for Distributed Shampoo via            (PerQueryAuc), by 0.44%. Accuracy improvements above
grafting the per layer step size from AdaGrad. More pre-         0.1% are considered significant. For comparison: a naive
cisely, we use the direction from Shampoo while using            scaling of the deep network by 2x yields a PerQueryAUC
the magnitude of step size from AdaGrad at a per-layer           improvement of 0.13%. See Table 1 for a summary of
granularity. An important feature of this bootstrapping          accuracy technique results.
is that it allowed us to inherit hyper-parameters from
previous AdaGrad tunings to search for a Pareto optimal          4.3. Deep & Cross Network
configuration.
   Momentum. Another effective implementation choice             Learning effective feature crosses is critical for recom-
is the combination of Nesterov-styled momentum with              mender systems [3, 44]. We adopt an efficient variant of
the preconditioned gradient. Our analysis suggests that          DCNv2 [44] using bottlenecks. This is added between the
momentum added modest gains on top of Shampoo                    embedding layer 𝑒 described in Section 2 and the DNN.
without increasing the computational overhead while              We next describe the Deep & Cross Network architec-
marginally increasing the memory overhead. Computa-              ture and its embedding layer input. We use a standard
tional overhead was addressed via the approximations             embedding projection layer for sparse categorical fea-
         Technique              Accuracy       Training Cost        Inference Cost
                              Improvement        Increase              Increase
                                                                                                               Training Cost
 Deep & Cross Network             0.18%              3%                  1%                    Technique
                                                                                                                 Decrease
 Shampoo Optimizer                0.44%             10%                  0%
 Distillation                     0.46%             <1%∗                 0%                  Bottlenecks            7%
 Rank Losses                      0.81%             <1%                  0%                  AutoML                 16%
                                                                                             Data Sampling          75%
 Baseline: 2x DNN Size            0.13%              36%                 10%

                                                                                           Table 2
Table 1                                                                                    Training cost improvements of
Accuracy improvement and training/inference costs for accuracy improving                   applied techniques.
techniques.∗ Distillation does not include teacher cost which, due to amortization,
is a small fraction of overall training costs.


tures. We project categorical feature 𝑖 from a higher                of hidden layers. Note, that sparse embedding lookups
dimensional sparse space to a lower dimensional dense                add to the overall training cost, thus doubling the number
space using 𝑒𝑖̃ = 𝑊𝑖 𝑥𝑖 , where 𝑥𝑖 ∈ {0, 1}𝑣𝑖 ; 𝑊𝑖 ∈ ℝ𝑚𝑖 ×𝑣𝑖 is      layers does not proportionally increase the cost.
the learned projection matrix; 𝑒𝑖̃ is the dense embedding
representation; and 𝑣𝑖 and 𝑚𝑖 represent the vocabulary
and dense embedding sizes respectively. For multivalent              5. Irreproducibility
features, we use average pooling of embedding vectors.
                                                          Irreproducibility, noted in Section 1, may not be easy to
Embedding dimensions {𝑚𝑖 } are tuned for efficiency and
                                                          detect because it may appear in post deployment sys-
accuracy trade-offs using AutoML (Section 3.2). Output
                                                          tem metrics and not in progressive validation quality
of the embedding layer is a wide concatenated vector
                                                          metrics. A pair of duplicate models may converge to
𝑒0 = concat(𝑒1̃ , 𝑒2̃ … 𝑒𝐹̃ ) ∈ ℝ𝑚 for 𝐹 features. For crosses,
                                                          two different optima of the highly non-convex objective,
we adopt an efficient variant of [44], applied directly on
                                                          giving equal average accuracy, but different individual
top of the embedding layer to explicitly learn feature
                                                          predictions, but with different downstream system/auc-
crosses: 𝑒𝑖 = 𝛼2 (𝑒0 ⊙ 𝑈𝑖 𝑉𝑖 𝑒𝑖−1 ) + 𝑒𝑖−1 , where 𝑒𝑖 , 𝑒𝑖−1 ∈ ℝ𝑚
                                                          tion outcomes. Model deployment leads to further diver-
represent the output and input of the 𝑖th cross layer, re-gence, as ads selected by deployed models become part
spectively; 𝑈𝑖 ∈ ℝ𝑚×𝑘 and 𝑉𝑖 ∈ ℝ𝑘×𝑚 are the learned       of subsequent training examples [17]. This can critically
weight matrices leveraging bottlenecks (Section 3.1) for  affect R&D: experimental models may appear beneficial,
efficiency; 𝛼2 is a scalar, ramping up from 0 → 1 during  but gains may disappear when they are retrained and
initial training, allowing the model to first learn the em-
                                                          deployed in production. Theoretical analysis is complex
beddings and then the crosses in a curriculum fashion.    even in the simple convex case, which is considered only
Furthermore, this ReZero initialization [45] also improvesin very recent work [46]. Many factors contribute to ir-
model stability and reproducibility (Section 5).          reproducibility [47, 48, 49, 10, 50, 51], including random
   In practice adding the Deep & Cross Network to the     initialization, non-determinism in training due to highly-
model yielded an accuracy improvement of 0.18% with       parallelized and highly-distributed training pipelines, nu-
a minimal increase in training cost of 3%.                merical errors, hardware, and more. Slight deviations
                                                          early in training may lead to very different models [52].
4.4. Summary of Efficiency and Accuracy While standard training metrics do not expose system
       Results                                            irreproducibility, we can use deviations of predictions on
                                                          individual examples as a cheap proxy, allowing us to fail
Below we share measurements of the relative impact of fast prior to evaluation at deployment-time. Common sta-
the previously discussed efficiency and accuracy tech- tistical metrics (standard deviation, various divergences)
niques as applied to the production model. The goal is can be used [53, 54] but they require training many more
to give a very rough sense of the impact of these tech- models, which is undesirable at our scale. Instead, we use
niques and their accuracy vs. efficiency tradeoffs. While the Relative Prediction Difference (PD) [7, 9] metric
precise measures of accuracy improvement on one par-
                                                                       △
ticular model are not necessarily meaningful, we believe           Δ𝑟 = 1/𝑀 ⋅ ∑ |𝑦𝑖,1
                                                                                    ̂ − 𝑦𝑖,2
                                                                                           ̂ |/[(𝑦𝑖,1
                                                                                                  ̂ + 𝑦𝑖,2
                                                                                                        ̂ )/2]
the coarse ranking of techniques and rough magnitude                            𝑖
of results are interesting (and are consistent with our
                                                          , measuring absolute point-wise difference in model pre-
general experience).
                                                          dictions for a pair of models (subscripts 1 and 2), nor-
   The baseline 2x DNN size model doubles the number
                                                          malized by the pair’s average prediction. Computing PD
requires training a pair of models instead of one, but
we have observed that reducing PD is sufficient to im-
prove reproducibility of important system metrics. In this
section, we focus on methods to improve PD; Section 7
focuses on directly improving system metrics.
   PDs may be as high as 20% for deep models. Per-
haps surprisingly, standard methods such as fixed ini-
tialization, regularization, dropout, data augmentation,
as well as new methods imposing constraints [55, 56]
either failed to improve PD or improved PD at the cost        Figure 4: Model factorization into separable Quality and UI
of accuracy degradation. Techniques like warm-starting        models with estimated CTR ∶= 𝜏 (𝑄 ⋅ 𝑈 )
model weights to values of previously trained models
may not be preferable because they can anchor the model
to a potentially bad solution space and do not help the
                                                              a non-ensemble model with PD less than 10% that also
development cycle for newer more reproducible models
                                                              improved accuracy by 0.1%. System reproducibility met-
for which there is no anchor.
                                                              rics also improved to acceptable levels compared to the
   Other techniques have shown varying levels of suc-
                                                              unacceptable levels of ReLU single component models.
cess. Ensembles [57], specifically self-ensembles [58],
where we average predictions of multiple model dupli-
cates (each initialized differently), can reduce prediction   6. Generalizing Across UI
variance and PD. However, maintaining ensembles in a
production system with multiple components builds up             Treatments
substantial technical debt [59]. While some literature
                                                              One of the major factors in CTR performance of an ad is
[60, 61, 62] describes accuracy advantages for ensembles,
                                                              its UI treatment, including positioning, placement rela-
in our regime, ensembles degraded accuracy relative to
                                                              tive to other results on the page, and specific renderings
equal-cost single networks. We believe this is because,
                                                              such as bolded text or inlined images. A complex auc-
unlike in the benchmark image models, examples in on-
                                                              tion must explore not just the set of results to show, but
line CTR systems are visited once, and, more importantly,
                                                              how they should be positioned relative to other content,
the learned model parameters are dominated by sparse
                                                              and how they should be individually rendered [68]. This
embeddings. Relatedly, more sophisticated techniques
                                                              exploration must take place efficiently over a combinato-
based on ensembling and constraints can also improve
                                                              rially large space of possible treatments.
PD [63, 7, 8].
                                                                 We solve this through model factorization, replacing
   Techniques described above trade accuracy and com-
                                                              estimated CTR with 𝜏 (𝑄 ⋅ 𝑈 ), composed of a transfer
plexity for better reproducibility, requiring either ensem-
                                                              function 𝜏 where 𝑄, 𝑈 are separable models that output
bles or constraints. Further study and experimentation
                                                              vectorized representations of the Quality and the UI, re-
revealed that the popular use of Rectified Linear Unit
                                                              spectively, and are combined using an inner-product.
(ReLU) activations contributes to increased PD. ReLU’s
                                                              While 𝑄, consisting of a large DNN and various feature
gradient discontinuity at 0 induces a highly non-convex
                                                              embeddings, is a costly model, it needs to be evaluated
loss landscape. Smoother activations, on the other hand,
                                                              only once per ad, irrespective of the number of UI treat-
reduce the amount of non-convexity, and can lead to
                                                              ments. In contrast, 𝑈, being a much lighter model, can be
more reproducible models [9]. Empirical evaluations of
                                                              evaluated hundreds of times per ad. Moreover, due to the
various smooth activations [64, 65, 66, 67] have shown
                                                              relatively small feature space of the UI model, outputs
not only better reproducibility compared to ReLU, but
                                                              can be cached to absorb a significant portion of lookup
also slightly better accuracy. The best reproducibility-
                                                              costs (as seen in Figure 4).
accuracy trade-offs in our system were attained by the
                                                                 Separately from model performance requirements, ac-
simple Smooth reLU (SmeLU) activation proposed in [9].
                                                              counting for the influence of UI treatments on CTR is
The function form is:
                                                              also a crucial factor for model quality. Auction dynamics
                                  0;    𝑧 ≤ −𝛽                deliberately create strong correlations between individ-
                           ⎧ (𝑧+𝛽)2                           ual ads and specific UI treatments. Results that are lower
            𝑓SmeLU (𝑧) =            ;   |𝑧| ≤ 𝛽        (1)    on the page may have low CTR regardless of their rele-
                           ⎨ 4𝛽
                           ⎩      𝑧;    𝑧 ≥ 𝛽.                vance to the query. Failure to properly disentangle these
   In our system, 3-component ensembles reduced PD            correlations creates inaccuracy when generalizing over
from 17% to 12% and anti-distillation reduced PD further      UI treatments (e.g., estimating CTR if the same ad was
to 10% with no accuracy loss. SmeLU allowed launching         shown higher on the page). Pricing and eligibility deci-
Figure 5: (a) Loss landscape for a model with non-identifiability across two weights and how bias constraints help find the
right solution: we add additional criteria (red and orange curves) such that we choose the correct solution at optimum loss
(dark blue curve). (b) Calibration bias across buckets of estimated CTR. For calibrated predictions, we expect uniform bias
(black curve). Whereas a model with the original objective function is biased for certain buckets of estimated CTR (blue curve),
we can get much closer to uniform with bias constraints (red curve).


sions depend crucially on CTR estimates of sub-optimal           to our objective function, enforced on relevant slices of
UIs that are rarely occurring in the wild. For instance, our     either the training set or a separate, labelled dataset. This
system shouldn’t show irrelevant ads, and so such sce-           allows us reduce non-identifiability by anchoring model
narios will not be in the training corpus, and so estimates      loss to a desired part of the solution space (e.g., one that
of their irrelevance (low CTR) will be out of distribution.      satisfies calibration) (Figure 5a). By extension, we reduce
But these estimates are needed to ensure the ads do not          irreproducibility by anchoring a retrained model to the
show. Even for relevant ads, there is a similar problem.         same solution.
Performance of ads that rarely show in first position may           Our technique is more lightweight than other meth-
still be used to set the price of those ads that often do        ods used for large-scale, online training (counterfactual
show in first position. This creates a specific generaliza-      reasoning [69], variations of inverse propensity scoring
tion problem related to UI, addressed in Section 7.              [70, 71]): in practice, there are fewer parameters to tune,
   Calibration is an important characteristic for large-         and we simply add an additional term to our objective
scale ads recommendation. We define calibration bias             rather than changing the model structure. To address cal-
as label minus prediction, and want this to be near zero         ibration, [72] adjusts model predictions in a separate cal-
per ad. A calibrated model allows us to use estimated            ibration step using isotonic regression, a non-parametric
CTR to determine the trade-off between showing and               method. Our technique does calibration jointly with esti-
not showing an ad, and between showing one ad versus             mation, and is more similar to methods which consider
another; both calculations can be used in downstream             efficient optimization of complex and augmented objec-
tasks such as UI treatment selection, auction pricing, or        tives (e.g., [73, 74]). Using additional constraints on the
understanding of ad viewability.                                 objective allows us to address a wide range of calibration
   The related concept of credit attribution is similar to       and credit attribution issues.
counterfactual reasoning [69] or bias in implicit feedback
[70]. It is a specific non-identifiability in model weights
that can contribute to irreproducibility (Section 5). Con-       7. Bias Constraints
sider an example to illustrate the UI effect (Section 6):
assume that model A has seen many training examples              7.1. Online Optimization of Bias
with high-CTR ads in high positions, and (incorrectly)                Constraints
learned that ad position most influences CTR. Model B,
                                                                 We now optimize our original objective function with
defined similar to A, trains first on the few examples
                                                                 the constraint that ∀𝑘∀𝑖 ∈ 𝑆𝑘 , (𝑦𝑖 − 𝑦𝑖̂ ) = 0. Here, 𝑆𝑘 are
where high-CTR ads appear in low positions, and (cor-
                                                                 subsets of the training set which we’d like to be calibrated
rectly) learns that something else (e.g., ad relevancy to
                                                                 (e.g., under-represented classes of data) or new training
query) is causing high CTR. Both models produce the
                                                                 data that we may or may not optimize the original model
same estimated CTR for these ads but for different rea-
                                                                 weights over (e.g., out-of-distribution or off-policy data
sons, and when they are deployed, model A will likely
                                                                 gathered from either randomized interventions or explo-
show fewer ads because it will not consider otherwise
                                                                 ration scavenging [75, 70, 76]). To aid optimization, we
useful ads in lower positions; these models will show
                                                                 first transform this into an unconstrained optimization
system irreproducibility.
                                                                 problem by introducing a dual variable 𝜆𝑘,𝑖 for each con-
   In our system, we use a novel, general-purpose tech-
                                                                 straint and maximizing the Lagrangian relative to the
nique called bias constraints to address both calibration
                                                                 dual variables. Next, instead of enforcing zero bias per
and credit attribution. We add calibration bias constraints
                                                                 example, we ask that the squared average bias across 𝑆𝑘
is zero. This reduces the number of dual variables to {𝜆𝑘 },   placement, we will include it in {𝑓 }). For the model in
and is equivalent to adding an L2 regularization on 𝜆𝑘         Table 3, we saw substantial bias improvements on several
with a constraint of zero average bias. For a constant 𝛼3      data subsets 𝑆𝑘 related to out-of-distribution ad placement
controlling regularization, and tuned via typical hyper-       and more reproducibility with minimal accuracy impact
parameter tuning techniques (e.g. grid search), our new        when adding bias constraints.
optimization is:
                                                                  𝑆1 Bias 𝑆2 Bias 𝑆3 Bias Loss Ads/Query Churn
                             𝐾
                                                 𝛼                 -15%    -75%    -43% +0.03%      -85%
   min max ∑ ℒ (𝑦𝑖 , 𝑦𝑖̂ ) + ∑ ∑(𝜆𝑘 (𝑦𝑖 − 𝑦𝑖̂ ) − 3 𝜆𝑘2 )
    𝑊 𝜆𝑘   𝑖                 𝑘=1 𝑖∈𝑆
                                                  2
                                   𝑘
                                                               Table 3
Any degraded accuracy or stability is mitigated by com-        Progressive validation and deployed system metrics reported
binations of the following tunings, ordered by impact:         as a percent change for a bias constraint over the original
ramping up the bias constraint term, reducing the learn-       model (negative is better). Ads/Query Churn records how
ing rate on {𝜆𝑘 }, increasing 𝛼3 , or adding more or finer-    much the percent difference in the number of ads shown
                                                               above search results per query between two model retrains
grained constraints (breaking up 𝑆𝑘 ). We believe the first
                                                               changes when deployed in similar conditions; we want this to
two can help normalize any differences between the mag-        be close to zero.
nitude of the dual variables and other weights, and the
latter two help lessen the strength of the bias term if 𝑆𝑘
                                                                  Viewing the bias constraints as anchoring loss rather
aren’t optimally selected.
                                                               than changing the loss landscape (Figure 5a), we find that
                                                               the technique does not fix model irreproducibility but
7.2. Bias Constraints for General                              rather mitigates system irreproducibility: we were able
     Calibration                                               to cut the number of components in the ensemble by half
                                                               and achieve the same level of reproducibility.
If we plot calibration bias across buckets of interesting
variables, such as estimated CTR or other system met-
rics, we expect a calibrated model to have uniform bias.       8. Conclusion
However, for several axes of interest, our system shows
higher bias at the ends of the range (Figure 5b). We ap-       We detailed a set of techniques for large-scale CTR pre-
ply bias constraints to this problem by defining 𝑆𝑘 to be      diction that have proven to be truly effective “in pro-
examples in each bucket of, e.g., estimated CTR. Since         duction”: balancing improvements to accuracy, training
we don’t use the dual variables during inference, we can       and deployment cost, system reproducibility and model
include estimated CTR in our training objective. With          complexity—along with describing approaches for gener-
bias constraints, bias across buckets of interest becomes      alizing across UI treatments. We hope that this brief visit
much more uniform: variance is reduced by more than            to the factory floor will be of interest to ML practitioners
half. This can in turn improve accuracy of downstream          of CTR prediction systems, recommender systems, online
consumers of estimated CTR.                                    training systems, and more generally to those interested
                                                               in large industrial settings.
7.3. Exploratory Data and Bias
     Constraints                                               References
We can also use bias constraints to solve credit attribution
                                                                [1] H. B. McMahan, G. Holt, D. Sculley, M. Young,
for UI treatments. We pick 𝑆𝑘 by focusing on classes
                                                                    D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov,
of examples that represent uncommon UI presentations
                                                                    D. Golovin, et al., Ad click prediction: a view from
for competitive queries where the ads shown may be
                                                                    the trenches, in: SIGKDD, 2013.
quite different. For example, 𝑆1 might be examples where
                                                                [2] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi,
a high-CTR ad showed at the bottom of the page, 𝑆2
                                                                    A. Atallah, R. Herbrich, S. Bowers, J. Q. n. Candela,
examples where a high-CTR ad showed in the second-to-
                                                                    Practical lessons from predicting clicks on ads at
last position on the page, etc. Depending on how model
                                                                    facebook, 2014.
training is implemented, it may be easier to define 𝑆𝑘 in
                                                                [3] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma,
terms of existing model features (e.g., for a binary feature
                                                                    Y. Yan, J. Jin, H. Li, K. Gai, Deep interest network for
𝑓, we split one sum over 𝑆𝑘 into two sums). We choose {𝑓 }
                                                                    click-through rate prediction, in: SIGKDD, 2018.
to include features that generate partitions large enough
                                                                [4] X. Ling, W. Deng, C. Gu, H. Zhou, C. Li, F. Sun,
to not impact convergence but small enough that we
                                                                    Model ensemble for click prediction in bing search
expect the bias per individual example will be driven to
                                                                    ads, in: WWW, 2017.
zero (e.g., if we think that query language impacts ad
 [5] H. R. Varian, C. Harris, The vcg auction in theory            preprint arXiv:1511.05641 (2015).
     and practice, American Economic Review (2014).           [20] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wier-
 [6] M. W. Dusenberry, D. Tran, E. Choi, J. Kemp,                  stra, Weight uncertainty in neural network, in: In-
     J. Nixon, G. Jerfel, K. Heller, A. M. Dai, Analyzing          ternational conference on machine learning, PMLR,
     the role of model uncertainty for electronic health           2015, pp. 1613–1622.
     records, in: CHIL, 2020.                                 [21] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, N. de Fre-
 [7] G. I. Shamir, L. Coviello, Anti-distillation: Im-             itas, Predicting parameters in deep learning, CoRR
     proving reproducibility of deep networks, arXiv               (????).
     preprint arXiv:2010.09923 (2020).                        [22] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learn-
 [8] G. I. Shamir, L. Coviello, Distilling from ensem-             ing transferable architectures for scalable image
     bles to improve reproducibility of neural networks,           recognition, in: CVPR, 2018.
     2020.                                                    [23] E. Real, A. Aggarwal, Y. Huang, Q. V. Le, Regu-
 [9] G. I. Shamir, D. Lin, L. Coviello, Smooth activa-             larized evolution for image classifier architecture
     tions and reproducibility in deep networks, arXiv             search, in: AAAI, 2019.
     preprint arXiv:2010.09931 (2020).                        [24] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P.-J.
[10] R. R. Snapp, G. I. Shamir, Synthesizing irre-                 Kindermans, Q. V. Le, Can weight sharing outper-
     producibility in deep networks, arXiv preprint                form random architecture search? an investigation
     arXiv:2102.10696 (2021).                                      with tunas, in: CVPR, 2020.
[11] A. D’Amour, K. Heller, D. Moldovan, B. Adlam,            [25] R. J. Williams, Simple statistical gradient-following
     B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisen-        algorithms for connectionist reinforcement learn-
     stein, M. D. Hoffman, F. Hormozdiari, N. Houlsby,             ing, Machine learning (1992).
     S. Hou, G. Jerfel, A. Karthikesalingam, M. Lucic,        [26] W. Fithian, T. Hastie, Local case-control sampling:
     Y. Ma, C. McLean, D. Mincu, A. Mitani, A. Mon-                Efficient subsampling in imbalanced data sets, The
     tanari, Z. Nado, V. Natarajan, C. Nielson, T. F.              Annals of Statistics (2014).
     Osborne, R. Raman, K. Ramasamy, R. Sayres,               [27] G. Hinton, O. Vinyals, J. Dean, Distilling the knowl-
     J. Schrouff, M. Seneviratne, S. Sequeira, H. Suresh,          edge in a neural network, in: NIPS Deep Learning
     V. Veitch, M. Vladymyrov, X. Wang, K. Webster,                and Representation Learning Workshop, 2015.
     S. Yadlowsky, T. Yun, X. Zhai, D. Sculley, Un-           [28] R. K. Pasumarthi, S. Bruch, X. Wang, C. Li, M. Ben-
     derspecification presents challenges for credibil-            dersky, M. Najork, J. Pfeifer, N. Golbandi, R. Anil,
     ity in modern machine learning, arXiv preprint                S. Wolf, Tf-ranking: Scalable tensorflow library for
     arXiv:2011.03395 (2020).                                      learning-to-rank, in: SIGKDD, 2019.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,         [29] C. J. Burges, From Ranknet to Lambdarank to Lamb-
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-          daMart: An overview, Learning (2010).
     tention is all you need, in: NeurIPS, 2017.              [30] C. Burges, T. Shaked, E. Renshaw, A. Lazier,
[13] N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil,         M. Deeds, N. Hamilton, G. Hullender, Learning
     J. Laudon, C. Young, D. Patterson, A domain-                  to rank using gradient descent, in: ICML, 2005.
     specific supercomputer for training deep neural          [31] D. Sculley, Combined regression and ranking, in:
     networks, Communications of the ACM (2020).                   In KDD’10, 2010.
[14] H. B. McMahan, M. Streeter, Adaptive bound op-           [32] R. Caruana, Multitask learning, Machine Learning
     timization for online convex optimization, arXiv              (1997).
     preprint arXiv:1002.4908 (2010).                         [33] S. Ruder, An overview of multi-task learning in
[15] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient           deep neural networks, 2017. arXiv:1706.05098 .
     methods for online learning and stochastic opti-         [34] Y. Bengio, J. Louradour, R. Collobert, J. Weston,
     mization, JMLR (2011).                                        Curriculum learning, in: ICML, 2009.
[16] R. Anil, V. Gupta, T. Koren, K. Regan, Y. Singer,        [35] J. Nocedal, S. J. Wright, Numerical Optimization,
     Second order optimization made practical,                     Springer, 2006.
     https://arxiv.org/abs/2002.09018 (2020).                 [36] V. Gupta, T. Koren, Y. Singer, Shampoo: Precon-
[17] A. Swaminathan, T. Joachims, Batch learning from              ditioned stochastic tensor optimization, in: ICML,
     logged bandit feedback through counterfactual risk            2018.
     minimization, JMLR (2015).                               [37] D. P. Kingma, J. Ba, Adam: A method for stochas-
[18] A. Blum, A. Kalai, J. Langford, Beating the hold-out:         tic optimization, arXiv preprint arXiv:1412.6980
     Bounds for k-fold and progressive cross-validation,           (2014).
     in: COLT, 1999.                                          [38] M. Zaheer, S. Reddi, D. Sachan, S. Kale, S. Ku-
[19] T. Chen, I. Goodfellow, J. Shlens, Net2net: Ac-               mar, Adaptive methods for nonconvex optimization,
     celerating learning via knowledge transfer, arXiv             NeurIPS (2018).
[39] Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar,                    ble prediction variation from neuron activation
     S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer,               strength in recommender systems, arXiv preprint
     C. Hsieh, Large batch optimization for deep learn-             arXiv:2008.07032 (2020).
     ing: Training bert in 76 minutes, arXiv preprint          [54] H. Yu, Z. Chen, D. Lin, G. Shamir, J. Han, Dropout
     arXiv:1904.00962 (2019).                                       prediction variation estimation using neuron acti-
[40] N. Agarwal, R. Anil, E. Hazan, T. Koren, C. Zhang,             vation strength, arXiv preprint arXiv:2110.06435
     Disentangling adaptive gradient methods from                   (2021).
     learning rates, arXiv preprint arXiv:2002.11803           [55] S. Bhojanapalli, K. J. Wilber, A. Veit, A. S. Rawat,
     (2020).                                                        S. Kim, A. K. Menon, S. Kumar, On the reproducibil-
[41] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the           ity of neural network predictions, 2021.
     importance of initialization and momentum in deep         [56] G. I. Shamir, Systems and methods for improved
     learning, in: ICML, 2013.                                      generalization, reproducibility, and stabilization of
[42] Y. Xu, H. Lee, D. Chen, H. Choi, B. Hechtman,                  neural networks via error control code constraints,
     S. Wang, Automatic cross-replica sharding of                   2018.
     weight update in data-parallel training, arXiv            [57] T. G. Dietterich, Ensemble methods in machine
     preprint arXiv:2004.13336 (2020).                              learning, Lecture Notes in Computer Science (2000).
[43] C.-H. Guo, N. J. Higham, A schur–newton method            [58] Z. Allen-Zhu, Y. Li, Towards understanding en-
     for the matrix\boldmath p th root and its inverse,             semble, knowledge distillation and self-distillation
     SIAM Journal on Matrix Analysis and Applications               in deep learning, arXiv preprint arXiv:2012.09816
     (2006).                                                        (2020).
[44] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin,          [59] D. Sculley, G. Holt, D. Golovin, E. Davydov,
     L. Hong, E. Chi, Dcn v2: Improved deep & cross                 T. Phillips, D. Ebner, V. Chaudhary, M. Young, Ma-
     network and practical lessons for web-scale learn-             chine learning: The high interest credit card of
     ing to rank systems, in: WWW, 2021.                            technical debt (2014).
[45] T. Bachlechner, B. P. Majumder, H. Mao, G. Cottrell,      [60] D. Kondratyuk, M. Tan, M. Brown, B. Gong,
     J. McAuley, Rezero is all you need: Fast conver-               When ensembling smaller models is more effi-
     gence at large depth, in: UAI, 2021.                           cient than single large models, arXiv preprint
[46] K. Ahn, P. Jain, Z. Ji, S. Kale, P. Netrapalli, G. I.          arXiv:2005.00570 (2020).
     Shamir, Reproducibility in optimization: The-             [61] E. Lobacheva, N. Chirkova, M. Kodryan, D. Vetrov,
     oretical framework and limits, arXiv preprint                  On power laws in deep ensembles, arXiv preprint
     arXiv:2202.04598 (2022).                                       arXiv:2007.08483 (2020).
[47] S. Fort, H. Hu, B. Lakshminarayanan, Deep en-             [62] X. Wang, D. Kondratyuk, E. Christiansen, K. M.
     sembles: A loss landscape perspective, 2020.                   Kitani, Y. Alon, E. Eban, Wisdom of committees: An
     arXiv:1912.02757 .                                             overlooked approach to faster and more accurate
[48] J. Frankle, G. K. Dziugaite, D. Roy, M. Carbin, Linear         models, arXiv preprint arXiv:2012.01988 (2021).
     mode connectivity and the lottery ticket hypothesis,      [63] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E.
     in: International Conference on Machine Learning,              Dahl, G. E. Hinton, Large scale distributed neural
     2020.                                                          network training through online distillation, arXiv
[49] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein,        preprint arXiv:1804.03235 (2018).
     R. Frostig, G. E. Dahl, Measuring the effects of          [64] J. T. Barron, Continuously differentiable exponen-
     data parallelism on neural network training, arXiv             tial linear units, arXiv preprint arXiv:1704.07483
     preprint arXiv:1811.03600 (2018).                              (2017).
[50] C. Summers, M. J. Dinneen, On nondeterminism              [65] D. Hendrycks, K. Gimpel, Gaussian error lin-
     and instability in neural network optimization,                ear units (gelus), arXiv preprint arXiv:1606.08415
     2021.                                                          (2016).
[51] D. Zhuang, X. Zhang, S. L. Song, S. Hooker,               [66] P. Ramachandran, B. Zoph, Q. V. Le, Search-
     Randomness in neural network training: Char-                   ing for activation functions,       arXiv preprint
     acterizing the impact of tooling, arXiv preprint               arXiv:1710.05941 (2017).
     arXiv:2106.11872 (2021).                                  [67] H. Zheng, Z. Yang, W. Liu, J. Liang, Y. Li, Improv-
[52] A. Achille, M. Rovere, S. Soatto, Critical learning            ing deep neural networks using softplus units, in:
     periods in deep neural networks, arXiv preprint                IJCNN, 2015.
     arXiv:1711.08856 (2017).                                  [68] R. Cavallo, P. Krishnamurthy, M. Sviridenko, C. A.
[53] Z. Chen, Y. Wang, D. Lin, D. Cheng, L. Hong, E. Chi,           Wilkens, Sponsored search auctions with rich ads,
     C. Cui, Beyond point estimate: Inferring ensem-                CoRR abs/1701.05948 (2017). URL: http://arxiv.org/
                                                                    abs/1701.05948. arXiv:1701.05948 .
[69] L. Bottou, J. Peters, J. Quiñonero-Candela, D. X.            ibration: A simple way to improve click models,
     Charles, D. M. Chickering, E. Portugaly, D. Ray,             CIKM (2018).
     P. Simard, E. Snelson, Counterfactual reasoning and     [73] E. Eban, M. Schain, A. Mackey, A. Gordon, R. A.
     learning systems: The example of computational               Saurous, G. Elidan, Scalable learning of non-
     advertising, JMLR (2013).                                    decomposable objectives, in: AIStats, 2017.
[70] T. Joachims, A. Swaminathan, T. Schnabel, Un-           [74] G. S. Mann, A. McCallum, Simple, robust, scalable
     biased learning-to-rank with biased feedback, in:            semi-supervised learning via expectation regular-
     WSDM, 2017.                                                  ization, in: ICML, 2007.
[71] D. Lefortier, A. Swaminathan, X. Gu, T. Joachims,       [75] X. Wang, M. Bendersky, D. Metzler, M. Najork,
     M. de Rijke, Large-scale validation of counterfac-           Learning to rank with selection bias in personal
     tual learning methods: A test-bed, arXiv preprint            search, in: ACM SIGIR, 2016.
     arXiv:1612.00367 (2016).                                [76] J. Langford, A. Strehl, J. Wortman, Exploration
[72] A. Borisov, J. Kiseleva, I. Markov, M. de Rijke, Cal-        scavenging, in: ICML, 2008.