<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rohan Anil</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandra Gadanho</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Da Huang</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nijith Jacob</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhuoshu Li</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dong Lin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Todd Phillips</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Pop</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Regan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gil I. Shamir</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rakesh Shivanna</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiqi Yan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Google Inc.</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>For industrial-scale advertising systems, prediction of ad click-through rate (CTR) is a central problem. Ad clicks constitute a significant class of user engagements and are often used as the primary signal for the usefulness of ads to users. Additionally, in cost-per-click advertising systems where advertisers are charged per click, click rate expectations feed directly into value estimation. Accordingly, CTR model development is a significant investment for most Internet advertising companies. Engineering for such problems requires many machine learning (ML) techniques suited to online learning that go well beyond traditional accuracy improvements, especially concerning eficiency, reproducibility, calibration, credit attribution. We present a case study of practical techniques deployed in a search ads CTR model at a large Internet company. This paper provides an industry case study highlighting important areas of current ML research and illustrating how impactful new ML methods are evaluated and made useful in a large-scale industrial setting.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Personalization</kwd>
        <kwd>Recommender system</kwd>
        <kwd>Content optimization</kwd>
        <kwd>Content ranking</kwd>
        <kwd>Content diversity</kwd>
        <kwd>Causal bandit</kwd>
        <kwd>Contextual bandit</kwd>
        <kwd>View-through attribution</kwd>
        <kwd>Holistic optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>viewed video, search query, or other. Search
advertising specifically looks at matching a query  with an ad
Ad click-through rate (CTR) prediction is a key com-  . CTR models for recommendation specifically aim to
ponent of online advertising systems that has a direct predict the probability  (|) , where the input  is
impact on revenue, and continues to be an area of active an ad-query pair ⟨, ⟩ , potentially adorned with
addiresearch [1, 2, 3, 4]. This paper presents a detailed case tional factors afecting CTR, especially related to user
study to give the reader a ”tour of the factory floor” of a interface: how ads will be positioned and rendered on a
production CTR prediction system, describing challenges results page (Section 6).
specific to this category of large industrial ML systems Beyond surfacing maximally useful results,
recomand highlighting techniques that have proven to work mender systems for ads have important additional
caliwell in practice. bration requirements. Actual click labels are stochastic,</p>
      <p>The production CTR prediction model consists of bil- reflecting noisy responses from users. For any given
adlions of weights, trains on more than one hundred bil- query   and binary label   , we typically hope to achieve
lion examples, and is required to perform inference at precisely  (|  ) ∶=  ⟨  ,  ⟩∼ [  = |  ] over some
well over one hundred thousand requests per second. sample of examples  (in test or training). While a typical
The techniques described here balance accuracy improve- log-likelihood objective in supervised training will result
ments with training and serving costs, without adding in zero aggregate calibration bias across a validation set,
undue complexity: the model is the target of sustained per-example bias is often non-zero.
and substantial R&amp;D and must allow for efectively build- Ads pricing and allocation problems create the
pering on top of what came before. example calibration requirement. Typically, predictions
will flow through to an auction mechanism that
incor1.1. CTR for Search Ads porates bids to determine advertiser pricing. Auction
Recommendations pricing schemes (e.g, VCG [5]) rely on the relative value
of various potential outcomes. This requires that
preThe recommender problem surfaces a result or set of re- dictions for all potential choices of  be well calibrated
sults from a given corpus, for a given initial context. The with respect to each other. Additionally, unlike simple
initial context may be a user demographic, previously- recommenders, ads systems frequently opt to show no
ads. This requires estimating the value of individual ads
relative to this ”null-set” of no ads, rather than simply
maximizing for ad relevance.</p>
      <p>Consider a query like ”yarn for sale”; estimated CTR
for an ad from ”yarn-site-1.com” might be 15.3%.
EstiORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender
Systems and User Modeling, jointly with the 16th ACM Conference on
Recommender Systems, September 23rd, 2022, Seattle, WA, USA</p>
      <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
mated CTR for an ad from ”yarn-site-2.com” might be producibility.
10.4%. Though such estimates can be informed by the This paper makes the following contributions: 1) we
semantic relevance of the websites, the requirements for discuss practical ML considerations from many
perspecprecision are more than what one should expect from gen- tives including accuracy, eficiency and reproducibility,
eral models of language. Additionally, click-through data 2) we detail the real-world application of techniques that
is highly non-stationary: click prediction is fundamen- have improved eficiency and accuracy, in some cases
tally an online recommendation problem. An expectation describing adaptations specific to online learning, and
of 15.3% is not static ground truth in the same sense as, 3) we describe how models can better generalize across
for example, translation or image recommendation; it is UI treatments through model factorization and bias
condefinitively more subject to evolution over time. straints.</p>
      <sec id="sec-1-1">
        <title>1.2. Outline</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Model and Training Overview</title>
      <p>For ads CTR predictors, minor improvements to model
quality will often translate into improved user experience A major design choice is how to represent an ad-query
and overall ads system gains. This motivates continuous pair  . The semantic information in the language of the
investments in model research and development. Theo- query and the ad headlines is the most critical component.
retical and benchmark improvements from ML literature Usage of attention layers on top of raw text tokens may
rarely transfer directly to problem-dependent settings of generate the most useful language embeddings in current
real-world applications. As such, model research must literature [12], but we find better accuracy and eficiency
be primarily empirical and experimental. Consequently, trade-ofs by combining variations of fully-connected
a great deal of attention must be paid to the machine DNNs with simple feature generation such as bi-grams
costs of model training experiments while evaluating and n-grams on sub-word units. The short nature of user
new techniques. In Section 2 we first give a general queries and ad headlines is a contributing factor. Data
overview of the model and training setup; Section 3 then is highly sparse for these features, with typically only a
discusses eficiency concerns and details several suc- tiny fraction of non-zero feature values per example.
cessfully deployed techniques. In Section 4, we survey All features are treated as categorical and mapped to
applications of modern ML techniques targeted at improv- sparse embedding tables. Given an input  , we
concateing measures of accuracy and geared explicitly toward nate the embedding values for all features to form a vector
very-large-scale models. Section 4.4 summarizes empiri-  , the embedding input layer of our DNN.  denotes a
cal results roughly characterizing the relative impact of minibatch of embedding values  across several examples.
these techniques. Next, we formally describe a simplified version of</p>
      <p>Deep neural networks (DNNs) provide substantial the model’s fully-connected neural network
architecimprovements over previous methods in many appli- ture. Later sections will introduce variations to this
arcations, including large-scale industry settings. How- chitecture that improve accuracy, eficiency, or
reproever, non-convex optimization reveals (and exacerbates) ducibility. We feed  into a fully-connected hidden layer
a critical problem of prediction: irreproducibility  1 =  ( 1) that performs a linear transformation of
[6, 7, 8, 9, 10, 11]. Training the same model twice (iden-  using weights  1 followed by non-linear activation
tical architecture, hyper-parameters, training data) may  . Hidden layers   =  ( −1   ) are stacked, with the
lead to metrics of the second model being very difer- output of the  th layer feeding into an output layer
ent from the first. We distinguish between model ir-  =̂ sigmoid(   +1 ) that generates the model’s
predicreproducibility, strictly related to predictions on fixed tion corresponding to a click estimate  .̂ Model weights
data, and system irreproducibility, where a deployed are optimized following min ∑ ℒ (  ,  ̂ ). We found
irreproducible model afects important system metrics. ReLUs to be a good choice for the activation function;
Section 5 characterizes the problem and describes im- Section 5 describes improvements using smoothed
activaprovements to model irreproducibility. tion functions. The model is trained through supervised</p>
      <p>An efective click prediction model must be able to learning with the logistic loss of the observed click label
generalize across diferent UI treatments , including:  with respect to  .̂ Sections 4 and 7 describe additional
where an ad is shown on the page and any changes to losses that have improved our model. Training uses
synthe formatting of the ad (e.g., bolding specific text or chronous minibatch SGD on Tensor Processing Units
adding an image). Section 6 describes a specific model fac- (TPUs) [13]: at each training step  , compute gradients
torization that improves UI generalization performance.   of the loss on a batch of examples (ranging up to
milFinally, Section 7 details a general-purpose technique lions of examples), and weights are optimized with an
for adding bias constraints to the model that has been adaptive optimizer. We find that AdaGrad [ 14, 15] works
applied to both improve generalization and system irre- well for optimizing both embedding weights and dense
network weights. Moreover, In Section 4.2 discusses ac- for ML training is implemented via maximizing model
curacy improvements from deploying a second-order throughput, subject to constraints on minimum
bandoptimizer: Distributed Shampoo [16] for training dense width and maximum training latency. We find that
renetwork weights, which to our knowledge, is the first quired bandwidth is most frequently governed by the
known large-scale deployment in a production scale neu- number of researchers addressing a fixed task. For an
ral network training system. impactful ads model at a large internet company, this
may represent many dozens of engineers attempting
in2.1. Online Optimization cremental progress on a single modelling task. Allowable
training latency is a function of researcher preference,
Given the non-stationarity of data in ads optimization, varying from hours to weeks in practice. Varying
parwe find that online learning methods perform best in allelism (i.e., number of accelerator chips) in training
practice [1]. Models train using a single sequential pass controls development latency. As in many systems,
lowover logged examples in chronological order. Each model ered latency often comes at the expense of throughput.
continues to process new query-ad examples as data ar- For example, using twice the number of chips speeds
rives [17]. For evaluation, we use models’ predictions on up training, but most often does so sub-linearly
(traineach example from before the example is trained on (i.e., ing is less than twice as fast) because of parallelization
progressive validation) [18]. This setup has a number overhead.
of practical advantages. Since all metrics are computed For any given ML advancement, immediate gains must
before an example is trained on, we have an immediate be weighed against the long-term cost to future R&amp;D.
measure of generalization that reflects our deployment For instance, naively scaling up the size of a large DNN
setup. Because we do not need to maintain a holdout might provide immediate accuracy but add prohibitive
validation set, we can efectively use all data for training, cost to future training (Table 1 includes a comparison of
leading to higher confidence measurements. This setup techniques and includes one such naive scaling baseline).
allows the entire learning platform to be implemented as We have found that there are many techniques and
a single-pass streaming algorithm, facilitating the use of model architectures from literature that ofer significant
large datasets. improvements in model accuracy, but fail the test of
whether these improvements are worth the trade-ofs
3. ML Eficiency (e.g., ensembling many models, or full stochastic
variational Bayesian inference [20]). We have also found that
Our CTR prediction system provides predictions for all many accuracy-improving ML techniques can be recast
ads shown to users, scoring a large set of eligible ads for as eficiency-improving via adjusting model parameters
billions of queries per day and requiring support for infer- (especially total number of weights) in order to lower
ence at rates above 100,000 QPS. Any increase in compute training costs. Thus, when we evaluate a technique, we
used for inference directly translates into substantial ad- are often interested in two tuning points: 1) what is the
ditional deployment costs. Latency of inference is also improvement in accuracy when training cost is neutral
critical for real-time CTR prediction and related auctions. and 2) what is the training cost improvement if model
caAs we evaluate improvements to our model, we carefully pacity is lowered until accuracy is neutral. In our setting,
weigh any accuracy improvements against increases in some techniques are much better at improving training
inference cost. costs (e.g., distillation in Section 4.1.2) while others are</p>
      <p>Model training costs are likewise important to consider. better at improving accuracy. Figure 1 illustrates these
For continuous research with a fixed computational bud- two tuning axes.
get, the most important axes for measuring costs are We survey some successfully deployed eficiency
techbandwidth (number of models that can be trained con- niques in the remainder of this section. Section 3.1 details
currently), latency (end-to-end evaluation time for a new the use of matrix factorization bottlenecks to
approximodel), and throughput (models that can be trained per mate large matrix multiplication with reduced cost.
Secunit time). tion 3.2 describes AutoML, an eficient RL-based
architec</p>
      <p>Where inference and training costs may difer, several ture search that is used to identify model configurations
ML techniques are available to make trade-ofs. Distilla- that balance cost and accuracy. Section 3.3 discusses a
tion is particularly useful for controlling inference costs set of efective sampling strategies to reduce data used
or amortizing training costs (see Section 4.1.2). Tech- for training without hurting accuracy.
niques related to adaptive network growth [19] can
control training costs relative to a larger final model (with 3.1. Bottlenecks
larger inference cost).</p>
      <p>Eficient management of computational resources</p>
      <sec id="sec-2-1">
        <title>One practical way to achieve accuracy is to scale up the widths of all the layers in the network. The wider they</title>
        <p>are, the more non-linearities there are in the model, and 3.2. AutoML for Eficiency
in practice this improves model accuracy. On the other
hand, the size of the matrices involved in the loss and To develop an ads CTR prediction model architecture
gradient calculations increases, making the underlying with optimal accuracy/cost trade-of, we typically have
matmul computations slower. Unfortunately, the cost of to tune the embedding widths of dozens of features and
matmul operations (naively) scale up quadratically in the layer widths for each layer in the DNN. Assuming even
size of their inputs. To compute the output of a hidden just a small constant number of options for each such
Tl×ahyeer‘wmi duelt=ripisl y(b-eatdt−ed1ro’pste)rraawtteihgoeynrsetyfopriceaa∈llcyhℝiis×nnp’tu,ctworsoetw-pefeiencrtf iovr−em1. wicnoitsdrtta-hec,tfeactbhtileveecsotcomalcebosinn.adFtuoocrrtiiantrldasudesiattrirocianhla-sslpcaaarclceehmiqteuocidctueklrlsye, sirteeaiasrcchnheost
[21]. We find that carefully inserting bottleneck layers with multiple iterations [22, 23]. We have successfully
of low-rank matrices between layers of non-linearities adopted neural architecture search based on weight
shargreatly reduces scaling costs, with only a small loss of ing [24] to eficiently explore network configurations
relative accuracy. (e.g., varying layer width, embedding dimension) to find</p>
        <p>Applying singular value decomposition to   ’s, we of- versions of our model that provide neutral accuracy with
ten observe that the top half of singular values contribute decreased training and serving cost. As illustrated in
to over 90% of the norm of singular values. This suggests Figure 2, this is achieved by three components: a
weightsharing network, an RL controller, and constraints.
tsphmu−ata1tellwreeend,uocwucaegnhsheatrope. p×Fr  oo+∈rx iaℝm×fix× aetde,    ,−,w∈i1fhℝwi×cehbs.cycaTaanlheebb,eoatmstilogebunnynieficctcaokonnftlscaftooyamrnert- caosnTstuhabein-nwineegtiwgaholltrc-ksashn.adIrniidntahgteinsaewrtcwahyoi,trewkcetbuucraeilnsdistnraatihsnueapsleelrac-rancnhedtswipdoaarctkee
 , compute scales only linearly with  . Empirically, we architectures simultaneously in a single iteration and
found that accuracy loss from this approximation was select a specific architecture by activating part of the
indeed small. By carefully balancing the following two super-network with masking. This setup significantly
refactors, we were able to leverage bottlenecks to achieve duces the number of exploration iterations from O(1000)
better accuracy without increasing computation cost: (1) to O(1).
increasing layer sizes toward better accuracy, at the cost The reinforcement learning controller maintains a
samof more compute, and (2) inserting bottleneck layers to pling distribution,   , over candidate networks. It
samreduce compute, at a small loss of accuracy. Balancing ples a set of decisions ( 1,  2, ...) to activate a sub-network
of these two can be done manually or via AutoML tech- at each training step. We then do a forward pass for the
niques (discussed in the next section). A recent man- activated sub-network to compute loss and cost. Based
ual application of this technique to the model (without on that, we estimate the reward value ( 1,  2, ...) and
AutoML tuning) reduced time per training step by 7% conduct a policy gradient update using the REINFORCE
without impacting accuracy (See Table 2 for a summary algorithm [25] as follows:
of eficiency techniques).</p>
        <p>+  0 ⋅ (( 1,  2, ...) − ) ⋅ ∇ log  ( 1,  2, ...|  ),
 
=  
where  denotes the moving average value of the reward
and  0 is the learning rate for the reinforcement
learning algorithm. Through the update at each training step,
the sampling rate of better architectures will gradually
increase and the sampling distribution will eventually
converge to a promising architecture. We select the
architecture with maximum likelihood at the end of the
training. Constraints specify how to compute the cost
of the activated sub-network, which can typically be
done by estimating the number of floating-point
operations or running a pre-built hardware-aware neural cost
model. The reinforcement learning controller
incorporates the provided cost estimate into the reward (e.g.,
 =  accuracy +  ⋅ | cost/target − 1|, where  &lt; 0 ) [24] in
order to force the sampling distribution to converge to
a cost-constrained point. In order to search for
architectures with lower training cost but neutral accuracy, in our
system we set up multiple AutoML tasks with diferent
constraint targets (e.g. 85%/90%/95% of the baseline cost)
and selected the one with neutral accuracy and smallest
training cost. A recent application of this architecture
search to the model reduced time per training step by
16% without reducing accuracy.</p>
        <sec id="sec-2-1-1">
          <title>3.3. Data Sampling</title>
          <p>Historical examples of clicks on search ads make up a
large dataset that increases substantially every day. The
diminishing returns of ever larger datasets dictate that
it is not beneficial to retain all the data. The marginal
value for improving model quality goes toward zero, and
eventually does not justify any extra machine costs for
training compute and data storage. Alongside using ML
optimization techniques to improve ML eficiency, we
also use data sampling to control training costs. Given
that training is a single-pass over data in time-order, there
are two ways to reduce the training dataset: 1) restricting
the time range of data consumed; and 2) sampling the
data within that range. Limiting training data to more
recent periods is intuitive. As we extend our date range
further back in time, the data becomes less relevant to
future problems. Within any range, clicked examples
are more infrequent and more important to our learning
task; so we sample the non-clicked examples to achieve
rough class balance. Since this is primarily for eficiency,
exact class balance is unnecessary. A constant sampling
rate (a constant class imbalance prior) can be used with a
simple single-pass filter. To keep model predictions
unbiased, importance weighting is used to up-weight negative
examples by the inverse of the sampling rate. Two
additional sampling strategies that have proved efective are
as follows:
• Sampling examples associated with a low logistic
loss (typically examples with low estimated CTR
and no click).
• Sampling examples that are very unlikely to have
been seen by the user based on their position on
the page.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The thresholds for the conditions above are hand</title>
        <p>tuned and chosen to maximize data reduction without
hurting model accuracy. These strategies are
implemented by applying a small, constant sampling rate to all
examples meeting any of the conditions above.
PseudoRandom sampling determines whether examples should
be kept and re-weighted or simply discarded. This
ensures that all training models train on the same data. This
scheme may be viewed as a practical version of [26] for
large problem instances with expensive evaluation.
Simple random sampling allows us to keep model estimates
unbiased with simple constant importance re-weighting.
It is important to avoid very small sampling rates in this
scheme, the consequent large up-weighting can lead to
model instability. Re-weighting is particularly
important for maintaining calibration, since these sampling
strategies are directly correlated to labels.</p>
        <p>For sampling strategies that involve knowing the loss
on an example, calculating that loss would require
running inference on the training example, removing most
of the performance gains. For this reason, we use a proxy
value based on a prediction made by a ”teacher model”.
In this two-pass approach. We first train once over all
data to compute losses and associated sampling rates, and
then once on the sub-sampled data. The first pass uses
the same teacher model for distillation (Section 4.1.2)
and is only done once. Iterative research can then be
performed solely on the sub-sampled data. While these
latter models will have diferent losses per example, the
ifrst pass loss-estimates still provide a good signal for
the ‘dificulty’ of the training example and leads to good
results in practice. Overall our combination of class
rebalancing and loss-based sampling strategies reduces the
data to &lt; 25% of the original dataset for any given period
without significant loss in accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Accuracy</title>
      <p>Next we detail a set of techniques aimed at improving the
accuracy of the system. We discuss: additional losses that
better align ofline training-time metrics with important
business metrics, the application of distillation to our
online training setting, the adaptation of the Shampoo
second-order optimizer to our model, and the use of Deep
and &amp; Cross networks.</p>
      <sec id="sec-3-1">
        <title>4.1. Loss Engineering</title>
        <sec id="sec-3-1-1">
          <title>Loss engineering plays an important role in our system.</title>
          <p>As the goal of our model is to predict whether an ad
will be clicked, our model generally optimizes for logis- the model’s prediction is unbiased per example. More
tic loss, often thought of as the cross-entropy between detail can be found in Section 7. Application of ranklosses
model predictions and the binary task (click/no-click) to the model generated accuracy improvements of −0.81%
labels for each example. Using logistic loss allows model with a slight increase in training cost of 1%.
predictions to be unbiased so that the prediction can
be interpreted directly as a calibrated probability. Bi- 4.1.2. Distillation.
nary predictions can be improved by introducing soft
prediction through distillation methods [27]. Beyond es- Distillation adds an additional auxiliary loss requiring
timating the CTR per ad, it is important that the set of matching the predictions of a high-capacity teacher
candidate ads for a particular query is correctly ranked model, treating teacher predictions as soft labels [ 27].
(such that ads with clicks have higher CTR than ads with- In our model, we use a two-pass online distillation
out clicks), thus incorporating proper ranking losses is setup. On the first pass, a teacher model records its
predicalso important. In this section, we discuss novel auxil- tions progressively before training on examples. Student
iary losses and introduce multi-task and multi-objective models consume the teacher’s predictions while
trainmethods for joint training with these losses ing on the second pass. Thus, the cost of generating
the predictions from the single teacher can be amortized
4.1.1. Rank Losses across many students (without requiring the teacher to
repeat inference to generate predictions). In addition
We found that Area under the ROC curve computed per to improving accuracy, distillation can also be used for
query (PerQueryAUC) is a metric well correlated with reducing training data costs. Since the high-capacity
business metrics quantifying the overall performance teacher is trained once, it can be trained on a larger data
of a model. In addition to using PerQueryAUC during set. Students benefit implicitly from the teachers prior
evaluation, we also use a relaxation of this metric, i.e., knowledge of the larger training set, and so require
trainrank-loss, as a second training loss in our model. There ing only smaller and more recent data. The addition of
are many rank losses in the learning-to-rank family [28, distillation to the model improved accuracy by 0.41%
29]. We find one efective approximation is Ranknet loss without increasing training costs (in the student).
[30], which is a pairwise logistic loss:
− ∑ ∑ log(sigmoid(  ,   )),</p>
          <p>∈{  =1} ∈{  ≠1}
where   ,   are logit scores of two examples.</p>
          <p>Rank losses should be trained jointly with logistic loss;
there are several potential optimization setups. In one
setup, we create a multi-objective optimization problem
[31]:
ℒ ( ) =  1ℒrank(yrank, s) + (1 −  1)ℒlogistic(y, s),
4.1.3. Curriculums of Losses
In machine learning, curriculum learning [34] typically
involves a model learning easy tasks first and gradually
switching to harder tasks. We found that training on all
classes of losses in the beginning of training increased
model instability (manifesting as outlier gradients which
cause quality to diverge). Thus, we apply an approach
similar to curriculum learning to ramp up losses, starting
with the binary logistic loss and gradually ramping up
distillation and rank losses over the course of training.
where s are logit scores for examples, yrank are ranking
labels, y are the binary task labels, and  1 ∈ (0, 1) is the
rank-loss weight. Another solution is to use multi-task
learning [32, 33], where the model produces multiple
diferent estimates  for each loss.</p>
          <p>Second-order optimization methods that use second
derivatives and/or second-order statistics are known to
have better convergence properties to first-order
methods [35]. Yet to our knowledge, second-order methods are
ℒ ( shared,  logistic,  rank) = rarely reported to be used in production ML systems for
 1ℒrank(y, srank) + (1 −  1)ℒlogistic(y, slogistic), DNNs. Recent work on Distributed Shampoo [16, 36] has
made second-order optimization feasible for our model
where  shared are weights shared between the two losses, by leveraging the heterogneous compute ofered by TPUs
 logistic are for the logistic loss output, and  rank are and host-CPUs, and employing additional algorithmic
for the rank-loss output. In this case, the ranking loss and eficiency improvements.
afects the ”main” prediction slogistic as a ”regularizer” on In our system, Distributed Shampoo provided much
 shared. faster convergence with respect to training steps, and</p>
          <p>As rank losses are not naturally calibrated predictors of yielded better accuracy when compared to standard
adapclick probabilities, the model’s predictions will be biased. tive optimization techniques including AdaGrad [15],
A strong bias correction component is needed to ensure</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Second-order Optimization</title>
        <p>Adam [37], Yogi [38], and LAMB [39]. While, second- described in [41].
order methods like Distributed Shampoo is known to Stability &amp; Eficiency. Distributed Shampoo has higher
provide faster convergence compared to first-order meth- computational complexity per step as it involves
multiods in the literature - It often fails to provide competitive plication of large matrices for preconditioning and
statiswall-clock time due to the computational overheads in tics/preconditioner computation. We address these
overthe optimizer on smaller scale benchmarks. For our train- heads with several techniques in our deployment. For
ing system, second-order optimization method was an example, the block-diagonalization suggested in [16] was
ideal candidate due to large batch sizes used in training efective at reducing the computational complexity while
which amortizes the cost of costly update rule. Train- also allowing the implementation of parallel updates for
ing time only increased by approximately 10% and the each block in the data-parallel setting via weight-update
improvements to model accuracy far outweighed the in- sharding [42]. This reduced the overall step time.
Morecrease in training time. We next discuss some of the more over, optimizer overheads are independent of batch size,
salient implementation details specific to our model. and thus we increased batch size to reduce overall
com</p>
        <p>Learning Rate Grafting. One of the main challenges in putational overhead. Finally, we found that condition
online optimization is defining a learning rate schedule. number of statistics used for preconditioning can vary
In contrast to training on static datasets, the number of in range reaching more than 1010. Because, numerical
steps an online model will require is not known and may stability and robustness is of utmost importance in
probe unbounded. Accordingly, popular learning rate sched- duction; we make use of double precision numerics. To
ules from literature depending on fixed time horizons, compute the preconditioners, we use the CPUs attached
such as cosine decay or exponential decay, perform worse to the TPUs to run inverse- th roots and exploit a faster
in contrast to the implicit data-dependent adaptive sched- algorithm, the coupled Newton iteration for larger
preule from AdaGrad [15]. As observed in literature [40], we conditioners [43] as in Figure 3.
also find that AdaGrad’s implicit schedule works quite When integrated with the ad click prediction model
well in the online setting; especially after the  parame- the optimizer improved our primary measure of
accuter (the initial accumulator value) is tuned. Accordingly, racy, Area under the ROC curve computed per query
we bootstrap the schedule for Distributed Shampoo via (PerQueryAuc), by 0.44%. Accuracy improvements above
grafting the per layer step size from AdaGrad. More pre- 0.1% are considered significant. For comparison: a naive
cisely, we use the direction from Shampoo while using scaling of the deep network by 2x yields a PerQueryAUC
the magnitude of step size from AdaGrad at a per-layer improvement of 0.13%. See Table 1 for a summary of
granularity. An important feature of this bootstrapping accuracy technique results.
is that it allowed us to inherit hyper-parameters from
previous AdaGrad tunings to search for a Pareto optimal 4.3. Deep &amp; Cross Network
configuration.</p>
        <p>Momentum. Another efective implementation choice Learning efective feature crosses is critical for
recomis the combination of Nesterov-styled momentum with mender systems [3, 44]. We adopt an eficient variant of
the preconditioned gradient. Our analysis suggests that DCNv2 [44] using bottlenecks. This is added between the
momentum added modest gains on top of Shampoo embedding layer  described in Section 2 and the DNN.
without increasing the computational overhead while We next describe the Deep &amp; Cross Network
architecmarginally increasing the memory overhead. Computa- ture and its embedding layer input. We use a standard
tional overhead was addressed via the approximations embedding projection layer for sparse categorical
fea0.18%
0.44%
techniques.∗ Distillation does not include teacher cost which, due to amortization,
is a small fraction of overall training costs.
tures. We project categorical feature  from a higher
dimensional sparse space to a lower dimensional dense
space using  ̃ =     , where   ∈ {0, 1}  ;   ∈ ℝ  ×  is
the learned projection matrix;  ̃ is the dense embedding
representation; and   and   represent the vocabulary
and dense embedding sizes respectively. For multivalent
features, we use average pooling of embedding vectors.</p>
        <p>Embedding dimensions {  } are tuned for eficiency and
accuracy trade-ofs using AutoML (Section</p>
        <p>3.2). Output
of the embedding layer is a wide concatenated vector
 0 = concat( 1̃,  2̃ …  ̃ ) ∈ ℝ for  features. For crosses,
we adopt an eficient variant of [</p>
        <p>44], applied directly on
top of the embedding layer to explicitly learn feature
crosses:   =  2( 0 ⊙      −1 ) +  −1 , where   ,  −1 ∈ ℝ
represent the output and input of the  th cross layer,
respectively;   ∈ ℝ×
and   ∈ ℝ×</p>
        <p>are the learned
weight matrices leveraging bottlenecks (Section 3.1) for
eficiency;  2 is a scalar, ramping up from 0 → 1 during
initial training, allowing the model to first learn the
embeddings and then the crosses in a curriculum fashion.</p>
        <sec id="sec-3-2-1">
          <title>Furthermore, this ReZero initialization [45] also improves</title>
          <p>model stability and reproducibility (Section 5).</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>In practice adding the Deep &amp; Cross Network to the</title>
          <p>model yielded an accuracy improvement of 0.18% with
a minimal increase in training cost of 3%.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.4. Summary of Eficiency and Accuracy</title>
      </sec>
      <sec id="sec-3-4">
        <title>Results</title>
        <p>Below we share measurements of the relative impact of
the previously discussed eficiency and accuracy
techniques as applied to the production model. The goal is
to give a very rough sense of the impact of these
techniques and their accuracy vs. eficiency tradeofs. While
precise measures of accuracy improvement on one
particular model are not necessarily meaningful, we believe
the coarse ranking of techniques and rough magnitude
of results are interesting (and are consistent with our
general experience).</p>
        <sec id="sec-3-4-1">
          <title>The baseline 2x DNN size model doubles the number</title>
          <p>of hidden layers. Note, that sparse embedding lookups
add to the overall training cost, thus doubling the number
layers does not proportionally increase the cost.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Irreproducibility</title>
      <p>Irreproducibility, noted in Section 1, may not be easy to
detect because it may appear in post deployment
system metrics and not in progressive validation quality
metrics. A pair of duplicate models may converge to
two diferent optima of the highly non-convex objective,
giving equal average accuracy, but diferent individual
predictions, but with diferent downstream
system/auction outcomes. Model deployment leads to further
divergence, as ads selected by deployed models become part
of subsequent training examples [17]. This can critically
afect R&amp;D: experimental models may appear beneficial,
but gains may disappear when they are retrained and
deployed in production. Theoretical analysis is complex
even in the simple convex case, which is considered only
in very recent work [46]. Many factors contribute to
irreproducibility [47, 48, 49, 10, 50, 51], including random
initialization, non-determinism in training due to
highlyparallelized and highly-distributed training pipelines,
numerical errors, hardware, and more. Slight deviations
early in training may lead to very diferent models [ 52].
While standard training metrics do not expose system
irreproducibility, we can use deviations of predictions on
individual examples as a cheap proxy, allowing us to fail
fast prior to evaluation at deployment-time. Common
statistical metrics (standard deviation, various divergences)
can be used [53, 54] but they require training many more
models, which is undesirable at our scale. Instead, we use
the Relative Prediction Diference
(PD) [7, 9] metric
△
Δ = 1/ ⋅
∑ | ,̂1 −  ,̂2 |/[( ,̂1 +  ,̂2 )/2]

, measuring absolute point-wise diference in model
predictions for a pair of models (subscripts 1 and 2),
normalized by the pair’s average prediction. Computing PD
requires training a pair of models instead of one, but
we have observed that reducing PD is suficient to
improve reproducibility of important system metrics. In this
section, we focus on methods to improve PD; Section 7
focuses on directly improving system metrics.</p>
      <p>PDs may be as high as 20% for deep models.
Perhaps surprisingly, standard methods such as fixed
initialization, regularization, dropout, data augmentation,
as well as new methods imposing constraints [55, 56]
either failed to improve PD or improved PD at the cost
of accuracy degradation. Techniques like warm-starting
model weights to values of previously trained models
may not be preferable because they can anchor the model
to a potentially bad solution space and do not help the
development cycle for newer more reproducible models
for which there is no anchor.</p>
      <p>Other techniques have shown varying levels of
success. Ensembles [57], specifically self-ensembles [58],
where we average predictions of multiple model
duplicates (each initialized diferently), can reduce prediction
variance and PD. However, maintaining ensembles in a
production system with multiple components builds up
substantial technical debt [59]. While some literature
[60, 61, 62] describes accuracy advantages for ensembles,
in our regime, ensembles degraded accuracy relative to
equal-cost single networks. We believe this is because,
unlike in the benchmark image models, examples in
online CTR systems are visited once, and, more importantly,
the learned model parameters are dominated by sparse
embeddings. Relatedly, more sophisticated techniques
based on ensembling and constraints can also improve
PD [63, 7, 8].</p>
      <p>Techniques described above trade accuracy and
complexity for better reproducibility, requiring either
ensembles or constraints. Further study and experimentation
revealed that the popular use of Rectified Linear Unit
(ReLU) activations contributes to increased PD. ReLU’s
gradient discontinuity at 0 induces a highly non-convex
loss landscape. Smoother activations, on the other hand,
reduce the amount of non-convexity, and can lead to
more reproducible models [9]. Empirical evaluations of
various smooth activations [64, 65, 66, 67] have shown
not only better reproducibility compared to ReLU, but
also slightly better accuracy. The best
reproducibilityaccuracy trade-ofs in our system were attained by the
simple Smooth reLU (SmeLU) activation proposed in [9].
The function form is:
 SmeLU() =</p>
      <p>0;  ≤ −
⎧⎨ (+4) 2 ; || ≤ 
⎩ ;  ≥ .
(1)</p>
      <sec id="sec-4-1">
        <title>In our system, 3-component ensembles reduced PD</title>
        <p>from 17% to 12% and anti-distillation reduced PD further
to 10% with no accuracy loss. SmeLU allowed launching
a non-ensemble model with PD less than 10% that also
improved accuracy by 0.1%. System reproducibility
metrics also improved to acceptable levels compared to the
unacceptable levels of ReLU single component models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Generalizing Across UI</title>
    </sec>
    <sec id="sec-6">
      <title>Treatments</title>
      <p>One of the major factors in CTR performance of an ad is
its UI treatment, including positioning, placement
relative to other results on the page, and specific renderings
such as bolded text or inlined images. A complex
auction must explore not just the set of results to show, but
how they should be positioned relative to other content,
and how they should be individually rendered [68]. This
exploration must take place eficiently over a
combinatorially large space of possible treatments.</p>
      <p>We solve this through model factorization, replacing
estimated CTR with  ( ⋅  ) , composed of a transfer
function  where  ,  are separable models that output
vectorized representations of the Quality and the UI,
respectively, and are combined using an inner-product.</p>
      <p>While  , consisting of a large DNN and various feature
embeddings, is a costly model, it needs to be evaluated
only once per ad, irrespective of the number of UI
treatments. In contrast,  , being a much lighter model, can be
evaluated hundreds of times per ad. Moreover, due to the
relatively small feature space of the UI model, outputs
can be cached to absorb a significant portion of lookup
costs (as seen in Figure 4).</p>
      <p>Separately from model performance requirements,
accounting for the influence of UI treatments on CTR is
also a crucial factor for model quality. Auction dynamics
deliberately create strong correlations between
individual ads and specific UI treatments. Results that are lower
on the page may have low CTR regardless of their
relevance to the query. Failure to properly disentangle these
correlations creates inaccuracy when generalizing over
UI treatments (e.g., estimating CTR if the same ad was
shown higher on the page). Pricing and eligibility
decisions depend crucially on CTR estimates of sub-optimal to our objective function, enforced on relevant slices of
UIs that are rarely occurring in the wild. For instance, our either the training set or a separate, labelled dataset. This
system shouldn’t show irrelevant ads, and so such sce- allows us reduce non-identifiability by anchoring model
narios will not be in the training corpus, and so estimates loss to a desired part of the solution space (e.g., one that
of their irrelevance (low CTR) will be out of distribution. satisfies calibration) (Figure 5a). By extension, we reduce
But these estimates are needed to ensure the ads do not irreproducibility by anchoring a retrained model to the
show. Even for relevant ads, there is a similar problem. same solution.</p>
      <p>Performance of ads that rarely show in first position may Our technique is more lightweight than other
methstill be used to set the price of those ads that often do ods used for large-scale, online training (counterfactual
show in first position. This creates a specific generaliza- reasoning [69], variations of inverse propensity scoring
tion problem related to UI, addressed in Section 7. [70, 71]): in practice, there are fewer parameters to tune,</p>
      <p>Calibration is an important characteristic for large- and we simply add an additional term to our objective
scale ads recommendation. We define calibration bias rather than changing the model structure. To address
calas label minus prediction, and want this to be near zero ibration, [72] adjusts model predictions in a separate
calper ad. A calibrated model allows us to use estimated ibration step using isotonic regression, a non-parametric
CTR to determine the trade-of between showing and method. Our technique does calibration jointly with
estinot showing an ad, and between showing one ad versus mation, and is more similar to methods which consider
another; both calculations can be used in downstream eficient optimization of complex and augmented
objectasks such as UI treatment selection, auction pricing, or tives (e.g., [73, 74]). Using additional constraints on the
understanding of ad viewability. objective allows us to address a wide range of calibration</p>
      <p>The related concept of credit attribution is similar to and credit attribution issues.
counterfactual reasoning [69] or bias in implicit feedback
[70]. It is a specific non-identifiability in model weights
that can contribute to irreproducibility (Section 5). Con- 7. Bias Constraints
sider an example to illustrate the UI efect (Section 6):
assume that model A has seen many training examples 7.1. Online Optimization of Bias
with high-CTR ads in high positions, and (incorrectly) Constraints
learned that ad position most influences CTR. Model B,
defined similar to A, trains first on the few examples
where high-CTR ads appear in low positions, and
(correctly) learns that something else (e.g., ad relevancy to
query) is causing high CTR. Both models produce the
same estimated CTR for these ads but for diferent
reasons, and when they are deployed, model A will likely
show fewer ads because it will not consider otherwise
useful ads in lower positions; these models will show
system irreproducibility.</p>
      <p>In our system, we use a novel, general-purpose
technique called bias constraints to address both calibration
and credit attribution. We add calibration bias constraints
We now optimize our original objective function with
the constraint that ∀∀ ∈   , (  −  ̂ ) = 0. Here,   are
subsets of the training set which we’d like to be calibrated
(e.g., under-represented classes of data) or new training
data that we may or may not optimize the original model
weights over (e.g., out-of-distribution or of-policy data
gathered from either randomized interventions or
exploration scavenging [75, 70, 76]). To aid optimization, we
ifrst transform this into an unconstrained optimization
problem by introducing a dual variable  , for each
constraint and maximizing the Lagrangian relative to the
dual variables. Next, instead of enforcing zero bias per
example, we ask that the squared average bias across  
is zero. This reduces the number of dual variables to {  },
and is equivalent to adding an L2 regularization on  
with a constraint of zero average bias. For a constant  3
controlling regularization, and tuned via typical
hyperplacement, we will include it in { } ). For the model in
Table 3, we saw substantial bias improvements on several
data subsets   related to out-of-distribution ad placement
and more reproducibility with minimal accuracy impact
parameter tuning techniques (e.g. grid search), our new
when adding bias constraints.
optimization is:
Any degraded accuracy or stability is mitigated by com- Progressive validation and deployed system metrics reported</p>
      <p>min max ∑ ℒ (  ,  ̂ ) + ∑</p>
      <p>∑(  (  −  ̂ ) −

=1 ∈</p>
      <p>23  2)
aren’t optimally selected.
binations of the following tunings, ordered by impact:
ramping up the bias constraint term, reducing the
learning rate on {  }, increasing  3, or adding more or
finergrained constraints (breaking up   ). We believe the first
two can help normalize any diferences between the
magnitude of the dual variables and other weights, and the
latter two help lessen the strength of the bias term if</p>
      <sec id="sec-6-1">
        <title>7.2. Bias Constraints for General</title>
      </sec>
      <sec id="sec-6-2">
        <title>Calibration</title>
        <p>1 Bias  2 Bias  3 Bias
-15%
-75%
-43%</p>
        <p>Loss
+0.03%</p>
        <p>Ads/Query Churn
-85%
as a percent change for a bias constraint over the original
model (negative is better). Ads/Query Churn records how
much the percent diference in the number of ads shown
above search results per query between two model retrains
changes when deployed in similar conditions; we want this to
be close to zero.</p>
        <sec id="sec-6-2-1">
          <title>Viewing the bias constraints as anchoring loss rather</title>
          <p>than changing the loss landscape (Figure 5a), we find that
the technique does not fix model irreproducibility but
rather mitigates system irreproducibility: we were able
to cut the number of components in the ensemble by half
and achieve the same level of reproducibility.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>If we plot calibration bias across buckets of interesting variables, such as estimated CTR or other system met</title>
        </sec>
        <sec id="sec-6-2-3">
          <title>However, for several axes of interest, our system shows</title>
          <p>rics, we expect a calibrated model to have uniform bias. 8. Conclusion
higher bias at the ends of the range (Figure 5b). We
apWe detailed a set of techniques for large-scale CTR
preply bias constraints to this problem by defining 
 to be
diction that have proven to be truly efective “in
proexamples in each bucket of, e.g., estimated CTR. Since
duction”: balancing improvements to accuracy, training
we don’t use the dual variables during inference, we can
and deployment cost, system reproducibility and model
include estimated CTR in our training objective. With
complexity—along with describing approaches for
generbias constraints, bias across buckets of interest becomes
alizing across UI treatments. We hope that this brief visit
much more uniform: variance is reduced by more than
to the factory floor will be of interest to ML practitioners
half. This can in turn improve accuracy of downstream
of CTR prediction systems, recommender systems, online
consumers of estimated CTR.</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>7.3. Exploratory Data and Bias</title>
      </sec>
      <sec id="sec-6-4">
        <title>Constraints</title>
        <p>
          We can also use bias constraints to solve credit attribution
for UI treatments. We pick   by focusing on classes
of examples that represent uncommon UI presentations
for competitive queries where the ads shown may be
quite diferent. For example,  1 might be examples where
a high-CTR ad showed at the bottom of the page,  2
examples where a high-CTR ad showed in the
second-tolast position on the page, etc. Depending on how model
training is implemented, it may be easier to define   in
terms of existing model features (e.g., for a binary feature
 , we split one sum over   into two sums). We choose { }
to include features that generate partitions large enough
to not impact convergence but small enough that we
expect the bias per individual example will be driven to
zero (e.g., if we think that query language impacts ad
training systems, and more generally to those interested
in large indust
          <xref ref-type="bibr" rid="ref25">rial settings.
ads, in: WWW, 2017</xref>
          .
[69] L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. ibration: A simple way to improve click models,
Charles, D. M. Chickering, E. Po
          <xref ref-type="bibr" rid="ref25">rtugaly, D. Ray, CIKM (2018</xref>
          ).
        </p>
        <p>
          P. Simard, E. Snelson, Counterfactual reasoning and [73] E. Eban, M. Schain, A. Mackey, A. Gordon, R. A.
learning systems: The example of computational Saurous, G. Elidan, Scalable learning of
nonadvertising, JMLR (2013). decomposable objectives, in: AIStats, 2017.
[70] T. Joachims, A. Swaminathan, T. Schnabel, Un- [74] G. S. Mann, A. McCallum, Simple, robust, scalable
biased learning-to-rank with biased feedback, in: semi-supervised learning via expectation
          <xref ref-type="bibr" rid="ref25">regularWSDM, 2017</xref>
          . ization, in: ICML, 2007.
[71] D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, [75] X. Wang, M. Bendersky, D. Metzler, M. Najork,
M. de Rijke, Large-scale validation of counterfac- Learning to rank with selection bias in personal
tual learning methods: A test-bed, arXiv p
          <xref ref-type="bibr" rid="ref25">reprint search, in: ACM SIGIR, 2016</xref>
          .
        </p>
        <p>
          arXiv:1612.00367 (2016). [76] J. Langford, A. Strehl, J. Wortman, Exploration
[72] A. Borisov, J. Kiseleva, I. Markov, M. de Rijke, Cal- scaveng
          <xref ref-type="bibr" rid="ref16">ing, in: ICML, 2008</xref>
          .
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>facebook</surname>
          </string-name>
          ,
          <year>2014</year>
          . [1]
          <string-name>
            <surname>H. B. McMahan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Holt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Sculley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ebner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Grady</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Davydov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Golovin</surname>
          </string-name>
          , et al.,
          <article-title>Ad click prediction: a view from the trenches</article-title>
          ,
          <source>in: SIGKDD</source>
          ,
          <year>2013</year>
          . [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Atallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Herbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bowers</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Q. n. Candela,</surname>
          </string-name>
          <article-title>Practical lessons from predicting clicks on ads</article-title>
          at [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gai</surname>
          </string-name>
          ,
          <article-title>Deep interest network for click-through rate prediction</article-title>
          ,
          <source>in: SIGKDD</source>
          ,
          <year>2018</year>
          . [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Model ensemble for click prediction in bing search</article-title>
          [5]
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Varian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <source>The vcg auction in theory preprint arXiv:1511.05641</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          and practice, American Economic Review (
          <year>2014</year>
          ). [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Blundell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cornebise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , D. Wier[6]
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Dusenberry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Kemp, stra, Weight uncertainty in neural network</article-title>
          , in: InJ. Nixon, G. Jerfel,
          <string-name>
            <given-names>K.</given-names>
            <surname>Heller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>Analyzing ternational conference on machine learning, PMLR, the role of model uncertainty for electronic health 2015</article-title>
          , pp.
          <fpage>1613</fpage>
          -
          <lpage>1622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          records, in: CHIL,
          <year>2020</year>
          . [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Denil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shakibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dinh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ranzato</surname>
          </string-name>
          , N. de Fre[7]
          <string-name>
            <given-names>G. I.</given-names>
            <surname>Shamir</surname>
          </string-name>
          , L. Coviello,
          <article-title>Anti-distillation: Im- itas, Predicting parameters in deep learning, CoRR proving reproducibility of deep networks</article-title>
          , arXiv (????).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          preprint arXiv:
          <year>2010</year>
          .
          <volume>09923</volume>
          (
          <year>2020</year>
          ). [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Learn[8]
          <string-name>
            <given-names>G. I.</given-names>
            <surname>Shamir</surname>
          </string-name>
          , L. Coviello,
          <article-title>Distilling from ensem- ing transferable architectures for scalable image bles to improve reproducibility of neural networks, recognition</article-title>
          , in: CVPR,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <year>2020</year>
          . [23]
          <string-name>
            <given-names>E.</given-names>
            <surname>Real</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Regu[9]
          <string-name>
            <given-names>G. I.</given-names>
            <surname>Shamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Coviello</surname>
          </string-name>
          ,
          <article-title>Smooth activa- larized evolution for image classifier architecture tions and reproducibility in deep networks, arXiv search</article-title>
          , in: AAAI,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          preprint arXiv:
          <year>2010</year>
          .
          <volume>09931</volume>
          (
          <year>2020</year>
          ). [24]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Chu, S. Cheng, P.-J. [10]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Snapp</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. I. Shamir</surname>
          </string-name>
          , Synthesizing irre- Kindermans,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Can weight sharing outperproducibility in deep networks</article-title>
          ,
          <source>arXiv preprint form random architecture search? an investigation arXiv:2102.10696</source>
          (
          <year>2021</year>
          ).
          <article-title>with tunas</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2020</year>
          . [11]
          <string-name>
            <surname>A. D'Amour</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Heller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moldovan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Adlam</surname>
            , [25]
            <given-names>R. J.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>Simple statistical gradient-following B</article-title>
          .
          <string-name>
            <surname>Alipanahi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beutel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Deaton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Eisen- algorithms for connectionist reinforcement learnstein</article-title>
          , M. D.
          <string-name>
            <surname>Hofman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hormozdiari</surname>
          </string-name>
          , N. Houlsby, ing,
          <source>Machine learning</source>
          (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Jerfel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karthikesalingam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lucic</surname>
          </string-name>
          , [26]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fithian</surname>
          </string-name>
          , T. Hastie,
          <article-title>Local case-control sampling: Y</article-title>
          . Ma,
          <string-name>
            <given-names>C.</given-names>
            <surname>McLean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mincu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitani</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Mon- Eficient subsampling in imbalanced data sets, The tanari</article-title>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nielson</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. F</surname>
          </string-name>
          . Annals of Statistics (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Osborne</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Raman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ramasamy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sayres</surname>
            , [27]
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowlJ</article-title>
          . Schrouf,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sequeira</surname>
          </string-name>
          , H. Suresh,
          <article-title>edge in a neural network, in: NIPS Deep Learning V</article-title>
          . Veitch,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vladymyrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Webster</surname>
          </string-name>
          , and Representation Learning Workshop,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Yadlowsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          , Un- [28]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Pasumarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Benderspecification presents challenges for credibil- dersky,</article-title>
          <string-name>
            <surname>M. Najork</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pfeifer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Golbandi</surname>
          </string-name>
          , R. Anil,
          <article-title>ity in modern machine learning, arXiv preprint S. Wolf, Tf-ranking: Scalable tensorflow library for arXiv:</article-title>
          <year>2011</year>
          .
          <volume>03395</volume>
          (
          <year>2020</year>
          ).
          <article-title>learning-to-rank</article-title>
          ,
          <source>in: SIGKDD</source>
          ,
          <year>2019</year>
          . [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit, [29]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Burges</surname>
          </string-name>
          , From Ranknet to Lambdarank to LambL. Jones,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , At- daMart:
          <article-title>An overview</article-title>
          ,
          <source>Learning</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>tention is all you need</article-title>
          , in: NeurIPS,
          <year>2017</year>
          . [30]
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shaked</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Renshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazier</surname>
          </string-name>
          , [13]
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Jouppi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Yoon</surname>
          </string-name>
          , G. Kurian,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deeds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hullender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Learning J.</given-names>
            <surname>Laudon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <article-title>A domain- to rank using gradient descent</article-title>
          , in: ICML,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>specific supercomputer for training deep neural</article-title>
          [31]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          ,
          <article-title>Combined regression and ranking</article-title>
          , in: networks,
          <source>Communications of the ACM</source>
          (
          <year>2020</year>
          ).
          <source>In KDD'10</source>
          ,
          <year>2010</year>
          . [14]
          <string-name>
            <surname>H. B. McMahan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Streeter</surname>
            , Adaptive bound op- [32]
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Caruana</surname>
          </string-name>
          ,
          <article-title>Multitask learning, Machine Learning timization for online convex optimization</article-title>
          ,
          <source>arXiv</source>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>preprint arXiv:1002.4908</source>
          (
          <year>2010</year>
          ). [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <article-title>An overview of multi-task learning</article-title>
          in [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Duchi</surname>
          </string-name>
          , E. Hazan,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <source>Adaptive subgradient deep neural networks</source>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>05098</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>methods for online learning</article-title>
          and stochastic opti- [34]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Louradour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          , J. Weston, mization,
          <source>JMLR</source>
          (
          <year>2011</year>
          ).
          <article-title>Curriculum learning</article-title>
          , in: ICML,
          <year>2009</year>
          . [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Regan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          , [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nocedal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Wright</surname>
          </string-name>
          , Numerical Optimization, Second order optimization made practical, Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          https://arxiv.org/abs/
          <year>2002</year>
          .09018 (
          <year>2020</year>
          ). [36]
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          , Shampoo: Precon[17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          , T. Joachims,
          <article-title>Batch learning from ditioned stochastic tensor optimization, in: ICML, logged bandit feedback through counterfactual risk 2018</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>minimization</surname>
          </string-name>
          ,
          <source>JMLR</source>
          (
          <year>2015</year>
          ). [37]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochas[18]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Blum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          ,
          <article-title>Beating the hold-out: tic optimization</article-title>
          ,
          <source>arXiv preprint arXiv:1412</source>
          .
          <article-title>6980 Bounds for k-fold and progressive cross-validation</article-title>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>in: COLT</source>
          ,
          <year>1999</year>
          . [38]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sachan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kale</surname>
          </string-name>
          , S. Ku[19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Shlens,</surname>
          </string-name>
          <article-title>Net2net: Ac- mar, Adaptive methods for nonconvex optimization, celerating learning via knowledge transfer</article-title>
          ,
          <source>arXiv NeurIPS</source>
          (
          <year>2018</year>
          ). [39]
          <string-name>
            <given-names>Y.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hseu</surname>
          </string-name>
          , S. Kumar,
          <article-title>ble prediction variation from neuron activation S. Bhojanapalli</article-title>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Demmel</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Keutzer, strength in recommender systems</article-title>
          , arXiv preprint C.
          <article-title>Hsieh, Large batch optimization for deep learn</article-title>
          - arXiv:
          <year>2008</year>
          .
          <volume>07032</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>ing: Training bert in 76 minutes</article-title>
          , arXiv preprint [54]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Shamir</surname>
          </string-name>
          , J. Han, Dropout arXiv:
          <year>1904</year>
          .
          <volume>00962</volume>
          (
          <year>2019</year>
          ).
          <article-title>prediction variation estimation using neuron acti</article-title>
          [40]
          <string-name>
            <given-names>N.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          , E. Hazan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Koren</surname>
          </string-name>
          , C. Zhang, vation strength,
          <source>arXiv preprint arXiv:2110.06435 Disentangling adaptive gradient methods from</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>learning rates</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>11803</volume>
          [55]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhojanapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Wilber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Veit</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. S. Rawat</surname>
          </string-name>
          , (
          <year>2020</year>
          ). S. Kim,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , On the reproducibil[41]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martens</surname>
          </string-name>
          , G. Dahl, G. Hinton,
          <source>On the ity of neural network predictions</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>importance of initialization and</article-title>
          momentum in deep [56]
          <string-name>
            <surname>G. I. Shamir</surname>
          </string-name>
          ,
          <article-title>Systems and methods for improved learning</article-title>
          ,
          <source>in: ICML</source>
          ,
          <year>2013</year>
          . generalization, reproducibility, and stabilization of [42]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hechtman</surname>
          </string-name>
          ,
          <article-title>neural networks via error control code constraints, S. Wang, Automatic cross-replica sharding of 2018.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>weight update in data-parallel training</article-title>
          , arXiv [57]
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          , Ensemble methods in machine preprint arXiv:
          <year>2004</year>
          .
          <volume>13336</volume>
          (
          <year>2020</year>
          ).
          <source>learning, Lecture Notes in Computer Science</source>
          (
          <year>2000</year>
          ). [43]
          <string-name>
            <surname>C.-H. Guo</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          <string-name>
            <surname>Higham</surname>
            , A schur-newton method [58]
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Allen-Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Towards understanding enfor the matrix\boldmath p th root and its inverse, semble, knowledge distillation and self-distillation SIAM Journal on Matrix Analysis and Applications in deep learning</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>09816</volume>
          (
          <year>2006</year>
          ).
          <article-title>(</article-title>
          <year>2020</year>
          ). [44]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shivanna</surname>
          </string-name>
          , D. Cheng, S. Jain,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          , [59]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Golovin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Davydov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          , E. Chi, Dcn v2:
          <article-title>Improved deep</article-title>
          &amp; cross
          <string-name>
            <given-names>T.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ebner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <article-title>Manetwork and practical lessons for web-scale learn- chine learning: The high interest credit card of ing to rank systems</article-title>
          , in: WWW,
          <year>2021</year>
          . technical debt (
          <year>2014</year>
          ). [45]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bachlechner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          , G. Cottrell, [60]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondratyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Rezero is all you need: Fast conver- When ensembling smaller models is more efigence at large depth</article-title>
          ,
          <source>in: UAI</source>
          ,
          <year>2021</year>
          .
          <article-title>cient than single large models</article-title>
          , arXiv preprint [46]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Netrapalli</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. I. arXiv</surname>
          </string-name>
          :
          <year>2005</year>
          .
          <volume>00570</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Shamir</surname>
            , Reproducibility in optimization: The- [61]
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Lobacheva</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Chirkova</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kodryan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Vetrov, oretical framework and limits, arXiv preprint On power laws in deep ensembles</article-title>
          ,
          <source>arXiv preprint arXiv:2202.04598</source>
          (
          <year>2022</year>
          ). arXiv:
          <year>2007</year>
          .
          <volume>08483</volume>
          (
          <year>2020</year>
          ). [47]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lakshminarayanan</surname>
          </string-name>
          , Deep en- [62]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kondratyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Christiansen</surname>
          </string-name>
          , K. M.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>sembles: A loss landscape perspective</article-title>
          ,
          <year>2020</year>
          . Kitani,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Alon</surname>
          </string-name>
          , E. Eban, Wisdom of committees: An arXiv:
          <year>1912</year>
          .02757.
          <article-title>overlooked approach to faster</article-title>
          and more accurate [48]
          <string-name>
            <given-names>J.</given-names>
            <surname>Frankle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Dziugaite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carbin</surname>
          </string-name>
          , Linear models, arXiv preprint arXiv:
          <year>2012</year>
          .
          <year>01988</year>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>mode connectivity and the lottery ticket hypothesis</article-title>
          , [63]
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pereyra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ormandi</surname>
          </string-name>
          , G. E.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>in: International Conference on Machine Learning</source>
          , Dahl,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Large scale distributed neural 2020. network training through online distillation</article-title>
          , arXiv [49]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Shallue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Antognini</surname>
          </string-name>
          , J. Sohl-Dickstein, preprint arXiv:
          <year>1804</year>
          .
          <volume>03235</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Frostig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Dahl</surname>
          </string-name>
          , Measuring the efects of [64]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Barron</surname>
          </string-name>
          ,
          <article-title>Continuously diferentiable exponendata parallelism on neural network training, arXiv tial linear units</article-title>
          ,
          <source>arXiv preprint arXiv:1704</source>
          .07483 preprint arXiv:
          <year>1811</year>
          .
          <volume>03600</volume>
          (
          <year>2018</year>
          ).
          <article-title>(</article-title>
          <year>2017</year>
          ). [50]
          <string-name>
            <given-names>C.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Dinneen</surname>
          </string-name>
          , On nondeterminism [65]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <article-title>Gaussian error linand instability in neural network optimization, ear units (gelus</article-title>
          ),
          <source>arXiv preprint arXiv:1606.08415</source>
          <year>2021</year>
          . (
          <year>2016</year>
          ). [51]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Song</surname>
          </string-name>
          , S. Hooker, [66]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ramachandran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>SearchRandomness in neural network training: Char- ing for activation functions, arXiv preprint acterizing the impact of tooling</article-title>
          ,
          <source>arXiv preprint arXiv:1710.05941</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>arXiv:2106.11872</source>
          (
          <year>2021</year>
          ). [67]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          , Improv[52]
          <string-name>
            <given-names>A.</given-names>
            <surname>Achille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rovere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soatto</surname>
          </string-name>
          ,
          <article-title>Critical learning ing deep neural networks using softplus units, in: periods in deep neural networks</article-title>
          ,
          <source>arXiv preprint IJCNN</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>arXiv:1711.08856</source>
          (
          <year>2017</year>
          ). [68]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cavallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Krishnamurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sviridenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. A.</surname>
          </string-name>
          [53]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          , D. Cheng, L.
          <string-name>
            <surname>Hong</surname>
          </string-name>
          , E. Chi, Wilkens,
          <article-title>Sponsored search auctions with rich ads, C. Cui, Beyond point estimate: Inferring ensem</article-title>
          -
          <source>CoRR abs/1701</source>
          .05948 (
          <year>2017</year>
          ). URL: http://arxiv.org/ abs/1701.05948. arXiv:
          <volume>1701</volume>
          .
          <fpage>05948</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>