<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eficient and Efective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tian Tan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Huertas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Buyer Risk Prevention - ML, WW Customer Trust</institution>
          ,
          <addr-line>Amazon, Seattle, WA 98109</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Gradient boosting decision trees (GBDTs) are widely applied on tabular data in real-world ML systems. Quantifying uncertainty in GBDT models is thus essential for decision making and for avoiding costly mistakes to ensure an interpretable and safe deployment of tree-based models. Recently, Bayesian ensemble of GBDT models is used to measure uncertainty by leveraging an algorithm called stochastic gradient Langevin boosting (SGLB), which combines GB with stochastic gradient MCMC (SG-MCMC). Although theoretically sound, SGLB gets trapped easily on a particular mode of the Bayesian posterior, just like other forms of SG-MCMCs. Therefore, a single SGLB model can often fail to produce uncertainty estimates of high-fidelity. To address this problem, we present Cyclical SGLB (cSGLB) which incorporates a Cyclical Gradient schedule in the SGLB algorithm. The cyclical gradient mechanism promotes new mode discovery and helps explore high multimodal posterior distributions. As a result, cSGLB can eficiently quantify uncertainty in GB with only a single model. In addition, we present another cSGLB variant with data bootstrapping to further encourage diversity among posterior samples. We conduct extensive experiments to demonstrate the eficiency and efectiveness of our algorithm, and show that it outperforms the state-of-the-art SGLB on uncertainty quantification, especially when uncertainty is used for detecting out-of-domain (OOD) data or distributional shifts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;uncertainty quantification</kwd>
        <kwd>gradient boosting decision trees</kwd>
        <kwd>Bayesian inference</kwd>
        <kwd>out-of-domain (OOD) detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        eficiently on GBDT predictions can therefore not only
improve model interpretability in production but also
With the rapid growth of data and computing power, ensure a safer deployment of ML systems, especially for
machine learning (ML) has been gaining a lot of new high-risk applications.
applications in areas not imagined before. The more Uncertainty quantification (UQ) has been widely
studubiquitous ML systems become, it is inevitable to see ied for neural networks under the Bayesian framework
applications in very sensitive and high-risk fields. This [8], however, it is relatively under-explored for tree-based
expands to a numerous areas like criminal recidivism models. Although calibrated probability estimation trees
[1], medical follow-ups [2] and autonomous-systems [3]. [9, 10] can be used for UQ, they have not been studied
While these systems might be very broad, they share a from a Bayesian perspective. Recently, Bayesian
ensemcommon need, and that is to have a certain degree of con- ble methods were extended to measure uncertainty in
ifdence on ML predictions. A proven successful way to GBDTs by leveraging a new algorithm called stochastic
build confidence in critical systems is uncertainty estima- gradient Langevin boosting (SGLB) [11]. Specifically, two
tion. Research has shown that humans are more likely to SGLB-based approaches were introduced for UQ [12]: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
agree with a system if they get access to the correspond- SGLB ensemble, which trains multiple SGLB models in
ing uncertainty, and this holds true regardless of shape parallel, and (
        <xref ref-type="bibr" rid="ref3">2</xref>
        ) SGLB virtual ensemble, which constructs
and variance as the approach itself is model and task ag- a virtual ensemble using only a single SGLB model where
nostic [4]. Since the most common data type in real-world each member in the ensemble is a "truncated" sub-model
ML applications is tabular [5], our work in this paper fo- [12]. Although both approaches are theoretically sound,
cuses specifically on uncertainty quantification for the there is clearly a trade-of between quality and eficiency
state-of-the-art gradient boosting decision trees (GBDTs) in practice. SGLB (real) ensemble is believed to be
ac[6, 7], which are known to outperform deep learning (DL) curate as it can characterize the Bayesian posterior well
methods on tabular data, both in accuracy and tuning by running independent models in parallel. However,
requirements [5]. Measuring uncertainty efectively and it is almost infeasible to deploy such an ensemble in
real-world production due to its high computational and
The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), maintenance costs. SGLB virtual ensemble greatly
imFebruary 13 - 14, 2023, Washington, D.C., USA proves the eficiency, however, it often gets stuck on a
* Corresponding author. single mode of the Bayesian posterior and can produce
("C. tHiaunetart@asa)m;qaqzzohna.coo@mam(Ta.zToann.c);ocmar(lQoh.uZeh@aoa)mazon.com downgraded uncertainty estimates [13, 14]. To better
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License balance between quality and eficiency and to facilitate
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
perior performance in detecting distributional/domain
shifts in real-world tabular data streams.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Bayesian ML and approximate Bayesian inference
provide a principled representation of uncertainty. One
popular family of approaches to inference in Bayesian ML
are stochastic gradient Markov Chain Monte Carlo
(SGMCMC) methods [17, 18, 19, 20, 13], which are used to
Figure 1: Illustration of proposed cyclical schedule on gradi- efectively sample models (or model parameters) from
ent scales for SGLB algorithm. the Bayesian posterior. Uncertainty then comes naturally
by measuring the "discrepancy" in predictions from the
sampled models which are regarded as posterior samples.
the usage of uncertainty-enabled ML systems, an impor- Recently, stochastic gradient Langevin boosting (SGLB)
tant question remains: how can we make a single SGLB [11] was proposed by combining gradient boosting with
explore efectively diefrent modes of a posterior given a SG-MCMC. As its name suggests, the Markov chain
genlimited computational budget? erated by SGLB obeys a special form of the stochastic</p>
      <p>
        In this paper, we address the question above by com- gradient Langevin dynamics (SGLD) [11, 17], which
imbining SGLB virtual ensemble with advanced sampling plies that SGLB is able to generate samples from the true
techniques from Bayesian DL [14, 13, 15, 16]. Inspired by Bayesian posterior asymptotically. Leveraging this
propthe ideas in [13], we propose to use a scaler (or scaling erty, Malinin et al. [12] proposed to use (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) SGLB
ensemfactor) on gradients that follows a cyclical schedule dur- ble or (
        <xref ref-type="bibr" rid="ref3">2</xref>
        ) SGLB virtual ensemble to measure uncertainty
ing the course of SGLB training. The cyclical schedule is in GBDTs. Essentially, SGLB ensemble corresponds to
illustrated in Figure 1, and consequently, we name the re- running multiple SG-MCMCs in parallel and each chain
sulting algorithm Cyclical SGLB (cSGLB). Similar to [13], (or SGLB) is initialized independently with a diferent
raneach cycle in cSGLB contains two stages: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Exploration: dom seed. Since SGLB allows us to sample from the true
when the scaler is large, we treat this stage as a warm posterior, the ensemble with multiple samples gives a
restart from the previous cycle, enabling the model/sam- high-fidelity approximation to the Bayesian posterior. In
pler to follow the gradients closely and to escape from contrast, SGLB virtual ensemble only trains a single SGLB
the current local mode. (
        <xref ref-type="bibr" rid="ref3">2</xref>
        ) Sampling: when the gradient model and it uses multiple truncated sub-models to form
scaler is small, the scale of injected Gaussian noise in the a (virtual) ensemble. The key idea is essentially extracting
SGLB procedure becomes relatively large, encouraging multiple samples from a single-chain SG-MCMC instead
the sampler to fully characterize one local mode. We of running multiple chains in parallel.
collect one sample (or truncated sub-model) to build the In theory, SGLB or single-chain SG-MCMC converges
virtual ensemble at the end of each cycle. The cyclical asymptotically to the target distribution and should
begradient schedule therefore helps cSGLB efectively ex- have similarly to the multi-chain SGLB ensemble in the
plore diferent modes of a posterior while maintaining the limit, but it can sufer from a bounded estimation error in
same level of eficiency of a virtual ensemble. Moreover, limited time [21]. Moreover, it is often believed that the
inspired by a recent study [16] showing that "diversified" posterior is highly multimodal in the parametric space
posterior may provide a tighter generalization bound, we of modern ML models [13], since there are potentially
present another simple approach to encourage diversity many diferent sets of parameters that can describe the
in samples obtained from running cSGLB via data boot- training data equally well. The real ensemble can explore
strapping. We name this variant Cyclical Bootstrapped diferent modes of the posterior by running in parallel
SGLB (cbSGLB). independent chains, providing a complete picture of the
      </p>
      <p>We extensively experiment with our proposed algo- distribution as the number of chains increases. However,
rithms and compare the performance against SGLB en- a single-chain SG-MCMC often gets stuck easily on a
semble and the original SGLB virtual ensemble. Partic- single mode of the posterior [13, 14], failing to cover the
ularly, we show that our cyclical gradient schedule can full spectrum of the distribution.
help explore efectively multimodal distributions, cSGLB In this paper, we extend the ideas behind Cyclical
SGis capable of producing uncertainty estimates that are MCMC (cSG-MCMC) in DL [13] to sampling from a
treebetter aligned with SGLB real ensemble, and cSGLB/cbS- based SGLB model, which promotes new mode discovery
GLB outperforms the SGLB baseline with a large margin during training. Diferent from cSG-MCMC that puts
on out-of-domain (OOD) data detection, indicating a su- a cyclical schedule on step size, we propose to use a
cyclical schedule on gradient scale. We also point out and
justify the diference and our design choice in Appendix
over tree structures.
vector  ∈ R . Simply put, (|) defines a distribution
A. In addition, we propose a simple strategy to further
As with other GBDT algorithms, SGLB constructs an
encourage diversity in samples obtained from a single
ensemble of decision trees iteratively. At each iteration
chain by data bootstrapping. At the beginning of each  , we compute unbiased gradient estimates ˆ such that
bootstrapped data consistently during the exploration
vectors , 
(0 ,  ), where 0 ,  denote zero
cycle (see Fig.1), we construct a bootstrapped dataset
that is a random subset of the training, and use that
stage to update the GBDT model. The "bias" induced by
data bootstrapping also amounts to posterior tempering
[14, 15, 22, 23, 13].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminaries</title>
      <sec id="sec-3-1">
        <title>3.1. General Setup</title>
        <p>′</p>
        <p>∼ 
E[ˆ ] = (  ( Θ^</p>
        <p>(), ))=1 ∈ R using the
current model Θ^  , and sample independently two normal
vector and identity matrix in R , respectively. Then, a
base learner (or tree structure)  is picked by drawing
one sample from (|ˆ +</p>
        <p>learning rate (or step size) and  &gt;</p>
        <p>0 is a parameter often
referred as inverse difusion temperature. Next, we
estimate the parameters  * (at tree leaves) of the sampled
base learner by solving the following optimization:
√︁ 2  ′), where  &gt; 0 is a
Given a set of  training data points sampled
from an unknown distribution  on  ×  , i.e.,
(1, 1), . . . , ( ,  ) ∼ 
function (, ) :  ×  →
denoted as  , and a loss</p>
        <p>R where  denotes the
space of predictions, our goal is to minimize the
empirical loss ℒ( | ) := 
ditive ensemble of decision trees ℋ := {ℎ(,  ) :
In this paper, we only consider ℱ corresponding to ad- lution as  *
 ×</p>
        <p>R →</p>
        <p>R,  ∈ }, where  is an index set and
ℎ</p>
        <p>has parameters  . Decision trees are built by
partitioning recursively the feature space into disjoint regions
(called leaves). Each region is assigned a value that is used
to estimate the response of  in the corresponding feature
subspace. Let’s denote these regions by  ’s, then we
have ℎ(,  ) = ∑︀   1{ ∈  }, where 1{·} denotes
indicator function. Therefore, given the tree structure,
decision tree ℎ is a linear function of its parameters  .</p>
        <p>It is often assumed that the set  is finite</p>
        <p>because the
training data is finite [ 11, 12], e.g., there exists only a
ifnite number of ways to partition the training data.
Owing to the linear dependence of ℎ on   and the finite
els from ℋ as a linear model Θ() = () Θ for some
feature map () :  → R and Θ</p>
        <p>∈ R denotes the
parameters of the entire ensemble [11]. Hence, in the
subsequent discussion, we will simply denote the parameters
of the GBDT model obtained at iteration  as Θ ˆ  , and
additionally define a linear mapping
that converts   to predictions (ℎ(,  ))=1.</p>
        <p>: R → R
3.2. SGLB
assumption of , we can represent any ensemble of mod- learning) and  ′ (used for tree sampling), and the model
SGLB combines stochastic gradient boosting (SGB) [7]
where Γ = Γ
 &gt; 0 is a regularization matrix which
with stochastic gradient Langevin dynamics (SGLD) [17].
depends on a particular tree construction algorithm or the
Following notations used in the original paper [11], we
choice of tuple ℬ := {ℋ, (|)} [11]. Note that since
characterize the SGB procedure by a tuple ℬ := {ℋ, }, the GBDT model is linear and can be fully determined
where ℋ again is the set of base learners and (|) is a
distribution over indices  ∈  conditioned on a gradient
ℒ(Θ | ) interchangeably.
by parameters Θ , we simply use notation ℒ( | ) and
minimize ||  ||22</p>
        <p>..</p>
        <p>∈R
∈ argmin || − ˆ +
√︂ 2

 −   ||2,
2</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
which returns the minimum norm solution that fits
best to the perturbed "noisy" version of negative
gradients. The optimization above has a closed form
so= − Φ  (ˆ +
√︁ 2  ), where Φ  :=
        </p>
        <p>(  )</p>
        <p>+ , + denotes pseudo-inverse. For
decision trees, Φ   essentially corresponds to averaging the
gradient estimates  in each leaf node of the tree. Lastly,</p>
        <sec id="sec-3-1-1">
          <title>SGLB algorithm updates the ensemble model by</title>
          <p>
            Θ^  +1 (· ) := (1 −  )Θ^  (· ) + ℎ  (· ,  * ),
(
            <xref ref-type="bibr" rid="ref3">2</xref>
            )
where  is a regularization parameter that "shrinks" the
currently built model when updating the ensemble. At
a high-level, SGLB is a stochastic GB algorithm with
Gaussian noise injected into gradient estimates, which
encourages the algorithm to explore a larger area in the
functional space to find a better fit for the given data.
The independence between noise  (used for parameter
shrinking by  in Eqn.(
            <xref ref-type="bibr" rid="ref3">2</xref>
            ) are technical details needed for
establishing theoretical results and rigorous analysis of
SGLB [11]. All the procedures of SGLB are also present
in our proposed cSGLB in Algo. 1 (with our additional
modifications highlighted in blue).
          </p>
          <p>One can show that the parameters of SGLB Θ ˆ  at each
iteration form a Markov chain that weakly converges to
the following stationary distribution:</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Posterior Sampling</title>
        <p>We consider here a standard Bayesian learning
framework [8] that treats parameters Θ</p>
        <p>as random variables
and places a prior (Θ)</p>
        <p>over Θ . In addition, we consider
the GBDT model Θ as a probabilistic model and
explicand 
− 
with parameters Θ . The entropy of the predictive posterior estimates total
unpredictions can be made by taking an average over
the ensemble, often known as predictive posterior or
Bayesian model average (BMA):
(|,  ) = E(Θ| )[ (|; Θ)] ≈ 1 ∑︀</p>
        <p>=1  (|; Θ ()).
certainty (TU) in predictions, which can be further
decomposed into two distinct types of uncertainty: knowledge
uncertainty and data uncertainty.1 (a) Knowledge
uncertainty (KU) arises due to the lack of knowledge about the
data generation process (or the unknown distribution
). KU is expected to be large in regions (in the feature
space) where we do not have suficient training data. (b)
Data uncertainty (DU) arises due to the inherent
stochasticity within the data generation process, and it is high
in regions with class overlaps. In applications like active
learning [25], reinforcement learning (RL) [26], and OOD
DU (or TU), and the following equation can be used in
practice to compute and connect them via mutual
infor=1 log (|, Θ) . Then, the limiting distribu- detection, it is desirable to measure KU separately from
mation [27]:</p>
        <p>I(; Θ|,  )
Knowledge U⏞ncertainty
⏟
∝ exp (︁ log ( |Θ) −
∝ ( |Θ) (Θ) ,
1
2 ||
ΘΓ</p>
        <p>2
||2
︁)</p>
        <p>(4)
der Gaussian prior (Θ) =
which is proportional to the true posterior (Θ | )
un (0, Γ)</p>
        <p>[11].</p>
        <p>Now, consider a Bayesian ensemble of probabilistic
models { (|; Θ ())}=1 where each model is trained
independently by running SGLB. Since each Θ () is
guaranteed to be sampled from (Θ | ) by Eqn.(4), the
ensemble {Θ ()

}=1 with  samples yields a "discrete"
approximation to the posterior (Θ | ). This is
exactly the idea behind SGLB ensemble [12], which learns
 independent SGLB models in parallel with diferent
random seeds. Although the approximation improves
as  increases, the computational cost also increases
linearly with . To alleviate the computational
burden, SGLB virtual ensemble [12] builds a Bayesian
virtual ensemble by sampling multiple times from a
singlechain SGLB model. Because samples from the same
chain are highly correlated, SGLB virtual ensemble
proations. More specifically, the parameters are sampled
by {Θ () =⌊ 2 ⌋ = {Θ ˆ (+⌊ 2 ⌋),  = 1, . . . , ⌊ 2 ⌋},</p>
        <p>}=1
i.e., appending one member to the ensemble every 
iterations while constructing one SGLB model using 
iterations of gradient boosting. Notice that no sampling
is performed during the first half of iterations (  &lt;
since Eqn.(4) holds only asymptotically. For large  and
, the virtual ensemble should behave similarly to the</p>
        <p>/2)</p>
        <sec id="sec-3-2-1">
          <title>SGLB real ensemble in the limit theoretically.</title>
          <p>poses to sample one member Θ () every  &gt; 1 iter- 4.1. Promoting Mode Discovery via</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Uncertainty Estimation</title>
        <sec id="sec-3-3-1">
          <title>Once</title>
          <p>the</p>
          <p>
            Bayesian
(virtual)
ensemble
{ (|; Θ ())}=1, Θ ()
∼
(Θ | ) is learned,
tasks.
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
(6)
          </p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Cyclical Gradient Scheduling</title>
        <p>Instead of building a cumbersome true ensemble of SGLB
models, the virtual ensemble of SGLB greatly improves
the eficiency by training only a single model.
However, similar to other types of SG-MCMC in Bayesian DL
[13, 14, 15], single-chain SGLB gets trapped easily on a
particular single mode of the posterior. To eficiently
explore diferent modes of the multimodal posterior and
effectively measure uncertainty in GBDT predictions with
1KU is also named epistemic uncertainty and DU is also called
aleatoric uncertainty.
2See paper [12] for equations computing KU and DU in regression
⏞

= H((|,  )) − E(Θ| )[H( (|; Θ))]
⏟ Total Uncertainty
⏞
⏟</p>
        <p>Expected Data Uncertainty

 =1
≈ H
︁( 1 ∑︁  (|; Θ()))︁ −
 =1
1 ∑︁ H(︁  (|; Θ()))︁ ,
where I(; ) denotes the mutual information between
random variables A and B, and H(· ) denotes entropy. The
diference between TU and DU measures the
disagreement among members in the ensemble and estimates the
knowledge uncertainty.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Cyclical Stochastic Gradient</title>
    </sec>
    <sec id="sec-5">
      <title>Langevin Boosting (cSGLB)</title>
      <p>a single chain, we propose a simple remedy that places a
step size within the context of Bayesian DL. In fact,
cScyclical cosine schedule on gradient scale during training,
GLB extends the idea behind cSG-MCMC to tree-based
as illustrated in Fig.1. Specifically, the scaling factor at
iteration  is defined as:
GBDT models. We also summarize key diferences
between our design of cSGLB and cSG-MCMC in Appendix
(7)
4.2. Enhancing Sample Diversity via
noises from mini-batch training [15, 14]. Tree-based mod- the Bayesian posterior. By increasing the temperature  ,
︁(  max [︁ cos(</p>
      <p>2
mod (,  )

)+1]︁,  min ,
︁)</p>
      <p>A.
  = max
where  max ≥</p>
      <p>1 is the maximum of the scaler or the
(|  ˆ +
initial value of  0,  is the user-defined cycle length, and
 min defines the minimum of the scaler, e.g.,  min =
1 or 0.5, since decaying the gradients to arbitrarily small
could be harmful for performance. Putting together, this
amounts to sampling the tree structure and learning the
tree leaf parameters with the (re)scaled gradients:  ∼
√︁ 2  ′) and  * = − Φ  (  ˆ +

√︁ 2  ).</p>
      <p />
      <p>
        Similar to Cyclical SG-MCMC [13], we define two
stages within each cycle: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Exploration when the
com
      </p>
      <p>
        mod (, ) is smaller
pleted portion of a cycle ( ) =
than a given threshold: ( ) &lt;  , and (
        <xref ref-type="bibr" rid="ref3">2</xref>
        ) Sampling
when ( )
      </p>
      <p>
        ≥  , and  ∈ (
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ) balances the portion
between exploration and sampling. We obtain one sample
from the chain at the end of each cycle, i.e., the virtual
ensemble is built by {Θ () =⌊  ⌋ = {
}=1
      </p>
      <p>Θ ˆ − 1,  =
1, . . . , ⌊  ⌋}. Large gradients at the beginning of a cycle
provide enough perturbation and encourage the model
to escape from the current mode, while decreasing the
gradient scale inside one cycle makes the sampler better
characterize the density of the local mode. Moreover,
many prior works in Bayesian NNs proposed to apply a
certain form of preconditioning to compensate sampling
els can usually digest the full-batch (full dataset  ) per
iteration by leveraging modern multi-core processors
and multi-threading. Therefore, we directly use the
fullbatch GB in all sampling stages, while leaving the option
of random data subsampling in exploration stages to the
users if training time is a concern.</p>
      <p>Combining cyclical gradient scaling with SGLB, we
expect that our new Cyclical SGLB (cSGLB) algorithm
could inherit most (if not all) theoretical properties of the
original SGLB algorithm. Conceptually, with a proper
choice of  max,  min and cycle length , the sample
obtained at 
=  −</p>
      <p>1 from the Markov chain Θ^ 
generated by Algo. 1 (w/o bootstrap) can be approximately
seen as a random draw from the limiting distribution
with small bounded errors. Also, each next cycle can be
viewed as a warm restart from its previous cycle, and
thus no errors shall be accumulated into the subsequent
cycles (at sampling time 
=  −</p>
      <p>1). We left
rigorous analysis and proofs of our propositions for future
work. Empirically, we show in our experiments that the
cyclical gradient scaling achieves similar efects in
exploring a multimodal distribution when compared with
cSG-MCMC which places a similar cyclical schedule on</p>
      <sec id="sec-5-1">
        <title>Bootstrapping</title>
        <p>
          Recent work [16] provided a compelling analysis that the
Bayesian posterior is not optimal under model
misspeciifcation 3, where the performance of the true posterior is
dominated by an alternative non-Bayesian posterior that
explicitly encourages diversity among ensemble
member predictions. Inspired by these results, we propose a
simple strategy that promotes diversity among samples
obtained from cSGLB by data bootstrapping. At the
beginning of each cycle, we sample randomly a Bernoulli mask
of size  , i.e.,  := { [] ∼
{0, 1} , and  ∈ (
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ) defines the percentage of
data being used. In the following exploration stage, we
mask out gradients ˆ by taking an element-wise
product with  , i.e., ˆ ⊙  . The mask  and the mask-out
operation are used consistently throughout the
exploration stage ( ) &lt;  , and  only gets updated until
the end of the cycle. This design amounts to learning
with a bootstrapped subsample of the data in each
cycle. Since the model would observe consistently less
data than the original  , it also amounts to posterior
()}=1 ∈
1/ with some temperature
tempering (( |Θ) (Θ))
 &gt; 1, resulting in a warm posterior that is softer than
we expect to see increased density on the paths/corridors
connecting diferent modes of the posterior [ 28, 29],
further facilitating the sampler to escape from the current
local mode. By using a relatively large  ∈ (0.8, 1), the
tempering efects would carry over into the sampling
stage. Therefore, the bootstrapping mechanism helps
improve the sample diversity from cSGLB, and we name
this variant Cyclical Bootstrapped SGLB (cbSGLB).
        </p>
        <p>Lastly, we summarize our proposed cSGLB (plus
bootstrap option) in Algo. 1 and highlight our modifications
on top of SGLB in blue.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experiments</title>
      <sec id="sec-6-1">
        <title>5.1. Experiments on Synthetic Data</title>
        <p>
          We validate and qualitatively evaluate the proposed
gradient scheduling and our cSGLB algorithm on two synthetic
problems: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) a synthetic multimodal dataset in [13], and
3The function class in-use does not contain the unknown
groundtruth function.
        </p>
        <sec id="sec-6-1-1">
          <title>Algorithm 1: Cyclical (Bootstrapped) SGLB</title>
          <p>Input: dataset  , learning rate  &gt; 0, inverse
temperature  &gt;
0, regularization  &gt;
0,
number of iterations  &gt; 0, cycle length  &gt; 1,
scaler limits  max,  min &gt; 0, stage threshold
 &gt;</p>
          <p>0, mask probability  &gt; 0, boolean
indicator 
vector
for  in [0, 1, . . . ,  − 1] do</p>
          <p>if  then
Initialize Θ^ 0 (· ) = 0,  = 1 ∈ R as all-ones
if
end
if
end</p>
          <p>mod (, ) = 0 then
Sample  ∈ R with
 [] ∼ ()
mod (, )</p>
          <p>Set  = 1</p>
          <p>≥  then
end
Compute gradient scaler:   =
max(  max [cos( 
2</p>
          <p>mod (, ) ) + 1],  min)
Estimate gradients ˆ using Θ^  (· ) and  :
ˆ = (  (Θ^</p>
          <p>(), ))=1 ∈ R
Sample noise ,  ′ ∼</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>Sample tree structure:</title>
          <p>∼
︀( ⃒⃒   (ˆ ⊙  ) +
(0 ,  )
√︁ 2  ′)︀</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>Estimate leaf/parameter values:</title>
          <p>* = − Φ  (︀   (ˆ ⊙  ) +</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>Update GBDT model:</title>
          <p>√︁ 2  )︀

Θ^  +1 (· ) = (1 −  )Θ^  (· ) + ℎ  (· ,  * )
end</p>
          <p>
            Return: Θ^  (· )
(
            <xref ref-type="bibr" rid="ref3">2</xref>
            ) a multi-class Spiral dataset in [12]. Due to limited
space, we include experimental details in Appendix B.
Synthetic Multimodal Data
We first demonstrate
the ability of cyclical gradient scaling for sampling from a
Specifically, we compare (i) the original SG-MCMC with
SGLD (denoted as SGLD) and two SGLD variants: (ii)
SGLD with Cyclical schedule on Learning Rate (denoted
as clr-SGLD) [13] and (iii) SGLD with Cyclical schedule
on Gradient scale (denoted as cg-SGLD (ours)). We
reproduced the results for SGLD and clr-SGLD in paper
a fair comparison, each chain runs for 50 iterations
and both clr-SGLD and cg-SGLD have 30 cycles. Fig.2
shows the estimated density using diferent sampling
strategies. SGLD gets trapped in local modes depending
on the random initial position, and increasing the noise
scale does not solve the problem. In contrast, clr-SGLD
and cg-SGLD can explore and locate roughly 7 −
ent modes of the distribution, showing that our cg-SGLD
          </p>
        </sec>
        <sec id="sec-6-1-5">
          <title>8 difer</title>
          <p>can achieve the state-of-the-art performance in exploring
multimodal distributions. Moreover, cg-SGLD has
benefits in implementation over clr-SGLD when combined
with SGLB. The SGLB algorithm was made available in
the CatBoost library [30], which only supports a fixed lr.
All our proposed enhancements can be implemented with
a "user-defined loss function" available in CatBoost
without touching the source code, making it straightforward
to reproduce our algorithms.</p>
          <p>Multi-Class Spiral Data
After validating the eficacy
of cyclical gradient scheduling on sampling from
multimodal distributions, we are now ready to experiment
with cSGLB. Specifically, we compare the following
algorithms on a 3-class classification task called "Spiral" in
[12]: (i) SGLB ensemble, where we denote by ens with
 models, (ii) SGLB virtual ensemble, simply denoted
by SGLB, and (iii) cSGLB virtual ensemble, denoted by
cSGLB (ours). We again reproduced the results in [12]
with code released by the authors, and Fig.3 shows the
estimated KU on Spiral test set. As noted in [12], we see
that knowledge uncertainty due to decision-boundary
"jitter" exists in both ens20 and cSGLB, and the "jitter"
afects cSGLB more as the estimated KU is "noisy" at the
decision boundary. Nevertheless, cSGLB (with only a
single model) is significantly more eficient than
ens20 and
is able to greatly improve upon SGLB in capturing high</p>
        </sec>
        <sec id="sec-6-1-6">
          <title>KU in regions with no training data.</title>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Experiments on Real-World Weather</title>
      </sec>
      <sec id="sec-6-3">
        <title>Prediction Data</title>
        <p>
          Lastly, we evaluate our proposed methods on the public
Shifts Weather Prediction dataset [31]. We select the
classification task where the ML model is asked to predict
the precipitation class at a particular geolocation and
from weather station measurements and forecast
models. The full dataset is partitioned in a canonical fashion
and contains in-domain (ID) training, development and
evaluation datasets as well as out-of-domain (OOD)
development and evaluation datasets. Importantly, the ID
data and the OOD data are separated in time and
conMild; OOD: Snow, Polar), making the Shifts dataset an
ideal testbed for gauging the robustness of ML model and
the quality of uncertainty estimation. To further
facilitate our experimentation, we conducted the following
multimodal distribution on a 2D mixture of 25 Gaussians. timestamp, given heterogeneous tabular features derived
[13] with code released by the authors, and built our cg- sist of non-overlapping climate types (ID: Tropical, Dry,
(a) Target
(b) SGLD
(c) N-SGLD
(d) clr-SGLD
(e) cg-SGLD (ours)
[31], and the results are summarized in Table 1. For
predictive performance, we report the classification accuracy
and macro F1 using BMA on both the ID and the OOD
evaluation datasets. We can see the following efects: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
Virtual ensembles with a longer chain slightly
outperform the real ensembles on the ID data. (
          <xref ref-type="bibr" rid="ref3">2</xref>
          ) Our proposed
(a) Spiral (b) ens20 coSfGmLeBthaonddscobnSGthLeBOpOerDfordmatasl.igHhotlwyewveorr,sethtihsaunstuhaellryesist
not a concern in practice since the model is not trained
with data from OOD domains and would not be used to
solve the OOD prediction tasks in a practical scenario.
        </p>
        <p>
          As long as the domain shifts can be reliably detected (via
uncertainty), proactive decisions can be made to avoid
costly mistakes due to model errors. (
          <xref ref-type="bibr" rid="ref4">3</xref>
          ) Our proposed
(c) SGLB (d) cSGLB data bootstrapping mechanism is capable of improving
the performance on the OOD data (cbSGLB &gt; cSGLB). In
Figure 3: Spiral dataset and estimated knowledge uncertain- addition, we include the F1-AUC metric (on the combined
ties. Each diferent color in (a) represents a diferent class. ID&amp;OOD evaluation sets) introduced in [31] to jointly
assess the predictive power and uncertainty quality. The
F1-AUC can be increased by either having a stronger
predata preprocessing: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) feature selection to keep only dictive model or by improving the correlation between
the top 40 features by importance (out of 123 available uncertainty and error. Consistent with the findings in
features), where the feature importance is determined by [31], total uncertainty (TU) correlates more with errors
a CatBoost classifier with 1 trees trained on the entire than knowledge uncertainty (KU) as shown by the
F1training set; (
          <xref ref-type="bibr" rid="ref3">2</xref>
          ) dropping minority classes to keep only AUC scores. More specifically, we see that the F1-AUC
the major 3 precipitation classes, i.e., class 0, 10, and 20 is quite similar across the board when measured by TU,
out of the 9 available classes from the original dataset; although cSGLB/cbSGLB has slightly worse predictive
(
          <xref ref-type="bibr" rid="ref4">3</xref>
          ) random data sampling to keep 200 (medium-sized) power on the OOD segment. When F1-AUC is measured
data in the final training set. Again, the purpose of our by KU, our cSGLB/cbSGLB is capable of producing KU
data preprocessing is for speeding up experimentation, estimates that relate more closely to model errors than
and we believe that the observations and findings in this KU from the SGLB baseline.
study are generalizable to the original full dataset. For At last, we present the OOD detection ROC-AUC
permodel building, 30 independent SGLB models (each of formance on the evaluation data by using KU estimates.
1 trees) were trained and used to construct real ensem- Our cSGLB/cbSGLB outperforms the SGLB baseline with
bles  for  ∈ {3, 5, 10, 30}. SGLB/cSGLB/cbSGLB a large margin on the OOD detection task, and even
virtual ensembles were built by sampling 10 members achieves a comparable performance to the real
ensemfrom a single-chain with 2 trees. Hence, 10 is 5× ble 10, which is 5× more expensive. This highlights
more expensive in computation and memory than a vir- that our cSGLB/cbSGLB can produce high-fidelity KU
tual ensemble. Additional details regarding our data and estimates to detect domain (or distributional) shifts with
models are included in Appendix C. a single model, and that our proposed cyclical gradient
        </p>
        <p>We compare various methods on their predictive per- scheduling is efective in exploring diferent modes of a
formance and on uncertainty quantification following posterior. In real-world industrial applications, detecting
OOD data or domain shifts in an eficient way is often
crucial to ensure a safe deployment and operation of ML
systems. Observing consistently high uncertainty
(especially KU) from model predictions indicates that the
patterns of new incoming data have deviated from the
training. This often provides a strong signal for model
refresh, ensuring that the ML system can be updated in
time to avoid errors and operate safely in its "comfort
zone" (with relatively low predictive uncertainty).
arXiv preprint arXiv:1902.03932 (2019). PMLR, 2018, pp. 1309–1318.
[14] F. Wenzel, K. Roth, B. S. Veeling, J. Świątkowski, L. Tran, [30] A. V. Dorogush, V. Ershov, A. Gulin, Catboost:
gradiS. Mandt, J. Snoek, T. Salimans, R. Jenatton, S. Nowozin, ent boosting with categorical features support, arXiv
How good is the bayes posterior in deep neural net- preprint arXiv:1810.11363 (2018).</p>
        <p>works really?, arXiv preprint arXiv:2002.02405 (2020). [31] A. Malinin, N. Band, G. Chesnokov, Y. Gal, M. J.
[15] P. Izmailov, S. Vikram, M. D. Hofman, A. G. G. Wil- Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova,
son, What are bayesian neural network posteriors really I. Provilkov, V. Raina, et al., Shifts: A dataset of real
dislike?, in: International Conference on Machine Learn- tributional shift across multiple large-scale tasks, arXiv
ing, PMLR, 2021, pp. 4629–4640. preprint arXiv:2107.07455 (2021).
[16] A. Masegosa, Learning under model misspecification:</p>
        <p>Applications to variational and ensemble methods,
Advances in Neural Information Processing Systems 33 A. Comparison between cSGLB
(2020) 5479–5491.
[17] M. Welling, Y. W. Teh, Bayesian learning via stochastic and cSG-MCMC
gradient langevin dynamics, in: Proceedings of the 28th
international conference on machine learning (ICML- The proposed Cyclical SGLB algorithm combines SGLB
11), Citeseer, 2011, pp. 681–688. with cSG-MCMC [13] to efectively explore diferent
[18] T. Chen, E. Fox, C. Guestrin, Stochastic gradient hamil- modes of a highly multimodal posterior distribution. In
tonian monte carlo, in: International conference on ma- this section, we summarize some key diferences between
chine learning, PMLR, 2014, pp. 1683–1691. our design and the original cSG-MCMC algorithm.
[19] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel,</p>
        <p>
          H. Neven, Bayesian sampling using stochastic gradient (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) cSG-MCMC is a sampling algorithm designed for
thermostats, Advances in neural information process- Bayesian NNs, while cSGLB is built for GBDT models.
ing systems 27 (2014). In deep learning, full-batch gradient descent is usually
[20] C. Li, C. Chen, D. Carlson, L. Carin, Preconditioned not feasible, and techniques have been developed to
exstochastic gradient langevin dynamics for deep neural plicitly compensate mini-batch noises, such as
precondinetworks, in: Thirtieth AAAI Conference on Artificial tioning [14]. Some also suggested applying an additional
Intelligence, 2016. correction step called Metropolis-Hastings [15].
Tree[21] Y. W. Teh, A. H. Thiery, S. J. Vollmer, Consistency and based GB models can easily scale up to large industrial
lfuctuations for stochastic gradient langevin dynamics, datasets and digest the full training set at each iteration.
The Journal of Machine Learning Research 17 (2016) Therefore, our cSGLB uses full-batch GB in the sampling
193–225.
[22] A. G. Wilson, P. Izmailov, Bayesian deep learning and a stage of each cycle to ensure high-quality samples being
probabilistic perspective of generalization, Advances in generated. (
          <xref ref-type="bibr" rid="ref3">2</xref>
          ) cSGLB puts a cyclical schedule on
graneural information processing systems 33 (2020) 4697– dient scale while cSG-MCMC puts a schedule on step
4708. size. In addition, the original cSG-MCMC completely
re[23] A. Ashukha, A. Lyzhov, D. Molchanov, D. Vetrov, Pitfalls moved the injected Gaussian noises in the exploration
of in-domain uncertainty estimation and ensembling in stage, and cSG-MCMC reduced to regular stochastic
gradeep learning, arXiv preprint arXiv:2002.06470 (2020). dient descent (SGD) during the period of exploration.
[24] T. Duan, A. Anand, D. Y. Ding, K. K. Thai, S. Basu, A. Ng, Although the authors claimed that this amounts to
posA. Schuler, Ngboost: Natural gradient boosting for prob- terior temping which is commonly used in DL domain,
abilistic prediction, in: International Conference on Ma- the implementation of cSG-MCMC algorithm does not
[25] cBh. iSneettLleesa,rnAicntgiv,ePMleaLrRn,in20g2l0it,eprpa.tu2r6e90su–r2v7e0y0.(2009). follow closely/strictly the dynamics of SGLD during the
[26] T. Tan, Z. Xiong, V. R. Dwaracherla, Parameterized in- exploration stage. In contract, we keep the injected noise
dexed value function for eficient exploration in rein- term unchanged during the course of learning. Our
deforcement learning, in: Proceedings of the AAAI Con- sign achieved similar efects compared with the step-size
ference on Artificial Intelligence, volume 34, 2020, pp. scheduling on a synthetic experiment, and we also
en5948–5955. sure that cSGLB follows the dynamics of SGLD (more or
[27] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, less) at every iteration step. (
          <xref ref-type="bibr" rid="ref4">3</xref>
          ) Lastly, the gradient
scalS. Udluft, Decomposition of uncertainty for active learn- ing (instead of step-size scheduling) has implementation
ing and reliable reinforcement learning in stochastic sys- benefits. The SGLB algorithm is made available in the
tems, stat 1050 (2017) 11. CatBoost library [30], which only supports a constant
[28] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, step size (or learning rate). Our proposed cyclical
gradiA. G. Wilson, Loss surfaces, mode connectivity, and fast
ensembling of dnns, Advances in neural information ent scaling (and data bootstrapping) can be implemented
processing systems 31 (2018). easily with the "user-defined loss function" available in
[29] F. Draxler, K. Veschgini, M. Salmhofer, F. Hamprecht, the CatBoost package, without modifying a single line of
Essentially no barriers in neural network energy land- the source code.
scape, in: International conference on machine learning,
eval-ID
eval-OOD
200,000
0
33.97
35.07
25.13
26.58
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>B. Synthetic Data</title>
      <sec id="sec-7-1">
        <title>B.1. Synthetic Multimodal Distribution</title>
        <p>The ground truth density of the distribution is
=1
25
 () = ∑︁ 1
25  (| , Σ) ,
it a virtual ensemble of 20 members. For cSGLB
virtual ensemble, we set  = 0.05, cycle length  = 200,
 max = 10,  min = 1, making it a virtual ensemble of
2000/200 = 10 members. For cbSGLB with
bootstrapping, we additionally set exploration proportion  = 0.9
and mask probability  = 0.66.</p>
        <p>C. Real-World Shifts Data
where  = {− 4, − 2, 0, 2, 4} × {− 4, − 2, 0, 2, 4} and</p>
        <p>︂[ 0.03 0 ]︂ Data Summary A detailed summary of our final
parΣ = 0 0.03 . We used the code released by the au- titioning of the Weather Prediction dataset is included in
thors [13] to generate our results. Specifically, the SGLD Table 2.
was trained with a decaying lr   = 0.05( + 1)− 0.55,
and clr-SGLD was learned with a cyclical lr schedule Experimental Details We used default parameter
setwith initial value  0 = 0.09 and exploration proportion tings for SGLB models as suggested in the original paper
 = 0.25. For our cg-SGLD, we fixed lr  = 0.01 and [12] for uncertainty quantification except that we set
Gaussian noise scale as 0.4, and set  max = 10,  min = the subsample rate to 0.8 for stochastic gradient
boost1. The "noisy" version of SGLD (or NoisySGLD/N-SGLD) ing. The real SGLB ensemble consists of (up to) 30 SGLB
was trained with a fixed lr  = 0.02 and noise scale 5.0 models trained with diferent seeds, each of 1K trees. In
(roughly 10× larger than the noise scale used in the other order to get more samples from a single chain, the
virmethods). Each chain was trained for 50 iterations and tual ensembles of SGLB and our cSGLB/cbSGLB were
both clr-SGLD and cg-SGLD had 30 cycles. The results learned with a single model of 2K trees. We set learning
and findings are robust to random seeds, and similar re- rate for all models to  = 0.05, and tree depth to 6. For
sults were observed with diferent seeds. We refer the SGLB virtual ensemble, each 100th model from interval
interested readers to the original paper [13] for results of [1000, 2000] was added to the ensemble, making it a
virSGLD and clr-SGLD in parallel (or multi-chain) settings. tual ensemble of 10 members. cSGLB and cbSGLB shared
the same parameters with the SGLB counterpart. In
adB.2. Synthetic Spiral Dataset dition, for cSGLB/cbSGLB virtual ensemble, we set cycle
length  = 200,  max = 10.0,  min = 1.0, making it a
All experiments were conducted using CatBoost [30], one virtual ensemble of 2000/200 = 10 members. For
simof the-state-of-the-art libraries for GBDTs. The ensemble plicity, cSGLB used full-batch gradient boosting at each
of SGLB (ens20) contains 20 independent (with diferent iteration step. In contrast, for cbSGLB with
bootstraprandom seeds) models with 1K trees each. The learning ping, we set exploration proportion  = 0.8, i.e., 80% of
rate is  = 0.1, tree depth is 6, and random_strength = a cycle was treated as exploration, and set mask
prob100 and border_count = 128. The SGLB virtual ensem- ability  = 0.6 in the exploration stage. For model
ble and cSGLB virtual ensemble are trained with the and parameter selection, we only used the in-domain
same parameters except that we increase the number (ID) development set and did not use the out-of-domain
of trees to 2K and lower the lr for cSGLB to  = 0.05. (OOD) development set. Although this may potentially
Thus, the virtual ensemble is 10× more eficient in com- lower our reported performance on the OOD evaluation
putation and memory than the actual SGLB ensemble. set, we believe that it reflects better a real-world learning
For SGLB virtual ensemble, each 50th model from in- scenario where the shifted data is often unobserved and
terval [1000, 2000] is added to the ensemble, making unavailable at training time.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dressel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Farid</surname>
          </string-name>
          ,
          <article-title>The accuracy, fairness, and limits of predicting recidivism</article-title>
          ,
          <source>Science Advances</source>
          <volume>4</volume>
          (
          <year>2018</year>
          )
          <article-title>eaao5580</article-title>
          . URL: https://www.science.org/doi/abs/10.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          1126/sciadv.aao5580. doi:
          <volume>10</volume>
          .1126/sciadv.aao5580.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brooks</surname>
          </string-name>
          , E. Brynjolfsson,
          <string-name>
            <given-names>R.</given-names>
            <surname>Calo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kalyanakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kraus</surname>
          </string-name>
          , et al.,
          <source>Artificial intelligence and life in 2030, One Hundred Year Study on Artificial Intelligence: Report of the 2015-2016 Study Panel</source>
          <volume>52</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Serban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Poll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Visser</surname>
          </string-name>
          ,
          <article-title>Towards using probabilis6. Conclusion tic models to design software systems with inherent uncertainty (</article-title>
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2008</year>
          .03046.
          <article-title>We present cyclical gradient scheduling</article-title>
          and Cyclical doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>2008</year>
          .03046.
          <article-title>SGLB for eficiently and efectively quantifying uncer-</article-title>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>McGrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zytek</surname>
          </string-name>
          , I. Lage, H. Lakkaraju,
          <article-title>tainty in gradient boosting with a single model, and pro- When does uncertainty matter?: Understanding the impose a data bootstrapping scheme to enhance diversity pact of predictive uncertainty in ml assisted decision in posterior samples. We show empirically that our al- making (</article-title>
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2011</year>
          .06167.
          <article-title>gorithms have superior performance over the state</article-title>
          -of- doi:10.48550/ARXIV.
          <year>2011</year>
          .
          <volume>06167</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shwartz-Ziv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Armon</surname>
          </string-name>
          ,
          <article-title>Tabular data: Deep learning the-art SGLB, especially in quantifying knowledge un- is not all you need</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.
          <article-title>certainty and for OOD detection</article-title>
          .
          <volume>03253</volume>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.2106.03253.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          , A. G. Wilson,
          <article-title>Cyclical stochastic gradient mcmc for bayesian deep learning</article-title>
          ,
          <volume>40</volume>
          .
          <fpage>92</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>