=Paper=
{{Paper
|id=Vol-3381/paper_47
|storemode=property
|title=Efficient and Effective Uncertainty Quantification in Gradient Boosting via
        Cyclical Gradient MCMC
|pdfUrl=https://ceur-ws.org/Vol-3381/47.pdf
|volume=Vol-3381
|authors=Tian Tan,Carlos Huertas,Qi Zhao
|dblpUrl=https://dblp.org/rec/conf/aaai/TanHZ23
}}
==Efficient and Effective Uncertainty Quantification in Gradient Boosting via
        Cyclical Gradient MCMC==
<pdf width="1500px">https://ceur-ws.org/Vol-3381/47.pdf</pdf>
<pre>
Efficient and Effective Uncertainty Quantification in
Gradient Boosting via Cyclical Gradient MCMC
Tian Tan1,* , Carlos Huertas1 and Qi Zhao1
1
    Buyer Risk Prevention - ML, WW Customer Trust, Amazon, Seattle, WA 98109, USA


                                          Abstract
                                          Gradient boosting decision trees (GBDTs) are widely applied on tabular data in real-world ML systems. Quantifying uncer-
                                          tainty in GBDT models is thus essential for decision making and for avoiding costly mistakes to ensure an interpretable
                                          and safe deployment of tree-based models. Recently, Bayesian ensemble of GBDT models is used to measure uncertainty by
                                          leveraging an algorithm called stochastic gradient Langevin boosting (SGLB), which combines GB with stochastic gradient
                                          MCMC (SG-MCMC). Although theoretically sound, SGLB gets trapped easily on a particular mode of the Bayesian poste-
                                          rior, just like other forms of SG-MCMCs. Therefore, a single SGLB model can often fail to produce uncertainty estimates of
                                          high-fidelity. To address this problem, we present Cyclical SGLB (cSGLB) which incorporates a Cyclical Gradient schedule
                                          in the SGLB algorithm. The cyclical gradient mechanism promotes new mode discovery and helps explore high multimodal
                                          posterior distributions. As a result, cSGLB can efficiently quantify uncertainty in GB with only a single model. In addition, we
                                          present another cSGLB variant with data bootstrapping to further encourage diversity among posterior samples. We conduct
                                          extensive experiments to demonstrate the efficiency and effectiveness of our algorithm, and show that it outperforms the
                                          state-of-the-art SGLB on uncertainty quantification, especially when uncertainty is used for detecting out-of-domain (OOD)
                                          data or distributional shifts.

                                          Keywords
                                          uncertainty quantification, gradient boosting decision trees, Bayesian inference, out-of-domain (OOD) detection


1. Introduction                                                                                       efficiently on GBDT predictions can therefore not only
                                                                                                      improve model interpretability in production but also
With the rapid growth of data and computing power, ensure a safer deployment of ML systems, especially for
machine learning (ML) has been gaining a lot of new high-risk applications.
applications in areas not imagined before. The more                                                      Uncertainty quantification (UQ) has been widely stud-
ubiquitous ML systems become, it is inevitable to see ied for neural networks under the Bayesian framework
applications in very sensitive and high-risk fields. This [8], however, it is relatively under-explored for tree-based
expands to a numerous areas like criminal recidivism models. Although calibrated probability estimation trees
[1], medical follow-ups [2] and autonomous-systems [3]. [9, 10] can be used for UQ, they have not been studied
While these systems might be very broad, they share a from a Bayesian perspective. Recently, Bayesian ensem-
common need, and that is to have a certain degree of con- ble methods were extended to measure uncertainty in
fidence on ML predictions. A proven successful way to GBDTs by leveraging a new algorithm called stochastic
build confidence in critical systems is uncertainty estima- gradient Langevin boosting (SGLB) [11]. Specifically, two
tion. Research has shown that humans are more likely to SGLB-based approaches were introduced for UQ [12]: (1)
agree with a system if they get access to the correspond- SGLB ensemble, which trains multiple SGLB models in
ing uncertainty, and this holds true regardless of shape parallel, and (2) SGLB virtual ensemble, which constructs
and variance as the approach itself is model and task ag- a virtual ensemble using only a single SGLB model where
nostic [4]. Since the most common data type in real-world each member in the ensemble is a "truncated" sub-model
ML applications is tabular [5], our work in this paper fo- [12]. Although both approaches are theoretically sound,
cuses specifically on uncertainty quantification for the there is clearly a trade-off between quality and efficiency
state-of-the-art gradient boosting decision trees (GBDTs) in practice. SGLB (real) ensemble is believed to be ac-
[6, 7], which are known to outperform deep learning (DL) curate as it can characterize the Bayesian posterior well
methods on tabular data, both in accuracy and tuning by running independent models in parallel. However,
requirements [5]. Measuring uncertainty effectively and it is almost infeasible to deploy such an ensemble in
                                                                                                      real-world production due to its high computational and
The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), maintenance costs. SGLB virtual ensemble greatly im-
February 13 - 14, 2023, Washington, D.C., USA                                                         proves the efficiency, however, it often gets stuck on a
*
  Corresponding author.
                                                                                                      single mode of the Bayesian posterior and can produce
" tianta@amazon.com (T. Tan); carlohue@amazon.com
(C. Huertas); qqzhao@amazon.com (Q. Zhao)                                                             downgraded uncertainty estimates [13, 14]. To better
         © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License balance between quality and efficiency and to facilitate
         Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                 perior performance in detecting distributional/domain
                                                                 shifts in real-world tabular data streams.


                                                                 2. Related Work
                                                                 Bayesian ML and approximate Bayesian inference pro-
                                                                 vide a principled representation of uncertainty. One pop-
                                                                 ular family of approaches to inference in Bayesian ML
                                                                 are stochastic gradient Markov Chain Monte Carlo (SG-
                                                                 MCMC) methods [17, 18, 19, 20, 13], which are used to
Figure 1: Illustration of proposed cyclical schedule on gradi-   effectively sample models (or model parameters) from
ent scales for SGLB algorithm.                                   the Bayesian posterior. Uncertainty then comes naturally
                                                                 by measuring the "discrepancy" in predictions from the
                                                                 sampled models which are regarded as posterior samples.
                                                                 Recently, stochastic gradient Langevin boosting (SGLB)
the usage of uncertainty-enabled ML systems, an impor-
                                                                 [11] was proposed by combining gradient boosting with
tant question remains: how can we make a single SGLB
                                                                 SG-MCMC. As its name suggests, the Markov chain gen-
explore effectively different modes of a posterior given a
                                                                 erated by SGLB obeys a special form of the stochastic
limited computational budget?
                                                                 gradient Langevin dynamics (SGLD) [11, 17], which im-
   In this paper, we address the question above by com-
                                                                 plies that SGLB is able to generate samples from the true
bining SGLB virtual ensemble with advanced sampling
                                                                 Bayesian posterior asymptotically. Leveraging this prop-
techniques from Bayesian DL [14, 13, 15, 16]. Inspired by
                                                                 erty, Malinin et al. [12] proposed to use (1) SGLB ensem-
the ideas in [13], we propose to use a scaler (or scaling
                                                                 ble or (2) SGLB virtual ensemble to measure uncertainty
factor) on gradients that follows a cyclical schedule dur-
                                                                 in GBDTs. Essentially, SGLB ensemble corresponds to
ing the course of SGLB training. The cyclical schedule is
                                                                 running multiple SG-MCMCs in parallel and each chain
illustrated in Figure 1, and consequently, we name the re-
                                                                 (or SGLB) is initialized independently with a different ran-
sulting algorithm Cyclical SGLB (cSGLB). Similar to [13],
                                                                 dom seed. Since SGLB allows us to sample from the true
each cycle in cSGLB contains two stages: (1) Exploration:
                                                                 posterior, the ensemble with multiple samples gives a
when the scaler is large, we treat this stage as a warm
                                                                 high-fidelity approximation to the Bayesian posterior. In
restart from the previous cycle, enabling the model/sam-
                                                                 contrast, SGLB virtual ensemble only trains a single SGLB
pler to follow the gradients closely and to escape from
                                                                 model and it uses multiple truncated sub-models to form
the current local mode. (2) Sampling: when the gradient
                                                                 a (virtual) ensemble. The key idea is essentially extracting
scaler is small, the scale of injected Gaussian noise in the
                                                                 multiple samples from a single-chain SG-MCMC instead
SGLB procedure becomes relatively large, encouraging
                                                                 of running multiple chains in parallel.
the sampler to fully characterize one local mode. We
                                                                    In theory, SGLB or single-chain SG-MCMC converges
collect one sample (or truncated sub-model) to build the
                                                                 asymptotically to the target distribution and should be-
virtual ensemble at the end of each cycle. The cyclical
                                                                 have similarly to the multi-chain SGLB ensemble in the
gradient schedule therefore helps cSGLB effectively ex-
                                                                 limit, but it can suffer from a bounded estimation error in
plore different modes of a posterior while maintaining the
                                                                 limited time [21]. Moreover, it is often believed that the
same level of efficiency of a virtual ensemble. Moreover,
                                                                 posterior is highly multimodal in the parametric space
inspired by a recent study [16] showing that "diversified"
                                                                 of modern ML models [13], since there are potentially
posterior may provide a tighter generalization bound, we
                                                                 many different sets of parameters that can describe the
present another simple approach to encourage diversity
                                                                 training data equally well. The real ensemble can explore
in samples obtained from running cSGLB via data boot-
                                                                 different modes of the posterior by running in parallel
strapping. We name this variant Cyclical Bootstrapped
                                                                 independent chains, providing a complete picture of the
SGLB (cbSGLB).
                                                                 distribution as the number of chains increases. However,
   We extensively experiment with our proposed algo-
                                                                 a single-chain SG-MCMC often gets stuck easily on a
rithms and compare the performance against SGLB en-
                                                                 single mode of the posterior [13, 14], failing to cover the
semble and the original SGLB virtual ensemble. Partic-
                                                                 full spectrum of the distribution.
ularly, we show that our cyclical gradient schedule can
                                                                    In this paper, we extend the ideas behind Cyclical SG-
help explore effectively multimodal distributions, cSGLB
                                                                 MCMC (cSG-MCMC) in DL [13] to sampling from a tree-
is capable of producing uncertainty estimates that are
                                                                 based SGLB model, which promotes new mode discovery
better aligned with SGLB real ensemble, and cSGLB/cbS-
                                                                 during training. Different from cSG-MCMC that puts
GLB outperforms the SGLB baseline with a large margin
                                                                 a cyclical schedule on step size, we propose to use a
on out-of-domain (OOD) data detection, indicating a su-
cyclical schedule on gradient scale. We also point out and          vector 𝑔 ∈ R𝑁 . Simply put, 𝑝(𝑠|𝑔) defines a distribution
justify the difference and our design choice in Appendix            over tree structures.
A. In addition, we propose a simple strategy to further                 As with other GBDT algorithms, SGLB constructs an
encourage diversity in samples obtained from a single               ensemble of decision trees iteratively. At each iteration
chain by data bootstrapping. At the beginning of each               𝜏 , we compute unbiased gradient estimates 𝑔ˆ𝜏 such that
cycle (see Fig.1), we construct a bootstrapped dataset                          𝜕
                                                                    E[𝑔ˆ𝜏 ] = ( 𝜕𝑓 𝐿(𝑓Θ                  𝑁
                                                                                         ^ 𝜏 (𝑥𝑖 ), 𝑦𝑖 ))𝑖=1 ∈ R
                                                                                                                  𝑁
                                                                                                                    using the cur-
that is a random subset of the training, and use that               rent model 𝑓Θ  ^𝜏 , and   sample    independently  two normal
bootstrapped data consistently during the exploration               vectors 𝜁, 𝜁 ′ ∼ 𝒩 (0𝑁 , 𝐼𝑁 ), where 0𝑁 , 𝐼𝑁 denote zero
stage to update the GBDT model. The "bias" induced by               vector and identity matrix in R𝑁 , respectively. Then, a
data bootstrapping also amounts to posterior tempering              base learner (or tree structure)   √︁ 𝑠𝜏 is picked by drawing
[14, 15, 22, 23, 13].                                               one sample from 𝑝(𝑠|𝑔ˆ𝜏 + 2𝑁         𝜁 ′ ), where 𝜖 > 0 is a
                                                                                                      𝜖𝛽
                                                                    learning rate (or step size) and 𝛽 > 0 is a parameter often
3. Preliminaries                                                    referred as inverse diffusion temperature. Next, we esti-
                                                                    mate the parameters 𝜃*𝑠𝜏 (at tree leaves) of the sampled
3.1. General Setup                                                  base learner by solving the following optimization:

Given a set of 𝑁 training data points sampled                             minimize ||𝜃𝑠𝜏 ||22    𝑠.𝑡.
from an unknown distribution 𝒟 on 𝒳 × 𝒴, i.e.,                                                                                    (1)
                                                                                                        √︂
                                                                                                             2𝑁
(𝑥1 , 𝑦1 ), . . . , (𝑥𝑁 , 𝑦𝑁 ) ∼ 𝒟 denoted as 𝒟𝑁 , and a loss             𝜃𝑠𝜏 ∈ argmin || − 𝑔ˆ𝜏 +               𝜁 − 𝐻𝑠𝜏 𝜃||22 ,
                                                                                 𝜃∈R𝑚𝑠𝜏                      𝜖𝛽
function 𝐿(𝑧, 𝑦) : 𝒵 × 𝒴 → R where 𝒵 denotes the
space of predictions, our ∑︀    goal is to minimize the empiri-     which returns the minimum norm solution that fits
cal loss ℒ(𝑓 |𝒟𝑁 ) := 𝑁1          𝑖=1 𝐿(𝑓 (𝑥𝑖 ), 𝑦𝑖 )) over func-
                                  𝑁
                                                                    best to the perturbed "noisy" version of negative gra-
tions 𝑓 belonging to some family ℱ ⊂ {𝑓 : 𝒳 → 𝒵}.                   dients. The optimization above√︁ has a closed form so-
In this paper, we only consider ℱ corresponding to ad-              lution as 𝜃*𝑠𝜏 = −Φ𝑠𝜏 (𝑔ˆ𝜏 +             2𝑁
                                                                                                                𝜁), where Φ𝑠𝜏     :=
                                                                                                             𝜖𝛽
ditive ensemble of decision trees ℋ := {ℎ𝑠 (𝑥, 𝜃𝑠 ) :
                                                                    (𝐻𝑠𝑇𝜏 𝐻𝑠𝜏 )+ 𝐻𝑠𝑇𝜏 , + denotes pseudo-inverse.    For deci-
𝒳 × R𝑚𝑠 → R, 𝑠 ∈ 𝑆}, where 𝑆 is an index set and
                                                                    sion trees, Φ𝑠𝜏 𝑔 essentially corresponds to averaging the
ℎ𝑠 has parameters 𝜃𝑠 . Decision trees are built by parti-
                                                                    gradient estimates 𝑔 in each leaf node of the tree. Lastly,
tioning recursively the feature space into disjoint regions
                                                                    SGLB algorithm updates the ensemble model by
(called leaves). Each region is assigned a value that is used
to estimate the response of 𝑦 in the corresponding feature                 ^ 𝜏 +1 (·) := (1 − 𝛾𝜖)𝑓Θ
                                                                          𝑓Θ                                  𝑠𝜏    𝑠𝜏
                                                                                                  ^ 𝜏 (·) + 𝜖ℎ (·, 𝜃* ),          (2)
subspace. Let’s ∑︀    denote these regions by 𝑅𝑗 ’s, then we
have ℎ(𝑥, 𝜃) = 𝑗 𝜃𝑗 1{𝑥 ∈ 𝑅𝑗 }, where 1{·} denotes                  where 𝛾 is a regularization parameter that "shrinks" the
indicator function. Therefore, given the tree structure,            currently built model when updating the ensemble. At
decision tree ℎ𝑠 is a linear function of its parameters 𝜃𝑠 .        a high-level, SGLB is a stochastic GB algorithm with
It is often assumed that the set 𝑆 is finite because the            Gaussian noise injected into gradient estimates, which
training data is finite [11, 12], e.g., there exists only a         encourages the algorithm to explore a larger area in the
finite number of ways to partition the training data. Ow-           functional space to find a better fit for the given data.
ing to the linear dependence of ℎ𝑠 on 𝜃𝑠 and the finite             The independence between noise 𝜁 (used for parameter
assumption of 𝑆, we can represent any ensemble of mod-              learning) and 𝜁 ′ (used for tree sampling), and the model
els from ℋ as a linear model 𝑓Θ (𝑥) = 𝜑(𝑥)𝑇 Θ for some              shrinking by 𝛾 in Eqn.(2) are technical details needed for
feature map 𝜑(𝑥) : 𝒳 → R𝑚 and Θ ∈ R𝑚 denotes the                    establishing theoretical results and rigorous analysis of
parameters of the entire ensemble [11]. Hence, in the sub-          SGLB [11]. All the procedures of SGLB are also present
sequent discussion, we will simply denote the parameters            in our proposed cSGLB in Algo. 1 (with our additional
of the GBDT model obtained at iteration 𝜏 as Θ̂𝜏 , and              modifications highlighted in blue).
additionally define a linear mapping 𝐻𝑠 : R𝑚𝑠 → R𝑁                     One can show that the parameters of SGLB Θ̂𝜏 at each
that converts 𝜃𝑠 to predictions (ℎ𝑠 (𝑥𝑖 , 𝜃𝑠 ))𝑁                    iteration form a Markov chain that weakly converges to
                                                    𝑖=1 .
                                                                    the following stationary distribution:

3.2. SGLB                                                                 𝑝𝛽* (Θ) ∝ exp(−𝛽ℒ(Θ|𝒟𝑁 ) − 𝛽𝛾||ΓΘ||22 ),                (3)

SGLB combines stochastic gradient boosting (SGB) [7]                where Γ = Γ𝑇 > 0 is a regularization matrix which
with stochastic gradient Langevin dynamics (SGLD) [17].             depends on a particular tree construction algorithm or the
Following notations used in the original paper [11], we             choice of tuple ℬ := {ℋ, 𝑝(𝑠|𝑔)} [11]. Note that since
characterize the SGB procedure by a tuple ℬ := {ℋ, 𝑝},              the GBDT model is linear and can be fully determined
where ℋ again is the set of base learners and 𝑝(𝑠|𝑔) is a           by parameters Θ, we simply use notation ℒ(𝑓 |𝒟𝑁 ) and
distribution over indices 𝑠 ∈ 𝑆 conditioned on a gradient           ℒ(Θ|𝒟𝑁 ) interchangeably.
3.3. Posterior Sampling                                       predictions can be made by taking an average over
                                                              the ensemble, often known as predictive posterior or
We consider here a standard Bayesian learning frame-
                                                              Bayesian model average (BMA):
work [8] that treats parameters Θ as random variables
and places a prior 𝑝(Θ) over Θ. In addition, we consider 𝑝(𝑦|𝑥, 𝒟𝑁 ) = E𝑝(Θ|𝒟 ) [𝑃 (𝑦|𝑥; Θ)] ≈ 1 ∑︀𝐾 𝑃 (𝑦|𝑥; Θ(𝑘) ).
                                                                                                       𝑘=1
the GBDT model 𝑓Θ as a probabilistic model and explic-
                                                                                   𝑁              𝐾
                                                                                                                     (5)
itly denote the model by 𝑃 (𝑦|𝑥; Θ) with parameters Θ. The entropy of the predictive posterior estimates total un-
This is valid naturally for classification as GBDT models certainty (TU) in predictions, which can be further decom-
by construction return a distribution over class labels. For posed into two distinct types of uncertainty: knowledge
regression, one can leverage NGBoost algorithm [24] to uncertainty and data uncertainty.1 (a) Knowledge uncer-
return the mean and variance of a Gaussian distribution tainty (KU) arises due to the lack of knowledge about the
over the target 𝑦 for a given input 𝑥.                        data generation process (or the unknown distribution
   For the purpose of uncertainty estimation, we aim to 𝒟). KU is expected to be large in regions (in the feature
estimate or obtain an approximation to the Bayesian pos- space) where we do not have sufficient training data. (b)
                                                              Data uncertainty (DU) arises due to the inherent stochas-
terior 𝑝(Θ|𝒟𝑁 ). To that end, we can choose 𝛽 = 𝑁
                                                              ticity within the data generation process, and it is high
and 𝛾 = 2𝑁 and use the negative log-likelihood as in regions with class overlaps. In applications like active
              1

the loss
      ∑︀𝑁function ℒ(Θ|𝒟𝑁 ) = E𝒟𝑁 [− log 𝑝(𝑦|𝑥, Θ)] = learning [25], reinforcement learning (RL) [26], and OOD
− 𝑁1     𝑖=1 log 𝑝(𝑦𝑖 |𝑥𝑖 , Θ). Then, the limiting distribu- detection, it is desirable to measure KU separately from
tion of SGLB can be explicitly expressed as:                  DU (or TU), and the following equation can be used in
                                                              practice to compute and connect them via mutual infor-
                       (︁                  1        )︁
       𝑝𝛽* (Θ) ∝ exp log 𝑝(𝒟𝑁 |Θ) − ||ΓΘ||22                  mation [27]:
                                           2              (4)
                ∝ 𝑝(𝒟𝑁 |Θ)𝑝(Θ),                                 I(𝑦; Θ|𝑥, 𝒟𝑁 )
                                                                   ⏟        ⏞
                                                                  Knowledge Uncertainty
which is proportional to the true posterior 𝑝(Θ|𝒟𝑁 ) un-
der Gaussian prior 𝑝(Θ) = 𝒩 (0𝑚 , Γ) [11].                        = H(𝑝(𝑦|𝑥, 𝒟𝑁 )) − E𝑝(Θ|𝒟𝑁 ) [H(𝑃 (𝑦|𝑥; Θ))]
   Now, consider a Bayesian ensemble of probabilistic
                                                                    ⏟     ⏞          ⏟           ⏞
                                                                       Total Uncertainty   Expected Data Uncertainty
models {𝑃 (𝑦|𝑥; Θ(𝑘) )}𝐾𝑘=1 where each model is trained                      𝐾                      𝐾
independently by running SGLB. Since each Θ(𝑘) is guar-
                                                                        (︁ 1 ∑︁               )︁ 1 ∑︁ (︁               )︁
                                                                  ≈H            𝑃 (𝑦|𝑥; Θ(𝑘) ) −       H 𝑃 (𝑦|𝑥; Θ(𝑘) ) ,
anteed to be sampled from 𝑝(Θ|𝒟𝑁 ) by Eqn.(4), the en-                    𝐾
                                                                            𝑘=1
                                                                                                 𝐾
                                                                                                   𝑘=1
semble {Θ(𝑘) }𝐾 𝑘=1 with 𝐾 samples yields a "discrete"                                                                 (6)
approximation to the posterior 𝑝(Θ|𝒟𝑁 ). This is ex-           where I(𝐴; 𝐵) denotes the mutual information between
actly the idea behind SGLB ensemble [12], which learns        random variables A and B, and H(·) denotes entropy. The
𝐾 independent SGLB models in parallel with different          difference between TU and DU measures the disagree-
random seeds. Although the approximation improves             ment among members in the ensemble and estimates the
as 𝐾 increases, the computational cost also increases         knowledge uncertainty.2
linearly with 𝐾. To alleviate the computational bur-
den, SGLB virtual ensemble [12] builds a Bayesian vir-
tual ensemble by sampling multiple times from a single-       4. Cyclical Stochastic Gradient
chain SGLB model. Because samples from the same                  Langevin Boosting (cSGLB)
chain are highly correlated, SGLB virtual ensemble pro-
poses to sample one member Θ(𝑘) every 𝐶 > 1 iter-             4.1. Promoting Mode Discovery via
ations. More specifically, the parameters are sampled              Cyclical Gradient Scheduling
          𝐾=⌊ 𝒯 ⌋
by {Θ(𝑘) }𝑘=1 2𝐶 = {Θ̂𝐶(𝑘+⌊ 𝒯 ⌋) , 𝑘 = 1, . . . , ⌊ 2𝐶𝒯
                                                        ⌋},
                                 2𝐶                           Instead of building a cumbersome true ensemble of SGLB
i.e., appending one member to the ensemble every 𝐶 it-        models, the virtual ensemble of SGLB greatly improves
erations while constructing one SGLB model using 𝒯            the efficiency by training only a single model. How-
iterations of gradient boosting. Notice that no sampling      ever, similar to other types of SG-MCMC in Bayesian DL
is performed during the first half of iterations (𝜏 < 𝒯 /2)   [13, 14, 15], single-chain SGLB gets trapped easily on a
since Eqn.(4) holds only asymptotically. For large 𝐶 and      particular single mode of the posterior. To efficiently ex-
𝐾, the virtual ensemble should behave similarly to the        plore different modes of the multimodal posterior and ef-
SGLB real ensemble in the limit theoretically.                fectively measure uncertainty in GBDT predictions with

3.4. Uncertainty Estimation                                   1
                                                                KU is also named epistemic uncertainty and DU is also called
                                                                aleatoric uncertainty.
Once      the     Bayesian        (virtual)   ensemble        2
                                                                See paper [12] for equations computing KU and DU in regression
{𝑃 (𝑦|𝑥; Θ(𝑘) )}𝐾
                𝑘=1 , Θ
                       (𝑘)
                           ∼       𝑝(Θ|𝒟𝑁 ) is learned,         tasks.
a single chain, we propose a simple remedy that places a          step size within the context of Bayesian DL. In fact, cS-
cyclical cosine schedule on gradient scale during training,       GLB extends the idea behind cSG-MCMC to tree-based
as illustrated in Fig.1. Specifically, the scaling factor at      GBDT models. We also summarize key differences be-
iteration 𝜏 is defined as:                                        tween our design of cSGLB and cSG-MCMC in Appendix
             (︁ 𝛼     [︁      𝜋 mod (𝜏, 𝐶)         ]︁       )︁    A.
                 max
𝛼𝜏 = max                 cos(                  )+1 , 𝛼min ,
                  2                  𝐶
                                                           (7)    4.2. Enhancing Sample Diversity via
where 𝛼max ≥ 1 is the maximum of the scaler or the                     Bootstrapping
initial value of 𝛼0 , 𝐶 is the user-defined cycle length, and
𝛼min defines the minimum of the scaler, e.g., 𝛼min =              Recent work [16] provided a compelling analysis that the
1 or 0.5, since decaying the gradients to arbitrarily small       Bayesian posterior is not optimal under model misspeci-
could be harmful for performance. Putting together, this          fication3 , where the performance of the true posterior is
amounts to sampling the tree structure and learning the           dominated by an alternative non-Bayesian posterior that
tree leaf parameters     with the (re)scaled gradients:           explicitly encourages diversity among ensemble mem-
                                                      √︁ 𝑠𝜏 ∼
                                                                  ber predictions. Inspired by these results, we propose a
              √︁
𝑝(𝑠|𝛼𝜏 𝑔ˆ𝜏 + 2𝑁   𝜁 ′ ) and 𝜃*𝑠𝜏 = −Φ𝑠𝜏 (𝛼𝜏 𝑔ˆ𝜏 + 2𝑁    𝜁).
               𝜖𝛽                                    𝜖𝛽
                                                                  simple strategy that promotes diversity among samples
   Similar to Cyclical SG-MCMC [13], we define two                obtained from cSGLB by data bootstrapping. At the begin-
stages within each cycle: (1) Exploration when the com-           ning of each cycle, we sample randomly a Bernoulli mask
pleted portion of a cycle 𝒜(𝜏 ) = mod𝐶(𝜏,𝐶) is smaller            of size 𝑁 , i.e., 𝜈 := {𝜈[𝑖] ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝𝑏𝑚 )}𝑁  𝑖=1 ∈
than a given threshold: 𝒜(𝜏 ) < 𝜂, and (2) Sampling               {0, 1}𝑁 , and 𝑝𝑏𝑚 ∈ (0, 1) defines the percentage of
when 𝒜(𝜏 ) ≥ 𝜂, and 𝜂 ∈ (0, 1) balances the portion be-           data being used. In the following exploration stage, we
tween exploration and sampling. We obtain one sample              mask out gradients 𝑔ˆ𝜏 by taking an element-wise prod-
from the chain at the end of each cycle, i.e., the virtual        uct with 𝜈, i.e., 𝑔ˆ𝜏 ⊙ 𝜈. The mask 𝜈 and the mask-out
                                𝐾=⌊ 𝒯 ⌋
ensemble is built by {Θ(𝑘) }𝑘=1 𝐶 = {Θ̂𝐶𝑘−1 , 𝑘 =                 operation are used consistently throughout the explo-
1, . . . , ⌊ 𝒯𝐶 ⌋}. Large gradients at the beginning of a cycle   ration stage 𝒜(𝜏 ) < 𝜂, and 𝜈 only gets updated until
provide enough perturbation and encourage the model               the end of the cycle. This design amounts to learning
to escape from the current mode, while decreasing the             with a bootstrapped subsample of the data in each cy-
gradient scale inside one cycle makes the sampler better          cle. Since the model would observe consistently less
characterize the density of the local mode. Moreover,             data than the original 𝒟𝑁 , it also amounts to posterior
many prior works in Bayesian NNs proposed to apply a              tempering (𝑝(𝒟𝑁 |Θ)𝑝(Θ))1/𝑇 with some temperature
certain form of preconditioning to compensate sampling            𝑇 > 1, resulting in a warm posterior that is softer than
noises from mini-batch training [15, 14]. Tree-based mod-         the Bayesian posterior. By increasing the temperature 𝑇 ,
els can usually digest the full-batch (full dataset 𝒟𝑁 ) per      we expect to see increased density on the paths/corridors
iteration by leveraging modern multi-core processors              connecting different modes of the posterior [28, 29], fur-
and multi-threading. Therefore, we directly use the full-         ther facilitating the sampler to escape from the current
batch GB in all sampling stages, while leaving the option         local mode. By using a relatively large 𝜂 ∈ (0.8, 1), the
of random data subsampling in exploration stages to the           tempering effects would carry over into the sampling
users if training time is a concern.                              stage. Therefore, the bootstrapping mechanism helps
   Combining cyclical gradient scaling with SGLB, we              improve the sample diversity from cSGLB, and we name
expect that our new Cyclical SGLB (cSGLB) algorithm               this variant Cyclical Bootstrapped SGLB (cbSGLB).
could inherit most (if not all) theoretical properties of the        Lastly, we summarize our proposed cSGLB (plus boot-
original SGLB algorithm. Conceptually, with a proper              strap option) in Algo. 1 and highlight our modifications
choice of 𝛼max , 𝛼min and cycle length 𝐶, the sample              on top of SGLB in blue.
obtained at 𝜏 = 𝐶 − 1 from the Markov chain 𝑓Θ         ^ 𝜏 gen-
erated by Algo. 1 (w/o bootstrap) can be approximately
seen as a random draw from the limiting distribution              5. Experiments
with small bounded errors. Also, each next cycle can be
viewed as a warm restart from its previous cycle, and             5.1. Experiments on Synthetic Data
thus no errors shall be accumulated into the subsequent           We validate and qualitatively evaluate the proposed gradi-
cycles (at sampling time 𝜏 = 𝐶𝑘 − 1). We left rigor-              ent scheduling and our cSGLB algorithm on two synthetic
ous analysis and proofs of our propositions for future            problems: (1) a synthetic multimodal dataset in [13], and
work. Empirically, we show in our experiments that the
cyclical gradient scaling achieves similar effects in ex-
ploring a multimodal distribution when compared with              3
                                                                      The function class in-use does not contain the unknown ground-
cSG-MCMC which places a similar cyclical schedule on                  truth function.
 Algorithm 1: Cyclical (Bootstrapped) SGLB                      and both clr-SGLD and cg-SGLD have 30 cycles. Fig.2
  Input: dataset 𝒟𝑁 , learning rate 𝜖 > 0, inverse              shows the estimated density using different sampling
   temperature 𝛽 > 0, regularization 𝛾 > 0,                     strategies. SGLD gets trapped in local modes depending
   number of iterations 𝒯 > 0, cycle length 𝐶 > 1,              on the random initial position, and increasing the noise
   scaler limits 𝛼max , 𝛼min > 0, stage threshold               scale does not solve the problem. In contrast, clr-SGLD
   𝜂 > 0, mask probability 𝑝𝑏𝑚 > 0, boolean                     and cg-SGLD can explore and locate roughly 7 − 8 differ-
   indicator 𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝                                          ent modes of the distribution, showing that our cg-SGLD
                                                                can achieve the state-of-the-art performance in exploring
  Initialize 𝑓Θ                               𝑁
                                                  as all-ones   multimodal distributions. Moreover, cg-SGLD has ben-
              ^ 0 (·) = 0, 𝜈 = 1𝑁 ∈ R
   vector                                                       efits in implementation over clr-SGLD when combined
  for 𝜏 in [0, 1, . . . , 𝒯 − 1] do                             with SGLB. The SGLB algorithm was made available in
       if 𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝 then                                        the CatBoost library [30], which only supports a fixed lr.
                                                                All our proposed enhancements can be implemented with
           if mod𝐶(𝜏,𝐶) = 0 then
                                                                a "user-defined loss function" available in CatBoost with-
               Sample 𝜈 ∈ R𝑁 with
                                                                out touching the source code, making it straightforward
                 𝜈[𝑖] ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝𝑏𝑚 )
                                                                to reproduce our algorithms.
           end
           if mod𝐶(𝜏,𝐶) ≥ 𝜂 then                                Multi-Class Spiral Data After validating the efficacy
               Set 𝜈 = 1𝑁
                                                                of cyclical gradient scheduling on sampling from mul-
           end
                                                                timodal distributions, we are now ready to experiment
       end
                                                                with cSGLB. Specifically, we compare the following al-
       Compute gradient scaler: 𝛼𝜏 =
                                                                gorithms on a 3-class classification task called "Spiral" in
        max( 𝛼max2
                      [cos( 𝜋 mod    𝐶
                                       (𝜏,𝐶)
                                             ) + 1], 𝛼min )     [12]: (i) SGLB ensemble, where we denote by ens𝐾 with
       Estimate gradients 𝑔ˆ𝜏 using 𝑓Θ       ^ 𝜏 (·) and 𝒟𝑁 :   𝐾 models, (ii) SGLB virtual ensemble, simply denoted
                 𝜕                         𝑁         𝑁
        𝑔ˆ𝜏 = ( 𝜕𝑓  𝐿(𝑓Θ   ^ 𝜏 (𝑥𝑖 ), 𝑦𝑖 ))𝑖=1 ∈ R              by SGLB, and (iii) cSGLB virtual ensemble, denoted by
       Sample noise 𝜁, 𝜁 ∼ 𝒩 (0𝑁 , 𝐼𝑁 )
                              ′
                                                                cSGLB (ours). We again reproduced the results in [12]
       Sample tree structure: √︁                                with code released by the authors, and Fig.3 shows the
        𝑠𝜏 ∼ 𝑝 𝑠⃒𝛼𝜏 (𝑔ˆ𝜏 ⊙𝜈) + 2𝑁 𝜁 ′                           estimated KU on Spiral test set. As noted in [12], we see
                (︀ ⃒                              )︀
                                         𝜖𝛽
                                                                that knowledge uncertainty due to decision-boundary
     Estimate leaf/parameter values: √︁                         "jitter" exists in both ens20 and cSGLB, and the "jitter"
      𝜃*𝑠𝜏 = −Φ𝑠𝜏 𝛼𝜏 (𝑔ˆ𝜏 ⊙𝜈) + 2𝑁
                      (︀                     )︀
                                        𝜖𝛽
                                           𝜁                    affects cSGLB more as the estimated KU is "noisy" at the
     Update GBDT model:                                         decision boundary. Nevertheless, cSGLB (with only a sin-
                                           𝑠𝜏   𝑠𝜏
      𝑓Θ^ 𝜏 +1 (·) = (1 − 𝛾𝜖)𝑓Θ
                              ^ 𝜏 (·) + 𝜖ℎ (·, 𝜃* )             gle model) is significantly more efficient than ens20 and
  end                                                           is able to greatly improve upon SGLB in capturing high
  Return: 𝑓Θ ^ 𝒯 (·)                                            KU in regions with no training data.

                                                         5.2. Experiments on Real-World Weather
(2) a multi-class Spiral dataset in [12]. Due to limited      Prediction Data
space, we include experimental details in Appendix B.        Lastly, we evaluate our proposed methods on the public
                                                             Shifts Weather Prediction dataset [31]. We select the clas-
Synthetic Multimodal Data We first demonstrate sification task where the ML model is asked to predict
the ability of cyclical gradient scaling for sampling from a the precipitation class at a particular geolocation and
multimodal distribution on a 2D mixture of 25 Gaussians. timestamp, given heterogeneous tabular features derived
Specifically, we compare (i) the original SG-MCMC with from weather station measurements and forecast mod-
SGLD (denoted as SGLD) and two SGLD variants: (ii) els. The full dataset is partitioned in a canonical fashion
SGLD with Cyclical schedule on Learning Rate (denoted and contains in-domain (ID) training, development and
as clr-SGLD) [13] and (iii) SGLD with Cyclical schedule evaluation datasets as well as out-of-domain (OOD) de-
on Gradient scale (denoted as cg-SGLD (ours)). We re- velopment and evaluation datasets. Importantly, the ID
produced the results for SGLD and clr-SGLD in paper data and the OOD data are separated in time and con-
[13] with code released by the authors, and built our cg- sist of non-overlapping climate types (ID: Tropical, Dry,
SGLD on top of it. In addition, we experimented with Mild; OOD: Snow, Polar), making the Shifts dataset an
a "noisy" version of SGLD with a fixed lr and increased ideal testbed for gauging the robustness of ML model and
10× noise scale (denoted as NoisySGLD/N-SGLD). For the quality of uncertainty estimation. To further facili-
a fair comparison, each chain runs for 50𝑘 iterations tate our experimentation, we conducted the following
     4   2      0    2    4   4   2    0    2   4      4    2     0     2    4     4    2   0    2    4   4    2   0   2   4


         (a) Target               (b) SGLD                 (c) N-SGLD                  (d) clr-SGLD       (e) cg-SGLD (ours)

Figure 2: Sampling from a 5 by 5 mixture of 25 Gaussians.


                                                                      [31], and the results are summarized in Table 1. For pre-
                                                                      dictive performance, we report the classification accuracy
                                                                      and macro F1 using BMA on both the ID and the OOD
                                                                      evaluation datasets. We can see the following effects: (1)
                                                                      Virtual ensembles with a longer chain slightly outper-
                                                                      form the real ensembles on the ID data. (2) Our proposed
                                                                      cSGLB and cbSGLB perform slightly worse than the rest
             (a) Spiral                    (b) ens20
                                                                      of methods on the OOD data. However, this usually is
                                                                      not a concern in practice since the model is not trained
                                                                      with data from OOD domains and would not be used to
                                                                      solve the OOD prediction tasks in a practical scenario.
                                                                      As long as the domain shifts can be reliably detected (via
                                                                      uncertainty), proactive decisions can be made to avoid
                                                                      costly mistakes due to model errors. (3) Our proposed
             (c) SGLB                      (d) cSGLB                  data bootstrapping mechanism is capable of improving
                                                                      the performance on the OOD data (cbSGLB > cSGLB). In
Figure 3: Spiral dataset and estimated knowledge uncertain-           addition, we include the F1-AUC metric (on the combined
ties. Each different color in (a) represents a different class.       ID&OOD evaluation sets) introduced in [31] to jointly
                                                                      assess the predictive power and uncertainty quality. The
                                                                      F1-AUC can be increased by either having a stronger pre-
data preprocessing: (1) feature selection to keep only                dictive model or by improving the correlation between
the top 40 features by importance (out of 123 available               uncertainty and error. Consistent with the findings in
features), where the feature importance is determined by              [31], total uncertainty (TU) correlates more with errors
a CatBoost classifier with 1𝐾 trees trained on the entire             than knowledge uncertainty (KU) as shown by the F1-
training set; (2) dropping minority classes to keep only              AUC scores. More specifically, we see that the F1-AUC
the major 3 precipitation classes, i.e., class 0, 10, and 20          is quite similar across the board when measured by TU,
out of the 9 available classes from the original dataset;             although cSGLB/cbSGLB has slightly worse predictive
(3) random data sampling to keep 200𝐾 (medium-sized)                  power on the OOD segment. When F1-AUC is measured
data in the final training set. Again, the purpose of our             by KU, our cSGLB/cbSGLB is capable of producing KU
data preprocessing is for speeding up experimentation,                estimates that relate more closely to model errors than
and we believe that the observations and findings in this             KU from the SGLB baseline.
study are generalizable to the original full dataset. For                At last, we present the OOD detection ROC-AUC per-
model building, 30 independent SGLB models (each of                   formance on the evaluation data by using KU estimates.
1𝐾 trees) were trained and used to construct real ensem-              Our cSGLB/cbSGLB outperforms the SGLB baseline with
bles 𝑒𝑛𝑠𝐾 for 𝐾 ∈ {3, 5, 10, 30}. SGLB/cSGLB/cbSGLB                   a large margin on the OOD detection task, and even
virtual ensembles were built by sampling 10 members                   achieves a comparable performance to the real ensem-
from a single-chain with 2𝐾 trees. Hence, 𝑒𝑛𝑠10 is 5×                 ble 𝑒𝑛𝑠10 , which is 5× more expensive. This highlights
more expensive in computation and memory than a vir-                  that our cSGLB/cbSGLB can produce high-fidelity KU
tual ensemble. Additional details regarding our data and              estimates to detect domain (or distributional) shifts with
models are included in Appendix C.                                    a single model, and that our proposed cyclical gradient
   We compare various methods on their predictive per-                scheduling is effective in exploring different modes of a
formance and on uncertainty quantification following                  posterior. In real-world industrial applications, detecting
 Metric             Data         ens3           ens5           ens10       ens30      SGLB           cSGLB          cbSGLB
                   eval-ID    65.24 ± 0.02   65.25 ± 0.02   65.26 ± 0.01   65.26   65.63 ± 0.01   65.93 ± 0.01    65.59 ± 0.01
 Accuracy (%)↑
                  eval-OOD    52.51 ± 0.06   52.55 ± 0.06   52.55 ± 0.03   52.54   52.50 ± 0.14   50.72 ± 0.17    51.49 ± 0.11
                   eval-ID    63.28 ± 0.02   63.29 ± 0.02   63.30 ± 0.01   63.30   63.69 ± 0.01   64.06 ± 0.02    63.67 ± 0.01
 Macro F1 (%)↑
                  eval-OOD    52.36 ± 0.07   52.39 ± 0.06   52.38 ± 0.03   52.38   52.36 ± 0.13   50.64 ± 0.16    51.41 ± 0.15

                     TU       57.18 ± 0.01   57.19 ± 0.01   57.20 ± 0.01   57.20   57.26 ± 0.01   56.82 ± 0.03   56.95 ± 0.04
 F1-AUC (%)↑
                     KU       52.94 ± 0.03   53.71 ± 0.03   54.33 ± 0.02   54.83   52.55 ± 0.08   53.60 ± 0.08   53.31 ± 0.14
 OOD-AUC (%)↑        KU       65.72 ± 0.31   68.60 ± 0.21   71.15 ± 0.16   73.26   67.32 ± 0.70   71.45 ± 0.67   71.70 ± 0.91

Table 1
Comparing predictive performance & uncertainty measures of various methods on Shifts Weather data. Mean ± Std Dev
over 3 seeds. The best performance among virtual ensembles is highlighted in bold.


OOD data or domain shifts in an efficient way is often          References
crucial to ensure a safe deployment and operation of ML
systems. Observing consistently high uncertainty (es-            [1] J. Dressel, H. Farid, The accuracy, fairness, and limits
                                                                     of predicting recidivism, Science Advances 4 (2018)
pecially KU) from model predictions indicates that the
                                                                     eaao5580. URL: https://www.science.org/doi/abs/10.
patterns of new incoming data have deviated from the                 1126/sciadv.aao5580. doi:10.1126/sciadv.aao5580.
training. This often provides a strong signal for model          [2] P. Stone, R. Brooks, E. Brynjolfsson, R. Calo, O. Etzioni,
refresh, ensuring that the ML system can be updated in               G. Hager, J. Hirschberg, S. Kalyanakrishnan, E. Kamar,
time to avoid errors and operate safely in its "comfort              S. Kraus, et al., Artificial intelligence and life in 2030,
zone" (with relatively low predictive uncertainty).                  One Hundred Year Study on Artificial Intelligence: Re-
                                                                     port of the 2015-2016 Study Panel 52 (2016).
                                                                 [3] A. Serban, E. Poll, J. Visser, Towards using probabilis-
6. Conclusion                                                        tic models to design software systems with inherent un-
                                                                     certainty (2020). URL: https://arxiv.org/abs/2008.03046.
We present cyclical gradient scheduling and Cyclical                 doi:10.48550/ARXIV.2008.03046.
SGLB for efficiently and effectively quantifying uncer-          [4] S. McGrath, P. Mehta, A. Zytek, I. Lage, H. Lakkaraju,
tainty in gradient boosting with a single model, and pro-            When does uncertainty matter?: Understanding the im-
pose a data bootstrapping scheme to enhance diversity                pact of predictive uncertainty in ml assisted decision
in posterior samples. We show empirically that our al-               making (2020). URL: https://arxiv.org/abs/2011.06167.
gorithms have superior performance over the state-of-                doi:10.48550/ARXIV.2011.06167.
                                                                 [5] R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning
the-art SGLB, especially in quantifying knowledge un-
                                                                     is not all you need, 2021. URL: https://arxiv.org/abs/2106.
certainty and for OOD detection.                                     03253. doi:10.48550/ARXIV.2106.03253.
   Accurately quantifying uncertainty in ML predictions          [6] J. H. Friedman, Greedy function approximation: a gradi-
can yield many benefits in real-world applications. Un-              ent boosting machine, Annals of statistics (2001) 1189–
certainty directly measures the confidence in model pre-             1232.
dictions, not only improving interpretability, but also          [7] J. H. Friedman, Stochastic gradient boosting, Computa-
ensuring that costly mistakes can be avoided proactively.            tional statistics & data analysis 38 (2002) 367–378.
Consistently observing high uncertainty in predictions           [8] A. Malinin, Uncertainty estimation in deep learning
on incoming data stream often provides a reliable signal             with application to spoken language assessment, Ph.D.
for data distributional shifts and/or model being obsolete.          thesis, University of Cambridge, 2019.
                                                                 [9] F. Provost, P. Domingos, Tree induction for probability-
Uncertainty is also a key concept in optimal control and
                                                                     based ranking, Machine learning 52 (2003) 199–215.
decision making, where it can be leveraged to experi-
                                                                [10] U. Johansson, T. Löfström, H. Boström, Calibrating prob-
ment with different actions/decisions with controllable              ability estimation trees using venn-abers predictors, in:
costs in search for the best operating point of an ML                Proceedings of the 2019 SIAM International Conference
system. We believe that our work on efficient uncer-                 on Data Mining, SIAM, 2019, pp. 28–36.
tainty quantification for GBDTs facilitates the adoption        [11] A. Ustimenko, L. Prokhorenkova, Sglb: Stochastic gradi-
of uncertainty-enabled and uncertainty-aware ML sys-                 ent langevin boosting, in: International Conference on
tems, and more broadly promotes ML safety and a safe                 Machine Learning, PMLR, 2021, pp. 10487–10496.
and trustworthy deployment of ML model in production.           [12] A. Malinin, L. Prokhorenkova, A. Ustimenko, Un-
                                                                     certainty in gradient boosting via ensembles, arXiv
                                                                     preprint arXiv:2006.10562 (2020).
                                                                [13] R. Zhang, C. Li, J. Zhang, C. Chen, A. G. Wilson, Cycli-
                                                                     cal stochastic gradient mcmc for bayesian deep learning,
     arXiv preprint arXiv:1902.03932 (2019).                            PMLR, 2018, pp. 1309–1318.
[14] F. Wenzel, K. Roth, B. S. Veeling, J. Świątkowski, L. Tran,   [30] A. V. Dorogush, V. Ershov, A. Gulin, Catboost: gradi-
     S. Mandt, J. Snoek, T. Salimans, R. Jenatton, S. Nowozin,          ent boosting with categorical features support, arXiv
     How good is the bayes posterior in deep neural net-                preprint arXiv:1810.11363 (2018).
     works really?, arXiv preprint arXiv:2002.02405 (2020).        [31] A. Malinin, N. Band, G. Chesnokov, Y. Gal, M. J.
[15] P. Izmailov, S. Vikram, M. D. Hoffman, A. G. G. Wil-               Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova,
     son, What are bayesian neural network posteriors really            I. Provilkov, V. Raina, et al., Shifts: A dataset of real dis-
     like?, in: International Conference on Machine Learn-              tributional shift across multiple large-scale tasks, arXiv
     ing, PMLR, 2021, pp. 4629–4640.                                    preprint arXiv:2107.07455 (2021).
[16] A. Masegosa, Learning under model misspecification:
     Applications to variational and ensemble methods, Ad-
     vances in Neural Information Processing Systems 33
     (2020) 5479–5491.
                                                                   A. Comparison between cSGLB
[17] M. Welling, Y. W. Teh, Bayesian learning via stochastic          and cSG-MCMC
     gradient langevin dynamics, in: Proceedings of the 28th
     international conference on machine learning (ICML-           The proposed Cyclical SGLB algorithm combines SGLB
     11), Citeseer, 2011, pp. 681–688.                             with cSG-MCMC [13] to effectively explore different
[18] T. Chen, E. Fox, C. Guestrin, Stochastic gradient hamil-      modes of a highly multimodal posterior distribution. In
     tonian monte carlo, in: International conference on ma-       this section, we summarize some key differences between
     chine learning, PMLR, 2014, pp. 1683–1691.
                                                                   our design and the original cSG-MCMC algorithm.
[19] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel,
     H. Neven, Bayesian sampling using stochastic gradient            (1) cSG-MCMC is a sampling algorithm designed for
     thermostats, Advances in neural information process-          Bayesian NNs, while cSGLB is built for GBDT models.
     ing systems 27 (2014).                                        In deep learning, full-batch gradient descent is usually
[20] C. Li, C. Chen, D. Carlson, L. Carin, Preconditioned          not feasible, and techniques have been developed to ex-
     stochastic gradient langevin dynamics for deep neural         plicitly compensate mini-batch noises, such as precondi-
     networks, in: Thirtieth AAAI Conference on Artificial         tioning [14]. Some also suggested applying an additional
     Intelligence, 2016.                                           correction step called Metropolis-Hastings [15]. Tree-
[21] Y. W. Teh, A. H. Thiery, S. J. Vollmer, Consistency and       based GB models can easily scale up to large industrial
     fluctuations for stochastic gradient langevin dynamics,       datasets and digest the full training set at each iteration.
     The Journal of Machine Learning Research 17 (2016)
                                                                   Therefore, our cSGLB uses full-batch GB in the sampling
     193–225.
[22] A. G. Wilson, P. Izmailov, Bayesian deep learning and a
                                                                   stage of each cycle to ensure high-quality samples being
     probabilistic perspective of generalization, Advances in      generated. (2) cSGLB puts a cyclical schedule on gra-
     neural information processing systems 33 (2020) 4697–         dient scale while cSG-MCMC puts a schedule on step
     4708.                                                         size. In addition, the original cSG-MCMC completely re-
[23] A. Ashukha, A. Lyzhov, D. Molchanov, D. Vetrov, Pitfalls      moved the injected Gaussian noises in the exploration
     of in-domain uncertainty estimation and ensembling in         stage, and cSG-MCMC reduced to regular stochastic gra-
     deep learning, arXiv preprint arXiv:2002.06470 (2020).        dient descent (SGD) during the period of exploration.
[24] T. Duan, A. Anand, D. Y. Ding, K. K. Thai, S. Basu, A. Ng,    Although the authors claimed that this amounts to pos-
     A. Schuler, Ngboost: Natural gradient boosting for prob-      terior temping which is commonly used in DL domain,
     abilistic prediction, in: International Conference on Ma-     the implementation of cSG-MCMC algorithm does not
     chine Learning, PMLR, 2020, pp. 2690–2700.
                                                                   follow closely/strictly the dynamics of SGLD during the
[25] B. Settles, Active learning literature survey (2009).
[26] T. Tan, Z. Xiong, V. R. Dwaracherla, Parameterized in-        exploration stage. In contract, we keep the injected noise
     dexed value function for efficient exploration in rein-       term unchanged during the course of learning. Our de-
     forcement learning, in: Proceedings of the AAAI Con-          sign achieved similar effects compared with the step-size
     ference on Artificial Intelligence, volume 34, 2020, pp.      scheduling on a synthetic experiment, and we also en-
     5948–5955.                                                    sure that cSGLB follows the dynamics of SGLD (more or
[27] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez,           less) at every iteration step. (3) Lastly, the gradient scal-
     S. Udluft, Decomposition of uncertainty for active learn-     ing (instead of step-size scheduling) has implementation
     ing and reliable reinforcement learning in stochastic sys-    benefits. The SGLB algorithm is made available in the
     tems, stat 1050 (2017) 11.                                    CatBoost library [30], which only supports a constant
[28] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov,
                                                                   step size (or learning rate). Our proposed cyclical gradi-
     A. G. Wilson, Loss surfaces, mode connectivity, and fast
     ensembling of dnns, Advances in neural information
                                                                   ent scaling (and data bootstrapping) can be implemented
     processing systems 31 (2018).                                 easily with the "user-defined loss function" available in
[29] F. Draxler, K. Veschgini, M. Salmhofer, F. Hamprecht,         the CatBoost package, without modifying a single line of
     Essentially no barriers in neural network energy land-        the source code.
     scape, in: International conference on machine learning,
                                                # of samples                                        % of class
       Data
                     Total         Tropical     Dry      Mild        Snow       Polar    Class 0    Class 10 Class 20
       train-ID     200,000        26,786      45,357    127,857       0          0       40.92      33.71       25.37
       dev-ID        46,279            6,196   10,505    29,578        0          0       40.90      33.97       25.13
       dev-OOD       46,555              0       0         0         46,555       0       38.35      35.07       26.58
       eval-ID      518,587        69,245      117,981   331,361       0          0       40.98      33.71       25.30
       eval-OOD     524,048          0            0         0       479,952    44,096     33.05      35.70       31.24

Table 2
Data summary of our partitioning of the Weather Prediction dataset.


B. Synthetic Data                                                  it a virtual ensemble of 20 members. For cSGLB vir-
                                                                   tual ensemble, we set 𝜖 = 0.05, cycle length 𝐶 = 200,
B.1. Synthetic Multimodal Distribution                             𝛼max = 10, 𝛼min = 1, making it a virtual ensemble of
                                                                   2000/200 = 10 members. For cbSGLB with bootstrap-
The ground truth density of the distribution is
                                                                   ping, we additionally set exploration proportion 𝜂 = 0.9
                        25                                         and mask probability 𝑝𝑏𝑚 = 0.66.
                        ∑︁ 1
              𝐹 (𝑥) =              𝒩 (𝑥|𝜇𝑖 , Σ),
                        𝑖=1
                              25
                                                                   C. Real-World Shifts Data
where[︂ 𝜇 = {−4, −2,
                   ]︂ 0, 2, 4} × {−4, −2, 0, 2, 4} and
                                   𝑇

       0.03      0                                                 Data Summary A detailed summary of our final par-
Σ=                   . We used the code released by the au-        titioning of the Weather Prediction dataset is included in
         0     0.03
thors [13] to generate our results. Specifically, the SGLD         Table 2.
was trained with a decaying lr 𝜖𝜏 = 0.05(𝜏 + 1)−0.55 ,
and clr-SGLD was learned with a cyclical lr schedule               Experimental Details We used default parameter set-
with initial value 𝜖0 = 0.09 and exploration proportion            tings for SGLB models as suggested in the original paper
𝜂 = 0.25. For our cg-SGLD, we fixed lr 𝜖 = 0.01 and                [12] for uncertainty quantification except that we set
Gaussian noise scale as 0.4, and set 𝛼max = 10, 𝛼min =             the subsample rate to 0.8 for stochastic gradient boost-
1. The "noisy" version of SGLD (or NoisySGLD/N-SGLD)               ing. The real SGLB ensemble consists of (up to) 30 SGLB
was trained with a fixed lr 𝜖 = 0.02 and noise scale 5.0           models trained with different seeds, each of 1K trees. In
(roughly 10× larger than the noise scale used in the other         order to get more samples from a single chain, the vir-
methods). Each chain was trained for 50𝑘 iterations and            tual ensembles of SGLB and our cSGLB/cbSGLB were
both clr-SGLD and cg-SGLD had 30 cycles. The results               learned with a single model of 2K trees. We set learning
and findings are robust to random seeds, and similar re-           rate for all models to 𝜖 = 0.05, and tree depth to 6. For
sults were observed with different seeds. We refer the             SGLB virtual ensemble, each 100th model from interval
interested readers to the original paper [13] for results of       [1000, 2000] was added to the ensemble, making it a vir-
SGLD and clr-SGLD in parallel (or multi-chain) settings.           tual ensemble of 10 members. cSGLB and cbSGLB shared
                                                                   the same parameters with the SGLB counterpart. In ad-
                                                                   dition, for cSGLB/cbSGLB virtual ensemble, we set cycle
B.2. Synthetic Spiral Dataset                                      length 𝐶 = 200, 𝛼max = 10.0, 𝛼min = 1.0, making it a
All experiments were conducted using CatBoost [30], one            virtual ensemble of 2000/200 = 10 members. For sim-
of the-state-of-the-art libraries for GBDTs. The ensemble          plicity, cSGLB used full-batch gradient boosting at each
of SGLB (ens20 ) contains 20 independent (with different           iteration step. In contrast, for cbSGLB with bootstrap-
random seeds) models with 1K trees each. The learning              ping, we set exploration proportion 𝜂 = 0.8, i.e., 80% of
rate is 𝜖 = 0.1, tree depth is 6, and random_strength =            a cycle was treated as exploration, and set mask prob-
100 and border_count = 128. The SGLB virtual ensem-                ability 𝑝𝑏𝑚 = 0.6 in the exploration stage. For model
ble and cSGLB virtual ensemble are trained with the                and parameter selection, we only used the in-domain
same parameters except that we increase the number                 (ID) development set and did not use the out-of-domain
of trees to 2K and lower the lr for cSGLB to 𝜖 = 0.05.             (OOD) development set. Although this may potentially
Thus, the virtual ensemble is 10× more efficient in com-           lower our reported performance on the OOD evaluation
putation and memory than the actual SGLB ensemble.                 set, we believe that it reflects better a real-world learning
For SGLB virtual ensemble, each 50th model from in-                scenario where the shifted data is often unobserved and
terval [1000, 2000] is added to the ensemble, making               unavailable at training time.

</pre>