=Paper=
{{Paper
|id=Vol-3381/paper_47
|storemode=property
|title=Efficient and Effective Uncertainty Quantification in Gradient Boosting via
Cyclical Gradient MCMC
|pdfUrl=https://ceur-ws.org/Vol-3381/47.pdf
|volume=Vol-3381
|authors=Tian Tan,Carlos Huertas,Qi Zhao
|dblpUrl=https://dblp.org/rec/conf/aaai/TanHZ23
}}
==Efficient and Effective Uncertainty Quantification in Gradient Boosting via
Cyclical Gradient MCMC==
Efficient and Effective Uncertainty Quantification in
Gradient Boosting via Cyclical Gradient MCMC
Tian Tan1,* , Carlos Huertas1 and Qi Zhao1
1
Buyer Risk Prevention - ML, WW Customer Trust, Amazon, Seattle, WA 98109, USA
Abstract
Gradient boosting decision trees (GBDTs) are widely applied on tabular data in real-world ML systems. Quantifying uncer-
tainty in GBDT models is thus essential for decision making and for avoiding costly mistakes to ensure an interpretable
and safe deployment of tree-based models. Recently, Bayesian ensemble of GBDT models is used to measure uncertainty by
leveraging an algorithm called stochastic gradient Langevin boosting (SGLB), which combines GB with stochastic gradient
MCMC (SG-MCMC). Although theoretically sound, SGLB gets trapped easily on a particular mode of the Bayesian poste-
rior, just like other forms of SG-MCMCs. Therefore, a single SGLB model can often fail to produce uncertainty estimates of
high-fidelity. To address this problem, we present Cyclical SGLB (cSGLB) which incorporates a Cyclical Gradient schedule
in the SGLB algorithm. The cyclical gradient mechanism promotes new mode discovery and helps explore high multimodal
posterior distributions. As a result, cSGLB can efficiently quantify uncertainty in GB with only a single model. In addition, we
present another cSGLB variant with data bootstrapping to further encourage diversity among posterior samples. We conduct
extensive experiments to demonstrate the efficiency and effectiveness of our algorithm, and show that it outperforms the
state-of-the-art SGLB on uncertainty quantification, especially when uncertainty is used for detecting out-of-domain (OOD)
data or distributional shifts.
Keywords
uncertainty quantification, gradient boosting decision trees, Bayesian inference, out-of-domain (OOD) detection
1. Introduction efficiently on GBDT predictions can therefore not only
improve model interpretability in production but also
With the rapid growth of data and computing power, ensure a safer deployment of ML systems, especially for
machine learning (ML) has been gaining a lot of new high-risk applications.
applications in areas not imagined before. The more Uncertainty quantification (UQ) has been widely stud-
ubiquitous ML systems become, it is inevitable to see ied for neural networks under the Bayesian framework
applications in very sensitive and high-risk fields. This [8], however, it is relatively under-explored for tree-based
expands to a numerous areas like criminal recidivism models. Although calibrated probability estimation trees
[1], medical follow-ups [2] and autonomous-systems [3]. [9, 10] can be used for UQ, they have not been studied
While these systems might be very broad, they share a from a Bayesian perspective. Recently, Bayesian ensem-
common need, and that is to have a certain degree of con- ble methods were extended to measure uncertainty in
fidence on ML predictions. A proven successful way to GBDTs by leveraging a new algorithm called stochastic
build confidence in critical systems is uncertainty estima- gradient Langevin boosting (SGLB) [11]. Specifically, two
tion. Research has shown that humans are more likely to SGLB-based approaches were introduced for UQ [12]: (1)
agree with a system if they get access to the correspond- SGLB ensemble, which trains multiple SGLB models in
ing uncertainty, and this holds true regardless of shape parallel, and (2) SGLB virtual ensemble, which constructs
and variance as the approach itself is model and task ag- a virtual ensemble using only a single SGLB model where
nostic [4]. Since the most common data type in real-world each member in the ensemble is a "truncated" sub-model
ML applications is tabular [5], our work in this paper fo- [12]. Although both approaches are theoretically sound,
cuses specifically on uncertainty quantification for the there is clearly a trade-off between quality and efficiency
state-of-the-art gradient boosting decision trees (GBDTs) in practice. SGLB (real) ensemble is believed to be ac-
[6, 7], which are known to outperform deep learning (DL) curate as it can characterize the Bayesian posterior well
methods on tabular data, both in accuracy and tuning by running independent models in parallel. However,
requirements [5]. Measuring uncertainty effectively and it is almost infeasible to deploy such an ensemble in
real-world production due to its high computational and
The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), maintenance costs. SGLB virtual ensemble greatly im-
February 13 - 14, 2023, Washington, D.C., USA proves the efficiency, however, it often gets stuck on a
*
Corresponding author.
single mode of the Bayesian posterior and can produce
" tianta@amazon.com (T. Tan); carlohue@amazon.com
(C. Huertas); qqzhao@amazon.com (Q. Zhao) downgraded uncertainty estimates [13, 14]. To better
Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License balance between quality and efficiency and to facilitate
Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
perior performance in detecting distributional/domain
shifts in real-world tabular data streams.
2. Related Work
Bayesian ML and approximate Bayesian inference pro-
vide a principled representation of uncertainty. One pop-
ular family of approaches to inference in Bayesian ML
are stochastic gradient Markov Chain Monte Carlo (SG-
MCMC) methods [17, 18, 19, 20, 13], which are used to
Figure 1: Illustration of proposed cyclical schedule on gradi- effectively sample models (or model parameters) from
ent scales for SGLB algorithm. the Bayesian posterior. Uncertainty then comes naturally
by measuring the "discrepancy" in predictions from the
sampled models which are regarded as posterior samples.
Recently, stochastic gradient Langevin boosting (SGLB)
the usage of uncertainty-enabled ML systems, an impor-
[11] was proposed by combining gradient boosting with
tant question remains: how can we make a single SGLB
SG-MCMC. As its name suggests, the Markov chain gen-
explore effectively different modes of a posterior given a
erated by SGLB obeys a special form of the stochastic
limited computational budget?
gradient Langevin dynamics (SGLD) [11, 17], which im-
In this paper, we address the question above by com-
plies that SGLB is able to generate samples from the true
bining SGLB virtual ensemble with advanced sampling
Bayesian posterior asymptotically. Leveraging this prop-
techniques from Bayesian DL [14, 13, 15, 16]. Inspired by
erty, Malinin et al. [12] proposed to use (1) SGLB ensem-
the ideas in [13], we propose to use a scaler (or scaling
ble or (2) SGLB virtual ensemble to measure uncertainty
factor) on gradients that follows a cyclical schedule dur-
in GBDTs. Essentially, SGLB ensemble corresponds to
ing the course of SGLB training. The cyclical schedule is
running multiple SG-MCMCs in parallel and each chain
illustrated in Figure 1, and consequently, we name the re-
(or SGLB) is initialized independently with a different ran-
sulting algorithm Cyclical SGLB (cSGLB). Similar to [13],
dom seed. Since SGLB allows us to sample from the true
each cycle in cSGLB contains two stages: (1) Exploration:
posterior, the ensemble with multiple samples gives a
when the scaler is large, we treat this stage as a warm
high-fidelity approximation to the Bayesian posterior. In
restart from the previous cycle, enabling the model/sam-
contrast, SGLB virtual ensemble only trains a single SGLB
pler to follow the gradients closely and to escape from
model and it uses multiple truncated sub-models to form
the current local mode. (2) Sampling: when the gradient
a (virtual) ensemble. The key idea is essentially extracting
scaler is small, the scale of injected Gaussian noise in the
multiple samples from a single-chain SG-MCMC instead
SGLB procedure becomes relatively large, encouraging
of running multiple chains in parallel.
the sampler to fully characterize one local mode. We
In theory, SGLB or single-chain SG-MCMC converges
collect one sample (or truncated sub-model) to build the
asymptotically to the target distribution and should be-
virtual ensemble at the end of each cycle. The cyclical
have similarly to the multi-chain SGLB ensemble in the
gradient schedule therefore helps cSGLB effectively ex-
limit, but it can suffer from a bounded estimation error in
plore different modes of a posterior while maintaining the
limited time [21]. Moreover, it is often believed that the
same level of efficiency of a virtual ensemble. Moreover,
posterior is highly multimodal in the parametric space
inspired by a recent study [16] showing that "diversified"
of modern ML models [13], since there are potentially
posterior may provide a tighter generalization bound, we
many different sets of parameters that can describe the
present another simple approach to encourage diversity
training data equally well. The real ensemble can explore
in samples obtained from running cSGLB via data boot-
different modes of the posterior by running in parallel
strapping. We name this variant Cyclical Bootstrapped
independent chains, providing a complete picture of the
SGLB (cbSGLB).
distribution as the number of chains increases. However,
We extensively experiment with our proposed algo-
a single-chain SG-MCMC often gets stuck easily on a
rithms and compare the performance against SGLB en-
single mode of the posterior [13, 14], failing to cover the
semble and the original SGLB virtual ensemble. Partic-
full spectrum of the distribution.
ularly, we show that our cyclical gradient schedule can
In this paper, we extend the ideas behind Cyclical SG-
help explore effectively multimodal distributions, cSGLB
MCMC (cSG-MCMC) in DL [13] to sampling from a tree-
is capable of producing uncertainty estimates that are
based SGLB model, which promotes new mode discovery
better aligned with SGLB real ensemble, and cSGLB/cbS-
during training. Different from cSG-MCMC that puts
GLB outperforms the SGLB baseline with a large margin
a cyclical schedule on step size, we propose to use a
on out-of-domain (OOD) data detection, indicating a su-
cyclical schedule on gradient scale. We also point out and vector π β Rπ . Simply put, π(π |π) defines a distribution
justify the difference and our design choice in Appendix over tree structures.
A. In addition, we propose a simple strategy to further As with other GBDT algorithms, SGLB constructs an
encourage diversity in samples obtained from a single ensemble of decision trees iteratively. At each iteration
chain by data bootstrapping. At the beginning of each π , we compute unbiased gradient estimates πΛπ such that
cycle (see Fig.1), we construct a bootstrapped dataset π
E[πΛπ ] = ( ππ πΏ(πΞ π
^ π (π₯π ), π¦π ))π=1 β R
π
using the cur-
that is a random subset of the training, and use that rent model πΞ ^π , and sample independently two normal
bootstrapped data consistently during the exploration vectors π, π β² βΌ π© (0π , πΌπ ), where 0π , πΌπ denote zero
stage to update the GBDT model. The "bias" induced by vector and identity matrix in Rπ , respectively. Then, a
data bootstrapping also amounts to posterior tempering base learner (or tree structure) βοΈ π π is picked by drawing
[14, 15, 22, 23, 13]. one sample from π(π |πΛπ + 2π π β² ), where π > 0 is a
ππ½
learning rate (or step size) and π½ > 0 is a parameter often
3. Preliminaries referred as inverse diffusion temperature. Next, we esti-
mate the parameters π*π π (at tree leaves) of the sampled
3.1. General Setup base learner by solving the following optimization:
Given a set of π training data points sampled minimize ||ππ π ||22 π .π‘.
from an unknown distribution π on π³ Γ π΄, i.e., (1)
βοΈ
2π
(π₯1 , π¦1 ), . . . , (π₯π , π¦π ) βΌ π denoted as ππ , and a loss ππ π β argmin || β πΛπ + π β π»π π π||22 ,
πβRππ π ππ½
function πΏ(π§, π¦) : π΅ Γ π΄ β R where π΅ denotes the
space of predictions, our βοΈ goal is to minimize the empiri- which returns the minimum norm solution that fits
cal loss β(π |ππ ) := π1 π=1 πΏ(π (π₯π ), π¦π )) over func-
π
best to the perturbed "noisy" version of negative gra-
tions π belonging to some family β± β {π : π³ β π΅}. dients. The optimization aboveβοΈ has a closed form so-
In this paper, we only consider β± corresponding to ad- lution as π*π π = βΞ¦π π (πΛπ + 2π
π), where Ξ¦π π :=
ππ½
ditive ensemble of decision trees β := {βπ (π₯, ππ ) :
(π»π ππ π»π π )+ π»π ππ , + denotes pseudo-inverse. For deci-
π³ Γ Rππ β R, π β π}, where π is an index set and
sion trees, Ξ¦π π π essentially corresponds to averaging the
βπ has parameters ππ . Decision trees are built by parti-
gradient estimates π in each leaf node of the tree. Lastly,
tioning recursively the feature space into disjoint regions
SGLB algorithm updates the ensemble model by
(called leaves). Each region is assigned a value that is used
to estimate the response of π¦ in the corresponding feature ^ π +1 (Β·) := (1 β πΎπ)πΞ
πΞ π π π π
^ π (Β·) + πβ (Β·, π* ), (2)
subspace. Letβs βοΈ denote these regions by π
π βs, then we
have β(π₯, π) = π ππ 1{π₯ β π
π }, where 1{Β·} denotes where πΎ is a regularization parameter that "shrinks" the
indicator function. Therefore, given the tree structure, currently built model when updating the ensemble. At
decision tree βπ is a linear function of its parameters ππ . a high-level, SGLB is a stochastic GB algorithm with
It is often assumed that the set π is finite because the Gaussian noise injected into gradient estimates, which
training data is finite [11, 12], e.g., there exists only a encourages the algorithm to explore a larger area in the
finite number of ways to partition the training data. Ow- functional space to find a better fit for the given data.
ing to the linear dependence of βπ on ππ and the finite The independence between noise π (used for parameter
assumption of π, we can represent any ensemble of mod- learning) and π β² (used for tree sampling), and the model
els from β as a linear model πΞ (π₯) = π(π₯)π Ξ for some shrinking by πΎ in Eqn.(2) are technical details needed for
feature map π(π₯) : π³ β Rπ and Ξ β Rπ denotes the establishing theoretical results and rigorous analysis of
parameters of the entire ensemble [11]. Hence, in the sub- SGLB [11]. All the procedures of SGLB are also present
sequent discussion, we will simply denote the parameters in our proposed cSGLB in Algo. 1 (with our additional
of the GBDT model obtained at iteration π as ΞΜπ , and modifications highlighted in blue).
additionally define a linear mapping π»π : Rππ β Rπ One can show that the parameters of SGLB ΞΜπ at each
that converts ππ to predictions (βπ (π₯π , ππ ))π iteration form a Markov chain that weakly converges to
π=1 .
the following stationary distribution:
3.2. SGLB ππ½* (Ξ) β exp(βπ½β(Ξ|ππ ) β π½πΎ||ΞΞ||22 ), (3)
SGLB combines stochastic gradient boosting (SGB) [7] where Ξ = Ξπ > 0 is a regularization matrix which
with stochastic gradient Langevin dynamics (SGLD) [17]. depends on a particular tree construction algorithm or the
Following notations used in the original paper [11], we choice of tuple β¬ := {β, π(π |π)} [11]. Note that since
characterize the SGB procedure by a tuple β¬ := {β, π}, the GBDT model is linear and can be fully determined
where β again is the set of base learners and π(π |π) is a by parameters Ξ, we simply use notation β(π |ππ ) and
distribution over indices π β π conditioned on a gradient β(Ξ|ππ ) interchangeably.
3.3. Posterior Sampling predictions can be made by taking an average over
the ensemble, often known as predictive posterior or
We consider here a standard Bayesian learning frame-
Bayesian model average (BMA):
work [8] that treats parameters Ξ as random variables
and places a prior π(Ξ) over Ξ. In addition, we consider π(π¦|π₯, ππ ) = Eπ(Ξ|π ) [π (π¦|π₯; Ξ)] β 1 βοΈπΎ π (π¦|π₯; Ξ(π) ).
π=1
the GBDT model πΞ as a probabilistic model and explic-
π πΎ
(5)
itly denote the model by π (π¦|π₯; Ξ) with parameters Ξ. The entropy of the predictive posterior estimates total un-
This is valid naturally for classification as GBDT models certainty (TU) in predictions, which can be further decom-
by construction return a distribution over class labels. For posed into two distinct types of uncertainty: knowledge
regression, one can leverage NGBoost algorithm [24] to uncertainty and data uncertainty.1 (a) Knowledge uncer-
return the mean and variance of a Gaussian distribution tainty (KU) arises due to the lack of knowledge about the
over the target π¦ for a given input π₯. data generation process (or the unknown distribution
For the purpose of uncertainty estimation, we aim to π). KU is expected to be large in regions (in the feature
estimate or obtain an approximation to the Bayesian pos- space) where we do not have sufficient training data. (b)
Data uncertainty (DU) arises due to the inherent stochas-
terior π(Ξ|ππ ). To that end, we can choose π½ = π
ticity within the data generation process, and it is high
and πΎ = 2π and use the negative log-likelihood as in regions with class overlaps. In applications like active
1
the loss
βοΈπfunction β(Ξ|ππ ) = Eππ [β log π(π¦|π₯, Ξ)] = learning [25], reinforcement learning (RL) [26], and OOD
β π1 π=1 log π(π¦π |π₯π , Ξ). Then, the limiting distribu- detection, it is desirable to measure KU separately from
tion of SGLB can be explicitly expressed as: DU (or TU), and the following equation can be used in
practice to compute and connect them via mutual infor-
(οΈ 1 )οΈ
ππ½* (Ξ) β exp log π(ππ |Ξ) β ||ΞΞ||22 mation [27]:
2 (4)
β π(ππ |Ξ)π(Ξ), I(π¦; Ξ|π₯, ππ )
β β
Knowledge Uncertainty
which is proportional to the true posterior π(Ξ|ππ ) un-
der Gaussian prior π(Ξ) = π© (0π , Ξ) [11]. = H(π(π¦|π₯, ππ )) β Eπ(Ξ|ππ ) [H(π (π¦|π₯; Ξ))]
Now, consider a Bayesian ensemble of probabilistic
β β β β
Total Uncertainty Expected Data Uncertainty
models {π (π¦|π₯; Ξ(π) )}πΎπ=1 where each model is trained πΎ πΎ
independently by running SGLB. Since each Ξ(π) is guar-
(οΈ 1 βοΈ )οΈ 1 βοΈ (οΈ )οΈ
βH π (π¦|π₯; Ξ(π) ) β H π (π¦|π₯; Ξ(π) ) ,
anteed to be sampled from π(Ξ|ππ ) by Eqn.(4), the en- πΎ
π=1
πΎ
π=1
semble {Ξ(π) }πΎ π=1 with πΎ samples yields a "discrete" (6)
approximation to the posterior π(Ξ|ππ ). This is ex- where I(π΄; π΅) denotes the mutual information between
actly the idea behind SGLB ensemble [12], which learns random variables A and B, and H(Β·) denotes entropy. The
πΎ independent SGLB models in parallel with different difference between TU and DU measures the disagree-
random seeds. Although the approximation improves ment among members in the ensemble and estimates the
as πΎ increases, the computational cost also increases knowledge uncertainty.2
linearly with πΎ. To alleviate the computational bur-
den, SGLB virtual ensemble [12] builds a Bayesian vir-
tual ensemble by sampling multiple times from a single- 4. Cyclical Stochastic Gradient
chain SGLB model. Because samples from the same Langevin Boosting (cSGLB)
chain are highly correlated, SGLB virtual ensemble pro-
poses to sample one member Ξ(π) every πΆ > 1 iter- 4.1. Promoting Mode Discovery via
ations. More specifically, the parameters are sampled Cyclical Gradient Scheduling
πΎ=β π― β
by {Ξ(π) }π=1 2πΆ = {ΞΜπΆ(π+β π― β) , π = 1, . . . , β 2πΆπ―
β},
2πΆ Instead of building a cumbersome true ensemble of SGLB
i.e., appending one member to the ensemble every πΆ it- models, the virtual ensemble of SGLB greatly improves
erations while constructing one SGLB model using π― the efficiency by training only a single model. How-
iterations of gradient boosting. Notice that no sampling ever, similar to other types of SG-MCMC in Bayesian DL
is performed during the first half of iterations (π < π― /2) [13, 14, 15], single-chain SGLB gets trapped easily on a
since Eqn.(4) holds only asymptotically. For large πΆ and particular single mode of the posterior. To efficiently ex-
πΎ, the virtual ensemble should behave similarly to the plore different modes of the multimodal posterior and ef-
SGLB real ensemble in the limit theoretically. fectively measure uncertainty in GBDT predictions with
3.4. Uncertainty Estimation 1
KU is also named epistemic uncertainty and DU is also called
aleatoric uncertainty.
Once the Bayesian (virtual) ensemble 2
See paper [12] for equations computing KU and DU in regression
{π (π¦|π₯; Ξ(π) )}πΎ
π=1 , Ξ
(π)
βΌ π(Ξ|ππ ) is learned, tasks.
a single chain, we propose a simple remedy that places a step size within the context of Bayesian DL. In fact, cS-
cyclical cosine schedule on gradient scale during training, GLB extends the idea behind cSG-MCMC to tree-based
as illustrated in Fig.1. Specifically, the scaling factor at GBDT models. We also summarize key differences be-
iteration π is defined as: tween our design of cSGLB and cSG-MCMC in Appendix
(οΈ πΌ [οΈ π mod (π, πΆ) ]οΈ )οΈ A.
max
πΌπ = max cos( )+1 , πΌmin ,
2 πΆ
(7) 4.2. Enhancing Sample Diversity via
where πΌmax β₯ 1 is the maximum of the scaler or the Bootstrapping
initial value of πΌ0 , πΆ is the user-defined cycle length, and
πΌmin defines the minimum of the scaler, e.g., πΌmin = Recent work [16] provided a compelling analysis that the
1 or 0.5, since decaying the gradients to arbitrarily small Bayesian posterior is not optimal under model misspeci-
could be harmful for performance. Putting together, this fication3 , where the performance of the true posterior is
amounts to sampling the tree structure and learning the dominated by an alternative non-Bayesian posterior that
tree leaf parameters with the (re)scaled gradients: explicitly encourages diversity among ensemble mem-
βοΈ π π βΌ
ber predictions. Inspired by these results, we propose a
βοΈ
π(π |πΌπ πΛπ + 2π π β² ) and π*π π = βΞ¦π π (πΌπ πΛπ + 2π π).
ππ½ ππ½
simple strategy that promotes diversity among samples
Similar to Cyclical SG-MCMC [13], we define two obtained from cSGLB by data bootstrapping. At the begin-
stages within each cycle: (1) Exploration when the com- ning of each cycle, we sample randomly a Bernoulli mask
pleted portion of a cycle π(π ) = modπΆ(π,πΆ) is smaller of size π , i.e., π := {π[π] βΌ π΅πππππ’πππ(πππ )}π π=1 β
than a given threshold: π(π ) < π, and (2) Sampling {0, 1}π , and πππ β (0, 1) defines the percentage of
when π(π ) β₯ π, and π β (0, 1) balances the portion be- data being used. In the following exploration stage, we
tween exploration and sampling. We obtain one sample mask out gradients πΛπ by taking an element-wise prod-
from the chain at the end of each cycle, i.e., the virtual uct with π, i.e., πΛπ β π. The mask π and the mask-out
πΎ=β π― β
ensemble is built by {Ξ(π) }π=1 πΆ = {ΞΜπΆπβ1 , π = operation are used consistently throughout the explo-
1, . . . , β π―πΆ β}. Large gradients at the beginning of a cycle ration stage π(π ) < π, and π only gets updated until
provide enough perturbation and encourage the model the end of the cycle. This design amounts to learning
to escape from the current mode, while decreasing the with a bootstrapped subsample of the data in each cy-
gradient scale inside one cycle makes the sampler better cle. Since the model would observe consistently less
characterize the density of the local mode. Moreover, data than the original ππ , it also amounts to posterior
many prior works in Bayesian NNs proposed to apply a tempering (π(ππ |Ξ)π(Ξ))1/π with some temperature
certain form of preconditioning to compensate sampling π > 1, resulting in a warm posterior that is softer than
noises from mini-batch training [15, 14]. Tree-based mod- the Bayesian posterior. By increasing the temperature π ,
els can usually digest the full-batch (full dataset ππ ) per we expect to see increased density on the paths/corridors
iteration by leveraging modern multi-core processors connecting different modes of the posterior [28, 29], fur-
and multi-threading. Therefore, we directly use the full- ther facilitating the sampler to escape from the current
batch GB in all sampling stages, while leaving the option local mode. By using a relatively large π β (0.8, 1), the
of random data subsampling in exploration stages to the tempering effects would carry over into the sampling
users if training time is a concern. stage. Therefore, the bootstrapping mechanism helps
Combining cyclical gradient scaling with SGLB, we improve the sample diversity from cSGLB, and we name
expect that our new Cyclical SGLB (cSGLB) algorithm this variant Cyclical Bootstrapped SGLB (cbSGLB).
could inherit most (if not all) theoretical properties of the Lastly, we summarize our proposed cSGLB (plus boot-
original SGLB algorithm. Conceptually, with a proper strap option) in Algo. 1 and highlight our modifications
choice of πΌmax , πΌmin and cycle length πΆ, the sample on top of SGLB in blue.
obtained at π = πΆ β 1 from the Markov chain πΞ ^ π gen-
erated by Algo. 1 (w/o bootstrap) can be approximately
seen as a random draw from the limiting distribution 5. Experiments
with small bounded errors. Also, each next cycle can be
viewed as a warm restart from its previous cycle, and 5.1. Experiments on Synthetic Data
thus no errors shall be accumulated into the subsequent We validate and qualitatively evaluate the proposed gradi-
cycles (at sampling time π = πΆπ β 1). We left rigor- ent scheduling and our cSGLB algorithm on two synthetic
ous analysis and proofs of our propositions for future problems: (1) a synthetic multimodal dataset in [13], and
work. Empirically, we show in our experiments that the
cyclical gradient scaling achieves similar effects in ex-
ploring a multimodal distribution when compared with 3
The function class in-use does not contain the unknown ground-
cSG-MCMC which places a similar cyclical schedule on truth function.
Algorithm 1: Cyclical (Bootstrapped) SGLB and both clr-SGLD and cg-SGLD have 30 cycles. Fig.2
Input: dataset ππ , learning rate π > 0, inverse shows the estimated density using different sampling
temperature π½ > 0, regularization πΎ > 0, strategies. SGLD gets trapped in local modes depending
number of iterations π― > 0, cycle length πΆ > 1, on the random initial position, and increasing the noise
scaler limits πΌmax , πΌmin > 0, stage threshold scale does not solve the problem. In contrast, clr-SGLD
π > 0, mask probability πππ > 0, boolean and cg-SGLD can explore and locate roughly 7 β 8 differ-
indicator ππππ‘π π‘πππ ent modes of the distribution, showing that our cg-SGLD
can achieve the state-of-the-art performance in exploring
Initialize πΞ π
as all-ones multimodal distributions. Moreover, cg-SGLD has ben-
^ 0 (Β·) = 0, π = 1π β R
vector efits in implementation over clr-SGLD when combined
for π in [0, 1, . . . , π― β 1] do with SGLB. The SGLB algorithm was made available in
if ππππ‘π π‘πππ then the CatBoost library [30], which only supports a fixed lr.
All our proposed enhancements can be implemented with
if modπΆ(π,πΆ) = 0 then
a "user-defined loss function" available in CatBoost with-
Sample π β Rπ with
out touching the source code, making it straightforward
π[π] βΌ π΅πππππ’πππ(πππ )
to reproduce our algorithms.
end
if modπΆ(π,πΆ) β₯ π then Multi-Class Spiral Data After validating the efficacy
Set π = 1π
of cyclical gradient scheduling on sampling from mul-
end
timodal distributions, we are now ready to experiment
end
with cSGLB. Specifically, we compare the following al-
Compute gradient scaler: πΌπ =
gorithms on a 3-class classification task called "Spiral" in
max( πΌmax2
[cos( π mod πΆ
(π,πΆ)
) + 1], πΌmin ) [12]: (i) SGLB ensemble, where we denote by ensπΎ with
Estimate gradients πΛπ using πΞ ^ π (Β·) and ππ : πΎ models, (ii) SGLB virtual ensemble, simply denoted
π π π
πΛπ = ( ππ πΏ(πΞ ^ π (π₯π ), π¦π ))π=1 β R by SGLB, and (iii) cSGLB virtual ensemble, denoted by
Sample noise π, π βΌ π© (0π , πΌπ )
β²
cSGLB (ours). We again reproduced the results in [12]
Sample tree structure: βοΈ with code released by the authors, and Fig.3 shows the
π π βΌ π π βπΌπ (πΛπ βπ) + 2π π β² estimated KU on Spiral test set. As noted in [12], we see
(οΈ β )οΈ
ππ½
that knowledge uncertainty due to decision-boundary
Estimate leaf/parameter values: βοΈ "jitter" exists in both ens20 and cSGLB, and the "jitter"
π*π π = βΞ¦π π πΌπ (πΛπ βπ) + 2π
(οΈ )οΈ
ππ½
π affects cSGLB more as the estimated KU is "noisy" at the
Update GBDT model: decision boundary. Nevertheless, cSGLB (with only a sin-
π π π π
πΞ^ π +1 (Β·) = (1 β πΎπ)πΞ
^ π (Β·) + πβ (Β·, π* ) gle model) is significantly more efficient than ens20 and
end is able to greatly improve upon SGLB in capturing high
Return: πΞ ^ π― (Β·) KU in regions with no training data.
5.2. Experiments on Real-World Weather
(2) a multi-class Spiral dataset in [12]. Due to limited Prediction Data
space, we include experimental details in Appendix B. Lastly, we evaluate our proposed methods on the public
Shifts Weather Prediction dataset [31]. We select the clas-
Synthetic Multimodal Data We first demonstrate sification task where the ML model is asked to predict
the ability of cyclical gradient scaling for sampling from a the precipitation class at a particular geolocation and
multimodal distribution on a 2D mixture of 25 Gaussians. timestamp, given heterogeneous tabular features derived
Specifically, we compare (i) the original SG-MCMC with from weather station measurements and forecast mod-
SGLD (denoted as SGLD) and two SGLD variants: (ii) els. The full dataset is partitioned in a canonical fashion
SGLD with Cyclical schedule on Learning Rate (denoted and contains in-domain (ID) training, development and
as clr-SGLD) [13] and (iii) SGLD with Cyclical schedule evaluation datasets as well as out-of-domain (OOD) de-
on Gradient scale (denoted as cg-SGLD (ours)). We re- velopment and evaluation datasets. Importantly, the ID
produced the results for SGLD and clr-SGLD in paper data and the OOD data are separated in time and con-
[13] with code released by the authors, and built our cg- sist of non-overlapping climate types (ID: Tropical, Dry,
SGLD on top of it. In addition, we experimented with Mild; OOD: Snow, Polar), making the Shifts dataset an
a "noisy" version of SGLD with a fixed lr and increased ideal testbed for gauging the robustness of ML model and
10Γ noise scale (denoted as NoisySGLD/N-SGLD). For the quality of uncertainty estimation. To further facili-
a fair comparison, each chain runs for 50π iterations tate our experimentation, we conducted the following
4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4
(a) Target (b) SGLD (c) N-SGLD (d) clr-SGLD (e) cg-SGLD (ours)
Figure 2: Sampling from a 5 by 5 mixture of 25 Gaussians.
[31], and the results are summarized in Table 1. For pre-
dictive performance, we report the classification accuracy
and macro F1 using BMA on both the ID and the OOD
evaluation datasets. We can see the following effects: (1)
Virtual ensembles with a longer chain slightly outper-
form the real ensembles on the ID data. (2) Our proposed
cSGLB and cbSGLB perform slightly worse than the rest
(a) Spiral (b) ens20
of methods on the OOD data. However, this usually is
not a concern in practice since the model is not trained
with data from OOD domains and would not be used to
solve the OOD prediction tasks in a practical scenario.
As long as the domain shifts can be reliably detected (via
uncertainty), proactive decisions can be made to avoid
costly mistakes due to model errors. (3) Our proposed
(c) SGLB (d) cSGLB data bootstrapping mechanism is capable of improving
the performance on the OOD data (cbSGLB > cSGLB). In
Figure 3: Spiral dataset and estimated knowledge uncertain- addition, we include the F1-AUC metric (on the combined
ties. Each different color in (a) represents a different class. ID&OOD evaluation sets) introduced in [31] to jointly
assess the predictive power and uncertainty quality. The
F1-AUC can be increased by either having a stronger pre-
data preprocessing: (1) feature selection to keep only dictive model or by improving the correlation between
the top 40 features by importance (out of 123 available uncertainty and error. Consistent with the findings in
features), where the feature importance is determined by [31], total uncertainty (TU) correlates more with errors
a CatBoost classifier with 1πΎ trees trained on the entire than knowledge uncertainty (KU) as shown by the F1-
training set; (2) dropping minority classes to keep only AUC scores. More specifically, we see that the F1-AUC
the major 3 precipitation classes, i.e., class 0, 10, and 20 is quite similar across the board when measured by TU,
out of the 9 available classes from the original dataset; although cSGLB/cbSGLB has slightly worse predictive
(3) random data sampling to keep 200πΎ (medium-sized) power on the OOD segment. When F1-AUC is measured
data in the final training set. Again, the purpose of our by KU, our cSGLB/cbSGLB is capable of producing KU
data preprocessing is for speeding up experimentation, estimates that relate more closely to model errors than
and we believe that the observations and findings in this KU from the SGLB baseline.
study are generalizable to the original full dataset. For At last, we present the OOD detection ROC-AUC per-
model building, 30 independent SGLB models (each of formance on the evaluation data by using KU estimates.
1πΎ trees) were trained and used to construct real ensem- Our cSGLB/cbSGLB outperforms the SGLB baseline with
bles πππ πΎ for πΎ β {3, 5, 10, 30}. SGLB/cSGLB/cbSGLB a large margin on the OOD detection task, and even
virtual ensembles were built by sampling 10 members achieves a comparable performance to the real ensem-
from a single-chain with 2πΎ trees. Hence, πππ 10 is 5Γ ble πππ 10 , which is 5Γ more expensive. This highlights
more expensive in computation and memory than a vir- that our cSGLB/cbSGLB can produce high-fidelity KU
tual ensemble. Additional details regarding our data and estimates to detect domain (or distributional) shifts with
models are included in Appendix C. a single model, and that our proposed cyclical gradient
We compare various methods on their predictive per- scheduling is effective in exploring different modes of a
formance and on uncertainty quantification following posterior. In real-world industrial applications, detecting
Metric Data ens3 ens5 ens10 ens30 SGLB cSGLB cbSGLB
eval-ID 65.24 Β± 0.02 65.25 Β± 0.02 65.26 Β± 0.01 65.26 65.63 Β± 0.01 65.93 Β± 0.01 65.59 Β± 0.01
Accuracy (%)β
eval-OOD 52.51 Β± 0.06 52.55 Β± 0.06 52.55 Β± 0.03 52.54 52.50 Β± 0.14 50.72 Β± 0.17 51.49 Β± 0.11
eval-ID 63.28 Β± 0.02 63.29 Β± 0.02 63.30 Β± 0.01 63.30 63.69 Β± 0.01 64.06 Β± 0.02 63.67 Β± 0.01
Macro F1 (%)β
eval-OOD 52.36 Β± 0.07 52.39 Β± 0.06 52.38 Β± 0.03 52.38 52.36 Β± 0.13 50.64 Β± 0.16 51.41 Β± 0.15
TU 57.18 Β± 0.01 57.19 Β± 0.01 57.20 Β± 0.01 57.20 57.26 Β± 0.01 56.82 Β± 0.03 56.95 Β± 0.04
F1-AUC (%)β
KU 52.94 Β± 0.03 53.71 Β± 0.03 54.33 Β± 0.02 54.83 52.55 Β± 0.08 53.60 Β± 0.08 53.31 Β± 0.14
OOD-AUC (%)β KU 65.72 Β± 0.31 68.60 Β± 0.21 71.15 Β± 0.16 73.26 67.32 Β± 0.70 71.45 Β± 0.67 71.70 Β± 0.91
Table 1
Comparing predictive performance & uncertainty measures of various methods on Shifts Weather data. Mean Β± Std Dev
over 3 seeds. The best performance among virtual ensembles is highlighted in bold.
OOD data or domain shifts in an efficient way is often References
crucial to ensure a safe deployment and operation of ML
systems. Observing consistently high uncertainty (es- [1] J. Dressel, H. Farid, The accuracy, fairness, and limits
of predicting recidivism, Science Advances 4 (2018)
pecially KU) from model predictions indicates that the
eaao5580. URL: https://www.science.org/doi/abs/10.
patterns of new incoming data have deviated from the 1126/sciadv.aao5580. doi:10.1126/sciadv.aao5580.
training. This often provides a strong signal for model [2] P. Stone, R. Brooks, E. Brynjolfsson, R. Calo, O. Etzioni,
refresh, ensuring that the ML system can be updated in G. Hager, J. Hirschberg, S. Kalyanakrishnan, E. Kamar,
time to avoid errors and operate safely in its "comfort S. Kraus, et al., Artificial intelligence and life in 2030,
zone" (with relatively low predictive uncertainty). One Hundred Year Study on Artificial Intelligence: Re-
port of the 2015-2016 Study Panel 52 (2016).
[3] A. Serban, E. Poll, J. Visser, Towards using probabilis-
6. Conclusion tic models to design software systems with inherent un-
certainty (2020). URL: https://arxiv.org/abs/2008.03046.
We present cyclical gradient scheduling and Cyclical doi:10.48550/ARXIV.2008.03046.
SGLB for efficiently and effectively quantifying uncer- [4] S. McGrath, P. Mehta, A. Zytek, I. Lage, H. Lakkaraju,
tainty in gradient boosting with a single model, and pro- When does uncertainty matter?: Understanding the im-
pose a data bootstrapping scheme to enhance diversity pact of predictive uncertainty in ml assisted decision
in posterior samples. We show empirically that our al- making (2020). URL: https://arxiv.org/abs/2011.06167.
gorithms have superior performance over the state-of- doi:10.48550/ARXIV.2011.06167.
[5] R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning
the-art SGLB, especially in quantifying knowledge un-
is not all you need, 2021. URL: https://arxiv.org/abs/2106.
certainty and for OOD detection. 03253. doi:10.48550/ARXIV.2106.03253.
Accurately quantifying uncertainty in ML predictions [6] J. H. Friedman, Greedy function approximation: a gradi-
can yield many benefits in real-world applications. Un- ent boosting machine, Annals of statistics (2001) 1189β
certainty directly measures the confidence in model pre- 1232.
dictions, not only improving interpretability, but also [7] J. H. Friedman, Stochastic gradient boosting, Computa-
ensuring that costly mistakes can be avoided proactively. tional statistics & data analysis 38 (2002) 367β378.
Consistently observing high uncertainty in predictions [8] A. Malinin, Uncertainty estimation in deep learning
on incoming data stream often provides a reliable signal with application to spoken language assessment, Ph.D.
for data distributional shifts and/or model being obsolete. thesis, University of Cambridge, 2019.
[9] F. Provost, P. Domingos, Tree induction for probability-
Uncertainty is also a key concept in optimal control and
based ranking, Machine learning 52 (2003) 199β215.
decision making, where it can be leveraged to experi-
[10] U. Johansson, T. LΓΆfstrΓΆm, H. BostrΓΆm, Calibrating prob-
ment with different actions/decisions with controllable ability estimation trees using venn-abers predictors, in:
costs in search for the best operating point of an ML Proceedings of the 2019 SIAM International Conference
system. We believe that our work on efficient uncer- on Data Mining, SIAM, 2019, pp. 28β36.
tainty quantification for GBDTs facilitates the adoption [11] A. Ustimenko, L. Prokhorenkova, Sglb: Stochastic gradi-
of uncertainty-enabled and uncertainty-aware ML sys- ent langevin boosting, in: International Conference on
tems, and more broadly promotes ML safety and a safe Machine Learning, PMLR, 2021, pp. 10487β10496.
and trustworthy deployment of ML model in production. [12] A. Malinin, L. Prokhorenkova, A. Ustimenko, Un-
certainty in gradient boosting via ensembles, arXiv
preprint arXiv:2006.10562 (2020).
[13] R. Zhang, C. Li, J. Zhang, C. Chen, A. G. Wilson, Cycli-
cal stochastic gradient mcmc for bayesian deep learning,
arXiv preprint arXiv:1902.03932 (2019). PMLR, 2018, pp. 1309β1318.
[14] F. Wenzel, K. Roth, B. S. Veeling, J. ΕwiΔ
tkowski, L. Tran, [30] A. V. Dorogush, V. Ershov, A. Gulin, Catboost: gradi-
S. Mandt, J. Snoek, T. Salimans, R. Jenatton, S. Nowozin, ent boosting with categorical features support, arXiv
How good is the bayes posterior in deep neural net- preprint arXiv:1810.11363 (2018).
works really?, arXiv preprint arXiv:2002.02405 (2020). [31] A. Malinin, N. Band, G. Chesnokov, Y. Gal, M. J.
[15] P. Izmailov, S. Vikram, M. D. Hoffman, A. G. G. Wil- Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova,
son, What are bayesian neural network posteriors really I. Provilkov, V. Raina, et al., Shifts: A dataset of real dis-
like?, in: International Conference on Machine Learn- tributional shift across multiple large-scale tasks, arXiv
ing, PMLR, 2021, pp. 4629β4640. preprint arXiv:2107.07455 (2021).
[16] A. Masegosa, Learning under model misspecification:
Applications to variational and ensemble methods, Ad-
vances in Neural Information Processing Systems 33
(2020) 5479β5491.
A. Comparison between cSGLB
[17] M. Welling, Y. W. Teh, Bayesian learning via stochastic and cSG-MCMC
gradient langevin dynamics, in: Proceedings of the 28th
international conference on machine learning (ICML- The proposed Cyclical SGLB algorithm combines SGLB
11), Citeseer, 2011, pp. 681β688. with cSG-MCMC [13] to effectively explore different
[18] T. Chen, E. Fox, C. Guestrin, Stochastic gradient hamil- modes of a highly multimodal posterior distribution. In
tonian monte carlo, in: International conference on ma- this section, we summarize some key differences between
chine learning, PMLR, 2014, pp. 1683β1691.
our design and the original cSG-MCMC algorithm.
[19] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel,
H. Neven, Bayesian sampling using stochastic gradient (1) cSG-MCMC is a sampling algorithm designed for
thermostats, Advances in neural information process- Bayesian NNs, while cSGLB is built for GBDT models.
ing systems 27 (2014). In deep learning, full-batch gradient descent is usually
[20] C. Li, C. Chen, D. Carlson, L. Carin, Preconditioned not feasible, and techniques have been developed to ex-
stochastic gradient langevin dynamics for deep neural plicitly compensate mini-batch noises, such as precondi-
networks, in: Thirtieth AAAI Conference on Artificial tioning [14]. Some also suggested applying an additional
Intelligence, 2016. correction step called Metropolis-Hastings [15]. Tree-
[21] Y. W. Teh, A. H. Thiery, S. J. Vollmer, Consistency and based GB models can easily scale up to large industrial
fluctuations for stochastic gradient langevin dynamics, datasets and digest the full training set at each iteration.
The Journal of Machine Learning Research 17 (2016)
Therefore, our cSGLB uses full-batch GB in the sampling
193β225.
[22] A. G. Wilson, P. Izmailov, Bayesian deep learning and a
stage of each cycle to ensure high-quality samples being
probabilistic perspective of generalization, Advances in generated. (2) cSGLB puts a cyclical schedule on gra-
neural information processing systems 33 (2020) 4697β dient scale while cSG-MCMC puts a schedule on step
4708. size. In addition, the original cSG-MCMC completely re-
[23] A. Ashukha, A. Lyzhov, D. Molchanov, D. Vetrov, Pitfalls moved the injected Gaussian noises in the exploration
of in-domain uncertainty estimation and ensembling in stage, and cSG-MCMC reduced to regular stochastic gra-
deep learning, arXiv preprint arXiv:2002.06470 (2020). dient descent (SGD) during the period of exploration.
[24] T. Duan, A. Anand, D. Y. Ding, K. K. Thai, S. Basu, A. Ng, Although the authors claimed that this amounts to pos-
A. Schuler, Ngboost: Natural gradient boosting for prob- terior temping which is commonly used in DL domain,
abilistic prediction, in: International Conference on Ma- the implementation of cSG-MCMC algorithm does not
chine Learning, PMLR, 2020, pp. 2690β2700.
follow closely/strictly the dynamics of SGLD during the
[25] B. Settles, Active learning literature survey (2009).
[26] T. Tan, Z. Xiong, V. R. Dwaracherla, Parameterized in- exploration stage. In contract, we keep the injected noise
dexed value function for efficient exploration in rein- term unchanged during the course of learning. Our de-
forcement learning, in: Proceedings of the AAAI Con- sign achieved similar effects compared with the step-size
ference on Artificial Intelligence, volume 34, 2020, pp. scheduling on a synthetic experiment, and we also en-
5948β5955. sure that cSGLB follows the dynamics of SGLD (more or
[27] S. Depeweg, J. M. HernΓ‘ndez-Lobato, F. Doshi-Velez, less) at every iteration step. (3) Lastly, the gradient scal-
S. Udluft, Decomposition of uncertainty for active learn- ing (instead of step-size scheduling) has implementation
ing and reliable reinforcement learning in stochastic sys- benefits. The SGLB algorithm is made available in the
tems, stat 1050 (2017) 11. CatBoost library [30], which only supports a constant
[28] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov,
step size (or learning rate). Our proposed cyclical gradi-
A. G. Wilson, Loss surfaces, mode connectivity, and fast
ensembling of dnns, Advances in neural information
ent scaling (and data bootstrapping) can be implemented
processing systems 31 (2018). easily with the "user-defined loss function" available in
[29] F. Draxler, K. Veschgini, M. Salmhofer, F. Hamprecht, the CatBoost package, without modifying a single line of
Essentially no barriers in neural network energy land- the source code.
scape, in: International conference on machine learning,
# of samples % of class
Data
Total Tropical Dry Mild Snow Polar Class 0 Class 10 Class 20
train-ID 200,000 26,786 45,357 127,857 0 0 40.92 33.71 25.37
dev-ID 46,279 6,196 10,505 29,578 0 0 40.90 33.97 25.13
dev-OOD 46,555 0 0 0 46,555 0 38.35 35.07 26.58
eval-ID 518,587 69,245 117,981 331,361 0 0 40.98 33.71 25.30
eval-OOD 524,048 0 0 0 479,952 44,096 33.05 35.70 31.24
Table 2
Data summary of our partitioning of the Weather Prediction dataset.
B. Synthetic Data it a virtual ensemble of 20 members. For cSGLB vir-
tual ensemble, we set π = 0.05, cycle length πΆ = 200,
B.1. Synthetic Multimodal Distribution πΌmax = 10, πΌmin = 1, making it a virtual ensemble of
2000/200 = 10 members. For cbSGLB with bootstrap-
The ground truth density of the distribution is
ping, we additionally set exploration proportion π = 0.9
25 and mask probability πππ = 0.66.
βοΈ 1
πΉ (π₯) = π© (π₯|ππ , Ξ£),
π=1
25
C. Real-World Shifts Data
where[οΈ π = {β4, β2,
]οΈ 0, 2, 4} Γ {β4, β2, 0, 2, 4} and
π
0.03 0 Data Summary A detailed summary of our final par-
Ξ£= . We used the code released by the au- titioning of the Weather Prediction dataset is included in
0 0.03
thors [13] to generate our results. Specifically, the SGLD Table 2.
was trained with a decaying lr ππ = 0.05(π + 1)β0.55 ,
and clr-SGLD was learned with a cyclical lr schedule Experimental Details We used default parameter set-
with initial value π0 = 0.09 and exploration proportion tings for SGLB models as suggested in the original paper
π = 0.25. For our cg-SGLD, we fixed lr π = 0.01 and [12] for uncertainty quantification except that we set
Gaussian noise scale as 0.4, and set πΌmax = 10, πΌmin = the subsample rate to 0.8 for stochastic gradient boost-
1. The "noisy" version of SGLD (or NoisySGLD/N-SGLD) ing. The real SGLB ensemble consists of (up to) 30 SGLB
was trained with a fixed lr π = 0.02 and noise scale 5.0 models trained with different seeds, each of 1K trees. In
(roughly 10Γ larger than the noise scale used in the other order to get more samples from a single chain, the vir-
methods). Each chain was trained for 50π iterations and tual ensembles of SGLB and our cSGLB/cbSGLB were
both clr-SGLD and cg-SGLD had 30 cycles. The results learned with a single model of 2K trees. We set learning
and findings are robust to random seeds, and similar re- rate for all models to π = 0.05, and tree depth to 6. For
sults were observed with different seeds. We refer the SGLB virtual ensemble, each 100th model from interval
interested readers to the original paper [13] for results of [1000, 2000] was added to the ensemble, making it a vir-
SGLD and clr-SGLD in parallel (or multi-chain) settings. tual ensemble of 10 members. cSGLB and cbSGLB shared
the same parameters with the SGLB counterpart. In ad-
dition, for cSGLB/cbSGLB virtual ensemble, we set cycle
B.2. Synthetic Spiral Dataset length πΆ = 200, πΌmax = 10.0, πΌmin = 1.0, making it a
All experiments were conducted using CatBoost [30], one virtual ensemble of 2000/200 = 10 members. For sim-
of the-state-of-the-art libraries for GBDTs. The ensemble plicity, cSGLB used full-batch gradient boosting at each
of SGLB (ens20 ) contains 20 independent (with different iteration step. In contrast, for cbSGLB with bootstrap-
random seeds) models with 1K trees each. The learning ping, we set exploration proportion π = 0.8, i.e., 80% of
rate is π = 0.1, tree depth is 6, and random_strength = a cycle was treated as exploration, and set mask prob-
100 and border_count = 128. The SGLB virtual ensem- ability πππ = 0.6 in the exploration stage. For model
ble and cSGLB virtual ensemble are trained with the and parameter selection, we only used the in-domain
same parameters except that we increase the number (ID) development set and did not use the out-of-domain
of trees to 2K and lower the lr for cSGLB to π = 0.05. (OOD) development set. Although this may potentially
Thus, the virtual ensemble is 10Γ more efficient in com- lower our reported performance on the OOD evaluation
putation and memory than the actual SGLB ensemble. set, we believe that it reflects better a real-world learning
For SGLB virtual ensemble, each 50th model from in- scenario where the shifted data is often unobserved and
terval [1000, 2000] is added to the ensemble, making unavailable at training time.