Efficient and Effective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC Tian Tan1,* , Carlos Huertas1 and Qi Zhao1 1 Buyer Risk Prevention - ML, WW Customer Trust, Amazon, Seattle, WA 98109, USA Abstract Gradient boosting decision trees (GBDTs) are widely applied on tabular data in real-world ML systems. Quantifying uncer- tainty in GBDT models is thus essential for decision making and for avoiding costly mistakes to ensure an interpretable and safe deployment of tree-based models. Recently, Bayesian ensemble of GBDT models is used to measure uncertainty by leveraging an algorithm called stochastic gradient Langevin boosting (SGLB), which combines GB with stochastic gradient MCMC (SG-MCMC). Although theoretically sound, SGLB gets trapped easily on a particular mode of the Bayesian poste- rior, just like other forms of SG-MCMCs. Therefore, a single SGLB model can often fail to produce uncertainty estimates of high-fidelity. To address this problem, we present Cyclical SGLB (cSGLB) which incorporates a Cyclical Gradient schedule in the SGLB algorithm. The cyclical gradient mechanism promotes new mode discovery and helps explore high multimodal posterior distributions. As a result, cSGLB can efficiently quantify uncertainty in GB with only a single model. In addition, we present another cSGLB variant with data bootstrapping to further encourage diversity among posterior samples. We conduct extensive experiments to demonstrate the efficiency and effectiveness of our algorithm, and show that it outperforms the state-of-the-art SGLB on uncertainty quantification, especially when uncertainty is used for detecting out-of-domain (OOD) data or distributional shifts. Keywords uncertainty quantification, gradient boosting decision trees, Bayesian inference, out-of-domain (OOD) detection 1. Introduction efficiently on GBDT predictions can therefore not only improve model interpretability in production but also With the rapid growth of data and computing power, ensure a safer deployment of ML systems, especially for machine learning (ML) has been gaining a lot of new high-risk applications. applications in areas not imagined before. The more Uncertainty quantification (UQ) has been widely stud- ubiquitous ML systems become, it is inevitable to see ied for neural networks under the Bayesian framework applications in very sensitive and high-risk fields. This [8], however, it is relatively under-explored for tree-based expands to a numerous areas like criminal recidivism models. Although calibrated probability estimation trees [1], medical follow-ups [2] and autonomous-systems [3]. [9, 10] can be used for UQ, they have not been studied While these systems might be very broad, they share a from a Bayesian perspective. Recently, Bayesian ensem- common need, and that is to have a certain degree of con- ble methods were extended to measure uncertainty in fidence on ML predictions. A proven successful way to GBDTs by leveraging a new algorithm called stochastic build confidence in critical systems is uncertainty estima- gradient Langevin boosting (SGLB) [11]. Specifically, two tion. Research has shown that humans are more likely to SGLB-based approaches were introduced for UQ [12]: (1) agree with a system if they get access to the correspond- SGLB ensemble, which trains multiple SGLB models in ing uncertainty, and this holds true regardless of shape parallel, and (2) SGLB virtual ensemble, which constructs and variance as the approach itself is model and task ag- a virtual ensemble using only a single SGLB model where nostic [4]. Since the most common data type in real-world each member in the ensemble is a "truncated" sub-model ML applications is tabular [5], our work in this paper fo- [12]. Although both approaches are theoretically sound, cuses specifically on uncertainty quantification for the there is clearly a trade-off between quality and efficiency state-of-the-art gradient boosting decision trees (GBDTs) in practice. SGLB (real) ensemble is believed to be ac- [6, 7], which are known to outperform deep learning (DL) curate as it can characterize the Bayesian posterior well methods on tabular data, both in accuracy and tuning by running independent models in parallel. However, requirements [5]. Measuring uncertainty effectively and it is almost infeasible to deploy such an ensemble in real-world production due to its high computational and The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), maintenance costs. SGLB virtual ensemble greatly im- February 13 - 14, 2023, Washington, D.C., USA proves the efficiency, however, it often gets stuck on a * Corresponding author. single mode of the Bayesian posterior and can produce " tianta@amazon.com (T. Tan); carlohue@amazon.com (C. Huertas); qqzhao@amazon.com (Q. Zhao) downgraded uncertainty estimates [13, 14]. To better Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License balance between quality and efficiency and to facilitate Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) perior performance in detecting distributional/domain shifts in real-world tabular data streams. 2. Related Work Bayesian ML and approximate Bayesian inference pro- vide a principled representation of uncertainty. One pop- ular family of approaches to inference in Bayesian ML are stochastic gradient Markov Chain Monte Carlo (SG- MCMC) methods [17, 18, 19, 20, 13], which are used to Figure 1: Illustration of proposed cyclical schedule on gradi- effectively sample models (or model parameters) from ent scales for SGLB algorithm. the Bayesian posterior. Uncertainty then comes naturally by measuring the "discrepancy" in predictions from the sampled models which are regarded as posterior samples. Recently, stochastic gradient Langevin boosting (SGLB) the usage of uncertainty-enabled ML systems, an impor- [11] was proposed by combining gradient boosting with tant question remains: how can we make a single SGLB SG-MCMC. As its name suggests, the Markov chain gen- explore effectively different modes of a posterior given a erated by SGLB obeys a special form of the stochastic limited computational budget? gradient Langevin dynamics (SGLD) [11, 17], which im- In this paper, we address the question above by com- plies that SGLB is able to generate samples from the true bining SGLB virtual ensemble with advanced sampling Bayesian posterior asymptotically. Leveraging this prop- techniques from Bayesian DL [14, 13, 15, 16]. Inspired by erty, Malinin et al. [12] proposed to use (1) SGLB ensem- the ideas in [13], we propose to use a scaler (or scaling ble or (2) SGLB virtual ensemble to measure uncertainty factor) on gradients that follows a cyclical schedule dur- in GBDTs. Essentially, SGLB ensemble corresponds to ing the course of SGLB training. The cyclical schedule is running multiple SG-MCMCs in parallel and each chain illustrated in Figure 1, and consequently, we name the re- (or SGLB) is initialized independently with a different ran- sulting algorithm Cyclical SGLB (cSGLB). Similar to [13], dom seed. Since SGLB allows us to sample from the true each cycle in cSGLB contains two stages: (1) Exploration: posterior, the ensemble with multiple samples gives a when the scaler is large, we treat this stage as a warm high-fidelity approximation to the Bayesian posterior. In restart from the previous cycle, enabling the model/sam- contrast, SGLB virtual ensemble only trains a single SGLB pler to follow the gradients closely and to escape from model and it uses multiple truncated sub-models to form the current local mode. (2) Sampling: when the gradient a (virtual) ensemble. The key idea is essentially extracting scaler is small, the scale of injected Gaussian noise in the multiple samples from a single-chain SG-MCMC instead SGLB procedure becomes relatively large, encouraging of running multiple chains in parallel. the sampler to fully characterize one local mode. We In theory, SGLB or single-chain SG-MCMC converges collect one sample (or truncated sub-model) to build the asymptotically to the target distribution and should be- virtual ensemble at the end of each cycle. The cyclical have similarly to the multi-chain SGLB ensemble in the gradient schedule therefore helps cSGLB effectively ex- limit, but it can suffer from a bounded estimation error in plore different modes of a posterior while maintaining the limited time [21]. Moreover, it is often believed that the same level of efficiency of a virtual ensemble. Moreover, posterior is highly multimodal in the parametric space inspired by a recent study [16] showing that "diversified" of modern ML models [13], since there are potentially posterior may provide a tighter generalization bound, we many different sets of parameters that can describe the present another simple approach to encourage diversity training data equally well. The real ensemble can explore in samples obtained from running cSGLB via data boot- different modes of the posterior by running in parallel strapping. We name this variant Cyclical Bootstrapped independent chains, providing a complete picture of the SGLB (cbSGLB). distribution as the number of chains increases. However, We extensively experiment with our proposed algo- a single-chain SG-MCMC often gets stuck easily on a rithms and compare the performance against SGLB en- single mode of the posterior [13, 14], failing to cover the semble and the original SGLB virtual ensemble. Partic- full spectrum of the distribution. ularly, we show that our cyclical gradient schedule can In this paper, we extend the ideas behind Cyclical SG- help explore effectively multimodal distributions, cSGLB MCMC (cSG-MCMC) in DL [13] to sampling from a tree- is capable of producing uncertainty estimates that are based SGLB model, which promotes new mode discovery better aligned with SGLB real ensemble, and cSGLB/cbS- during training. Different from cSG-MCMC that puts GLB outperforms the SGLB baseline with a large margin a cyclical schedule on step size, we propose to use a on out-of-domain (OOD) data detection, indicating a su- cyclical schedule on gradient scale. We also point out and vector 𝑔 ∈ R𝑁 . Simply put, 𝑝(𝑠|𝑔) defines a distribution justify the difference and our design choice in Appendix over tree structures. A. In addition, we propose a simple strategy to further As with other GBDT algorithms, SGLB constructs an encourage diversity in samples obtained from a single ensemble of decision trees iteratively. At each iteration chain by data bootstrapping. At the beginning of each 𝜏 , we compute unbiased gradient estimates π‘”Λ†πœ such that cycle (see Fig.1), we construct a bootstrapped dataset πœ• E[π‘”Λ†πœ ] = ( πœ•π‘“ 𝐿(π‘“Ξ˜ 𝑁 ^ 𝜏 (π‘₯𝑖 ), 𝑦𝑖 ))𝑖=1 ∈ R 𝑁 using the cur- that is a random subset of the training, and use that rent model π‘“Ξ˜ ^𝜏 , and sample independently two normal bootstrapped data consistently during the exploration vectors 𝜁, 𝜁 β€² ∼ 𝒩 (0𝑁 , 𝐼𝑁 ), where 0𝑁 , 𝐼𝑁 denote zero stage to update the GBDT model. The "bias" induced by vector and identity matrix in R𝑁 , respectively. Then, a data bootstrapping also amounts to posterior tempering base learner (or tree structure) √︁ π‘ πœ is picked by drawing [14, 15, 22, 23, 13]. one sample from 𝑝(𝑠|π‘”Λ†πœ + 2𝑁 𝜁 β€² ), where πœ– > 0 is a πœ–π›½ learning rate (or step size) and 𝛽 > 0 is a parameter often 3. Preliminaries referred as inverse diffusion temperature. Next, we esti- mate the parameters πœƒ*π‘ πœ (at tree leaves) of the sampled 3.1. General Setup base learner by solving the following optimization: Given a set of 𝑁 training data points sampled minimize ||πœƒπ‘ πœ ||22 𝑠.𝑑. from an unknown distribution π’Ÿ on 𝒳 Γ— 𝒴, i.e., (1) βˆšοΈ‚ 2𝑁 (π‘₯1 , 𝑦1 ), . . . , (π‘₯𝑁 , 𝑦𝑁 ) ∼ π’Ÿ denoted as π’Ÿπ‘ , and a loss πœƒπ‘ πœ ∈ argmin || βˆ’ π‘”Λ†πœ + 𝜁 βˆ’ π»π‘ πœ πœƒ||22 , πœƒβˆˆRπ‘šπ‘ πœ πœ–π›½ function 𝐿(𝑧, 𝑦) : 𝒡 Γ— 𝒴 β†’ R where 𝒡 denotes the space of predictions, our βˆ‘οΈ€ goal is to minimize the empiri- which returns the minimum norm solution that fits cal loss β„’(𝑓 |π’Ÿπ‘ ) := 𝑁1 𝑖=1 𝐿(𝑓 (π‘₯𝑖 ), 𝑦𝑖 )) over func- 𝑁 best to the perturbed "noisy" version of negative gra- tions 𝑓 belonging to some family β„± βŠ‚ {𝑓 : 𝒳 β†’ 𝒡}. dients. The optimization above√︁ has a closed form so- In this paper, we only consider β„± corresponding to ad- lution as πœƒ*π‘ πœ = βˆ’Ξ¦π‘ πœ (π‘”Λ†πœ + 2𝑁 𝜁), where Ξ¦π‘ πœ := πœ–π›½ ditive ensemble of decision trees β„‹ := {β„Žπ‘  (π‘₯, πœƒπ‘  ) : (π»π‘ π‘‡πœ π»π‘ πœ )+ π»π‘ π‘‡πœ , + denotes pseudo-inverse. For deci- 𝒳 Γ— Rπ‘šπ‘  β†’ R, 𝑠 ∈ 𝑆}, where 𝑆 is an index set and sion trees, Ξ¦π‘ πœ 𝑔 essentially corresponds to averaging the β„Žπ‘  has parameters πœƒπ‘  . Decision trees are built by parti- gradient estimates 𝑔 in each leaf node of the tree. Lastly, tioning recursively the feature space into disjoint regions SGLB algorithm updates the ensemble model by (called leaves). Each region is assigned a value that is used to estimate the response of 𝑦 in the corresponding feature ^ 𝜏 +1 (Β·) := (1 βˆ’ π›Ύπœ–)π‘“Ξ˜ π‘“Ξ˜ π‘ πœ π‘ πœ ^ 𝜏 (Β·) + πœ–β„Ž (Β·, πœƒ* ), (2) subspace. Let’s βˆ‘οΈ€ denote these regions by 𝑅𝑗 ’s, then we have β„Ž(π‘₯, πœƒ) = 𝑗 πœƒπ‘— 1{π‘₯ ∈ 𝑅𝑗 }, where 1{Β·} denotes where 𝛾 is a regularization parameter that "shrinks" the indicator function. Therefore, given the tree structure, currently built model when updating the ensemble. At decision tree β„Žπ‘  is a linear function of its parameters πœƒπ‘  . a high-level, SGLB is a stochastic GB algorithm with It is often assumed that the set 𝑆 is finite because the Gaussian noise injected into gradient estimates, which training data is finite [11, 12], e.g., there exists only a encourages the algorithm to explore a larger area in the finite number of ways to partition the training data. Ow- functional space to find a better fit for the given data. ing to the linear dependence of β„Žπ‘  on πœƒπ‘  and the finite The independence between noise 𝜁 (used for parameter assumption of 𝑆, we can represent any ensemble of mod- learning) and 𝜁 β€² (used for tree sampling), and the model els from β„‹ as a linear model π‘“Ξ˜ (π‘₯) = πœ‘(π‘₯)𝑇 Θ for some shrinking by 𝛾 in Eqn.(2) are technical details needed for feature map πœ‘(π‘₯) : 𝒳 β†’ Rπ‘š and Θ ∈ Rπ‘š denotes the establishing theoretical results and rigorous analysis of parameters of the entire ensemble [11]. Hence, in the sub- SGLB [11]. All the procedures of SGLB are also present sequent discussion, we will simply denote the parameters in our proposed cSGLB in Algo. 1 (with our additional of the GBDT model obtained at iteration 𝜏 as Ξ˜Μ‚πœ , and modifications highlighted in blue). additionally define a linear mapping 𝐻𝑠 : Rπ‘šπ‘  β†’ R𝑁 One can show that the parameters of SGLB Ξ˜Μ‚πœ at each that converts πœƒπ‘  to predictions (β„Žπ‘  (π‘₯𝑖 , πœƒπ‘  ))𝑁 iteration form a Markov chain that weakly converges to 𝑖=1 . the following stationary distribution: 3.2. SGLB 𝑝𝛽* (Θ) ∝ exp(βˆ’π›½β„’(Θ|π’Ÿπ‘ ) βˆ’ 𝛽𝛾||Ξ“Ξ˜||22 ), (3) SGLB combines stochastic gradient boosting (SGB) [7] where Ξ“ = Γ𝑇 > 0 is a regularization matrix which with stochastic gradient Langevin dynamics (SGLD) [17]. depends on a particular tree construction algorithm or the Following notations used in the original paper [11], we choice of tuple ℬ := {β„‹, 𝑝(𝑠|𝑔)} [11]. Note that since characterize the SGB procedure by a tuple ℬ := {β„‹, 𝑝}, the GBDT model is linear and can be fully determined where β„‹ again is the set of base learners and 𝑝(𝑠|𝑔) is a by parameters Θ, we simply use notation β„’(𝑓 |π’Ÿπ‘ ) and distribution over indices 𝑠 ∈ 𝑆 conditioned on a gradient β„’(Θ|π’Ÿπ‘ ) interchangeably. 3.3. Posterior Sampling predictions can be made by taking an average over the ensemble, often known as predictive posterior or We consider here a standard Bayesian learning frame- Bayesian model average (BMA): work [8] that treats parameters Θ as random variables and places a prior 𝑝(Θ) over Θ. In addition, we consider 𝑝(𝑦|π‘₯, π’Ÿπ‘ ) = E𝑝(Θ|π’Ÿ ) [𝑃 (𝑦|π‘₯; Θ)] β‰ˆ 1 βˆ‘οΈ€πΎ 𝑃 (𝑦|π‘₯; Θ(π‘˜) ). π‘˜=1 the GBDT model π‘“Ξ˜ as a probabilistic model and explic- 𝑁 𝐾 (5) itly denote the model by 𝑃 (𝑦|π‘₯; Θ) with parameters Θ. The entropy of the predictive posterior estimates total un- This is valid naturally for classification as GBDT models certainty (TU) in predictions, which can be further decom- by construction return a distribution over class labels. For posed into two distinct types of uncertainty: knowledge regression, one can leverage NGBoost algorithm [24] to uncertainty and data uncertainty.1 (a) Knowledge uncer- return the mean and variance of a Gaussian distribution tainty (KU) arises due to the lack of knowledge about the over the target 𝑦 for a given input π‘₯. data generation process (or the unknown distribution For the purpose of uncertainty estimation, we aim to π’Ÿ). KU is expected to be large in regions (in the feature estimate or obtain an approximation to the Bayesian pos- space) where we do not have sufficient training data. (b) Data uncertainty (DU) arises due to the inherent stochas- terior 𝑝(Θ|π’Ÿπ‘ ). To that end, we can choose 𝛽 = 𝑁 ticity within the data generation process, and it is high and 𝛾 = 2𝑁 and use the negative log-likelihood as in regions with class overlaps. In applications like active 1 the loss βˆ‘οΈ€π‘function β„’(Θ|π’Ÿπ‘ ) = Eπ’Ÿπ‘ [βˆ’ log 𝑝(𝑦|π‘₯, Θ)] = learning [25], reinforcement learning (RL) [26], and OOD βˆ’ 𝑁1 𝑖=1 log 𝑝(𝑦𝑖 |π‘₯𝑖 , Θ). Then, the limiting distribu- detection, it is desirable to measure KU separately from tion of SGLB can be explicitly expressed as: DU (or TU), and the following equation can be used in practice to compute and connect them via mutual infor- (︁ 1 )︁ 𝑝𝛽* (Θ) ∝ exp log 𝑝(π’Ÿπ‘ |Θ) βˆ’ ||Ξ“Ξ˜||22 mation [27]: 2 (4) ∝ 𝑝(π’Ÿπ‘ |Θ)𝑝(Θ), I(𝑦; Θ|π‘₯, π’Ÿπ‘ ) ⏟ ⏞ Knowledge Uncertainty which is proportional to the true posterior 𝑝(Θ|π’Ÿπ‘ ) un- der Gaussian prior 𝑝(Θ) = 𝒩 (0π‘š , Ξ“) [11]. = H(𝑝(𝑦|π‘₯, π’Ÿπ‘ )) βˆ’ E𝑝(Θ|π’Ÿπ‘ ) [H(𝑃 (𝑦|π‘₯; Θ))] Now, consider a Bayesian ensemble of probabilistic ⏟ ⏞ ⏟ ⏞ Total Uncertainty Expected Data Uncertainty models {𝑃 (𝑦|π‘₯; Θ(π‘˜) )}πΎπ‘˜=1 where each model is trained 𝐾 𝐾 independently by running SGLB. Since each Θ(π‘˜) is guar- (︁ 1 βˆ‘οΈ )︁ 1 βˆ‘οΈ (︁ )︁ β‰ˆH 𝑃 (𝑦|π‘₯; Θ(π‘˜) ) βˆ’ H 𝑃 (𝑦|π‘₯; Θ(π‘˜) ) , anteed to be sampled from 𝑝(Θ|π’Ÿπ‘ ) by Eqn.(4), the en- 𝐾 π‘˜=1 𝐾 π‘˜=1 semble {Θ(π‘˜) }𝐾 π‘˜=1 with 𝐾 samples yields a "discrete" (6) approximation to the posterior 𝑝(Θ|π’Ÿπ‘ ). This is ex- where I(𝐴; 𝐡) denotes the mutual information between actly the idea behind SGLB ensemble [12], which learns random variables A and B, and H(Β·) denotes entropy. The 𝐾 independent SGLB models in parallel with different difference between TU and DU measures the disagree- random seeds. Although the approximation improves ment among members in the ensemble and estimates the as 𝐾 increases, the computational cost also increases knowledge uncertainty.2 linearly with 𝐾. To alleviate the computational bur- den, SGLB virtual ensemble [12] builds a Bayesian vir- tual ensemble by sampling multiple times from a single- 4. Cyclical Stochastic Gradient chain SGLB model. Because samples from the same Langevin Boosting (cSGLB) chain are highly correlated, SGLB virtual ensemble pro- poses to sample one member Θ(π‘˜) every 𝐢 > 1 iter- 4.1. Promoting Mode Discovery via ations. More specifically, the parameters are sampled Cyclical Gradient Scheduling 𝐾=⌊ 𝒯 βŒ‹ by {Θ(π‘˜) }π‘˜=1 2𝐢 = {Ξ˜Μ‚πΆ(π‘˜+⌊ 𝒯 βŒ‹) , π‘˜ = 1, . . . , ⌊ 2𝐢𝒯 βŒ‹}, 2𝐢 Instead of building a cumbersome true ensemble of SGLB i.e., appending one member to the ensemble every 𝐢 it- models, the virtual ensemble of SGLB greatly improves erations while constructing one SGLB model using 𝒯 the efficiency by training only a single model. How- iterations of gradient boosting. Notice that no sampling ever, similar to other types of SG-MCMC in Bayesian DL is performed during the first half of iterations (𝜏 < 𝒯 /2) [13, 14, 15], single-chain SGLB gets trapped easily on a since Eqn.(4) holds only asymptotically. For large 𝐢 and particular single mode of the posterior. To efficiently ex- 𝐾, the virtual ensemble should behave similarly to the plore different modes of the multimodal posterior and ef- SGLB real ensemble in the limit theoretically. fectively measure uncertainty in GBDT predictions with 3.4. Uncertainty Estimation 1 KU is also named epistemic uncertainty and DU is also called aleatoric uncertainty. Once the Bayesian (virtual) ensemble 2 See paper [12] for equations computing KU and DU in regression {𝑃 (𝑦|π‘₯; Θ(π‘˜) )}𝐾 π‘˜=1 , Θ (π‘˜) ∼ 𝑝(Θ|π’Ÿπ‘ ) is learned, tasks. a single chain, we propose a simple remedy that places a step size within the context of Bayesian DL. In fact, cS- cyclical cosine schedule on gradient scale during training, GLB extends the idea behind cSG-MCMC to tree-based as illustrated in Fig.1. Specifically, the scaling factor at GBDT models. We also summarize key differences be- iteration 𝜏 is defined as: tween our design of cSGLB and cSG-MCMC in Appendix (︁ 𝛼 [︁ πœ‹ mod (𝜏, 𝐢) ]︁ )︁ A. max π›Όπœ = max cos( )+1 , 𝛼min , 2 𝐢 (7) 4.2. Enhancing Sample Diversity via where 𝛼max β‰₯ 1 is the maximum of the scaler or the Bootstrapping initial value of 𝛼0 , 𝐢 is the user-defined cycle length, and 𝛼min defines the minimum of the scaler, e.g., 𝛼min = Recent work [16] provided a compelling analysis that the 1 or 0.5, since decaying the gradients to arbitrarily small Bayesian posterior is not optimal under model misspeci- could be harmful for performance. Putting together, this fication3 , where the performance of the true posterior is amounts to sampling the tree structure and learning the dominated by an alternative non-Bayesian posterior that tree leaf parameters with the (re)scaled gradients: explicitly encourages diversity among ensemble mem- √︁ π‘ πœ ∼ ber predictions. Inspired by these results, we propose a √︁ 𝑝(𝑠|π›Όπœ π‘”Λ†πœ + 2𝑁 𝜁 β€² ) and πœƒ*π‘ πœ = βˆ’Ξ¦π‘ πœ (π›Όπœ π‘”Λ†πœ + 2𝑁 𝜁). πœ–π›½ πœ–π›½ simple strategy that promotes diversity among samples Similar to Cyclical SG-MCMC [13], we define two obtained from cSGLB by data bootstrapping. At the begin- stages within each cycle: (1) Exploration when the com- ning of each cycle, we sample randomly a Bernoulli mask pleted portion of a cycle π’œ(𝜏 ) = mod𝐢(𝜏,𝐢) is smaller of size 𝑁 , i.e., 𝜈 := {𝜈[𝑖] ∼ π΅π‘’π‘Ÿπ‘›π‘œπ‘’π‘™π‘™π‘–(π‘π‘π‘š )}𝑁 𝑖=1 ∈ than a given threshold: π’œ(𝜏 ) < πœ‚, and (2) Sampling {0, 1}𝑁 , and π‘π‘π‘š ∈ (0, 1) defines the percentage of when π’œ(𝜏 ) β‰₯ πœ‚, and πœ‚ ∈ (0, 1) balances the portion be- data being used. In the following exploration stage, we tween exploration and sampling. We obtain one sample mask out gradients π‘”Λ†πœ by taking an element-wise prod- from the chain at the end of each cycle, i.e., the virtual uct with 𝜈, i.e., π‘”Λ†πœ βŠ™ 𝜈. The mask 𝜈 and the mask-out 𝐾=⌊ 𝒯 βŒ‹ ensemble is built by {Θ(π‘˜) }π‘˜=1 𝐢 = {Ξ˜Μ‚πΆπ‘˜βˆ’1 , π‘˜ = operation are used consistently throughout the explo- 1, . . . , ⌊ 𝒯𝐢 βŒ‹}. Large gradients at the beginning of a cycle ration stage π’œ(𝜏 ) < πœ‚, and 𝜈 only gets updated until provide enough perturbation and encourage the model the end of the cycle. This design amounts to learning to escape from the current mode, while decreasing the with a bootstrapped subsample of the data in each cy- gradient scale inside one cycle makes the sampler better cle. Since the model would observe consistently less characterize the density of the local mode. Moreover, data than the original π’Ÿπ‘ , it also amounts to posterior many prior works in Bayesian NNs proposed to apply a tempering (𝑝(π’Ÿπ‘ |Θ)𝑝(Θ))1/𝑇 with some temperature certain form of preconditioning to compensate sampling 𝑇 > 1, resulting in a warm posterior that is softer than noises from mini-batch training [15, 14]. Tree-based mod- the Bayesian posterior. By increasing the temperature 𝑇 , els can usually digest the full-batch (full dataset π’Ÿπ‘ ) per we expect to see increased density on the paths/corridors iteration by leveraging modern multi-core processors connecting different modes of the posterior [28, 29], fur- and multi-threading. Therefore, we directly use the full- ther facilitating the sampler to escape from the current batch GB in all sampling stages, while leaving the option local mode. By using a relatively large πœ‚ ∈ (0.8, 1), the of random data subsampling in exploration stages to the tempering effects would carry over into the sampling users if training time is a concern. stage. Therefore, the bootstrapping mechanism helps Combining cyclical gradient scaling with SGLB, we improve the sample diversity from cSGLB, and we name expect that our new Cyclical SGLB (cSGLB) algorithm this variant Cyclical Bootstrapped SGLB (cbSGLB). could inherit most (if not all) theoretical properties of the Lastly, we summarize our proposed cSGLB (plus boot- original SGLB algorithm. Conceptually, with a proper strap option) in Algo. 1 and highlight our modifications choice of 𝛼max , 𝛼min and cycle length 𝐢, the sample on top of SGLB in blue. obtained at 𝜏 = 𝐢 βˆ’ 1 from the Markov chain π‘“Ξ˜ ^ 𝜏 gen- erated by Algo. 1 (w/o bootstrap) can be approximately seen as a random draw from the limiting distribution 5. Experiments with small bounded errors. Also, each next cycle can be viewed as a warm restart from its previous cycle, and 5.1. Experiments on Synthetic Data thus no errors shall be accumulated into the subsequent We validate and qualitatively evaluate the proposed gradi- cycles (at sampling time 𝜏 = πΆπ‘˜ βˆ’ 1). We left rigor- ent scheduling and our cSGLB algorithm on two synthetic ous analysis and proofs of our propositions for future problems: (1) a synthetic multimodal dataset in [13], and work. Empirically, we show in our experiments that the cyclical gradient scaling achieves similar effects in ex- ploring a multimodal distribution when compared with 3 The function class in-use does not contain the unknown ground- cSG-MCMC which places a similar cyclical schedule on truth function. Algorithm 1: Cyclical (Bootstrapped) SGLB and both clr-SGLD and cg-SGLD have 30 cycles. Fig.2 Input: dataset π’Ÿπ‘ , learning rate πœ– > 0, inverse shows the estimated density using different sampling temperature 𝛽 > 0, regularization 𝛾 > 0, strategies. SGLD gets trapped in local modes depending number of iterations 𝒯 > 0, cycle length 𝐢 > 1, on the random initial position, and increasing the noise scaler limits 𝛼max , 𝛼min > 0, stage threshold scale does not solve the problem. In contrast, clr-SGLD πœ‚ > 0, mask probability π‘π‘π‘š > 0, boolean and cg-SGLD can explore and locate roughly 7 βˆ’ 8 differ- indicator π‘π‘œπ‘œπ‘‘π‘ π‘‘π‘Ÿπ‘Žπ‘ ent modes of the distribution, showing that our cg-SGLD can achieve the state-of-the-art performance in exploring Initialize π‘“Ξ˜ 𝑁 as all-ones multimodal distributions. Moreover, cg-SGLD has ben- ^ 0 (Β·) = 0, 𝜈 = 1𝑁 ∈ R vector efits in implementation over clr-SGLD when combined for 𝜏 in [0, 1, . . . , 𝒯 βˆ’ 1] do with SGLB. The SGLB algorithm was made available in if π‘π‘œπ‘œπ‘‘π‘ π‘‘π‘Ÿπ‘Žπ‘ then the CatBoost library [30], which only supports a fixed lr. All our proposed enhancements can be implemented with if mod𝐢(𝜏,𝐢) = 0 then a "user-defined loss function" available in CatBoost with- Sample 𝜈 ∈ R𝑁 with out touching the source code, making it straightforward 𝜈[𝑖] ∼ π΅π‘’π‘Ÿπ‘›π‘œπ‘’π‘™π‘™π‘–(π‘π‘π‘š ) to reproduce our algorithms. end if mod𝐢(𝜏,𝐢) β‰₯ πœ‚ then Multi-Class Spiral Data After validating the efficacy Set 𝜈 = 1𝑁 of cyclical gradient scheduling on sampling from mul- end timodal distributions, we are now ready to experiment end with cSGLB. Specifically, we compare the following al- Compute gradient scaler: π›Όπœ = gorithms on a 3-class classification task called "Spiral" in max( 𝛼max2 [cos( πœ‹ mod 𝐢 (𝜏,𝐢) ) + 1], 𝛼min ) [12]: (i) SGLB ensemble, where we denote by ens𝐾 with Estimate gradients π‘”Λ†πœ using π‘“Ξ˜ ^ 𝜏 (Β·) and π’Ÿπ‘ : 𝐾 models, (ii) SGLB virtual ensemble, simply denoted πœ• 𝑁 𝑁 π‘”Λ†πœ = ( πœ•π‘“ 𝐿(π‘“Ξ˜ ^ 𝜏 (π‘₯𝑖 ), 𝑦𝑖 ))𝑖=1 ∈ R by SGLB, and (iii) cSGLB virtual ensemble, denoted by Sample noise 𝜁, 𝜁 ∼ 𝒩 (0𝑁 , 𝐼𝑁 ) β€² cSGLB (ours). We again reproduced the results in [12] Sample tree structure: √︁ with code released by the authors, and Fig.3 shows the π‘ πœ ∼ 𝑝 π‘ βƒ’π›Όπœ (π‘”Λ†πœ βŠ™πœˆ) + 2𝑁 𝜁 β€² estimated KU on Spiral test set. As noted in [12], we see (οΈ€ βƒ’ )οΈ€ πœ–π›½ that knowledge uncertainty due to decision-boundary Estimate leaf/parameter values: √︁ "jitter" exists in both ens20 and cSGLB, and the "jitter" πœƒ*π‘ πœ = βˆ’Ξ¦π‘ πœ π›Όπœ (π‘”Λ†πœ βŠ™πœˆ) + 2𝑁 (οΈ€ )οΈ€ πœ–π›½ 𝜁 affects cSGLB more as the estimated KU is "noisy" at the Update GBDT model: decision boundary. Nevertheless, cSGLB (with only a sin- π‘ πœ π‘ πœ π‘“Ξ˜^ 𝜏 +1 (Β·) = (1 βˆ’ π›Ύπœ–)π‘“Ξ˜ ^ 𝜏 (Β·) + πœ–β„Ž (Β·, πœƒ* ) gle model) is significantly more efficient than ens20 and end is able to greatly improve upon SGLB in capturing high Return: π‘“Ξ˜ ^ 𝒯 (Β·) KU in regions with no training data. 5.2. Experiments on Real-World Weather (2) a multi-class Spiral dataset in [12]. Due to limited Prediction Data space, we include experimental details in Appendix B. Lastly, we evaluate our proposed methods on the public Shifts Weather Prediction dataset [31]. We select the clas- Synthetic Multimodal Data We first demonstrate sification task where the ML model is asked to predict the ability of cyclical gradient scaling for sampling from a the precipitation class at a particular geolocation and multimodal distribution on a 2D mixture of 25 Gaussians. timestamp, given heterogeneous tabular features derived Specifically, we compare (i) the original SG-MCMC with from weather station measurements and forecast mod- SGLD (denoted as SGLD) and two SGLD variants: (ii) els. The full dataset is partitioned in a canonical fashion SGLD with Cyclical schedule on Learning Rate (denoted and contains in-domain (ID) training, development and as clr-SGLD) [13] and (iii) SGLD with Cyclical schedule evaluation datasets as well as out-of-domain (OOD) de- on Gradient scale (denoted as cg-SGLD (ours)). We re- velopment and evaluation datasets. Importantly, the ID produced the results for SGLD and clr-SGLD in paper data and the OOD data are separated in time and con- [13] with code released by the authors, and built our cg- sist of non-overlapping climate types (ID: Tropical, Dry, SGLD on top of it. In addition, we experimented with Mild; OOD: Snow, Polar), making the Shifts dataset an a "noisy" version of SGLD with a fixed lr and increased ideal testbed for gauging the robustness of ML model and 10Γ— noise scale (denoted as NoisySGLD/N-SGLD). For the quality of uncertainty estimation. To further facili- a fair comparison, each chain runs for 50π‘˜ iterations tate our experimentation, we conducted the following 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 (a) Target (b) SGLD (c) N-SGLD (d) clr-SGLD (e) cg-SGLD (ours) Figure 2: Sampling from a 5 by 5 mixture of 25 Gaussians. [31], and the results are summarized in Table 1. For pre- dictive performance, we report the classification accuracy and macro F1 using BMA on both the ID and the OOD evaluation datasets. We can see the following effects: (1) Virtual ensembles with a longer chain slightly outper- form the real ensembles on the ID data. (2) Our proposed cSGLB and cbSGLB perform slightly worse than the rest (a) Spiral (b) ens20 of methods on the OOD data. However, this usually is not a concern in practice since the model is not trained with data from OOD domains and would not be used to solve the OOD prediction tasks in a practical scenario. As long as the domain shifts can be reliably detected (via uncertainty), proactive decisions can be made to avoid costly mistakes due to model errors. (3) Our proposed (c) SGLB (d) cSGLB data bootstrapping mechanism is capable of improving the performance on the OOD data (cbSGLB > cSGLB). In Figure 3: Spiral dataset and estimated knowledge uncertain- addition, we include the F1-AUC metric (on the combined ties. Each different color in (a) represents a different class. ID&OOD evaluation sets) introduced in [31] to jointly assess the predictive power and uncertainty quality. The F1-AUC can be increased by either having a stronger pre- data preprocessing: (1) feature selection to keep only dictive model or by improving the correlation between the top 40 features by importance (out of 123 available uncertainty and error. Consistent with the findings in features), where the feature importance is determined by [31], total uncertainty (TU) correlates more with errors a CatBoost classifier with 1𝐾 trees trained on the entire than knowledge uncertainty (KU) as shown by the F1- training set; (2) dropping minority classes to keep only AUC scores. More specifically, we see that the F1-AUC the major 3 precipitation classes, i.e., class 0, 10, and 20 is quite similar across the board when measured by TU, out of the 9 available classes from the original dataset; although cSGLB/cbSGLB has slightly worse predictive (3) random data sampling to keep 200𝐾 (medium-sized) power on the OOD segment. When F1-AUC is measured data in the final training set. Again, the purpose of our by KU, our cSGLB/cbSGLB is capable of producing KU data preprocessing is for speeding up experimentation, estimates that relate more closely to model errors than and we believe that the observations and findings in this KU from the SGLB baseline. study are generalizable to the original full dataset. For At last, we present the OOD detection ROC-AUC per- model building, 30 independent SGLB models (each of formance on the evaluation data by using KU estimates. 1𝐾 trees) were trained and used to construct real ensem- Our cSGLB/cbSGLB outperforms the SGLB baseline with bles 𝑒𝑛𝑠𝐾 for 𝐾 ∈ {3, 5, 10, 30}. SGLB/cSGLB/cbSGLB a large margin on the OOD detection task, and even virtual ensembles were built by sampling 10 members achieves a comparable performance to the real ensem- from a single-chain with 2𝐾 trees. Hence, 𝑒𝑛𝑠10 is 5Γ— ble 𝑒𝑛𝑠10 , which is 5Γ— more expensive. This highlights more expensive in computation and memory than a vir- that our cSGLB/cbSGLB can produce high-fidelity KU tual ensemble. Additional details regarding our data and estimates to detect domain (or distributional) shifts with models are included in Appendix C. a single model, and that our proposed cyclical gradient We compare various methods on their predictive per- scheduling is effective in exploring different modes of a formance and on uncertainty quantification following posterior. In real-world industrial applications, detecting Metric Data ens3 ens5 ens10 ens30 SGLB cSGLB cbSGLB eval-ID 65.24 Β± 0.02 65.25 Β± 0.02 65.26 Β± 0.01 65.26 65.63 Β± 0.01 65.93 Β± 0.01 65.59 Β± 0.01 Accuracy (%)↑ eval-OOD 52.51 Β± 0.06 52.55 Β± 0.06 52.55 Β± 0.03 52.54 52.50 Β± 0.14 50.72 Β± 0.17 51.49 Β± 0.11 eval-ID 63.28 Β± 0.02 63.29 Β± 0.02 63.30 Β± 0.01 63.30 63.69 Β± 0.01 64.06 Β± 0.02 63.67 Β± 0.01 Macro F1 (%)↑ eval-OOD 52.36 Β± 0.07 52.39 Β± 0.06 52.38 Β± 0.03 52.38 52.36 Β± 0.13 50.64 Β± 0.16 51.41 Β± 0.15 TU 57.18 Β± 0.01 57.19 Β± 0.01 57.20 Β± 0.01 57.20 57.26 Β± 0.01 56.82 Β± 0.03 56.95 Β± 0.04 F1-AUC (%)↑ KU 52.94 Β± 0.03 53.71 Β± 0.03 54.33 Β± 0.02 54.83 52.55 Β± 0.08 53.60 Β± 0.08 53.31 Β± 0.14 OOD-AUC (%)↑ KU 65.72 Β± 0.31 68.60 Β± 0.21 71.15 Β± 0.16 73.26 67.32 Β± 0.70 71.45 Β± 0.67 71.70 Β± 0.91 Table 1 Comparing predictive performance & uncertainty measures of various methods on Shifts Weather data. Mean Β± Std Dev over 3 seeds. The best performance among virtual ensembles is highlighted in bold. OOD data or domain shifts in an efficient way is often References crucial to ensure a safe deployment and operation of ML systems. Observing consistently high uncertainty (es- [1] J. Dressel, H. Farid, The accuracy, fairness, and limits of predicting recidivism, Science Advances 4 (2018) pecially KU) from model predictions indicates that the eaao5580. URL: https://www.science.org/doi/abs/10. patterns of new incoming data have deviated from the 1126/sciadv.aao5580. doi:10.1126/sciadv.aao5580. training. This often provides a strong signal for model [2] P. Stone, R. Brooks, E. Brynjolfsson, R. Calo, O. Etzioni, refresh, ensuring that the ML system can be updated in G. Hager, J. Hirschberg, S. Kalyanakrishnan, E. Kamar, time to avoid errors and operate safely in its "comfort S. Kraus, et al., Artificial intelligence and life in 2030, zone" (with relatively low predictive uncertainty). One Hundred Year Study on Artificial Intelligence: Re- port of the 2015-2016 Study Panel 52 (2016). [3] A. Serban, E. Poll, J. Visser, Towards using probabilis- 6. Conclusion tic models to design software systems with inherent un- certainty (2020). URL: https://arxiv.org/abs/2008.03046. We present cyclical gradient scheduling and Cyclical doi:10.48550/ARXIV.2008.03046. SGLB for efficiently and effectively quantifying uncer- [4] S. McGrath, P. Mehta, A. Zytek, I. Lage, H. Lakkaraju, tainty in gradient boosting with a single model, and pro- When does uncertainty matter?: Understanding the im- pose a data bootstrapping scheme to enhance diversity pact of predictive uncertainty in ml assisted decision in posterior samples. We show empirically that our al- making (2020). URL: https://arxiv.org/abs/2011.06167. gorithms have superior performance over the state-of- doi:10.48550/ARXIV.2011.06167. [5] R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning the-art SGLB, especially in quantifying knowledge un- is not all you need, 2021. URL: https://arxiv.org/abs/2106. certainty and for OOD detection. 03253. doi:10.48550/ARXIV.2106.03253. Accurately quantifying uncertainty in ML predictions [6] J. H. Friedman, Greedy function approximation: a gradi- can yield many benefits in real-world applications. Un- ent boosting machine, Annals of statistics (2001) 1189– certainty directly measures the confidence in model pre- 1232. dictions, not only improving interpretability, but also [7] J. H. Friedman, Stochastic gradient boosting, Computa- ensuring that costly mistakes can be avoided proactively. tional statistics & data analysis 38 (2002) 367–378. Consistently observing high uncertainty in predictions [8] A. Malinin, Uncertainty estimation in deep learning on incoming data stream often provides a reliable signal with application to spoken language assessment, Ph.D. for data distributional shifts and/or model being obsolete. thesis, University of Cambridge, 2019. [9] F. Provost, P. Domingos, Tree induction for probability- Uncertainty is also a key concept in optimal control and based ranking, Machine learning 52 (2003) 199–215. decision making, where it can be leveraged to experi- [10] U. Johansson, T. LΓΆfstrΓΆm, H. BostrΓΆm, Calibrating prob- ment with different actions/decisions with controllable ability estimation trees using venn-abers predictors, in: costs in search for the best operating point of an ML Proceedings of the 2019 SIAM International Conference system. We believe that our work on efficient uncer- on Data Mining, SIAM, 2019, pp. 28–36. tainty quantification for GBDTs facilitates the adoption [11] A. Ustimenko, L. Prokhorenkova, Sglb: Stochastic gradi- of uncertainty-enabled and uncertainty-aware ML sys- ent langevin boosting, in: International Conference on tems, and more broadly promotes ML safety and a safe Machine Learning, PMLR, 2021, pp. 10487–10496. and trustworthy deployment of ML model in production. [12] A. Malinin, L. Prokhorenkova, A. Ustimenko, Un- certainty in gradient boosting via ensembles, arXiv preprint arXiv:2006.10562 (2020). [13] R. Zhang, C. Li, J. Zhang, C. Chen, A. G. Wilson, Cycli- cal stochastic gradient mcmc for bayesian deep learning, arXiv preprint arXiv:1902.03932 (2019). PMLR, 2018, pp. 1309–1318. [14] F. Wenzel, K. Roth, B. S. Veeling, J. ŚwiΔ…tkowski, L. Tran, [30] A. V. Dorogush, V. Ershov, A. Gulin, Catboost: gradi- S. Mandt, J. Snoek, T. Salimans, R. Jenatton, S. Nowozin, ent boosting with categorical features support, arXiv How good is the bayes posterior in deep neural net- preprint arXiv:1810.11363 (2018). works really?, arXiv preprint arXiv:2002.02405 (2020). [31] A. Malinin, N. Band, G. Chesnokov, Y. Gal, M. J. [15] P. Izmailov, S. Vikram, M. D. Hoffman, A. G. G. Wil- Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova, son, What are bayesian neural network posteriors really I. Provilkov, V. Raina, et al., Shifts: A dataset of real dis- like?, in: International Conference on Machine Learn- tributional shift across multiple large-scale tasks, arXiv ing, PMLR, 2021, pp. 4629–4640. preprint arXiv:2107.07455 (2021). [16] A. Masegosa, Learning under model misspecification: Applications to variational and ensemble methods, Ad- vances in Neural Information Processing Systems 33 (2020) 5479–5491. A. Comparison between cSGLB [17] M. Welling, Y. W. Teh, Bayesian learning via stochastic and cSG-MCMC gradient langevin dynamics, in: Proceedings of the 28th international conference on machine learning (ICML- The proposed Cyclical SGLB algorithm combines SGLB 11), Citeseer, 2011, pp. 681–688. with cSG-MCMC [13] to effectively explore different [18] T. Chen, E. Fox, C. Guestrin, Stochastic gradient hamil- modes of a highly multimodal posterior distribution. In tonian monte carlo, in: International conference on ma- this section, we summarize some key differences between chine learning, PMLR, 2014, pp. 1683–1691. our design and the original cSG-MCMC algorithm. [19] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, H. Neven, Bayesian sampling using stochastic gradient (1) cSG-MCMC is a sampling algorithm designed for thermostats, Advances in neural information process- Bayesian NNs, while cSGLB is built for GBDT models. ing systems 27 (2014). In deep learning, full-batch gradient descent is usually [20] C. Li, C. Chen, D. Carlson, L. Carin, Preconditioned not feasible, and techniques have been developed to ex- stochastic gradient langevin dynamics for deep neural plicitly compensate mini-batch noises, such as precondi- networks, in: Thirtieth AAAI Conference on Artificial tioning [14]. Some also suggested applying an additional Intelligence, 2016. correction step called Metropolis-Hastings [15]. Tree- [21] Y. W. Teh, A. H. Thiery, S. J. Vollmer, Consistency and based GB models can easily scale up to large industrial fluctuations for stochastic gradient langevin dynamics, datasets and digest the full training set at each iteration. The Journal of Machine Learning Research 17 (2016) Therefore, our cSGLB uses full-batch GB in the sampling 193–225. [22] A. G. Wilson, P. Izmailov, Bayesian deep learning and a stage of each cycle to ensure high-quality samples being probabilistic perspective of generalization, Advances in generated. (2) cSGLB puts a cyclical schedule on gra- neural information processing systems 33 (2020) 4697– dient scale while cSG-MCMC puts a schedule on step 4708. size. In addition, the original cSG-MCMC completely re- [23] A. Ashukha, A. Lyzhov, D. Molchanov, D. Vetrov, Pitfalls moved the injected Gaussian noises in the exploration of in-domain uncertainty estimation and ensembling in stage, and cSG-MCMC reduced to regular stochastic gra- deep learning, arXiv preprint arXiv:2002.06470 (2020). dient descent (SGD) during the period of exploration. [24] T. Duan, A. Anand, D. Y. Ding, K. K. Thai, S. Basu, A. Ng, Although the authors claimed that this amounts to pos- A. Schuler, Ngboost: Natural gradient boosting for prob- terior temping which is commonly used in DL domain, abilistic prediction, in: International Conference on Ma- the implementation of cSG-MCMC algorithm does not chine Learning, PMLR, 2020, pp. 2690–2700. follow closely/strictly the dynamics of SGLD during the [25] B. Settles, Active learning literature survey (2009). [26] T. Tan, Z. Xiong, V. R. Dwaracherla, Parameterized in- exploration stage. In contract, we keep the injected noise dexed value function for efficient exploration in rein- term unchanged during the course of learning. Our de- forcement learning, in: Proceedings of the AAAI Con- sign achieved similar effects compared with the step-size ference on Artificial Intelligence, volume 34, 2020, pp. scheduling on a synthetic experiment, and we also en- 5948–5955. sure that cSGLB follows the dynamics of SGLD (more or [27] S. Depeweg, J. M. HernΓ‘ndez-Lobato, F. Doshi-Velez, less) at every iteration step. (3) Lastly, the gradient scal- S. Udluft, Decomposition of uncertainty for active learn- ing (instead of step-size scheduling) has implementation ing and reliable reinforcement learning in stochastic sys- benefits. The SGLB algorithm is made available in the tems, stat 1050 (2017) 11. CatBoost library [30], which only supports a constant [28] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, step size (or learning rate). Our proposed cyclical gradi- A. G. Wilson, Loss surfaces, mode connectivity, and fast ensembling of dnns, Advances in neural information ent scaling (and data bootstrapping) can be implemented processing systems 31 (2018). easily with the "user-defined loss function" available in [29] F. Draxler, K. Veschgini, M. Salmhofer, F. Hamprecht, the CatBoost package, without modifying a single line of Essentially no barriers in neural network energy land- the source code. scape, in: International conference on machine learning, # of samples % of class Data Total Tropical Dry Mild Snow Polar Class 0 Class 10 Class 20 train-ID 200,000 26,786 45,357 127,857 0 0 40.92 33.71 25.37 dev-ID 46,279 6,196 10,505 29,578 0 0 40.90 33.97 25.13 dev-OOD 46,555 0 0 0 46,555 0 38.35 35.07 26.58 eval-ID 518,587 69,245 117,981 331,361 0 0 40.98 33.71 25.30 eval-OOD 524,048 0 0 0 479,952 44,096 33.05 35.70 31.24 Table 2 Data summary of our partitioning of the Weather Prediction dataset. B. Synthetic Data it a virtual ensemble of 20 members. For cSGLB vir- tual ensemble, we set πœ– = 0.05, cycle length 𝐢 = 200, B.1. Synthetic Multimodal Distribution 𝛼max = 10, 𝛼min = 1, making it a virtual ensemble of 2000/200 = 10 members. For cbSGLB with bootstrap- The ground truth density of the distribution is ping, we additionally set exploration proportion πœ‚ = 0.9 25 and mask probability π‘π‘π‘š = 0.66. βˆ‘οΈ 1 𝐹 (π‘₯) = 𝒩 (π‘₯|πœ‡π‘– , Ξ£), 𝑖=1 25 C. Real-World Shifts Data where[οΈ‚ πœ‡ = {βˆ’4, βˆ’2, ]οΈ‚ 0, 2, 4} Γ— {βˆ’4, βˆ’2, 0, 2, 4} and 𝑇 0.03 0 Data Summary A detailed summary of our final par- Ξ£= . We used the code released by the au- titioning of the Weather Prediction dataset is included in 0 0.03 thors [13] to generate our results. Specifically, the SGLD Table 2. was trained with a decaying lr πœ–πœ = 0.05(𝜏 + 1)βˆ’0.55 , and clr-SGLD was learned with a cyclical lr schedule Experimental Details We used default parameter set- with initial value πœ–0 = 0.09 and exploration proportion tings for SGLB models as suggested in the original paper πœ‚ = 0.25. For our cg-SGLD, we fixed lr πœ– = 0.01 and [12] for uncertainty quantification except that we set Gaussian noise scale as 0.4, and set 𝛼max = 10, 𝛼min = the subsample rate to 0.8 for stochastic gradient boost- 1. The "noisy" version of SGLD (or NoisySGLD/N-SGLD) ing. The real SGLB ensemble consists of (up to) 30 SGLB was trained with a fixed lr πœ– = 0.02 and noise scale 5.0 models trained with different seeds, each of 1K trees. In (roughly 10Γ— larger than the noise scale used in the other order to get more samples from a single chain, the vir- methods). Each chain was trained for 50π‘˜ iterations and tual ensembles of SGLB and our cSGLB/cbSGLB were both clr-SGLD and cg-SGLD had 30 cycles. The results learned with a single model of 2K trees. We set learning and findings are robust to random seeds, and similar re- rate for all models to πœ– = 0.05, and tree depth to 6. For sults were observed with different seeds. We refer the SGLB virtual ensemble, each 100th model from interval interested readers to the original paper [13] for results of [1000, 2000] was added to the ensemble, making it a vir- SGLD and clr-SGLD in parallel (or multi-chain) settings. tual ensemble of 10 members. cSGLB and cbSGLB shared the same parameters with the SGLB counterpart. In ad- dition, for cSGLB/cbSGLB virtual ensemble, we set cycle B.2. Synthetic Spiral Dataset length 𝐢 = 200, 𝛼max = 10.0, 𝛼min = 1.0, making it a All experiments were conducted using CatBoost [30], one virtual ensemble of 2000/200 = 10 members. For sim- of the-state-of-the-art libraries for GBDTs. The ensemble plicity, cSGLB used full-batch gradient boosting at each of SGLB (ens20 ) contains 20 independent (with different iteration step. In contrast, for cbSGLB with bootstrap- random seeds) models with 1K trees each. The learning ping, we set exploration proportion πœ‚ = 0.8, i.e., 80% of rate is πœ– = 0.1, tree depth is 6, and random_strength = a cycle was treated as exploration, and set mask prob- 100 and border_count = 128. The SGLB virtual ensem- ability π‘π‘π‘š = 0.6 in the exploration stage. For model ble and cSGLB virtual ensemble are trained with the and parameter selection, we only used the in-domain same parameters except that we increase the number (ID) development set and did not use the out-of-domain of trees to 2K and lower the lr for cSGLB to πœ– = 0.05. (OOD) development set. Although this may potentially Thus, the virtual ensemble is 10Γ— more efficient in com- lower our reported performance on the OOD evaluation putation and memory than the actual SGLB ensemble. set, we believe that it reflects better a real-world learning For SGLB virtual ensemble, each 50th model from in- scenario where the shifted data is often unobserved and terval [1000, 2000] is added to the ensemble, making unavailable at training time.