1. Introduction

Eficient and Efective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC

Tian Tan

Carlos Huertas

Qi Zhao

0 0 Buyer Risk Prevention - ML, WW Customer Trust , Amazon, Seattle, WA 98109 , USA

Gradient boosting decision trees (GBDTs) are widely applied on tabular data in real-world ML systems. Quantifying uncertainty in GBDT models is thus essential for decision making and for avoiding costly mistakes to ensure an interpretable and safe deployment of tree-based models. Recently, Bayesian ensemble of GBDT models is used to measure uncertainty by leveraging an algorithm called stochastic gradient Langevin boosting (SGLB), which combines GB with stochastic gradient MCMC (SG-MCMC). Although theoretically sound, SGLB gets trapped easily on a particular mode of the Bayesian posterior, just like other forms of SG-MCMCs. Therefore, a single SGLB model can often fail to produce uncertainty estimates of high-fidelity. To address this problem, we present Cyclical SGLB (cSGLB) which incorporates a Cyclical Gradient schedule in the SGLB algorithm. The cyclical gradient mechanism promotes new mode discovery and helps explore high multimodal posterior distributions. As a result, cSGLB can eficiently quantify uncertainty in GB with only a single model. In addition, we present another cSGLB variant with data bootstrapping to further encourage diversity among posterior samples. We conduct extensive experiments to demonstrate the eficiency and efectiveness of our algorithm, and show that it outperforms the state-of-the-art SGLB on uncertainty quantification, especially when uncertainty is used for detecting out-of-domain (OOD) data or distributional shifts.

eol>uncertainty quantification gradient boosting decision trees Bayesian inference out-of-domain (OOD) detection

1. Introduction

eficiently on GBDT predictions can therefore not only improve model interpretability in production but also With the rapid growth of data and computing power, ensure a safer deployment of ML systems, especially for machine learning (ML) has been gaining a lot of new high-risk applications. applications in areas not imagined before. The more Uncertainty quantification (UQ) has been widely studubiquitous ML systems become, it is inevitable to see ied for neural networks under the Bayesian framework applications in very sensitive and high-risk fields. This [8], however, it is relatively under-explored for tree-based expands to a numerous areas like criminal recidivism models. Although calibrated probability estimation trees [1], medical follow-ups [2] and autonomous-systems [3]. [9, 10] can be used for UQ, they have not been studied While these systems might be very broad, they share a from a Bayesian perspective. Recently, Bayesian ensemcommon need, and that is to have a certain degree of con- ble methods were extended to measure uncertainty in ifdence on ML predictions. A proven successful way to GBDTs by leveraging a new algorithm called stochastic build confidence in critical systems is uncertainty estima- gradient Langevin boosting (SGLB) [11]. Specifically, two tion. Research has shown that humans are more likely to SGLB-based approaches were introduced for UQ [12]: ( 1 ) agree with a system if they get access to the correspond- SGLB ensemble, which trains multiple SGLB models in ing uncertainty, and this holds true regardless of shape parallel, and ( 2 ) SGLB virtual ensemble, which constructs and variance as the approach itself is model and task ag- a virtual ensemble using only a single SGLB model where nostic [4]. Since the most common data type in real-world each member in the ensemble is a "truncated" sub-model ML applications is tabular [5], our work in this paper fo- [12]. Although both approaches are theoretically sound, cuses specifically on uncertainty quantification for the there is clearly a trade-of between quality and eficiency state-of-the-art gradient boosting decision trees (GBDTs) in practice. SGLB (real) ensemble is believed to be ac[6, 7], which are known to outperform deep learning (DL) curate as it can characterize the Bayesian posterior well methods on tabular data, both in accuracy and tuning by running independent models in parallel. However, requirements [5]. Measuring uncertainty efectively and it is almost infeasible to deploy such an ensemble in real-world production due to its high computational and The AAAI-23 Workshop on Artificial Intelligence Safety (SafeAI 2023), maintenance costs. SGLB virtual ensemble greatly imFebruary 13 - 14, 2023, Washington, D.C., USA proves the eficiency, however, it often gets stuck on a * Corresponding author. single mode of the Bayesian posterior and can produce ("C. tHiaunetart@asa)m;qaqzzohna.coo@mam(Ta.zToann.c);ocmar(lQoh.uZeh@aoa)mazon.com downgraded uncertainty estimates [13, 14]. To better © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License balance between quality and eficiency and to facilitate CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) perior performance in detecting distributional/domain shifts in real-world tabular data streams.

2. Related Work

Bayesian ML and approximate Bayesian inference provide a principled representation of uncertainty. One popular family of approaches to inference in Bayesian ML are stochastic gradient Markov Chain Monte Carlo (SGMCMC) methods [17, 18, 19, 20, 13], which are used to Figure 1: Illustration of proposed cyclical schedule on gradi- efectively sample models (or model parameters) from ent scales for SGLB algorithm. the Bayesian posterior. Uncertainty then comes naturally by measuring the "discrepancy" in predictions from the sampled models which are regarded as posterior samples. the usage of uncertainty-enabled ML systems, an impor- Recently, stochastic gradient Langevin boosting (SGLB) tant question remains: how can we make a single SGLB [11] was proposed by combining gradient boosting with explore efectively diefrent modes of a posterior given a SG-MCMC. As its name suggests, the Markov chain genlimited computational budget? erated by SGLB obeys a special form of the stochastic

In this paper, we address the question above by com- gradient Langevin dynamics (SGLD) [11, 17], which imbining SGLB virtual ensemble with advanced sampling plies that SGLB is able to generate samples from the true techniques from Bayesian DL [14, 13, 15, 16]. Inspired by Bayesian posterior asymptotically. Leveraging this propthe ideas in [13], we propose to use a scaler (or scaling erty, Malinin et al. [12] proposed to use ( 1 ) SGLB ensemfactor) on gradients that follows a cyclical schedule dur- ble or ( 2 ) SGLB virtual ensemble to measure uncertainty ing the course of SGLB training. The cyclical schedule is in GBDTs. Essentially, SGLB ensemble corresponds to illustrated in Figure 1, and consequently, we name the re- running multiple SG-MCMCs in parallel and each chain sulting algorithm Cyclical SGLB (cSGLB). Similar to [13], (or SGLB) is initialized independently with a diferent raneach cycle in cSGLB contains two stages: ( 1 ) Exploration: dom seed. Since SGLB allows us to sample from the true when the scaler is large, we treat this stage as a warm posterior, the ensemble with multiple samples gives a restart from the previous cycle, enabling the model/sam- high-fidelity approximation to the Bayesian posterior. In pler to follow the gradients closely and to escape from contrast, SGLB virtual ensemble only trains a single SGLB the current local mode. ( 2 ) Sampling: when the gradient model and it uses multiple truncated sub-models to form scaler is small, the scale of injected Gaussian noise in the a (virtual) ensemble. The key idea is essentially extracting SGLB procedure becomes relatively large, encouraging multiple samples from a single-chain SG-MCMC instead the sampler to fully characterize one local mode. We of running multiple chains in parallel. collect one sample (or truncated sub-model) to build the In theory, SGLB or single-chain SG-MCMC converges virtual ensemble at the end of each cycle. The cyclical asymptotically to the target distribution and should begradient schedule therefore helps cSGLB efectively ex- have similarly to the multi-chain SGLB ensemble in the plore diferent modes of a posterior while maintaining the limit, but it can sufer from a bounded estimation error in same level of eficiency of a virtual ensemble. Moreover, limited time [21]. Moreover, it is often believed that the inspired by a recent study [16] showing that "diversified" posterior is highly multimodal in the parametric space posterior may provide a tighter generalization bound, we of modern ML models [13], since there are potentially present another simple approach to encourage diversity many diferent sets of parameters that can describe the in samples obtained from running cSGLB via data boot- training data equally well. The real ensemble can explore strapping. We name this variant Cyclical Bootstrapped diferent modes of the posterior by running in parallel SGLB (cbSGLB). independent chains, providing a complete picture of the

We extensively experiment with our proposed algo- distribution as the number of chains increases. However, rithms and compare the performance against SGLB en- a single-chain SG-MCMC often gets stuck easily on a semble and the original SGLB virtual ensemble. Partic- single mode of the posterior [13, 14], failing to cover the ularly, we show that our cyclical gradient schedule can full spectrum of the distribution. help explore efectively multimodal distributions, cSGLB In this paper, we extend the ideas behind Cyclical SGis capable of producing uncertainty estimates that are MCMC (cSG-MCMC) in DL [13] to sampling from a treebetter aligned with SGLB real ensemble, and cSGLB/cbS- based SGLB model, which promotes new mode discovery GLB outperforms the SGLB baseline with a large margin during training. Diferent from cSG-MCMC that puts on out-of-domain (OOD) data detection, indicating a su- a cyclical schedule on step size, we propose to use a cyclical schedule on gradient scale. We also point out and justify the diference and our design choice in Appendix over tree structures. vector ∈ R . Simply put, (|) defines a distribution A. In addition, we propose a simple strategy to further As with other GBDT algorithms, SGLB constructs an encourage diversity in samples obtained from a single ensemble of decision trees iteratively. At each iteration chain by data bootstrapping. At the beginning of each , we compute unbiased gradient estimates ˆ such that bootstrapped data consistently during the exploration vectors , (0 , ), where 0 , denote zero cycle (see Fig.1), we construct a bootstrapped dataset that is a random subset of the training, and use that stage to update the GBDT model. The "bias" induced by data bootstrapping also amounts to posterior tempering [14, 15, 22, 23, 13].

3. Preliminaries 3.1. General Setup

′

∼ E[ˆ ] = ( ( Θ^

(), ))=1 ∈ R using the current model Θ^ , and sample independently two normal vector and identity matrix in R , respectively. Then, a base learner (or tree structure) is picked by drawing one sample from (|ˆ +

learning rate (or step size) and >

0 is a parameter often referred as inverse difusion temperature. Next, we estimate the parameters * (at tree leaves) of the sampled base learner by solving the following optimization: √︁ 2 ′), where > 0 is a Given a set of training data points sampled from an unknown distribution on × , i.e., (1, 1), . . . , ( , ) ∼ function (, ) : × → denoted as , and a loss

R where denotes the space of predictions, our goal is to minimize the empirical loss ℒ( | ) := ditive ensemble of decision trees ℋ := {ℎ(, ) : In this paper, we only consider ℱ corresponding to ad- lution as * ×

R →

R, ∈ }, where is an index set and ℎ

has parameters . Decision trees are built by partitioning recursively the feature space into disjoint regions (called leaves). Each region is assigned a value that is used to estimate the response of in the corresponding feature subspace. Let’s denote these regions by ’s, then we have ℎ(, ) = ∑︀ 1{ ∈ }, where 1{·} denotes indicator function. Therefore, given the tree structure, decision tree ℎ is a linear function of its parameters .

It is often assumed that the set is finite

because the training data is finite [ 11, 12], e.g., there exists only a ifnite number of ways to partition the training data. Owing to the linear dependence of ℎ on and the finite els from ℋ as a linear model Θ() = () Θ for some feature map () : → R and Θ

∈ R denotes the parameters of the entire ensemble [11]. Hence, in the subsequent discussion, we will simply denote the parameters of the GBDT model obtained at iteration as Θ ˆ , and additionally define a linear mapping that converts to predictions (ℎ(, ))=1.

: R → R 3.2. SGLB assumption of , we can represent any ensemble of mod- learning) and ′ (used for tree sampling), and the model SGLB combines stochastic gradient boosting (SGB) [7] where Γ = Γ > 0 is a regularization matrix which with stochastic gradient Langevin dynamics (SGLD) [17]. depends on a particular tree construction algorithm or the Following notations used in the original paper [11], we choice of tuple ℬ := {ℋ, (|)} [11]. Note that since characterize the SGB procedure by a tuple ℬ := {ℋ, }, the GBDT model is linear and can be fully determined where ℋ again is the set of base learners and (|) is a distribution over indices ∈ conditioned on a gradient ℒ(Θ | ) interchangeably. by parameters Θ , we simply use notation ℒ( | ) and minimize || ||22

∈R ∈ argmin || − ˆ + √︂ 2 − ||2, 2

( 1 ) which returns the minimum norm solution that fits best to the perturbed "noisy" version of negative gradients. The optimization above has a closed form so= − Φ (ˆ + √︁ 2 ), where Φ :=

( )

+ , + denotes pseudo-inverse. For decision trees, Φ essentially corresponds to averaging the gradient estimates in each leaf node of the tree. Lastly,

SGLB algorithm updates the ensemble model by

Θ^ +1 (· ) := (1 − )Θ^ (· ) + ℎ (· , * ), ( 2 ) where is a regularization parameter that "shrinks" the currently built model when updating the ensemble. At a high-level, SGLB is a stochastic GB algorithm with Gaussian noise injected into gradient estimates, which encourages the algorithm to explore a larger area in the functional space to find a better fit for the given data. The independence between noise (used for parameter shrinking by in Eqn.( 2 ) are technical details needed for establishing theoretical results and rigorous analysis of SGLB [11]. All the procedures of SGLB are also present in our proposed cSGLB in Algo. 1 (with our additional modifications highlighted in blue).

One can show that the parameters of SGLB Θ ˆ at each iteration form a Markov chain that weakly converges to the following stationary distribution:

3.3. Posterior Sampling

We consider here a standard Bayesian learning framework [8] that treats parameters Θ

as random variables and places a prior (Θ)

over Θ . In addition, we consider the GBDT model Θ as a probabilistic model and explicand − with parameters Θ . The entropy of the predictive posterior estimates total unpredictions can be made by taking an average over the ensemble, often known as predictive posterior or Bayesian model average (BMA): (|, ) = E(Θ| )[ (|; Θ)] ≈ 1 ∑︀

=1 (|; Θ ()). certainty (TU) in predictions, which can be further decomposed into two distinct types of uncertainty: knowledge uncertainty and data uncertainty.1 (a) Knowledge uncertainty (KU) arises due to the lack of knowledge about the data generation process (or the unknown distribution ). KU is expected to be large in regions (in the feature space) where we do not have suficient training data. (b) Data uncertainty (DU) arises due to the inherent stochasticity within the data generation process, and it is high in regions with class overlaps. In applications like active learning [25], reinforcement learning (RL) [26], and OOD DU (or TU), and the following equation can be used in practice to compute and connect them via mutual infor=1 log (|, Θ) . Then, the limiting distribu- detection, it is desirable to measure KU separately from mation [27]:

I(; Θ|, ) Knowledge U⏞ncertainty ⏟ ∝ exp (︁ log ( |Θ) − ∝ ( |Θ) (Θ) , 1 2 || ΘΓ

2 ||2 ︁)

(4) der Gaussian prior (Θ) = which is proportional to the true posterior (Θ | ) un (0, Γ)

[11].

Now, consider a Bayesian ensemble of probabilistic models { (|; Θ ())}=1 where each model is trained independently by running SGLB. Since each Θ () is guaranteed to be sampled from (Θ | ) by Eqn.(4), the ensemble {Θ () }=1 with samples yields a "discrete" approximation to the posterior (Θ | ). This is exactly the idea behind SGLB ensemble [12], which learns independent SGLB models in parallel with diferent random seeds. Although the approximation improves as increases, the computational cost also increases linearly with . To alleviate the computational burden, SGLB virtual ensemble [12] builds a Bayesian virtual ensemble by sampling multiple times from a singlechain SGLB model. Because samples from the same chain are highly correlated, SGLB virtual ensemble proations. More specifically, the parameters are sampled by {Θ () =⌊ 2 ⌋ = {Θ ˆ (+⌊ 2 ⌋), = 1, . . . , ⌊ 2 ⌋},

}=1 i.e., appending one member to the ensemble every iterations while constructing one SGLB model using iterations of gradient boosting. Notice that no sampling is performed during the first half of iterations ( < since Eqn.(4) holds only asymptotically. For large and , the virtual ensemble should behave similarly to the

/2)

SGLB real ensemble in the limit theoretically.

poses to sample one member Θ () every > 1 iter- 4.1. Promoting Mode Discovery via

3.4. Uncertainty Estimation Once

the

Bayesian (virtual) ensemble { (|; Θ ())}=1, Θ () ∼ (Θ | ) is learned, tasks. ( 5 ) (6)

Cyclical Gradient Scheduling

Instead of building a cumbersome true ensemble of SGLB models, the virtual ensemble of SGLB greatly improves the eficiency by training only a single model. However, similar to other types of SG-MCMC in Bayesian DL [13, 14, 15], single-chain SGLB gets trapped easily on a particular single mode of the posterior. To eficiently explore diferent modes of the multimodal posterior and effectively measure uncertainty in GBDT predictions with 1KU is also named epistemic uncertainty and DU is also called aleatoric uncertainty. 2See paper [12] for equations computing KU and DU in regression ⏞ = H((|, )) − E(Θ| )[H( (|; Θ))] ⏟ Total Uncertainty ⏞ ⏟

Expected Data Uncertainty =1 ≈ H ︁( 1 ∑︁ (|; Θ()))︁ − =1 1 ∑︁ H(︁ (|; Θ()))︁ , where I(; ) denotes the mutual information between random variables A and B, and H(· ) denotes entropy. The diference between TU and DU measures the disagreement among members in the ensemble and estimates the knowledge uncertainty.2

4. Cyclical Stochastic Gradient Langevin Boosting (cSGLB)

a single chain, we propose a simple remedy that places a step size within the context of Bayesian DL. In fact, cScyclical cosine schedule on gradient scale during training, GLB extends the idea behind cSG-MCMC to tree-based as illustrated in Fig.1. Specifically, the scaling factor at iteration is defined as: GBDT models. We also summarize key diferences between our design of cSGLB and cSG-MCMC in Appendix (7) 4.2. Enhancing Sample Diversity via noises from mini-batch training [15, 14]. Tree-based mod- the Bayesian posterior. By increasing the temperature , ︁( max [︁ cos(

2 mod (, ) )+1]︁, min , ︁)

A. = max where max ≥

1 is the maximum of the scaler or the (| ˆ + initial value of 0, is the user-defined cycle length, and min defines the minimum of the scaler, e.g., min = 1 or 0.5, since decaying the gradients to arbitrarily small could be harmful for performance. Putting together, this amounts to sampling the tree structure and learning the tree leaf parameters with the (re)scaled gradients: ∼ √︁ 2 ′) and * = − Φ ( ˆ + √︁ 2 ).

Similar to Cyclical SG-MCMC [13], we define two stages within each cycle: ( 1 ) Exploration when the com

mod (, ) is smaller pleted portion of a cycle ( ) = than a given threshold: ( ) < , and ( 2 ) Sampling when ( )

≥ , and ∈ ( 0, 1 ) balances the portion between exploration and sampling. We obtain one sample from the chain at the end of each cycle, i.e., the virtual ensemble is built by {Θ () =⌊ ⌋ = { }=1

Θ ˆ − 1, = 1, . . . , ⌊ ⌋}. Large gradients at the beginning of a cycle provide enough perturbation and encourage the model to escape from the current mode, while decreasing the gradient scale inside one cycle makes the sampler better characterize the density of the local mode. Moreover, many prior works in Bayesian NNs proposed to apply a certain form of preconditioning to compensate sampling els can usually digest the full-batch (full dataset ) per iteration by leveraging modern multi-core processors and multi-threading. Therefore, we directly use the fullbatch GB in all sampling stages, while leaving the option of random data subsampling in exploration stages to the users if training time is a concern.

Combining cyclical gradient scaling with SGLB, we expect that our new Cyclical SGLB (cSGLB) algorithm could inherit most (if not all) theoretical properties of the original SGLB algorithm. Conceptually, with a proper choice of max, min and cycle length , the sample obtained at = −

1 from the Markov chain Θ^ generated by Algo. 1 (w/o bootstrap) can be approximately seen as a random draw from the limiting distribution with small bounded errors. Also, each next cycle can be viewed as a warm restart from its previous cycle, and thus no errors shall be accumulated into the subsequent cycles (at sampling time = −

1). We left rigorous analysis and proofs of our propositions for future work. Empirically, we show in our experiments that the cyclical gradient scaling achieves similar efects in exploring a multimodal distribution when compared with cSG-MCMC which places a similar cyclical schedule on

Bootstrapping

Recent work [16] provided a compelling analysis that the Bayesian posterior is not optimal under model misspeciifcation 3, where the performance of the true posterior is dominated by an alternative non-Bayesian posterior that explicitly encourages diversity among ensemble member predictions. Inspired by these results, we propose a simple strategy that promotes diversity among samples obtained from cSGLB by data bootstrapping. At the beginning of each cycle, we sample randomly a Bernoulli mask of size , i.e., := { [] ∼ {0, 1} , and ∈ ( 0, 1 ) defines the percentage of data being used. In the following exploration stage, we mask out gradients ˆ by taking an element-wise product with , i.e., ˆ ⊙ . The mask and the mask-out operation are used consistently throughout the exploration stage ( ) < , and only gets updated until the end of the cycle. This design amounts to learning with a bootstrapped subsample of the data in each cycle. Since the model would observe consistently less data than the original , it also amounts to posterior ()}=1 ∈ 1/ with some temperature tempering (( |Θ) (Θ)) > 1, resulting in a warm posterior that is softer than we expect to see increased density on the paths/corridors connecting diferent modes of the posterior [ 28, 29], further facilitating the sampler to escape from the current local mode. By using a relatively large ∈ (0.8, 1), the tempering efects would carry over into the sampling stage. Therefore, the bootstrapping mechanism helps improve the sample diversity from cSGLB, and we name this variant Cyclical Bootstrapped SGLB (cbSGLB).

Lastly, we summarize our proposed cSGLB (plus bootstrap option) in Algo. 1 and highlight our modifications on top of SGLB in blue.

5. Experiments 5.1. Experiments on Synthetic Data

We validate and qualitatively evaluate the proposed gradient scheduling and our cSGLB algorithm on two synthetic problems: ( 1 ) a synthetic multimodal dataset in [13], and 3The function class in-use does not contain the unknown groundtruth function.

Algorithm 1: Cyclical (Bootstrapped) SGLB

Input: dataset , learning rate > 0, inverse temperature > 0, regularization > 0, number of iterations > 0, cycle length > 1, scaler limits max, min > 0, stage threshold >

0, mask probability > 0, boolean indicator vector for in [0, 1, . . . , − 1] do

if then Initialize Θ^ 0 (· ) = 0, = 1 ∈ R as all-ones if end if end

mod (, ) = 0 then Sample ∈ R with [] ∼ () mod (, )

Set = 1

≥ then end Compute gradient scaler: = max( max [cos( 2

mod (, ) ) + 1], min) Estimate gradients ˆ using Θ^ (· ) and : ˆ = ( (Θ^

(), ))=1 ∈ R Sample noise , ′ ∼

Sample tree structure:

∼ ︀( ⃒⃒ (ˆ ⊙ ) + (0 , ) √︁ 2 ′)︀

Estimate leaf/parameter values:

* = − Φ (︀ (ˆ ⊙ ) +

Update GBDT model:

√︁ 2 )︀ Θ^ +1 (· ) = (1 − )Θ^ (· ) + ℎ (· , * ) end

Return: Θ^ (· ) ( 2 ) a multi-class Spiral dataset in [12]. Due to limited space, we include experimental details in Appendix B. Synthetic Multimodal Data We first demonstrate the ability of cyclical gradient scaling for sampling from a Specifically, we compare (i) the original SG-MCMC with SGLD (denoted as SGLD) and two SGLD variants: (ii) SGLD with Cyclical schedule on Learning Rate (denoted as clr-SGLD) [13] and (iii) SGLD with Cyclical schedule on Gradient scale (denoted as cg-SGLD (ours)). We reproduced the results for SGLD and clr-SGLD in paper a fair comparison, each chain runs for 50 iterations and both clr-SGLD and cg-SGLD have 30 cycles. Fig.2 shows the estimated density using diferent sampling strategies. SGLD gets trapped in local modes depending on the random initial position, and increasing the noise scale does not solve the problem. In contrast, clr-SGLD and cg-SGLD can explore and locate roughly 7 − ent modes of the distribution, showing that our cg-SGLD

8 difer

can achieve the state-of-the-art performance in exploring multimodal distributions. Moreover, cg-SGLD has benefits in implementation over clr-SGLD when combined with SGLB. The SGLB algorithm was made available in the CatBoost library [30], which only supports a fixed lr. All our proposed enhancements can be implemented with a "user-defined loss function" available in CatBoost without touching the source code, making it straightforward to reproduce our algorithms.

Multi-Class Spiral Data After validating the eficacy of cyclical gradient scheduling on sampling from multimodal distributions, we are now ready to experiment with cSGLB. Specifically, we compare the following algorithms on a 3-class classification task called "Spiral" in [12]: (i) SGLB ensemble, where we denote by ens with models, (ii) SGLB virtual ensemble, simply denoted by SGLB, and (iii) cSGLB virtual ensemble, denoted by cSGLB (ours). We again reproduced the results in [12] with code released by the authors, and Fig.3 shows the estimated KU on Spiral test set. As noted in [12], we see that knowledge uncertainty due to decision-boundary "jitter" exists in both ens20 and cSGLB, and the "jitter" afects cSGLB more as the estimated KU is "noisy" at the decision boundary. Nevertheless, cSGLB (with only a single model) is significantly more eficient than ens20 and is able to greatly improve upon SGLB in capturing high

KU in regions with no training data. 5.2. Experiments on Real-World Weather Prediction Data

Lastly, we evaluate our proposed methods on the public Shifts Weather Prediction dataset [31]. We select the classification task where the ML model is asked to predict the precipitation class at a particular geolocation and from weather station measurements and forecast models. The full dataset is partitioned in a canonical fashion and contains in-domain (ID) training, development and evaluation datasets as well as out-of-domain (OOD) development and evaluation datasets. Importantly, the ID data and the OOD data are separated in time and conMild; OOD: Snow, Polar), making the Shifts dataset an ideal testbed for gauging the robustness of ML model and the quality of uncertainty estimation. To further facilitate our experimentation, we conducted the following multimodal distribution on a 2D mixture of 25 Gaussians. timestamp, given heterogeneous tabular features derived [13] with code released by the authors, and built our cg- sist of non-overlapping climate types (ID: Tropical, Dry, (a) Target (b) SGLD (c) N-SGLD (d) clr-SGLD (e) cg-SGLD (ours) [31], and the results are summarized in Table 1. For predictive performance, we report the classification accuracy and macro F1 using BMA on both the ID and the OOD evaluation datasets. We can see the following efects: ( 1 ) Virtual ensembles with a longer chain slightly outperform the real ensembles on the ID data. ( 2 ) Our proposed (a) Spiral (b) ens20 coSfGmLeBthaonddscobnSGthLeBOpOerDfordmatasl.igHhotlwyewveorr,sethtihsaunstuhaellryesist not a concern in practice since the model is not trained with data from OOD domains and would not be used to solve the OOD prediction tasks in a practical scenario.

As long as the domain shifts can be reliably detected (via uncertainty), proactive decisions can be made to avoid costly mistakes due to model errors. ( 3 ) Our proposed (c) SGLB (d) cSGLB data bootstrapping mechanism is capable of improving the performance on the OOD data (cbSGLB > cSGLB). In Figure 3: Spiral dataset and estimated knowledge uncertain- addition, we include the F1-AUC metric (on the combined ties. Each diferent color in (a) represents a diferent class. ID&OOD evaluation sets) introduced in [31] to jointly assess the predictive power and uncertainty quality. The F1-AUC can be increased by either having a stronger predata preprocessing: ( 1 ) feature selection to keep only dictive model or by improving the correlation between the top 40 features by importance (out of 123 available uncertainty and error. Consistent with the findings in features), where the feature importance is determined by [31], total uncertainty (TU) correlates more with errors a CatBoost classifier with 1 trees trained on the entire than knowledge uncertainty (KU) as shown by the F1training set; ( 2 ) dropping minority classes to keep only AUC scores. More specifically, we see that the F1-AUC the major 3 precipitation classes, i.e., class 0, 10, and 20 is quite similar across the board when measured by TU, out of the 9 available classes from the original dataset; although cSGLB/cbSGLB has slightly worse predictive ( 3 ) random data sampling to keep 200 (medium-sized) power on the OOD segment. When F1-AUC is measured data in the final training set. Again, the purpose of our by KU, our cSGLB/cbSGLB is capable of producing KU data preprocessing is for speeding up experimentation, estimates that relate more closely to model errors than and we believe that the observations and findings in this KU from the SGLB baseline. study are generalizable to the original full dataset. For At last, we present the OOD detection ROC-AUC permodel building, 30 independent SGLB models (each of formance on the evaluation data by using KU estimates. 1 trees) were trained and used to construct real ensem- Our cSGLB/cbSGLB outperforms the SGLB baseline with bles for ∈ {3, 5, 10, 30}. SGLB/cSGLB/cbSGLB a large margin on the OOD detection task, and even virtual ensembles were built by sampling 10 members achieves a comparable performance to the real ensemfrom a single-chain with 2 trees. Hence, 10 is 5× ble 10, which is 5× more expensive. This highlights more expensive in computation and memory than a vir- that our cSGLB/cbSGLB can produce high-fidelity KU tual ensemble. Additional details regarding our data and estimates to detect domain (or distributional) shifts with models are included in Appendix C. a single model, and that our proposed cyclical gradient

We compare various methods on their predictive per- scheduling is efective in exploring diferent modes of a formance and on uncertainty quantification following posterior. In real-world industrial applications, detecting OOD data or domain shifts in an eficient way is often crucial to ensure a safe deployment and operation of ML systems. Observing consistently high uncertainty (especially KU) from model predictions indicates that the patterns of new incoming data have deviated from the training. This often provides a strong signal for model refresh, ensuring that the ML system can be updated in time to avoid errors and operate safely in its "comfort zone" (with relatively low predictive uncertainty). arXiv preprint arXiv:1902.03932 (2019). PMLR, 2018, pp. 1309–1318. [14] F. Wenzel, K. Roth, B. S. Veeling, J. Świątkowski, L. Tran, [30] A. V. Dorogush, V. Ershov, A. Gulin, Catboost: gradiS. Mandt, J. Snoek, T. Salimans, R. Jenatton, S. Nowozin, ent boosting with categorical features support, arXiv How good is the bayes posterior in deep neural net- preprint arXiv:1810.11363 (2018).

works really?, arXiv preprint arXiv:2002.02405 (2020). [31] A. Malinin, N. Band, G. Chesnokov, Y. Gal, M. J. [15] P. Izmailov, S. Vikram, M. D. Hofman, A. G. G. Wil- Gales, A. Noskov, A. Ploskonosov, L. Prokhorenkova, son, What are bayesian neural network posteriors really I. Provilkov, V. Raina, et al., Shifts: A dataset of real dislike?, in: International Conference on Machine Learn- tributional shift across multiple large-scale tasks, arXiv ing, PMLR, 2021, pp. 4629–4640. preprint arXiv:2107.07455 (2021). [16] A. Masegosa, Learning under model misspecification:

Applications to variational and ensemble methods, Advances in Neural Information Processing Systems 33 A. Comparison between cSGLB (2020) 5479–5491. [17] M. Welling, Y. W. Teh, Bayesian learning via stochastic and cSG-MCMC gradient langevin dynamics, in: Proceedings of the 28th international conference on machine learning (ICML- The proposed Cyclical SGLB algorithm combines SGLB 11), Citeseer, 2011, pp. 681–688. with cSG-MCMC [13] to efectively explore diferent [18] T. Chen, E. Fox, C. Guestrin, Stochastic gradient hamil- modes of a highly multimodal posterior distribution. In tonian monte carlo, in: International conference on ma- this section, we summarize some key diferences between chine learning, PMLR, 2014, pp. 1683–1691. our design and the original cSG-MCMC algorithm. [19] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel,

H. Neven, Bayesian sampling using stochastic gradient ( 1 ) cSG-MCMC is a sampling algorithm designed for thermostats, Advances in neural information process- Bayesian NNs, while cSGLB is built for GBDT models. ing systems 27 (2014). In deep learning, full-batch gradient descent is usually [20] C. Li, C. Chen, D. Carlson, L. Carin, Preconditioned not feasible, and techniques have been developed to exstochastic gradient langevin dynamics for deep neural plicitly compensate mini-batch noises, such as precondinetworks, in: Thirtieth AAAI Conference on Artificial tioning [14]. Some also suggested applying an additional Intelligence, 2016. correction step called Metropolis-Hastings [15]. Tree[21] Y. W. Teh, A. H. Thiery, S. J. Vollmer, Consistency and based GB models can easily scale up to large industrial lfuctuations for stochastic gradient langevin dynamics, datasets and digest the full training set at each iteration. The Journal of Machine Learning Research 17 (2016) Therefore, our cSGLB uses full-batch GB in the sampling 193–225. [22] A. G. Wilson, P. Izmailov, Bayesian deep learning and a stage of each cycle to ensure high-quality samples being probabilistic perspective of generalization, Advances in generated. ( 2 ) cSGLB puts a cyclical schedule on graneural information processing systems 33 (2020) 4697– dient scale while cSG-MCMC puts a schedule on step 4708. size. In addition, the original cSG-MCMC completely re[23] A. Ashukha, A. Lyzhov, D. Molchanov, D. Vetrov, Pitfalls moved the injected Gaussian noises in the exploration of in-domain uncertainty estimation and ensembling in stage, and cSG-MCMC reduced to regular stochastic gradeep learning, arXiv preprint arXiv:2002.06470 (2020). dient descent (SGD) during the period of exploration. [24] T. Duan, A. Anand, D. Y. Ding, K. K. Thai, S. Basu, A. Ng, Although the authors claimed that this amounts to posA. Schuler, Ngboost: Natural gradient boosting for prob- terior temping which is commonly used in DL domain, abilistic prediction, in: International Conference on Ma- the implementation of cSG-MCMC algorithm does not [25] cBh. iSneettLleesa,rnAicntgiv,ePMleaLrRn,in20g2l0it,eprpa.tu2r6e90su–r2v7e0y0.(2009). follow closely/strictly the dynamics of SGLD during the [26] T. Tan, Z. Xiong, V. R. Dwaracherla, Parameterized in- exploration stage. In contract, we keep the injected noise dexed value function for eficient exploration in rein- term unchanged during the course of learning. Our deforcement learning, in: Proceedings of the AAAI Con- sign achieved similar efects compared with the step-size ference on Artificial Intelligence, volume 34, 2020, pp. scheduling on a synthetic experiment, and we also en5948–5955. sure that cSGLB follows the dynamics of SGLD (more or [27] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, less) at every iteration step. ( 3 ) Lastly, the gradient scalS. Udluft, Decomposition of uncertainty for active learn- ing (instead of step-size scheduling) has implementation ing and reliable reinforcement learning in stochastic sys- benefits. The SGLB algorithm is made available in the tems, stat 1050 (2017) 11. CatBoost library [30], which only supports a constant [28] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, step size (or learning rate). Our proposed cyclical gradiA. G. Wilson, Loss surfaces, mode connectivity, and fast ensembling of dnns, Advances in neural information ent scaling (and data bootstrapping) can be implemented processing systems 31 (2018). easily with the "user-defined loss function" available in [29] F. Draxler, K. Veschgini, M. Salmhofer, F. Hamprecht, the CatBoost package, without modifying a single line of Essentially no barriers in neural network energy land- the source code. scape, in: International conference on machine learning, eval-ID eval-OOD 200,000 0 33.97 35.07 25.13 26.58

B. Synthetic Data B.1. Synthetic Multimodal Distribution

The ground truth density of the distribution is =1 25 () = ∑︁ 1 25 (| , Σ) , it a virtual ensemble of 20 members. For cSGLB virtual ensemble, we set = 0.05, cycle length = 200, max = 10, min = 1, making it a virtual ensemble of 2000/200 = 10 members. For cbSGLB with bootstrapping, we additionally set exploration proportion = 0.9 and mask probability = 0.66.

C. Real-World Shifts Data where = {− 4, − 2, 0, 2, 4} × {− 4, − 2, 0, 2, 4} and

︂[ 0.03 0 ]︂ Data Summary A detailed summary of our final parΣ = 0 0.03 . We used the code released by the au- titioning of the Weather Prediction dataset is included in thors [13] to generate our results. Specifically, the SGLD Table 2. was trained with a decaying lr = 0.05( + 1)− 0.55, and clr-SGLD was learned with a cyclical lr schedule Experimental Details We used default parameter setwith initial value 0 = 0.09 and exploration proportion tings for SGLB models as suggested in the original paper = 0.25. For our cg-SGLD, we fixed lr = 0.01 and [12] for uncertainty quantification except that we set Gaussian noise scale as 0.4, and set max = 10, min = the subsample rate to 0.8 for stochastic gradient boost1. The "noisy" version of SGLD (or NoisySGLD/N-SGLD) ing. The real SGLB ensemble consists of (up to) 30 SGLB was trained with a fixed lr = 0.02 and noise scale 5.0 models trained with diferent seeds, each of 1K trees. In (roughly 10× larger than the noise scale used in the other order to get more samples from a single chain, the virmethods). Each chain was trained for 50 iterations and tual ensembles of SGLB and our cSGLB/cbSGLB were both clr-SGLD and cg-SGLD had 30 cycles. The results learned with a single model of 2K trees. We set learning and findings are robust to random seeds, and similar re- rate for all models to = 0.05, and tree depth to 6. For sults were observed with diferent seeds. We refer the SGLB virtual ensemble, each 100th model from interval interested readers to the original paper [13] for results of [1000, 2000] was added to the ensemble, making it a virSGLD and clr-SGLD in parallel (or multi-chain) settings. tual ensemble of 10 members. cSGLB and cbSGLB shared the same parameters with the SGLB counterpart. In adB.2. Synthetic Spiral Dataset dition, for cSGLB/cbSGLB virtual ensemble, we set cycle length = 200, max = 10.0, min = 1.0, making it a All experiments were conducted using CatBoost [30], one virtual ensemble of 2000/200 = 10 members. For simof the-state-of-the-art libraries for GBDTs. The ensemble plicity, cSGLB used full-batch gradient boosting at each of SGLB (ens20) contains 20 independent (with diferent iteration step. In contrast, for cbSGLB with bootstraprandom seeds) models with 1K trees each. The learning ping, we set exploration proportion = 0.8, i.e., 80% of rate is = 0.1, tree depth is 6, and random_strength = a cycle was treated as exploration, and set mask prob100 and border_count = 128. The SGLB virtual ensem- ability = 0.6 in the exploration stage. For model ble and cSGLB virtual ensemble are trained with the and parameter selection, we only used the in-domain same parameters except that we increase the number (ID) development set and did not use the out-of-domain of trees to 2K and lower the lr for cSGLB to = 0.05. (OOD) development set. Although this may potentially Thus, the virtual ensemble is 10× more eficient in com- lower our reported performance on the OOD evaluation putation and memory than the actual SGLB ensemble. set, we believe that it reflects better a real-world learning For SGLB virtual ensemble, each 50th model from in- scenario where the shifted data is often unobserved and terval [1000, 2000] is added to the ensemble, making unavailable at training time.

[1]

Dressel ,

Farid , The accuracy, fairness, and limits of predicting recidivism , Science Advances 4 ( 2018 ) eaao5580 . URL: https://www.science.org/doi/abs/10.

1126/sciadv.aao5580. doi: 10 .1126/sciadv.aao5580.

[2]

Stone ,

Brooks , E. Brynjolfsson,

Calo ,

Etzioni ,

Hager ,

Hirschberg ,

Kalyanakrishnan ,

Kamar ,

Kraus , et al., Artificial intelligence and life in 2030, One Hundred Year Study on Artificial Intelligence: Report of the 2015-2016 Study Panel 52 ( 2016 ).

[3]

Serban ,

Poll ,

Visser , Towards using probabilis6. Conclusion tic models to design software systems with inherent uncertainty ( 2020 ). URL: https://arxiv.org/abs/ 2008 .03046. We present cyclical gradient scheduling and Cyclical doi: 10 .48550/ARXIV. 2008 .03046. SGLB for eficiently and efectively quantifying uncer- [4]

McGrath ,

Mehta ,

Zytek , I. Lage, H. Lakkaraju, tainty in gradient boosting with a single model, and pro- When does uncertainty matter?: Understanding the impose a data bootstrapping scheme to enhance diversity pact of predictive uncertainty in ml assisted decision in posterior samples. We show empirically that our al- making ( 2020 ). URL: https://arxiv.org/abs/ 2011 .06167. gorithms have superior performance over the state -of- doi:10.48550/ARXIV. 2011 . 06167 .

[5]

Shwartz-Ziv ,

Armon , Tabular data: Deep learning the-art SGLB, especially in quantifying knowledge un- is not all you need , 2021 . URL: https://arxiv.org/abs/2106. certainty and for OOD detection . 03253 . doi: 10 .48550/ARXIV.2106.03253.

[13]

Zhang ,

Li ,

Zhang ,

Chen , A. G. Wilson, Cyclical stochastic gradient mcmc for bayesian deep learning , 40 . 92