Efficient and Effective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC

Efficient and Effective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC TianTan tianta@amazon.com Buyer Risk Prevention -ML WW Customer Trust Amazon

98109 Seattle WA USA

CarlosHuertas carlohue@amazon.com Buyer Risk Prevention -ML WW Customer Trust Amazon

98109 Seattle WA USA

QiZhao qqzhao@amazon.com Buyer Risk Prevention -ML WW Customer Trust Amazon

98109 Seattle WA USA

February 13 -14 2023 Washington D.C USA

Efficient and Effective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC 1613-0073 A13C96CAB30BA94A6E1DA9A82D6A4254 GROBID - A machine learning software for extracting information from scholarly documents uncertainty quantification gradient boosting decision trees Bayesian inference out-of-domain (OOD) detection

Gradient boosting decision trees (GBDTs) are widely applied on tabular data in real-world ML systems. Quantifying uncertainty in GBDT models is thus essential for decision making and for avoiding costly mistakes to ensure an interpretable and safe deployment of tree-based models. Recently, Bayesian ensemble of GBDT models is used to measure uncertainty by leveraging an algorithm called stochastic gradient Langevin boosting (SGLB), which combines GB with stochastic gradient MCMC (SG-MCMC). Although theoretically sound, SGLB gets trapped easily on a particular mode of the Bayesian posterior, just like other forms of SG-MCMCs. Therefore, a single SGLB model can often fail to produce uncertainty estimates of high-fidelity. To address this problem, we present Cyclical SGLB (cSGLB) which incorporates a Cyclical Gradient schedule in the SGLB algorithm. The cyclical gradient mechanism promotes new mode discovery and helps explore high multimodal posterior distributions. As a result, cSGLB can efficiently quantify uncertainty in GB with only a single model. In addition, we present another cSGLB variant with data bootstrapping to further encourage diversity among posterior samples. We conduct extensive experiments to demonstrate the efficiency and effectiveness of our algorithm, and show that it outperforms the state-of-the-art SGLB on uncertainty quantification, especially when uncertainty is used for detecting out-of-domain (OOD) data or distributional shifts.

Introduction

With the rapid growth of data and computing power, machine learning (ML) has been gaining a lot of new applications in areas not imagined before. The more ubiquitous ML systems become, it is inevitable to see applications in very sensitive and high-risk fields. This expands to a numerous areas like criminal recidivism [1], medical follow-ups [2] and autonomous-systems [3]. While these systems might be very broad, they share a common need, and that is to have a certain degree of confidence on ML predictions. A proven successful way to build confidence in critical systems is uncertainty estimation. Research has shown that humans are more likely to agree with a system if they get access to the corresponding uncertainty, and this holds true regardless of shape and variance as the approach itself is model and task agnostic [4]. Since the most common data type in real-world ML applications is tabular [5], our work in this paper focuses specifically on uncertainty quantification for the state-of-the-art gradient boosting decision trees (GBDTs) [6,7], which are known to outperform deep learning (DL) methods on tabular data, both in accuracy and tuning requirements [5]. Measuring uncertainty effectively and efficiently on GBDT predictions can therefore not only improve model interpretability in production but also ensure a safer deployment of ML systems, especially for high-risk applications.

Uncertainty quantification (UQ) has been widely studied for neural networks under the Bayesian framework [8], however, it is relatively under-explored for tree-based models. Although calibrated probability estimation trees [9,10] can be used for UQ, they have not been studied from a Bayesian perspective. Recently, Bayesian ensemble methods were extended to measure uncertainty in GBDTs by leveraging a new algorithm called stochastic gradient Langevin boosting (SGLB) [11]. Specifically, two SGLB-based approaches were introduced for UQ [12]: (1) SGLB ensemble, which trains multiple SGLB models in parallel, and (2) SGLB virtual ensemble, which constructs a virtual ensemble using only a single SGLB model where each member in the ensemble is a "truncated" sub-model [12]. Although both approaches are theoretically sound, there is clearly a trade-off between quality and efficiency in practice. SGLB (real) ensemble is believed to be accurate as it can characterize the Bayesian posterior well by running independent models in parallel. However, it is almost infeasible to deploy such an ensemble in real-world production due to its high computational and maintenance costs. SGLB virtual ensemble greatly improves the efficiency, however, it often gets stuck on a single mode of the Bayesian posterior and can produce downgraded uncertainty estimates [13,14]. To better balance between quality and efficiency and to facilitate the usage of uncertainty-enabled ML systems, an important question remains: how can we make a single SGLB explore effectively different modes of a posterior given a limited computational budget?

In this paper, we address the question above by combining SGLB virtual ensemble with advanced sampling techniques from Bayesian DL [14,13,15,16]. Inspired by the ideas in [13], we propose to use a scaler (or scaling factor) on gradients that follows a cyclical schedule during the course of SGLB training. The cyclical schedule is illustrated in Figure 1, and consequently, we name the resulting algorithm Cyclical SGLB (cSGLB). Similar to [13], each cycle in cSGLB contains two stages: (1) Exploration: when the scaler is large, we treat this stage as a warm restart from the previous cycle, enabling the model/sampler to follow the gradients closely and to escape from the current local mode. (2) Sampling: when the gradient scaler is small, the scale of injected Gaussian noise in the SGLB procedure becomes relatively large, encouraging the sampler to fully characterize one local mode. We collect one sample (or truncated sub-model) to build the virtual ensemble at the end of each cycle. The cyclical gradient schedule therefore helps cSGLB effectively explore different modes of a posterior while maintaining the same level of efficiency of a virtual ensemble. Moreover, inspired by a recent study [16] showing that "diversified" posterior may provide a tighter generalization bound, we present another simple approach to encourage diversity in samples obtained from running cSGLB via data bootstrapping. We name this variant Cyclical Bootstrapped SGLB (cbSGLB).

We extensively experiment with our proposed algorithms and compare the performance against SGLB ensemble and the original SGLB virtual ensemble. Particularly, we show that our cyclical gradient schedule can help explore effectively multimodal distributions, cSGLB is capable of producing uncertainty estimates that are better aligned with SGLB real ensemble, and cSGLB/cbS-GLB outperforms the SGLB baseline with a large margin on out-of-domain (OOD) data detection, indicating a su-perior performance in detecting distributional/domain shifts in real-world tabular data streams.

Related Work

Bayesian ML and approximate Bayesian inference provide a principled representation of uncertainty. One popular family of approaches to inference in Bayesian ML are stochastic gradient Markov Chain Monte Carlo (SG-MCMC) methods [17,18,19,20,13], which are used to effectively sample models (or model parameters) from the Bayesian posterior. Uncertainty then comes naturally by measuring the "discrepancy" in predictions from the sampled models which are regarded as posterior samples. Recently, stochastic gradient Langevin boosting (SGLB) [11] was proposed by combining gradient boosting with SG-MCMC. As its name suggests, the Markov chain generated by SGLB obeys a special form of the stochastic gradient Langevin dynamics (SGLD) [11,17], which implies that SGLB is able to generate samples from the true Bayesian posterior asymptotically. Leveraging this property, Malinin et al. [12] proposed to use (1) SGLB ensemble or (2) SGLB virtual ensemble to measure uncertainty in GBDTs. Essentially, SGLB ensemble corresponds to running multiple SG-MCMCs in parallel and each chain (or SGLB) is initialized independently with a different random seed. Since SGLB allows us to sample from the true posterior, the ensemble with multiple samples gives a high-fidelity approximation to the Bayesian posterior. In contrast, SGLB virtual ensemble only trains a single SGLB model and it uses multiple truncated sub-models to form a (virtual) ensemble. The key idea is essentially extracting multiple samples from a single-chain SG-MCMC instead of running multiple chains in parallel.

In theory, SGLB or single-chain SG-MCMC converges asymptotically to the target distribution and should behave similarly to the multi-chain SGLB ensemble in the limit, but it can suffer from a bounded estimation error in limited time [21]. Moreover, it is often believed that the posterior is highly multimodal in the parametric space of modern ML models [13], since there are potentially many different sets of parameters that can describe the training data equally well. The real ensemble can explore different modes of the posterior by running in parallel independent chains, providing a complete picture of the distribution as the number of chains increases. However, a single-chain SG-MCMC often gets stuck easily on a single mode of the posterior [13,14], failing to cover the full spectrum of the distribution.

In this paper, we extend the ideas behind Cyclical SG-MCMC (cSG-MCMC) in DL [13] to sampling from a treebased SGLB model, which promotes new mode discovery during training. Different from cSG-MCMC that puts a cyclical schedule on step size, we propose to use a cyclical schedule on gradient scale. We also point out and justify the difference and our design choice in Appendix A. In addition, we propose a simple strategy to further encourage diversity in samples obtained from a single chain by data bootstrapping. At the beginning of each cycle (see Fig. 1), we construct a bootstrapped dataset that is a random subset of the training, and use that bootstrapped data consistently during the exploration stage to update the GBDT model. The "bias" induced by data bootstrapping also amounts to posterior tempering [14,15,22,23,13].

Preliminaries

General Setup

Given a set of 𝑁 training data points sampled from an unknown distribution 𝒟 on 𝒳 × 𝒴, i.e., (𝑥1, 𝑦1), . . . , (𝑥𝑁 , 𝑦𝑁 ) ∼ 𝒟 denoted as 𝒟𝑁 , and a loss function 𝐿(𝑧, 𝑦) : 𝒵 × 𝒴 → R where 𝒵 denotes the space of predictions, our goal is to minimize the empirical loss ℒ(𝑓 |𝒟𝑁 ) := 1 𝑁 ∑︀ 𝑁 𝑖=1 𝐿(𝑓 (𝑥𝑖), 𝑦𝑖)) over functions 𝑓 belonging to some family ℱ ⊂ {𝑓 : 𝒳 → 𝒵}. In this paper, we only consider ℱ corresponding to additive ensemble of decision trees ℋ := {ℎ 𝑠 (𝑥, 𝜃 𝑠 ) : 𝒳 × R 𝑚𝑠 → R, 𝑠 ∈ 𝑆}, where 𝑆 is an index set and ℎ 𝑠 has parameters 𝜃 𝑠 . Decision trees are built by partitioning recursively the feature space into disjoint regions (called leaves). Each region is assigned a value that is used to estimate the response of 𝑦 in the corresponding feature subspace. Let's denote these regions by 𝑅𝑗's, then we have ℎ(𝑥, 𝜃) = ∑︀ 𝑗 𝜃𝑗1{𝑥 ∈ 𝑅𝑗}, where 1{•} denotes indicator function. Therefore, given the tree structure, decision tree ℎ 𝑠 is a linear function of its parameters 𝜃 𝑠 . It is often assumed that the set 𝑆 is finite because the training data is finite [11,12], e.g., there exists only a finite number of ways to partition the training data. Owing to the linear dependence of ℎ 𝑠 on 𝜃 𝑠 and the finite assumption of 𝑆, we can represent any ensemble of models from ℋ as a linear model 𝑓Θ(𝑥) = 𝜑(𝑥) 𝑇 Θ for some feature map 𝜑(𝑥) : 𝒳 → R 𝑚 and Θ ∈ R 𝑚 denotes the parameters of the entire ensemble [11]. Hence, in the subsequent discussion, we will simply denote the parameters of the GBDT model obtained at iteration 𝜏 as Θ ˆ𝜏 , and additionally define a linear mapping 𝐻𝑠 : R 𝑚𝑠 → R 𝑁 that converts 𝜃 𝑠 to predictions (ℎ 𝑠 (𝑥𝑖, 𝜃 𝑠 )) 𝑁 𝑖=1 .

SGLB

SGLB combines stochastic gradient boosting (SGB) [7] with stochastic gradient Langevin dynamics (SGLD) [17]. Following notations used in the original paper [11], we characterize the SGB procedure by a tuple ℬ := {ℋ, 𝑝}, where ℋ again is the set of base learners and 𝑝(𝑠|𝑔) is a distribution over indices 𝑠 ∈ 𝑆 conditioned on a gradient vector 𝑔 ∈ R 𝑁 . Simply put, 𝑝(𝑠|𝑔) defines a distribution over tree structures.

As with other GBDT algorithms, SGLB constructs an ensemble of decision trees iteratively. At each iteration 𝜏 , we compute unbiased gradient estimates 𝑔 ˆ𝜏 such that E[𝑔 ˆ𝜏 ] = ( 𝜕 𝜕𝑓 𝐿(𝑓 Θ ^𝜏 (𝑥𝑖), 𝑦𝑖)) 𝑁 𝑖=1 ∈ R 𝑁 using the current model 𝑓 Θ ^𝜏 , and sample independently two normal vectors 𝜁, 𝜁 ′ ∼ 𝒩 (0𝑁 , 𝐼𝑁 ), where 0𝑁 , 𝐼𝑁 denote zero vector and identity matrix in R 𝑁 , respectively. Then, a base learner (or tree structure) 𝑠𝜏 is picked by drawing one sample from 𝑝(𝑠|𝑔 ˆ𝜏 + √︁ 2𝑁 𝜖𝛽 𝜁 ′ ), where 𝜖 > 0 is a learning rate (or step size) and 𝛽 > 0 is a parameter often referred as inverse diffusion temperature. Next, we estimate the parameters 𝜃 𝑠𝜏 * (at tree leaves) of the sampled base learner by solving the following optimization:

minimize ||𝜃 𝑠𝜏 || 2 2 𝑠.𝑡. 𝜃 𝑠𝜏 ∈ argmin 𝜃∈R 𝑚𝑠 𝜏 || − 𝑔 ˆ𝜏 + √︂ 2𝑁 𝜖𝛽 𝜁 − 𝐻𝑠 𝜏 𝜃|| 2 2 ,(1)

which returns the minimum norm solution that fits best to the perturbed "noisy" version of negative gradients. The optimization above has a closed form so-

lution as 𝜃 𝑠𝜏 * = −Φ𝑠 𝜏 (𝑔 ˆ𝜏 + √︁ 2𝑁 𝜖𝛽 𝜁)

, where Φ𝑠 𝜏 := (𝐻 𝑇 𝑠𝜏 𝐻𝑠 𝜏 ) + 𝐻 𝑇 𝑠𝜏 , + denotes pseudo-inverse. For decision trees, Φ𝑠 𝜏 𝑔 essentially corresponds to averaging the gradient estimates 𝑔 in each leaf node of the tree. Lastly, SGLB algorithm updates the ensemble model by

𝑓 Θ ^𝜏+1 (•) := (1 − 𝛾𝜖)𝑓 Θ ^𝜏 (•) + 𝜖ℎ 𝑠𝜏 (•, 𝜃 𝑠𝜏 * ),(2)

where 𝛾 is a regularization parameter that "shrinks" the currently built model when updating the ensemble. At a high-level, SGLB is a stochastic GB algorithm with Gaussian noise injected into gradient estimates, which encourages the algorithm to explore a larger area in the functional space to find a better fit for the given data. The independence between noise 𝜁 (used for parameter learning) and 𝜁 ′ (used for tree sampling), and the model shrinking by 𝛾 in Eqn.(2) are technical details needed for establishing theoretical results and rigorous analysis of SGLB [11]. All the procedures of SGLB are also present in our proposed cSGLB in Algo. 1 (with our additional modifications highlighted in blue).

One can show that the parameters of SGLB Θ ˆ𝜏 at each iteration form a Markov chain that weakly converges to the following stationary distribution:

𝑝 𝛽 * (Θ) ∝ exp(−𝛽ℒ(Θ|𝒟𝑁 ) − 𝛽𝛾||ΓΘ|| 2 2 ),(3)

where Γ = Γ 𝑇 > 0 is a regularization matrix which depends on a particular tree construction algorithm or the choice of tuple ℬ := {ℋ, 𝑝(𝑠|𝑔)} [11]. Note that since the GBDT model is linear and can be fully determined by parameters Θ, we simply use notation ℒ(𝑓 |𝒟𝑁 ) and ℒ(Θ|𝒟𝑁 ) interchangeably.

Posterior Sampling

We consider here a standard Bayesian learning framework [8] that treats parameters Θ as random variables and places a prior 𝑝(Θ) over Θ. In addition, we consider the GBDT model 𝑓Θ as a probabilistic model and explicitly denote the model by 𝑃 (𝑦|𝑥; Θ) with parameters Θ. This is valid naturally for classification as GBDT models by construction return a distribution over class labels. For regression, one can leverage NGBoost algorithm [24] to return the mean and variance of a Gaussian distribution over the target 𝑦 for a given input 𝑥.

For the purpose of uncertainty estimation, we aim to estimate or obtain an approximation to the Bayesian posterior 𝑝(Θ|𝒟𝑁 ). To that end, we can choose 𝛽 = 𝑁 and 𝛾 =1 2𝑁 and use the negative log-likelihood as the loss function

ℒ(Θ|𝒟𝑁 ) = E𝒟 𝑁 [− log 𝑝(𝑦|𝑥, Θ)] = − 1 𝑁 ∑︀ 𝑁 𝑖=1 log 𝑝(𝑦𝑖|𝑥𝑖, Θ).

Then, the limiting distribution of SGLB can be explicitly expressed as:

𝑝 𝛽 * (Θ) ∝ exp (︁ log 𝑝(𝒟𝑁 |Θ) − 1 2 ||ΓΘ|| 2 2 )︁ ∝ 𝑝(𝒟𝑁 |Θ)𝑝(Θ),(4)

which is proportional to the true posterior 𝑝(Θ|𝒟𝑁 ) under Gaussian prior 𝑝(Θ) = 𝒩 (0𝑚, Γ) [11]. Now, consider a Bayesian ensemble of probabilistic models {𝑃 (𝑦|𝑥; Θ (𝑘) )} 𝐾 𝑘=1 where each model is trained independently by running SGLB. Since each Θ (𝑘) is guaranteed to be sampled from 𝑝(Θ|𝒟𝑁 ) by Eqn.( 4), the ensemble {Θ (𝑘) } 𝐾 𝑘=1 with 𝐾 samples yields a "discrete" approximation to the posterior 𝑝(Θ|𝒟𝑁 ). This is exactly the idea behind SGLB ensemble [12], which learns 𝐾 independent SGLB models in parallel with different random seeds. Although the approximation improves as 𝐾 increases, the computational cost also increases linearly with 𝐾. To alleviate the computational burden, SGLB virtual ensemble [12] builds a Bayesian virtual ensemble by sampling multiple times from a singlechain SGLB model. Because samples from the same chain are highly correlated, SGLB virtual ensemble proposes to sample one member Θ (𝑘) every 𝐶 > 1 iterations. More specifically, the parameters are sampled by

{Θ (𝑘) } 𝐾=⌊ 𝒯 2𝐶 ⌋ 𝑘=1 = {Θ ˆ𝐶(𝑘+⌊ 𝒯 2𝐶 ⌋) , 𝑘 = 1, . . . , ⌊ 𝒯 2𝐶 ⌋}, i.

e., appending one member to the ensemble every 𝐶 iterations while constructing one SGLB model using 𝒯 iterations of gradient boosting. Notice that no sampling is performed during the first half of iterations (𝜏 < 𝒯 /2) since Eqn.(4) holds only asymptotically. For large 𝐶 and 𝐾, the virtual ensemble should behave similarly to the SGLB real ensemble in the limit theoretically.

Uncertainty EstimationOnce the Bayesian (virtual) ensemble {𝑃 (𝑦|𝑥; Θ (𝑘) )} 𝐾 𝑘=1 , Θ (𝑘) ∼ 𝑝(Θ|𝒟𝑁 ) is learned,

predictions can be made by taking an average over the ensemble, often known as predictive posterior or Bayesian model average (BMA):

𝑝(𝑦|𝑥, 𝒟𝑁 ) = E 𝑝(Θ|𝒟 𝑁 ) [𝑃 (𝑦|𝑥; Θ)] ≈ 1 𝐾 ∑︀ 𝐾 𝑘=1 𝑃 (𝑦|𝑥; Θ (𝑘) ).

(5) The entropy of the predictive posterior estimates total uncertainty (TU) in predictions, which can be further decomposed into two distinct types of uncertainty: knowledge uncertainty and data uncertainty. 1 (a) Knowledge uncertainty (KU) arises due to the lack of knowledge about the data generation process (or the unknown distribution 𝒟). KU is expected to be large in regions (in the feature space) where we do not have sufficient training data. (b) Data uncertainty (DU) arises due to the inherent stochasticity within the data generation process, and it is high in regions with class overlaps. In applications like active learning [25], reinforcement learning (RL) [26], and OOD detection, it is desirable to measure KU separately from DU (or TU), and the following equation can be used in practice to compute and connect them via mutual information [27]:

where I(𝐴; 𝐵) denotes the mutual information between random variables A and B, and H(•) denotes entropy. The difference between TU and DU measures the disagreement among members in the ensemble and estimates the knowledge uncertainty.2

Cyclical Stochastic Gradient

Langevin Boosting (cSGLB)

Promoting Mode Discovery via Cyclical Gradient Scheduling

Instead of building a cumbersome true ensemble of SGLB models, the virtual ensemble of SGLB greatly improves the efficiency by training only a single model. However, similar to other types of SG-MCMC in Bayesian DL [13,14,15], single-chain SGLB gets trapped easily on a particular single mode of the posterior. To efficiently explore different modes of the multimodal posterior and effectively measure uncertainty in GBDT predictions with a single chain, we propose a simple remedy that places a cyclical cosine schedule on gradient scale during training, as illustrated in Fig. 1. Specifically, the scaling factor at iteration 𝜏 is defined as:

𝛼𝜏 = max (︁ 𝛼max 2 [︁ cos( 𝜋 mod (𝜏, 𝐶) 𝐶 )+1 ]︁ , 𝛼min )︁ ,(7)

where 𝛼max ≥ 1 is the maximum of the scaler or the initial value of 𝛼0, 𝐶 is the user-defined cycle length, and 𝛼min defines the minimum of the scaler, e.g., 𝛼min = 1 or 0.5, since decaying the gradients to arbitrarily small could be harmful for performance. Putting together, this amounts to sampling the tree structure and learning the tree leaf parameters with the (re)scaled gradients: 𝑠𝜏 ∼

𝑝(𝑠|𝛼𝜏 𝑔 ˆ𝜏 + √︁ 2𝑁 𝜖𝛽 𝜁 ′ ) and 𝜃 𝑠𝜏 * = −Φ𝑠 𝜏 (𝛼𝜏 𝑔 ˆ𝜏 + √︁ 2𝑁

𝜖𝛽 𝜁). Similar to Cyclical SG-MCMC [13], we define two stages within each cycle: (1) Exploration when the completed portion of a cycle 𝒜(𝜏 ) = mod (𝜏,𝐶) 𝐶 is smaller than a given threshold: 𝒜(𝜏 ) < 𝜂, and (2) Sampling when 𝒜(𝜏 ) ≥ 𝜂, and 𝜂 ∈ (0, 1) balances the portion between exploration and sampling. We obtain one sample from the chain at the end of each cycle, i.e., the virtual ensemble is built by

{Θ (𝑘) } 𝐾=⌊ 𝒯 𝐶 ⌋ 𝑘=1 = {Θ ˆ𝐶𝑘−1 , 𝑘 = 1, . . . , ⌊ 𝒯

𝐶 ⌋}. Large gradients at the beginning of a cycle provide enough perturbation and encourage the model to escape from the current mode, while decreasing the gradient scale inside one cycle makes the sampler better characterize the density of the local mode. Moreover, many prior works in Bayesian NNs proposed to apply a certain form of preconditioning to compensate sampling noises from mini-batch training [15,14]. Tree-based models can usually digest the full-batch (full dataset 𝒟𝑁 ) per iteration by leveraging modern multi-core processors and multi-threading. Therefore, we directly use the fullbatch GB in all sampling stages, while leaving the option of random data subsampling in exploration stages to the users if training time is a concern.

Combining cyclical gradient scaling with SGLB, we expect that our new Cyclical SGLB (cSGLB) algorithm could inherit most (if not all) theoretical properties of the original SGLB algorithm. Conceptually, with a proper choice of 𝛼max, 𝛼min and cycle length 𝐶, the sample obtained at 𝜏 = 𝐶 − 1 from the Markov chain 𝑓 Θ ^𝜏 generated by Algo. 1 (w/o bootstrap) can be approximately seen as a random draw from the limiting distribution with small bounded errors. Also, each next cycle can be viewed as a warm restart from its previous cycle, and thus no errors shall be accumulated into the subsequent cycles (at sampling time 𝜏 = 𝐶𝑘 − 1). We left rigorous analysis and proofs of our propositions for future work. Empirically, we show in our experiments that the cyclical gradient scaling achieves similar effects in exploring a multimodal distribution when compared with cSG-MCMC which places a similar cyclical schedule on step size within the context of Bayesian DL. In fact, cS-GLB extends the idea behind cSG-MCMC to tree-based GBDT models. We also summarize key differences between our design of cSGLB and cSG-MCMC in Appendix A.

Enhancing Sample Diversity via Bootstrapping

Recent work [16] provided a compelling analysis that the Bayesian posterior is not optimal under model misspecification 3 , where the performance of the true posterior is dominated by an alternative non-Bayesian posterior that explicitly encourages diversity among ensemble member predictions. Inspired by these results, we propose a simple strategy that promotes diversity among samples obtained from cSGLB by data bootstrapping. At the beginning of each cycle, we sample randomly a Bernoulli mask of size 𝑁 , i.e., 𝜈 := {𝜈[𝑖] ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝 𝑏𝑚 )} 𝑁 𝑖=1 ∈ {0, 1} 𝑁 , and 𝑝 𝑏𝑚 ∈ (0, 1) defines the percentage of data being used. In the following exploration stage, we mask out gradients 𝑔 ˆ𝜏 by taking an element-wise product with 𝜈, i.e., 𝑔 ˆ𝜏 ⊙ 𝜈. The mask 𝜈 and the mask-out operation are used consistently throughout the exploration stage 𝒜(𝜏 ) < 𝜂, and 𝜈 only gets updated until the end of the cycle. This design amounts to learning with a bootstrapped subsample of the data in each cycle. Since the model would observe consistently less data than the original 𝒟𝑁 , it also amounts to posterior tempering (𝑝(𝒟𝑁 |Θ)𝑝(Θ)) 1/𝑇 with some temperature 𝑇 > 1, resulting in a warm posterior that is softer than the Bayesian posterior. By increasing the temperature 𝑇 , we expect to see increased density on the paths/corridors connecting different modes of the posterior [28,29], further facilitating the sampler to escape from the current local mode. By using a relatively large 𝜂 ∈ (0.8, 1), the tempering effects would carry over into the sampling stage. Therefore, the bootstrapping mechanism helps improve the sample diversity from cSGLB, and we name this variant Cyclical Bootstrapped SGLB (cbSGLB).

Lastly, we summarize our proposed cSGLB (plus bootstrap option) in Algo. 1 and highlight our modifications on top of SGLB in blue.

Experiments

Experiments on Synthetic Data

We validate and qualitatively evaluate the proposed gradient scheduling and our cSGLB algorithm on two synthetic problems: (1) a synthetic multimodal dataset in [13]

𝜃 𝑠𝜏 * = −Φ𝑠 𝜏 (︀ 𝛼𝜏 (𝑔 ˆ𝜏 ⊙𝜈) + √︁ 2𝑁 𝜖𝛽 𝜁 )︀

Update GBDT model:

𝑓 Θ ^𝜏+1 (•) = (1 − 𝛾𝜖)𝑓 Θ ^𝜏 (•) + 𝜖ℎ 𝑠𝜏 (•, 𝜃 𝑠𝜏 * ) end Return: 𝑓 Θ ^𝒯 (•)

(2) a multi-class Spiral dataset in [12]. Due to limited space, we include experimental details in Appendix B.

Synthetic Multimodal Data

We first demonstrate the ability of cyclical gradient scaling for sampling from a multimodal distribution on a 2D mixture of 25 Gaussians. Specifically, we compare (i) the original SG-MCMC with SGLD (denoted as SGLD) and two SGLD variants: (ii) SGLD with Cyclical schedule on Learning Rate (denoted as clr-SGLD) [13] and (iii) SGLD with Cyclical schedule on Gradient scale (denoted as cg-SGLD (ours)). We reproduced the results for SGLD and clr-SGLD in paper [13] with code released by the authors, and built our cg-SGLD on top of it. In addition, we experimented with a "noisy" version of SGLD with a fixed lr and increased 10× noise scale (denoted as NoisySGLD/N-SGLD). For a fair comparison, each chain runs for 50𝑘 iterations and both clr-SGLD and cg-SGLD have 30 cycles. Fig. 2 shows the estimated density using different sampling strategies. SGLD gets trapped in local modes depending on the random initial position, and increasing the noise scale does not solve the problem. In contrast, clr-SGLD and cg-SGLD can explore and locate roughly 7 − 8 different modes of the distribution, showing that our cg-SGLD can achieve the state-of-the-art performance in exploring multimodal distributions. Moreover, cg-SGLD has benefits in implementation over clr-SGLD when combined with SGLB. The SGLB algorithm was made available in the CatBoost library [30], which only supports a fixed lr. All our proposed enhancements can be implemented with a "user-defined loss function" available in CatBoost without touching the source code, making it straightforward to reproduce our algorithms.

Multi-Class Spiral Data

After validating the efficacy of cyclical gradient scheduling on sampling from multimodal distributions, we are now ready to experiment with cSGLB. Specifically, we compare the following algorithms on a 3-class classification task called "Spiral" in [12]: (i) SGLB ensemble, where we denote by ens𝐾 with 𝐾 models, (ii) SGLB virtual ensemble, simply denoted by SGLB, and (iii) cSGLB virtual ensemble, denoted by cSGLB (ours). We again reproduced the results in [12] with code released by the authors, and Fig. 3 shows the estimated KU on Spiral test set. As noted in [12], we see that knowledge uncertainty due to decision-boundary "jitter" exists in both ens20 and cSGLB, and the "jitter" affects cSGLB more as the estimated KU is "noisy" at the decision boundary. Nevertheless, cSGLB (with only a single model) is significantly more efficient than ens20 and is able to greatly improve upon SGLB in capturing high KU in regions with no training data.

Experiments on Real-World Weather Prediction Data

Lastly, we evaluate our proposed methods on the public Shifts Weather Prediction dataset [31]. We select the classification task where the ML model is asked to predict the precipitation class at a particular geolocation and timestamp, given heterogeneous tabular features derived from weather station measurements and forecast models. The full dataset is partitioned in a canonical fashion and contains in-domain (ID) training, development and evaluation datasets as well as out-of-domain (OOD) development and evaluation datasets. Importantly, the ID data and the OOD data are separated in time and consist of non-overlapping climate types (ID: Tropical, Dry, Mild; OOD: Snow, Polar), making the Shifts dataset an ideal testbed for gauging the robustness of ML model and the quality of uncertainty estimation. To further facilitate our experimentation, we conducted the following (3) random data sampling to keep 200𝐾 (medium-sized) data in the final training set. Again, the purpose of our data preprocessing is for speeding up experimentation, and we believe that the observations and findings in this study are generalizable to the original full dataset. For model building, 30 independent SGLB models (each of 1𝐾 trees) were trained and used to construct real ensembles 𝑒𝑛𝑠𝐾 for 𝐾 ∈ {3, 5, 10, 30}. SGLB/cSGLB/cbSGLB virtual ensembles were built by sampling 10 members from a single-chain with 2𝐾 trees. Hence, 𝑒𝑛𝑠10 is 5× more expensive in computation and memory than a virtual ensemble. Additional details regarding our data and models are included in Appendix C. We compare various methods on their predictive performance and on uncertainty quantification following [31], and the results are summarized in Table 1. For predictive performance, we report the classification accuracy and macro F1 using BMA on both the ID and the OOD evaluation datasets. We can see the following effects: (1) Virtual ensembles with a longer chain slightly outperform the real ensembles on the ID data. (2) Our proposed cSGLB and cbSGLB perform slightly worse than the rest of methods on the OOD data. However, this usually is not a concern in practice since the model is not trained with data from OOD domains and would not be used to solve the OOD prediction tasks in a practical scenario. As long as the domain shifts can be reliably detected (via uncertainty), proactive decisions can be made to avoid costly mistakes due to model errors. (3) Our proposed data bootstrapping mechanism is capable of improving the performance on the OOD data (cbSGLB > cSGLB). In addition, we include the F1-AUC metric (on the combined ID&OOD evaluation sets) introduced in [31] to jointly assess the predictive power and uncertainty quality. The F1-AUC can be increased by either having a stronger predictive model or by improving the correlation between uncertainty and error. Consistent with the findings in [31], total uncertainty (TU) correlates more with errors than knowledge uncertainty (KU) as shown by the F1-AUC scores. More specifically, we see that the F1-AUC is quite similar across the board when measured by TU, although cSGLB/cbSGLB has slightly worse predictive power on the OOD segment. When F1-AUC is measured by KU, our cSGLB/cbSGLB is capable of producing KU estimates that relate more closely to model errors than KU from the SGLB baseline.

At last, we present the OOD detection ROC-AUC performance on the evaluation data by using KU estimates. Our cSGLB/cbSGLB outperforms the SGLB baseline with a large margin on the OOD detection task, and even achieves a comparable performance to the real ensemble 𝑒𝑛𝑠10, which is 5× more expensive. This highlights that our cSGLB/cbSGLB can produce high-fidelity KU estimates to detect domain (or distributional) shifts with a single model, and that our proposed cyclical gradient scheduling is effective in exploring different modes of a posterior. In real-world industrial applications, detecting OOD data or domain shifts in an efficient way is often crucial to ensure a safe deployment and operation of ML systems. Observing consistently high uncertainty (especially KU) from model predictions indicates that the patterns of new incoming data have deviated from the training. This often provides a strong signal for model refresh, ensuring that the ML system can be updated in time to avoid errors and operate safely in its "comfort zone" (with relatively low predictive uncertainty).

Conclusion

We present cyclical gradient scheduling and Cyclical SGLB for efficiently and effectively quantifying uncertainty in gradient boosting with a single model, and propose a data bootstrapping scheme to enhance diversity in posterior samples. We show empirically that our algorithms have superior performance over the state-ofthe-art SGLB, especially in quantifying knowledge uncertainty and for OOD detection. Accurately quantifying uncertainty in ML predictions can yield many benefits in real-world applications. Uncertainty directly measures the confidence in model predictions, not only improving interpretability, but also ensuring that costly mistakes can be avoided proactively. Consistently observing high uncertainty in predictions on incoming data stream often provides a reliable signal for data distributional shifts and/or model being obsolete. Uncertainty is also a key concept in optimal control and decision making, where it can be leveraged to experiment with different actions/decisions with controllable costs in search for the best operating point of an ML system. We believe that our work on efficient uncertainty quantification for GBDTs facilitates the adoption of uncertainty-enabled and uncertainty-aware ML systems, and more broadly promotes ML safety and a safe and trustworthy deployment of ML model in production.

A. Comparison between cSGLB and cSG-MCMC

The proposed Cyclical SGLB algorithm combines SGLB with cSG-MCMC [13] to effectively explore different modes of a highly multimodal posterior distribution. In this section, we summarize some key differences between our design and the original cSG-MCMC algorithm.

(1) cSG-MCMC is a sampling algorithm designed for Bayesian NNs, while cSGLB is built for GBDT models. In deep learning, full-batch gradient descent is usually not feasible, and techniques have been developed to explicitly compensate mini-batch noises, such as preconditioning [14]. Some also suggested applying an additional correction step called Metropolis-Hastings [15]. Treebased GB models can easily scale up to large industrial datasets and digest the full training set at each iteration. Therefore, our cSGLB uses full-batch GB in the sampling stage of each cycle to ensure high-quality samples being generated. (2) cSGLB puts a cyclical schedule on gradient scale while cSG-MCMC puts a schedule on step size. In addition, the original cSG-MCMC completely removed the injected Gaussian noises in the exploration stage, and cSG-MCMC reduced to regular stochastic gradient descent (SGD) during the period of exploration. Although the authors claimed that this amounts to posterior temping which is commonly used in DL domain, the implementation of cSG-MCMC algorithm does not follow closely/strictly the dynamics of SGLD during the exploration stage. In contract, we keep the injected noise term unchanged during the course of learning. Our design achieved similar effects compared with the step-size scheduling on a synthetic experiment, and we also ensure that cSGLB follows the dynamics of SGLD (more or less) at every iteration step. (3) Lastly, the gradient scaling (instead of step-size scheduling) has implementation benefits. The SGLB algorithm is made available in the CatBoost library [30], which only supports a constant step size (or learning rate). Our proposed cyclical gradient scaling (and data bootstrapping) can be implemented easily with the "user-defined loss function" available in the CatBoost package, without modifying a single line of the source code.

Data

B. Synthetic Data

B.1. Synthetic Multimodal Distribution

The ground truth density of the distribution is . We used the code released by the authors [13] to generate our results. Specifically, the SGLD was trained with a decaying lr 𝜖𝜏 = 0.05(𝜏 + 1) −0.55 , and clr-SGLD was learned with a cyclical lr schedule with initial value 𝜖0 = 0.09 and exploration proportion 𝜂 = 0.25. For our cg-SGLD, we fixed lr 𝜖 = 0.01 and Gaussian noise scale as 0.4, and set 𝛼max = 10, 𝛼min = 1. The "noisy" version of SGLD (or NoisySGLD/N-SGLD) was trained with a fixed lr 𝜖 = 0.02 and noise scale 5.0 (roughly 10× larger than the noise scale used in the other methods). Each chain was trained for 50𝑘 iterations and both clr-SGLD and cg-SGLD had 30 cycles. The results and findings are robust to random seeds, and similar results were observed with different seeds. We refer the interested readers to the original paper [13] for results of SGLD and clr-SGLD in parallel (or multi-chain) settings.

𝐹 (𝑥) = 25 ∑︁ 𝑖=11

B.2. Synthetic Spiral Dataset

All experiments were conducted using CatBoost [30], one of the-state-of-the-art libraries for GBDTs. The ensemble of SGLB (ens20) contains 20 independent (with different random seeds) models with 1K trees each. The learning rate is 𝜖 = 0.1, tree depth is 6, and random_strength = 100 and border_count = 128. The SGLB virtual ensemble and cSGLB virtual ensemble are trained with the same parameters except that we increase the number of trees to 2K and lower the lr for cSGLB to 𝜖 = 0.05. Thus, the virtual ensemble is 10× more efficient in computation and memory than the actual SGLB ensemble.

For SGLB virtual ensemble, each 50th model from interval [1000, 2000] is added to the ensemble, making it a virtual ensemble of 20 members. For cSGLB virtual ensemble, we set 𝜖 = 0.05, cycle length 𝐶 = 200, 𝛼max = 10, 𝛼min = 1, making it a virtual ensemble of 2000/200 = 10 members. For cbSGLB with bootstrapping, we additionally set exploration proportion 𝜂 = 0.9 and mask probability 𝑝 𝑏𝑚 = 0.66.

C. Real-World Shifts Data

Data Summary A detailed summary of our final partitioning of the Weather Prediction dataset is included in Table 2.

Experimental Details We used default parameter settings for SGLB models as suggested in the original paper [12] for uncertainty quantification except that we set the subsample rate to 0.8 for stochastic gradient boosting. The real SGLB ensemble consists of (up to) 30 SGLB models trained with different seeds, each of 1K trees. In order to get more samples from a single chain, the virtual ensembles of SGLB and our cSGLB/cbSGLB were learned with a single model of 2K trees. We set learning rate for all models to 𝜖 = 0.05, and tree depth to 6. For SGLB virtual ensemble, each 100th model from interval [1000,2000] was added to the ensemble, making it a virtual ensemble of 10 members. cSGLB and cbSGLB shared the same parameters with the SGLB counterpart. In addition, for cSGLB/cbSGLB virtual ensemble, we set cycle length 𝐶 = 200, 𝛼max = 10.0, 𝛼min = 1.0, making it a virtual ensemble of 2000/200 = 10 members. For simplicity, cSGLB used full-batch gradient boosting at each iteration step. In contrast, for cbSGLB with bootstrapping, we set exploration proportion 𝜂 = 0.8, i.e., 80% of a cycle was treated as exploration, and set mask probability 𝑝 𝑏𝑚 = 0.6 in the exploration stage. For model and parameter selection, we only used the in-domain (ID) development set and did not use the out-of-domain (OOD) development set. Although this may potentially lower our reported performance on the OOD evaluation set, we believe that it reflects better a real-world learning scenario where the shifted data is often unobserved and unavailable at training time.

Figure 1 :1Figure 1: Illustration of proposed cyclical schedule on gradient scales for SGLB algorithm.

Figure 2 :2Figure 2: Sampling from a 5 by 5 mixture of 25 Gaussians.

Figure 3 :3Figure 3: Spiral dataset and estimated knowledge uncertainties. Each different color in (a) represents a different class.

25 𝒩 (𝑥|𝜇𝑖, Σ), where 𝜇 = {−4, −2, 0, 2, 4} 𝑇 × {−4, −2, 0, 2, 4} and

Estimate gradients 𝑔 ˆ𝜏 using 𝑓 Θ ^𝜏 (•) and 𝒟𝑁 : 𝑔 ˆ𝜏 = ( 𝜕 𝜕𝑓 𝐿(𝑓 Θ ^𝜏 (𝑥𝑖), 𝑦𝑖)) 𝑁 𝑖=1 ∈ R 𝑁 Sample noise 𝜁, 𝜁 ′ ∼ 𝒩 (0𝑁 , 𝐼𝑁 ) Sample tree structure:Algorithm 1: Cyclical (Bootstrapped) SGLBInput: dataset 𝒟𝑁 , learning rate 𝜖 > 0, inversetemperature 𝛽 > 0, regularization 𝛾 > 0,number of iterations 𝒯 > 0, cycle length 𝐶 > 1,scaler limits 𝛼max, 𝛼min > 0, stage threshold𝜂 > 0, mask probability 𝑝 𝑏𝑚 > 0, booleanindicator 𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝Initialize 𝑓 Θ ^0 (•) = 0, 𝜈 = 1𝑁 ∈ R 𝑁 as all-onesvectorfor 𝜏 in [0, 1, . . . , 𝒯 − 1] doif 𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝 thenif mod (𝜏,𝐶) 𝐶 Sample 𝜈 ∈ R 𝑁 with = 0 then𝜈[𝑖] ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝 𝑏𝑚 )endif mod (𝜏,𝐶) 𝐶 Set 𝜈 = 1𝑁 ≥ 𝜂 thenendendCompute gradient scaler: 𝛼𝜏 =max( 𝛼max 2 [cos( 𝜋 mod (𝜏,𝐶) 𝐶) + 1], 𝛼min)𝑠𝜏 ∼ 𝑝(︀ 𝑠 ⃒ ⃒ 𝛼𝜏 (𝑔 ˆ𝜏 ⊙𝜈) +√︁2𝑁 𝜖𝛽 𝜁 ′ )︀Estimate leaf/parameter values:, and

Table 11Comparing predictive performance & uncertainty measures of various methods on Shifts Weather data. Mean ± Std Dev over 3 seeds. The best performance among virtual ensembles is highlighted in bold.MetricDataens3ens5ens10ens30SGLBcSGLBcbSGLBAccuracy (%)↑eval-ID eval-OOD65.24 ± 0.02 65.25 ± 0.02 65.26 ± 0.01 52.51 ± 0.06 52.55 ± 0.06 52.55 ± 0.0365.26 52.5465.63 ± 0.01 52.50 ± 0.1465.93 ± 0.01 50.72 ± 0.1765.59 ± 0.01 51.49 ± 0.11Macro F1 (%)↑eval-ID eval-OOD63.28 ± 0.02 63.29 ± 0.02 63.30 ± 0.01 52.36 ± 0.07 52.39 ± 0.06 52.38 ± 0.0363.30 52.3863.69 ± 0.01 52.36 ± 0.1364.06 ± 0.02 50.64 ± 0.1663.67 ± 0.01 51.41 ± 0.15F1-AUC (%)↑TU KU57.18 ± 0.01 57.19 ± 0.01 57.20 ± 0.01 52.94 ± 0.03 53.71 ± 0.03 54.33 ± 0.0257.20 54.8357.26 ± 0.01 52.55 ± 0.0856.82 ± 0.03 53.60 ± 0.08 53.31 ± 0.14 56.95 ± 0.04OOD-AUC (%)↑KU65.72 ± 0.31 68.60 ± 0.21 71.15 ± 0.1673.2667.32 ± 0.7071.45 ± 0.67 71.70 ± 0.91

Table 22Data summary of our partitioning of the Weather Prediction dataset.# of samples% of classTotalTropicalDryMildSnowPolarClass 0 Class 10 Class 20train-ID200,00026,78645,357127,8570040.9233.7125.37dev-ID46,2796,19610,50529,5780040.9033.9725.13dev-OOD46,55500046,555038.3535.0726.58eval-ID518,58769,245117,981 331,3610040.9833.7125.30eval-OOD524,048000479,952 44,09633.0535.7031.24

KU is also named epistemic uncertainty and DU is also called aleatoric uncertainty. See paper[12] for equations computing KU and DU in regression tasks. The function class in-use does not contain the unknown groundtruth function.

The accuracy, fairness, and limits of predicting recidivism JDressel HFarid 10.1126/sciadv.aao5580 Science Advances 4 5580 2018 PStone RBrooks EBrynjolfsson RCalo OEtzioni GHager JHirschberg SKalyanakrishnan EKamar SKraus Artificial intelligence and life in 2030, One Hundred Year Study on Artificial Intelligence: Report of the 2015-2016 Study Panel 2016 52 ASerban EPoll JVisser 10.48550/ARXIV.2008.03046 Towards using probabilistic models to design software systems with inherent uncertainty 2020 ml assisted decision making SMcgrath PMehta AZytek ILage HLakkaraju 10.48550/ARXIV.2011.06167 When does uncertainty matter?: Understanding the impact of predictive uncertainty in 2020 Tabular data: Deep learning is not all you need RShwartz-Ziv AArmon 10.48550/ARXIV.2106.03253 2021 Greedy function approximation: a gradient boosting machine JHFriedman Annals of statistics 2001 Stochastic gradient boosting JHFriedman Computational statistics & data analysis 38 2002 Uncertainty estimation in deep learning with application to spoken language assessment AMalinin 2019 University of Cambridge Ph.D. thesis Tree induction for probabilitybased ranking FProvost PDomingos Machine learning 52 2003 Calibrating probability estimation trees using venn-abers predictors UJohansson TLöfström HBoström Proceedings of the 2019 SIAM International Conference on Data Mining the 2019 SIAM International Conference on Data Mining SIAM 2019 Sglb: Stochastic gradient langevin boosting AUstimenko LProkhorenkova International Conference on Machine Learning

PMLR

2021 AMalinin LProkhorenkova AUstimenko arXiv:2006.10562 Uncertainty in gradient boosting via ensembles 2020 arXiv preprint RZhang CLi JZhang CChen AGWilson arXiv:1902.03932 Cyclical stochastic gradient mcmc for bayesian deep learning 2019 arXiv preprint FWenzel KRoth BSVeeling JŚwiątkowski LTran SMandt JSnoek TSalimans RJenatton SNowozin arXiv:2002.02405 How good is the bayes posterior in deep neural networks really? 2020 arXiv preprint What are bayesian neural network posteriors really like? PIzmailov SVikram MDHoffman AG GWilson International Conference on Machine Learning PMLR 2021 Learning under model misspecification: Applications to variational and ensemble methods AMasegosa Advances in Neural Information Processing Systems 33 2020 Bayesian learning via stochastic gradient langevin dynamics MWelling YWTeh Proceedings of the 28th international conference on machine learning (ICML-11) the 28th international conference on machine learning (ICML-11) Citeseer 2011 Stochastic gradient hamiltonian monte carlo TChen EFox CGuestrin International conference on machine learning PMLR 2014 Bayesian sampling using stochastic gradient thermostats NDing YFang RBabbush CChen RDSkeel HNeven Advances in neural information processing systems 27 2014 Preconditioned stochastic gradient langevin dynamics for deep neural networks CLi CChen DCarlson LCarin Thirtieth AAAI Conference on Artificial Intelligence 2016 Consistency and fluctuations for stochastic gradient langevin dynamics YWTeh AHThiery SJVollmer The Journal of Machine Learning Research 17 2016 Bayesian deep learning and a probabilistic perspective of generalization AGWilson PIzmailov Advances in neural information processing systems 33 2020 AAshukha ALyzhov DMolchanov DVetrov arXiv:2002.06470 Pitfalls of in-domain uncertainty estimation and ensembling in deep learning 2020 arXiv preprint Ngboost: Natural gradient boosting for probabilistic prediction TDuan AAnand DYDing KKThai SBasu ANg ASchuler International Conference on Machine Learning

PMLR

2020 BSettles Active learning literature survey 2009 Parameterized indexed value function for efficient exploration in reinforcement learning TTan ZXiong VRDwaracherla Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2020 34 Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems SDepeweg JMHernández-Lobato FDoshi-Velez SUdluft stat 1050 11 2017 Loss surfaces, mode connectivity, and fast ensembling of dnns TGaripov PIzmailov DPodoprikhin DPVetrov AGWilson Advances in neural information processing systems 31 2018 Essentially no barriers in neural network energy landscape FDraxler KVeschgini MSalmhofer FHamprecht International conference on machine learning

PMLR

2018 AVDorogush VErshov AGulin arXiv:1810.11363 Catboost: gradient boosting with categorical features support 2018 arXiv preprint AMalinin NBand GChesnokov YGal MJGales ANoskov APloskonosov LProkhorenkova IProvilkov VRaina arXiv:2107.07455 Shifts: A dataset of real distributional shift across multiple large-scale tasks 2021 arXiv preprint