<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Efficient and Effective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tian</forename><surname>Tan</surname></persName>
							<email>tianta@amazon.com</email>
							<affiliation key="aff0">
								<orgName type="department">Buyer Risk Prevention -ML</orgName>
								<orgName type="institution">WW Customer Trust Amazon</orgName>
								<address>
									<postCode>98109</postCode>
									<settlement>Seattle</settlement>
									<region>WA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Carlos</forename><surname>Huertas</surname></persName>
							<email>carlohue@amazon.com</email>
							<affiliation key="aff0">
								<orgName type="department">Buyer Risk Prevention -ML</orgName>
								<orgName type="institution">WW Customer Trust Amazon</orgName>
								<address>
									<postCode>98109</postCode>
									<settlement>Seattle</settlement>
									<region>WA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Qi</forename><surname>Zhao</surname></persName>
							<email>qqzhao@amazon.com</email>
							<affiliation key="aff0">
								<orgName type="department">Buyer Risk Prevention -ML</orgName>
								<orgName type="institution">WW Customer Trust Amazon</orgName>
								<address>
									<postCode>98109</postCode>
									<settlement>Seattle</settlement>
									<region>WA</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<address>
									<addrLine>February 13 -14</addrLine>
									<postCode>2023</postCode>
									<settlement>Washington</settlement>
									<region>D.C</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Efficient and Effective Uncertainty Quantification in Gradient Boosting via Cyclical Gradient MCMC</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">A13C96CAB30BA94A6E1DA9A82D6A4254</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-04-29T06:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>uncertainty quantification</term>
					<term>gradient boosting decision trees</term>
					<term>Bayesian inference</term>
					<term>out-of-domain (OOD) detection</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Gradient boosting decision trees (GBDTs) are widely applied on tabular data in real-world ML systems. Quantifying uncertainty in GBDT models is thus essential for decision making and for avoiding costly mistakes to ensure an interpretable and safe deployment of tree-based models. Recently, Bayesian ensemble of GBDT models is used to measure uncertainty by leveraging an algorithm called stochastic gradient Langevin boosting (SGLB), which combines GB with stochastic gradient MCMC (SG-MCMC). Although theoretically sound, SGLB gets trapped easily on a particular mode of the Bayesian posterior, just like other forms of SG-MCMCs. Therefore, a single SGLB model can often fail to produce uncertainty estimates of high-fidelity. To address this problem, we present Cyclical SGLB (cSGLB) which incorporates a Cyclical Gradient schedule in the SGLB algorithm. The cyclical gradient mechanism promotes new mode discovery and helps explore high multimodal posterior distributions. As a result, cSGLB can efficiently quantify uncertainty in GB with only a single model. In addition, we present another cSGLB variant with data bootstrapping to further encourage diversity among posterior samples. We conduct extensive experiments to demonstrate the efficiency and effectiveness of our algorithm, and show that it outperforms the state-of-the-art SGLB on uncertainty quantification, especially when uncertainty is used for detecting out-of-domain (OOD) data or distributional shifts.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>With the rapid growth of data and computing power, machine learning (ML) has been gaining a lot of new applications in areas not imagined before. The more ubiquitous ML systems become, it is inevitable to see applications in very sensitive and high-risk fields. This expands to a numerous areas like criminal recidivism <ref type="bibr" target="#b0">[1]</ref>, medical follow-ups <ref type="bibr" target="#b1">[2]</ref> and autonomous-systems <ref type="bibr" target="#b2">[3]</ref>. While these systems might be very broad, they share a common need, and that is to have a certain degree of confidence on ML predictions. A proven successful way to build confidence in critical systems is uncertainty estimation. Research has shown that humans are more likely to agree with a system if they get access to the corresponding uncertainty, and this holds true regardless of shape and variance as the approach itself is model and task agnostic <ref type="bibr" target="#b3">[4]</ref>. Since the most common data type in real-world ML applications is tabular <ref type="bibr" target="#b4">[5]</ref>, our work in this paper focuses specifically on uncertainty quantification for the state-of-the-art gradient boosting decision trees (GBDTs) <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>, which are known to outperform deep learning (DL) methods on tabular data, both in accuracy and tuning requirements <ref type="bibr" target="#b4">[5]</ref>. Measuring uncertainty effectively and efficiently on GBDT predictions can therefore not only improve model interpretability in production but also ensure a safer deployment of ML systems, especially for high-risk applications.</p><p>Uncertainty quantification (UQ) has been widely studied for neural networks under the Bayesian framework <ref type="bibr" target="#b7">[8]</ref>, however, it is relatively under-explored for tree-based models. Although calibrated probability estimation trees <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref> can be used for UQ, they have not been studied from a Bayesian perspective. Recently, Bayesian ensemble methods were extended to measure uncertainty in GBDTs by leveraging a new algorithm called stochastic gradient Langevin boosting (SGLB) <ref type="bibr" target="#b10">[11]</ref>. Specifically, two SGLB-based approaches were introduced for UQ <ref type="bibr" target="#b11">[12]</ref>: (1) SGLB ensemble, which trains multiple SGLB models in parallel, and (2) SGLB virtual ensemble, which constructs a virtual ensemble using only a single SGLB model where each member in the ensemble is a "truncated" sub-model <ref type="bibr" target="#b11">[12]</ref>. Although both approaches are theoretically sound, there is clearly a trade-off between quality and efficiency in practice. SGLB (real) ensemble is believed to be accurate as it can characterize the Bayesian posterior well by running independent models in parallel. However, it is almost infeasible to deploy such an ensemble in real-world production due to its high computational and maintenance costs. SGLB virtual ensemble greatly improves the efficiency, however, it often gets stuck on a single mode of the Bayesian posterior and can produce downgraded uncertainty estimates <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>. To better balance between quality and efficiency and to facilitate the usage of uncertainty-enabled ML systems, an important question remains: how can we make a single SGLB explore effectively different modes of a posterior given a limited computational budget?</p><p>In this paper, we address the question above by combining SGLB virtual ensemble with advanced sampling techniques from Bayesian DL <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>. Inspired by the ideas in <ref type="bibr" target="#b12">[13]</ref>, we propose to use a scaler (or scaling factor) on gradients that follows a cyclical schedule during the course of SGLB training. The cyclical schedule is illustrated in Figure <ref type="figure" target="#fig_0">1</ref>, and consequently, we name the resulting algorithm Cyclical SGLB (cSGLB). Similar to <ref type="bibr" target="#b12">[13]</ref>, each cycle in cSGLB contains two stages: (1) Exploration: when the scaler is large, we treat this stage as a warm restart from the previous cycle, enabling the model/sampler to follow the gradients closely and to escape from the current local mode. (2) Sampling: when the gradient scaler is small, the scale of injected Gaussian noise in the SGLB procedure becomes relatively large, encouraging the sampler to fully characterize one local mode. We collect one sample (or truncated sub-model) to build the virtual ensemble at the end of each cycle. The cyclical gradient schedule therefore helps cSGLB effectively explore different modes of a posterior while maintaining the same level of efficiency of a virtual ensemble. Moreover, inspired by a recent study <ref type="bibr" target="#b15">[16]</ref> showing that "diversified" posterior may provide a tighter generalization bound, we present another simple approach to encourage diversity in samples obtained from running cSGLB via data bootstrapping. We name this variant Cyclical Bootstrapped SGLB (cbSGLB).</p><p>We extensively experiment with our proposed algorithms and compare the performance against SGLB ensemble and the original SGLB virtual ensemble. Particularly, we show that our cyclical gradient schedule can help explore effectively multimodal distributions, cSGLB is capable of producing uncertainty estimates that are better aligned with SGLB real ensemble, and cSGLB/cbS-GLB outperforms the SGLB baseline with a large margin on out-of-domain (OOD) data detection, indicating a su-perior performance in detecting distributional/domain shifts in real-world tabular data streams.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Bayesian ML and approximate Bayesian inference provide a principled representation of uncertainty. One popular family of approaches to inference in Bayesian ML are stochastic gradient Markov Chain Monte Carlo (SG-MCMC) methods <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b12">13]</ref>, which are used to effectively sample models (or model parameters) from the Bayesian posterior. Uncertainty then comes naturally by measuring the "discrepancy" in predictions from the sampled models which are regarded as posterior samples. Recently, stochastic gradient Langevin boosting (SGLB) <ref type="bibr" target="#b10">[11]</ref> was proposed by combining gradient boosting with SG-MCMC. As its name suggests, the Markov chain generated by SGLB obeys a special form of the stochastic gradient Langevin dynamics (SGLD) <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b16">17]</ref>, which implies that SGLB is able to generate samples from the true Bayesian posterior asymptotically. Leveraging this property, Malinin et al. <ref type="bibr" target="#b11">[12]</ref> proposed to use (1) SGLB ensemble or (2) SGLB virtual ensemble to measure uncertainty in GBDTs. Essentially, SGLB ensemble corresponds to running multiple SG-MCMCs in parallel and each chain (or SGLB) is initialized independently with a different random seed. Since SGLB allows us to sample from the true posterior, the ensemble with multiple samples gives a high-fidelity approximation to the Bayesian posterior. In contrast, SGLB virtual ensemble only trains a single SGLB model and it uses multiple truncated sub-models to form a (virtual) ensemble. The key idea is essentially extracting multiple samples from a single-chain SG-MCMC instead of running multiple chains in parallel.</p><p>In theory, SGLB or single-chain SG-MCMC converges asymptotically to the target distribution and should behave similarly to the multi-chain SGLB ensemble in the limit, but it can suffer from a bounded estimation error in limited time <ref type="bibr" target="#b20">[21]</ref>. Moreover, it is often believed that the posterior is highly multimodal in the parametric space of modern ML models <ref type="bibr" target="#b12">[13]</ref>, since there are potentially many different sets of parameters that can describe the training data equally well. The real ensemble can explore different modes of the posterior by running in parallel independent chains, providing a complete picture of the distribution as the number of chains increases. However, a single-chain SG-MCMC often gets stuck easily on a single mode of the posterior <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>, failing to cover the full spectrum of the distribution.</p><p>In this paper, we extend the ideas behind Cyclical SG-MCMC (cSG-MCMC) in DL <ref type="bibr" target="#b12">[13]</ref> to sampling from a treebased SGLB model, which promotes new mode discovery during training. Different from cSG-MCMC that puts a cyclical schedule on step size, we propose to use a cyclical schedule on gradient scale. We also point out and justify the difference and our design choice in Appendix A. In addition, we propose a simple strategy to further encourage diversity in samples obtained from a single chain by data bootstrapping. At the beginning of each cycle (see Fig. <ref type="figure" target="#fig_0">1</ref>), we construct a bootstrapped dataset that is a random subset of the training, and use that bootstrapped data consistently during the exploration stage to update the GBDT model. The "bias" induced by data bootstrapping also amounts to posterior tempering <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b12">13]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Preliminaries</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">General Setup</head><p>Given a set of 𝑁 training data points sampled from an unknown distribution 𝒟 on 𝒳 × 𝒴, i.e., (𝑥1, 𝑦1), . . . , (𝑥𝑁 , 𝑦𝑁 ) ∼ 𝒟 denoted as 𝒟𝑁 , and a loss function 𝐿(𝑧, 𝑦) : 𝒵 × 𝒴 → R where 𝒵 denotes the space of predictions, our goal is to minimize the empirical loss ℒ(𝑓 |𝒟𝑁 ) := 1 𝑁 ∑︀ 𝑁 𝑖=1 𝐿(𝑓 (𝑥𝑖), 𝑦𝑖)) over functions 𝑓 belonging to some family ℱ ⊂ {𝑓 : 𝒳 → 𝒵}. In this paper, we only consider ℱ corresponding to additive ensemble of decision trees ℋ := {ℎ 𝑠 (𝑥, 𝜃 𝑠 ) : 𝒳 × R 𝑚𝑠 → R, 𝑠 ∈ 𝑆}, where 𝑆 is an index set and ℎ 𝑠 has parameters 𝜃 𝑠 . Decision trees are built by partitioning recursively the feature space into disjoint regions (called leaves). Each region is assigned a value that is used to estimate the response of 𝑦 in the corresponding feature subspace. Let's denote these regions by 𝑅𝑗's, then we have ℎ(𝑥, 𝜃) = ∑︀ 𝑗 𝜃𝑗1{𝑥 ∈ 𝑅𝑗}, where 1{•} denotes indicator function. Therefore, given the tree structure, decision tree ℎ 𝑠 is a linear function of its parameters 𝜃 𝑠 . It is often assumed that the set 𝑆 is finite because the training data is finite <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>, e.g., there exists only a finite number of ways to partition the training data. Owing to the linear dependence of ℎ 𝑠 on 𝜃 𝑠 and the finite assumption of 𝑆, we can represent any ensemble of models from ℋ as a linear model 𝑓Θ(𝑥) = 𝜑(𝑥) 𝑇 Θ for some feature map 𝜑(𝑥) : 𝒳 → R 𝑚 and Θ ∈ R 𝑚 denotes the parameters of the entire ensemble <ref type="bibr" target="#b10">[11]</ref>. Hence, in the subsequent discussion, we will simply denote the parameters of the GBDT model obtained at iteration 𝜏 as Θ ˆ𝜏 , and additionally define a linear mapping 𝐻𝑠 : R 𝑚𝑠 → R 𝑁 that converts 𝜃 𝑠 to predictions (ℎ 𝑠 (𝑥𝑖, 𝜃 𝑠 )) 𝑁 𝑖=1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">SGLB</head><p>SGLB combines stochastic gradient boosting (SGB) <ref type="bibr" target="#b6">[7]</ref> with stochastic gradient Langevin dynamics (SGLD) <ref type="bibr" target="#b16">[17]</ref>. Following notations used in the original paper <ref type="bibr" target="#b10">[11]</ref>, we characterize the SGB procedure by a tuple ℬ := {ℋ, 𝑝}, where ℋ again is the set of base learners and 𝑝(𝑠|𝑔) is a distribution over indices 𝑠 ∈ 𝑆 conditioned on a gradient vector 𝑔 ∈ R 𝑁 . Simply put, 𝑝(𝑠|𝑔) defines a distribution over tree structures.</p><p>As with other GBDT algorithms, SGLB constructs an ensemble of decision trees iteratively. At each iteration 𝜏 , we compute unbiased gradient estimates 𝑔 ˆ𝜏 such that E[𝑔 ˆ𝜏 ] = ( 𝜕 𝜕𝑓 𝐿(𝑓 Θ ^𝜏 (𝑥𝑖), 𝑦𝑖)) 𝑁 𝑖=1 ∈ R 𝑁 using the current model 𝑓 Θ ^𝜏 , and sample independently two normal vectors 𝜁, 𝜁 ′ ∼ 𝒩 (0𝑁 , 𝐼𝑁 ), where 0𝑁 , 𝐼𝑁 denote zero vector and identity matrix in R 𝑁 , respectively. Then, a base learner (or tree structure) 𝑠𝜏 is picked by drawing one sample from 𝑝(𝑠|𝑔 ˆ𝜏 + √︁ 2𝑁 𝜖𝛽 𝜁 ′ ), where 𝜖 &gt; 0 is a learning rate (or step size) and 𝛽 &gt; 0 is a parameter often referred as inverse diffusion temperature. Next, we estimate the parameters 𝜃 𝑠𝜏 * (at tree leaves) of the sampled base learner by solving the following optimization:</p><formula xml:id="formula_0">minimize ||𝜃 𝑠𝜏 || 2 2 𝑠.𝑡. 𝜃 𝑠𝜏 ∈ argmin 𝜃∈R 𝑚𝑠 𝜏 || − 𝑔 ˆ𝜏 + √︂ 2𝑁 𝜖𝛽 𝜁 − 𝐻𝑠 𝜏 𝜃|| 2 2 ,<label>(1)</label></formula><p>which returns the minimum norm solution that fits best to the perturbed "noisy" version of negative gradients. The optimization above has a closed form so-</p><formula xml:id="formula_1">lution as 𝜃 𝑠𝜏 * = −Φ𝑠 𝜏 (𝑔 ˆ𝜏 + √︁ 2𝑁 𝜖𝛽 𝜁)</formula><p>, where Φ𝑠 𝜏 := (𝐻 𝑇 𝑠𝜏 𝐻𝑠 𝜏 ) + 𝐻 𝑇 𝑠𝜏 , + denotes pseudo-inverse. For decision trees, Φ𝑠 𝜏 𝑔 essentially corresponds to averaging the gradient estimates 𝑔 in each leaf node of the tree. Lastly, SGLB algorithm updates the ensemble model by</p><formula xml:id="formula_2">𝑓 Θ ^𝜏+1 (•) := (1 − 𝛾𝜖)𝑓 Θ ^𝜏 (•) + 𝜖ℎ 𝑠𝜏 (•, 𝜃 𝑠𝜏 * ),<label>(2)</label></formula><p>where 𝛾 is a regularization parameter that "shrinks" the currently built model when updating the ensemble. At a high-level, SGLB is a stochastic GB algorithm with Gaussian noise injected into gradient estimates, which encourages the algorithm to explore a larger area in the functional space to find a better fit for the given data. The independence between noise 𝜁 (used for parameter learning) and 𝜁 ′ (used for tree sampling), and the model shrinking by 𝛾 in Eqn.(2) are technical details needed for establishing theoretical results and rigorous analysis of SGLB <ref type="bibr" target="#b10">[11]</ref>. All the procedures of SGLB are also present in our proposed cSGLB in Algo. 1 (with our additional modifications highlighted in blue).</p><p>One can show that the parameters of SGLB Θ ˆ𝜏 at each iteration form a Markov chain that weakly converges to the following stationary distribution:</p><formula xml:id="formula_3">𝑝 𝛽 * (Θ) ∝ exp(−𝛽ℒ(Θ|𝒟𝑁 ) − 𝛽𝛾||ΓΘ|| 2 2 ),<label>(3)</label></formula><p>where Γ = Γ 𝑇 &gt; 0 is a regularization matrix which depends on a particular tree construction algorithm or the choice of tuple ℬ := {ℋ, 𝑝(𝑠|𝑔)} <ref type="bibr" target="#b10">[11]</ref>. Note that since the GBDT model is linear and can be fully determined by parameters Θ, we simply use notation ℒ(𝑓 |𝒟𝑁 ) and ℒ(Θ|𝒟𝑁 ) interchangeably.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Posterior Sampling</head><p>We consider here a standard Bayesian learning framework <ref type="bibr" target="#b7">[8]</ref> that treats parameters Θ as random variables and places a prior 𝑝(Θ) over Θ. In addition, we consider the GBDT model 𝑓Θ as a probabilistic model and explicitly denote the model by 𝑃 (𝑦|𝑥; Θ) with parameters Θ. This is valid naturally for classification as GBDT models by construction return a distribution over class labels. For regression, one can leverage NGBoost algorithm <ref type="bibr" target="#b23">[24]</ref> to return the mean and variance of a Gaussian distribution over the target 𝑦 for a given input 𝑥.</p><p>For the purpose of uncertainty estimation, we aim to estimate or obtain an approximation to the Bayesian posterior 𝑝(Θ|𝒟𝑁 ). To that end, we can choose 𝛽 = 𝑁 and 𝛾 =<ref type="foot" target="#foot_0">1</ref> 2𝑁 and use the negative log-likelihood as the loss function</p><formula xml:id="formula_4">ℒ(Θ|𝒟𝑁 ) = E𝒟 𝑁 [− log 𝑝(𝑦|𝑥, Θ)] = − 1 𝑁 ∑︀ 𝑁 𝑖=1 log 𝑝(𝑦𝑖|𝑥𝑖, Θ).</formula><p>Then, the limiting distribution of SGLB can be explicitly expressed as:</p><formula xml:id="formula_5">𝑝 𝛽 * (Θ) ∝ exp (︁ log 𝑝(𝒟𝑁 |Θ) − 1 2 ||ΓΘ|| 2 2 )︁ ∝ 𝑝(𝒟𝑁 |Θ)𝑝(Θ),<label>(4)</label></formula><p>which is proportional to the true posterior 𝑝(Θ|𝒟𝑁 ) under Gaussian prior 𝑝(Θ) = 𝒩 (0𝑚, Γ) <ref type="bibr" target="#b10">[11]</ref>. Now, consider a Bayesian ensemble of probabilistic models {𝑃 (𝑦|𝑥; Θ (𝑘) )} 𝐾 𝑘=1 where each model is trained independently by running SGLB. Since each Θ (𝑘) is guaranteed to be sampled from 𝑝(Θ|𝒟𝑁 ) by Eqn.( <ref type="formula" target="#formula_5">4</ref>), the ensemble {Θ (𝑘) } 𝐾 𝑘=1 with 𝐾 samples yields a "discrete" approximation to the posterior 𝑝(Θ|𝒟𝑁 ). This is exactly the idea behind SGLB ensemble <ref type="bibr" target="#b11">[12]</ref>, which learns 𝐾 independent SGLB models in parallel with different random seeds. Although the approximation improves as 𝐾 increases, the computational cost also increases linearly with 𝐾. To alleviate the computational burden, SGLB virtual ensemble <ref type="bibr" target="#b11">[12]</ref> builds a Bayesian virtual ensemble by sampling multiple times from a singlechain SGLB model. Because samples from the same chain are highly correlated, SGLB virtual ensemble proposes to sample one member Θ (𝑘) every 𝐶 &gt; 1 iterations. More specifically, the parameters are sampled by</p><formula xml:id="formula_6">{Θ (𝑘) } 𝐾=⌊ 𝒯 2𝐶 ⌋ 𝑘=1 = {Θ ˆ𝐶(𝑘+⌊ 𝒯 2𝐶 ⌋) , 𝑘 = 1, . . . , ⌊ 𝒯 2𝐶 ⌋}, i.</formula><p>e., appending one member to the ensemble every 𝐶 iterations while constructing one SGLB model using 𝒯 iterations of gradient boosting. Notice that no sampling is performed during the first half of iterations (𝜏 &lt; 𝒯 /2) since Eqn.(4) holds only asymptotically. For large 𝐶 and 𝐾, the virtual ensemble should behave similarly to the SGLB real ensemble in the limit theoretically.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Uncertainty Estimation</head><formula xml:id="formula_7">Once the Bayesian (virtual) ensemble {𝑃 (𝑦|𝑥; Θ (𝑘) )} 𝐾 𝑘=1 , Θ (𝑘) ∼ 𝑝(Θ|𝒟𝑁 ) is learned,</formula><p>predictions can be made by taking an average over the ensemble, often known as predictive posterior or Bayesian model average (BMA):</p><formula xml:id="formula_8">𝑝(𝑦|𝑥, 𝒟𝑁 ) = E 𝑝(Θ|𝒟 𝑁 ) [𝑃 (𝑦|𝑥; Θ)] ≈ 1 𝐾 ∑︀ 𝐾 𝑘=1 𝑃 (𝑦|𝑥; Θ (𝑘) ).</formula><p>(5) The entropy of the predictive posterior estimates total uncertainty (TU) in predictions, which can be further decomposed into two distinct types of uncertainty: knowledge uncertainty and data uncertainty. 1 (a) Knowledge uncertainty (KU) arises due to the lack of knowledge about the data generation process (or the unknown distribution 𝒟). KU is expected to be large in regions (in the feature space) where we do not have sufficient training data. (b) Data uncertainty (DU) arises due to the inherent stochasticity within the data generation process, and it is high in regions with class overlaps. In applications like active learning <ref type="bibr" target="#b24">[25]</ref>, reinforcement learning (RL) <ref type="bibr" target="#b25">[26]</ref>, and OOD detection, it is desirable to measure KU separately from DU (or TU), and the following equation can be used in practice to compute and connect them via mutual information <ref type="bibr" target="#b26">[27]</ref>:</p><formula xml:id="formula_9">I(𝑦; Θ|𝑥, 𝒟 𝑁 ) ⏟ ⏞ Knowledge Uncertainty = H(𝑝(𝑦|𝑥, 𝒟 𝑁 )) ⏟ ⏞ Total Uncertainty − E 𝑝(Θ|𝒟 𝑁 ) [H(𝑃 (𝑦|𝑥; Θ))] ⏟ ⏞ Expected Data Uncertainty ≈ H (︁ 1 𝐾 𝐾 ∑︁ 𝑘=1 𝑃 (𝑦|𝑥; Θ (𝑘) ) )︁ − 1 𝐾 𝐾 ∑︁ 𝑘=1 H (︁ 𝑃 (𝑦|𝑥; Θ (𝑘) ) )︁ ,<label>(6)</label></formula><p>where I(𝐴; 𝐵) denotes the mutual information between random variables A and B, and H(•) denotes entropy. The difference between TU and DU measures the disagreement among members in the ensemble and estimates the knowledge uncertainty.<ref type="foot" target="#foot_1">2</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Cyclical Stochastic Gradient</head><p>Langevin Boosting (cSGLB)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Promoting Mode Discovery via Cyclical Gradient Scheduling</head><p>Instead of building a cumbersome true ensemble of SGLB models, the virtual ensemble of SGLB greatly improves the efficiency by training only a single model. However, similar to other types of SG-MCMC in Bayesian DL <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref>, single-chain SGLB gets trapped easily on a particular single mode of the posterior. To efficiently explore different modes of the multimodal posterior and effectively measure uncertainty in GBDT predictions with a single chain, we propose a simple remedy that places a cyclical cosine schedule on gradient scale during training, as illustrated in Fig. <ref type="figure" target="#fig_0">1</ref>. Specifically, the scaling factor at iteration 𝜏 is defined as:</p><formula xml:id="formula_10">𝛼𝜏 = max (︁ 𝛼max 2 [︁ cos( 𝜋 mod (𝜏, 𝐶) 𝐶 )+1 ]︁ , 𝛼min )︁ ,<label>(7)</label></formula><p>where 𝛼max ≥ 1 is the maximum of the scaler or the initial value of 𝛼0, 𝐶 is the user-defined cycle length, and 𝛼min defines the minimum of the scaler, e.g., 𝛼min = 1 or 0.5, since decaying the gradients to arbitrarily small could be harmful for performance. Putting together, this amounts to sampling the tree structure and learning the tree leaf parameters with the (re)scaled gradients: 𝑠𝜏 ∼</p><formula xml:id="formula_11">𝑝(𝑠|𝛼𝜏 𝑔 ˆ𝜏 + √︁ 2𝑁 𝜖𝛽 𝜁 ′ ) and 𝜃 𝑠𝜏 * = −Φ𝑠 𝜏 (𝛼𝜏 𝑔 ˆ𝜏 + √︁ 2𝑁</formula><p>𝜖𝛽 𝜁). Similar to Cyclical SG-MCMC <ref type="bibr" target="#b12">[13]</ref>, we define two stages within each cycle: (1) Exploration when the completed portion of a cycle 𝒜(𝜏 ) = mod (𝜏,𝐶) 𝐶 is smaller than a given threshold: 𝒜(𝜏 ) &lt; 𝜂, and (2) Sampling when 𝒜(𝜏 ) ≥ 𝜂, and 𝜂 ∈ (0, 1) balances the portion between exploration and sampling. We obtain one sample from the chain at the end of each cycle, i.e., the virtual ensemble is built by</p><formula xml:id="formula_12">{Θ (𝑘) } 𝐾=⌊ 𝒯 𝐶 ⌋ 𝑘=1 = {Θ ˆ𝐶𝑘−1 , 𝑘 = 1, . . . , ⌊ 𝒯</formula><p>𝐶 ⌋}. Large gradients at the beginning of a cycle provide enough perturbation and encourage the model to escape from the current mode, while decreasing the gradient scale inside one cycle makes the sampler better characterize the density of the local mode. Moreover, many prior works in Bayesian NNs proposed to apply a certain form of preconditioning to compensate sampling noises from mini-batch training <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b13">14]</ref>. Tree-based models can usually digest the full-batch (full dataset 𝒟𝑁 ) per iteration by leveraging modern multi-core processors and multi-threading. Therefore, we directly use the fullbatch GB in all sampling stages, while leaving the option of random data subsampling in exploration stages to the users if training time is a concern.</p><p>Combining cyclical gradient scaling with SGLB, we expect that our new Cyclical SGLB (cSGLB) algorithm could inherit most (if not all) theoretical properties of the original SGLB algorithm. Conceptually, with a proper choice of 𝛼max, 𝛼min and cycle length 𝐶, the sample obtained at 𝜏 = 𝐶 − 1 from the Markov chain 𝑓 Θ ^𝜏 generated by Algo. 1 (w/o bootstrap) can be approximately seen as a random draw from the limiting distribution with small bounded errors. Also, each next cycle can be viewed as a warm restart from its previous cycle, and thus no errors shall be accumulated into the subsequent cycles (at sampling time 𝜏 = 𝐶𝑘 − 1). We left rigorous analysis and proofs of our propositions for future work. Empirically, we show in our experiments that the cyclical gradient scaling achieves similar effects in exploring a multimodal distribution when compared with cSG-MCMC which places a similar cyclical schedule on step size within the context of Bayesian DL. In fact, cS-GLB extends the idea behind cSG-MCMC to tree-based GBDT models. We also summarize key differences between our design of cSGLB and cSG-MCMC in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Enhancing Sample Diversity via Bootstrapping</head><p>Recent work <ref type="bibr" target="#b15">[16]</ref> provided a compelling analysis that the Bayesian posterior is not optimal under model misspecification <ref type="foot" target="#foot_2">3</ref> , where the performance of the true posterior is dominated by an alternative non-Bayesian posterior that explicitly encourages diversity among ensemble member predictions. Inspired by these results, we propose a simple strategy that promotes diversity among samples obtained from cSGLB by data bootstrapping. At the beginning of each cycle, we sample randomly a Bernoulli mask of size 𝑁 , i.e., 𝜈 := {𝜈[𝑖] ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝 𝑏𝑚 )} 𝑁 𝑖=1 ∈ {0, 1} 𝑁 , and 𝑝 𝑏𝑚 ∈ (0, 1) defines the percentage of data being used. In the following exploration stage, we mask out gradients 𝑔 ˆ𝜏 by taking an element-wise product with 𝜈, i.e., 𝑔 ˆ𝜏 ⊙ 𝜈. The mask 𝜈 and the mask-out operation are used consistently throughout the exploration stage 𝒜(𝜏 ) &lt; 𝜂, and 𝜈 only gets updated until the end of the cycle. This design amounts to learning with a bootstrapped subsample of the data in each cycle. Since the model would observe consistently less data than the original 𝒟𝑁 , it also amounts to posterior tempering (𝑝(𝒟𝑁 |Θ)𝑝(Θ)) 1/𝑇 with some temperature 𝑇 &gt; 1, resulting in a warm posterior that is softer than the Bayesian posterior. By increasing the temperature 𝑇 , we expect to see increased density on the paths/corridors connecting different modes of the posterior <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b28">29]</ref>, further facilitating the sampler to escape from the current local mode. By using a relatively large 𝜂 ∈ (0.8, 1), the tempering effects would carry over into the sampling stage. Therefore, the bootstrapping mechanism helps improve the sample diversity from cSGLB, and we name this variant Cyclical Bootstrapped SGLB (cbSGLB).</p><p>Lastly, we summarize our proposed cSGLB (plus bootstrap option) in Algo. 1 and highlight our modifications on top of SGLB in blue.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experiments on Synthetic Data</head><p>We validate and qualitatively evaluate the proposed gradient scheduling and our cSGLB algorithm on two synthetic problems: (1) a synthetic multimodal dataset in <ref type="bibr" target="#b12">[13]</ref> </p><formula xml:id="formula_13">𝜃 𝑠𝜏 * = −Φ𝑠 𝜏 (︀ 𝛼𝜏 (𝑔 ˆ𝜏 ⊙𝜈) + √︁ 2𝑁 𝜖𝛽 𝜁 )︀</formula><p>Update GBDT model:</p><formula xml:id="formula_14">𝑓 Θ ^𝜏+1 (•) = (1 − 𝛾𝜖)𝑓 Θ ^𝜏 (•) + 𝜖ℎ 𝑠𝜏 (•, 𝜃 𝑠𝜏 * ) end Return: 𝑓 Θ ^𝒯 (•)</formula><p>(2) a multi-class Spiral dataset in <ref type="bibr" target="#b11">[12]</ref>. Due to limited space, we include experimental details in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Synthetic Multimodal Data</head><p>We first demonstrate the ability of cyclical gradient scaling for sampling from a multimodal distribution on a 2D mixture of 25 Gaussians. Specifically, we compare (i) the original SG-MCMC with SGLD (denoted as SGLD) and two SGLD variants: (ii) SGLD with Cyclical schedule on Learning Rate (denoted as clr-SGLD) <ref type="bibr" target="#b12">[13]</ref> and (iii) SGLD with Cyclical schedule on Gradient scale (denoted as cg-SGLD (ours)). We reproduced the results for SGLD and clr-SGLD in paper <ref type="bibr" target="#b12">[13]</ref> with code released by the authors, and built our cg-SGLD on top of it. In addition, we experimented with a "noisy" version of SGLD with a fixed lr and increased 10× noise scale (denoted as NoisySGLD/N-SGLD). For a fair comparison, each chain runs for 50𝑘 iterations and both clr-SGLD and cg-SGLD have 30 cycles. Fig. <ref type="figure" target="#fig_2">2</ref> shows the estimated density using different sampling strategies. SGLD gets trapped in local modes depending on the random initial position, and increasing the noise scale does not solve the problem. In contrast, clr-SGLD and cg-SGLD can explore and locate roughly 7 − 8 different modes of the distribution, showing that our cg-SGLD can achieve the state-of-the-art performance in exploring multimodal distributions. Moreover, cg-SGLD has benefits in implementation over clr-SGLD when combined with SGLB. The SGLB algorithm was made available in the CatBoost library <ref type="bibr" target="#b29">[30]</ref>, which only supports a fixed lr. All our proposed enhancements can be implemented with a "user-defined loss function" available in CatBoost without touching the source code, making it straightforward to reproduce our algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multi-Class Spiral Data</head><p>After validating the efficacy of cyclical gradient scheduling on sampling from multimodal distributions, we are now ready to experiment with cSGLB. Specifically, we compare the following algorithms on a 3-class classification task called "Spiral" in <ref type="bibr" target="#b11">[12]</ref>: (i) SGLB ensemble, where we denote by ens𝐾 with 𝐾 models, (ii) SGLB virtual ensemble, simply denoted by SGLB, and (iii) cSGLB virtual ensemble, denoted by cSGLB (ours). We again reproduced the results in <ref type="bibr" target="#b11">[12]</ref> with code released by the authors, and Fig. <ref type="figure" target="#fig_4">3</ref> shows the estimated KU on Spiral test set. As noted in <ref type="bibr" target="#b11">[12]</ref>, we see that knowledge uncertainty due to decision-boundary "jitter" exists in both ens20 and cSGLB, and the "jitter" affects cSGLB more as the estimated KU is "noisy" at the decision boundary. Nevertheless, cSGLB (with only a single model) is significantly more efficient than ens20 and is able to greatly improve upon SGLB in capturing high KU in regions with no training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Experiments on Real-World Weather Prediction Data</head><p>Lastly, we evaluate our proposed methods on the public Shifts Weather Prediction dataset <ref type="bibr" target="#b30">[31]</ref>. We select the classification task where the ML model is asked to predict the precipitation class at a particular geolocation and timestamp, given heterogeneous tabular features derived from weather station measurements and forecast models. The full dataset is partitioned in a canonical fashion and contains in-domain (ID) training, development and evaluation datasets as well as out-of-domain (OOD) development and evaluation datasets. Importantly, the ID data and the OOD data are separated in time and consist of non-overlapping climate types (ID: Tropical, Dry, Mild; OOD: Snow, Polar), making the Shifts dataset an ideal testbed for gauging the robustness of ML model and the quality of uncertainty estimation. To further facilitate our experimentation, we conducted the following    (3) random data sampling to keep 200𝐾 (medium-sized) data in the final training set. Again, the purpose of our data preprocessing is for speeding up experimentation, and we believe that the observations and findings in this study are generalizable to the original full dataset. For model building, 30 independent SGLB models (each of 1𝐾 trees) were trained and used to construct real ensembles 𝑒𝑛𝑠𝐾 for 𝐾 ∈ {3, 5, 10, 30}. SGLB/cSGLB/cbSGLB virtual ensembles were built by sampling 10 members from a single-chain with 2𝐾 trees. Hence, 𝑒𝑛𝑠10 is 5× more expensive in computation and memory than a virtual ensemble. Additional details regarding our data and models are included in Appendix C. We compare various methods on their predictive performance and on uncertainty quantification following <ref type="bibr" target="#b30">[31]</ref>, and the results are summarized in Table <ref type="table" target="#tab_1">1</ref>. For predictive performance, we report the classification accuracy and macro F1 using BMA on both the ID and the OOD evaluation datasets. We can see the following effects: (1) Virtual ensembles with a longer chain slightly outperform the real ensembles on the ID data. (2) Our proposed cSGLB and cbSGLB perform slightly worse than the rest of methods on the OOD data. However, this usually is not a concern in practice since the model is not trained with data from OOD domains and would not be used to solve the OOD prediction tasks in a practical scenario. As long as the domain shifts can be reliably detected (via uncertainty), proactive decisions can be made to avoid costly mistakes due to model errors. (3) Our proposed data bootstrapping mechanism is capable of improving the performance on the OOD data (cbSGLB &gt; cSGLB). In addition, we include the F1-AUC metric (on the combined ID&amp;OOD evaluation sets) introduced in <ref type="bibr" target="#b30">[31]</ref> to jointly assess the predictive power and uncertainty quality. The F1-AUC can be increased by either having a stronger predictive model or by improving the correlation between uncertainty and error. Consistent with the findings in <ref type="bibr" target="#b30">[31]</ref>, total uncertainty (TU) correlates more with errors than knowledge uncertainty (KU) as shown by the F1-AUC scores. More specifically, we see that the F1-AUC is quite similar across the board when measured by TU, although cSGLB/cbSGLB has slightly worse predictive power on the OOD segment. When F1-AUC is measured by KU, our cSGLB/cbSGLB is capable of producing KU estimates that relate more closely to model errors than KU from the SGLB baseline.</p><p>At last, we present the OOD detection ROC-AUC performance on the evaluation data by using KU estimates. Our cSGLB/cbSGLB outperforms the SGLB baseline with a large margin on the OOD detection task, and even achieves a comparable performance to the real ensemble 𝑒𝑛𝑠10, which is 5× more expensive. This highlights that our cSGLB/cbSGLB can produce high-fidelity KU estimates to detect domain (or distributional) shifts with a single model, and that our proposed cyclical gradient scheduling is effective in exploring different modes of a posterior. In real-world industrial applications, detecting OOD data or domain shifts in an efficient way is often crucial to ensure a safe deployment and operation of ML systems. Observing consistently high uncertainty (especially KU) from model predictions indicates that the patterns of new incoming data have deviated from the training. This often provides a strong signal for model refresh, ensuring that the ML system can be updated in time to avoid errors and operate safely in its "comfort zone" (with relatively low predictive uncertainty).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>We present cyclical gradient scheduling and Cyclical SGLB for efficiently and effectively quantifying uncertainty in gradient boosting with a single model, and propose a data bootstrapping scheme to enhance diversity in posterior samples. We show empirically that our algorithms have superior performance over the state-ofthe-art SGLB, especially in quantifying knowledge uncertainty and for OOD detection. Accurately quantifying uncertainty in ML predictions can yield many benefits in real-world applications. Uncertainty directly measures the confidence in model predictions, not only improving interpretability, but also ensuring that costly mistakes can be avoided proactively. Consistently observing high uncertainty in predictions on incoming data stream often provides a reliable signal for data distributional shifts and/or model being obsolete. Uncertainty is also a key concept in optimal control and decision making, where it can be leveraged to experiment with different actions/decisions with controllable costs in search for the best operating point of an ML system. We believe that our work on efficient uncertainty quantification for GBDTs facilitates the adoption of uncertainty-enabled and uncertainty-aware ML systems, and more broadly promotes ML safety and a safe and trustworthy deployment of ML model in production.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Comparison between cSGLB and cSG-MCMC</head><p>The proposed Cyclical SGLB algorithm combines SGLB with cSG-MCMC <ref type="bibr" target="#b12">[13]</ref> to effectively explore different modes of a highly multimodal posterior distribution. In this section, we summarize some key differences between our design and the original cSG-MCMC algorithm.</p><p>(1) cSG-MCMC is a sampling algorithm designed for Bayesian NNs, while cSGLB is built for GBDT models. In deep learning, full-batch gradient descent is usually not feasible, and techniques have been developed to explicitly compensate mini-batch noises, such as preconditioning <ref type="bibr" target="#b13">[14]</ref>. Some also suggested applying an additional correction step called Metropolis-Hastings <ref type="bibr" target="#b14">[15]</ref>. Treebased GB models can easily scale up to large industrial datasets and digest the full training set at each iteration. Therefore, our cSGLB uses full-batch GB in the sampling stage of each cycle to ensure high-quality samples being generated. (2) cSGLB puts a cyclical schedule on gradient scale while cSG-MCMC puts a schedule on step size. In addition, the original cSG-MCMC completely removed the injected Gaussian noises in the exploration stage, and cSG-MCMC reduced to regular stochastic gradient descent (SGD) during the period of exploration. Although the authors claimed that this amounts to posterior temping which is commonly used in DL domain, the implementation of cSG-MCMC algorithm does not follow closely/strictly the dynamics of SGLD during the exploration stage. In contract, we keep the injected noise term unchanged during the course of learning. Our design achieved similar effects compared with the step-size scheduling on a synthetic experiment, and we also ensure that cSGLB follows the dynamics of SGLD (more or less) at every iteration step. (3) Lastly, the gradient scaling (instead of step-size scheduling) has implementation benefits. The SGLB algorithm is made available in the CatBoost library <ref type="bibr" target="#b29">[30]</ref>, which only supports a constant step size (or learning rate). Our proposed cyclical gradient scaling (and data bootstrapping) can be implemented easily with the "user-defined loss function" available in the CatBoost package, without modifying a single line of the source code. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Synthetic Data</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1. Synthetic Multimodal Distribution</head><p>The ground truth density of the distribution is . We used the code released by the authors <ref type="bibr" target="#b12">[13]</ref> to generate our results. Specifically, the SGLD was trained with a decaying lr 𝜖𝜏 = 0.05(𝜏 + 1) −0.55 , and clr-SGLD was learned with a cyclical lr schedule with initial value 𝜖0 = 0.09 and exploration proportion 𝜂 = 0.25. For our cg-SGLD, we fixed lr 𝜖 = 0.01 and Gaussian noise scale as 0.4, and set 𝛼max = 10, 𝛼min = 1. The "noisy" version of SGLD (or NoisySGLD/N-SGLD) was trained with a fixed lr 𝜖 = 0.02 and noise scale 5.0 (roughly 10× larger than the noise scale used in the other methods). Each chain was trained for 50𝑘 iterations and both clr-SGLD and cg-SGLD had 30 cycles. The results and findings are robust to random seeds, and similar results were observed with different seeds. We refer the interested readers to the original paper <ref type="bibr" target="#b12">[13]</ref> for results of SGLD and clr-SGLD in parallel (or multi-chain) settings.</p><formula xml:id="formula_15">𝐹 (𝑥) = 25 ∑︁ 𝑖=1<label>1</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. Synthetic Spiral Dataset</head><p>All experiments were conducted using CatBoost <ref type="bibr" target="#b29">[30]</ref>, one of the-state-of-the-art libraries for GBDTs. The ensemble of SGLB (ens20) contains 20 independent (with different random seeds) models with 1K trees each. The learning rate is 𝜖 = 0.1, tree depth is 6, and random_strength = 100 and border_count = 128. The SGLB virtual ensemble and cSGLB virtual ensemble are trained with the same parameters except that we increase the number of trees to 2K and lower the lr for cSGLB to 𝜖 = 0.05. Thus, the virtual ensemble is 10× more efficient in computation and memory than the actual SGLB ensemble.</p><p>For SGLB virtual ensemble, each 50th model from interval [1000, 2000] is added to the ensemble, making it a virtual ensemble of 20 members. For cSGLB virtual ensemble, we set 𝜖 = 0.05, cycle length 𝐶 = 200, 𝛼max = 10, 𝛼min = 1, making it a virtual ensemble of 2000/200 = 10 members. For cbSGLB with bootstrapping, we additionally set exploration proportion 𝜂 = 0.9 and mask probability 𝑝 𝑏𝑚 = 0.66.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Real-World Shifts Data</head><p>Data Summary A detailed summary of our final partitioning of the Weather Prediction dataset is included in Table <ref type="table" target="#tab_2">2</ref>.</p><p>Experimental Details We used default parameter settings for SGLB models as suggested in the original paper <ref type="bibr" target="#b11">[12]</ref> for uncertainty quantification except that we set the subsample rate to 0.8 for stochastic gradient boosting. The real SGLB ensemble consists of (up to) 30 SGLB models trained with different seeds, each of 1K trees. In order to get more samples from a single chain, the virtual ensembles of SGLB and our cSGLB/cbSGLB were learned with a single model of 2K trees. We set learning rate for all models to 𝜖 = 0.05, and tree depth to 6. For SGLB virtual ensemble, each 100th model from interval <ref type="bibr">[1000,</ref><ref type="bibr">2000]</ref> was added to the ensemble, making it a virtual ensemble of 10 members. cSGLB and cbSGLB shared the same parameters with the SGLB counterpart. In addition, for cSGLB/cbSGLB virtual ensemble, we set cycle length 𝐶 = 200, 𝛼max = 10.0, 𝛼min = 1.0, making it a virtual ensemble of 2000/200 = 10 members. For simplicity, cSGLB used full-batch gradient boosting at each iteration step. In contrast, for cbSGLB with bootstrapping, we set exploration proportion 𝜂 = 0.8, i.e., 80% of a cycle was treated as exploration, and set mask probability 𝑝 𝑏𝑚 = 0.6 in the exploration stage. For model and parameter selection, we only used the in-domain (ID) development set and did not use the out-of-domain (OOD) development set. Although this may potentially lower our reported performance on the OOD evaluation set, we believe that it reflects better a real-world learning scenario where the shifted data is often unobserved and unavailable at training time.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Illustration of proposed cyclical schedule on gradient scales for SGLB algorithm.</figDesc><graphic coords="2,109.63,84.19,162.68,112.67" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Sampling from a 5 by 5 mixture of 25 Gaussians.</figDesc><graphic coords="7,89.29,222.43,99.65,67.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Spiral dataset and estimated knowledge uncertainties. Each different color in (a) represents a different class.</figDesc><graphic coords="7,89.29,307.71,99.65,66.78" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>25 𝒩 (𝑥|𝜇𝑖, Σ), where 𝜇 = {−4, −2, 0, 2, 4} 𝑇 × {−4, −2, 0, 2, 4} and</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Estimate gradients 𝑔 ˆ𝜏 using 𝑓 Θ ^𝜏 (•) and 𝒟𝑁 : 𝑔 ˆ𝜏 = ( 𝜕 𝜕𝑓 𝐿(𝑓 Θ ^𝜏 (𝑥𝑖), 𝑦𝑖)) 𝑁 𝑖=1 ∈ R 𝑁 Sample noise 𝜁, 𝜁 ′ ∼ 𝒩 (0𝑁 , 𝐼𝑁 ) Sample tree structure:</figDesc><table><row><cell cols="5">Algorithm 1: Cyclical (Bootstrapped) SGLB</cell></row><row><cell cols="5">Input: dataset 𝒟𝑁 , learning rate 𝜖 &gt; 0, inverse</cell></row><row><cell cols="5">temperature 𝛽 &gt; 0, regularization 𝛾 &gt; 0,</cell></row><row><cell cols="5">number of iterations 𝒯 &gt; 0, cycle length 𝐶 &gt; 1,</cell></row><row><cell cols="5">scaler limits 𝛼max, 𝛼min &gt; 0, stage threshold</cell></row><row><cell cols="5">𝜂 &gt; 0, mask probability 𝑝 𝑏𝑚 &gt; 0, boolean</cell></row><row><cell cols="2">indicator 𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝</cell><cell></cell><cell></cell></row><row><cell cols="5">Initialize 𝑓 Θ ^0 (•) = 0, 𝜈 = 1𝑁 ∈ R 𝑁 as all-ones</cell></row><row><cell>vector</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">for 𝜏 in [0, 1, . . . , 𝒯 − 1] do</cell><cell></cell><cell></cell></row><row><cell cols="2">if 𝑏𝑜𝑜𝑡𝑠𝑡𝑟𝑎𝑝 then</cell><cell></cell><cell></cell></row><row><cell cols="4">if mod (𝜏,𝐶) 𝐶 Sample 𝜈 ∈ R 𝑁 with = 0 then</cell></row><row><cell></cell><cell cols="4">𝜈[𝑖] ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝 𝑏𝑚 )</cell></row><row><cell>end</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">if mod (𝜏,𝐶) 𝐶 Set 𝜈 = 1𝑁 ≥ 𝜂 then</cell><cell></cell></row><row><cell>end</cell><cell></cell><cell></cell><cell></cell></row><row><cell>end</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">Compute gradient scaler: 𝛼𝜏 =</cell></row><row><cell cols="4">max( 𝛼max 2 [cos( 𝜋 mod (𝜏,𝐶) 𝐶</cell><cell>) + 1], 𝛼min)</cell></row><row><cell>𝑠𝜏 ∼ 𝑝</cell><cell>(︀ 𝑠 ⃒ ⃒ 𝛼𝜏 (𝑔 ˆ𝜏 ⊙𝜈) +</cell><cell>√︁</cell><cell cols="2">2𝑁 𝜖𝛽 𝜁 ′ )︀</cell></row><row><cell cols="5">Estimate leaf/parameter values:</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>, and</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Comparing predictive performance &amp; uncertainty measures of various methods on Shifts Weather data. Mean ± Std Dev over 3 seeds. The best performance among virtual ensembles is highlighted in bold.</figDesc><table><row><cell>Metric</cell><cell>Data</cell><cell>ens3</cell><cell>ens5</cell><cell>ens10</cell><cell>ens30</cell><cell>SGLB</cell><cell>cSGLB</cell><cell>cbSGLB</cell></row><row><cell>Accuracy (%)↑</cell><cell>eval-ID eval-OOD</cell><cell cols="3">65.24 ± 0.02 65.25 ± 0.02 65.26 ± 0.01 52.51 ± 0.06 52.55 ± 0.06 52.55 ± 0.03</cell><cell>65.26 52.54</cell><cell>65.63 ± 0.01 52.50 ± 0.14</cell><cell>65.93 ± 0.01 50.72 ± 0.17</cell><cell>65.59 ± 0.01 51.49 ± 0.11</cell></row><row><cell>Macro F1 (%)↑</cell><cell>eval-ID eval-OOD</cell><cell cols="3">63.28 ± 0.02 63.29 ± 0.02 63.30 ± 0.01 52.36 ± 0.07 52.39 ± 0.06 52.38 ± 0.03</cell><cell>63.30 52.38</cell><cell>63.69 ± 0.01 52.36 ± 0.13</cell><cell>64.06 ± 0.02 50.64 ± 0.16</cell><cell>63.67 ± 0.01 51.41 ± 0.15</cell></row><row><cell>F1-AUC (%)↑</cell><cell>TU KU</cell><cell cols="3">57.18 ± 0.01 57.19 ± 0.01 57.20 ± 0.01 52.94 ± 0.03 53.71 ± 0.03 54.33 ± 0.02</cell><cell>57.20 54.83</cell><cell>57.26 ± 0.01 52.55 ± 0.08</cell><cell cols="2">56.82 ± 0.03 53.60 ± 0.08 53.31 ± 0.14 56.95 ± 0.04</cell></row><row><cell>OOD-AUC (%)↑</cell><cell>KU</cell><cell cols="3">65.72 ± 0.31 68.60 ± 0.21 71.15 ± 0.16</cell><cell>73.26</cell><cell>67.32 ± 0.70</cell><cell cols="2">71.45 ± 0.67 71.70 ± 0.91</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Data summary of our partitioning of the Weather Prediction dataset.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell cols="2"># of samples</cell><cell></cell><cell></cell><cell></cell><cell>% of class</cell><cell></cell></row><row><cell></cell><cell>Total</cell><cell>Tropical</cell><cell>Dry</cell><cell>Mild</cell><cell>Snow</cell><cell>Polar</cell><cell cols="3">Class 0 Class 10 Class 20</cell></row><row><cell>train-ID</cell><cell>200,000</cell><cell>26,786</cell><cell>45,357</cell><cell>127,857</cell><cell>0</cell><cell>0</cell><cell>40.92</cell><cell>33.71</cell><cell>25.37</cell></row><row><cell>dev-ID</cell><cell>46,279</cell><cell>6,196</cell><cell>10,505</cell><cell>29,578</cell><cell>0</cell><cell>0</cell><cell>40.90</cell><cell>33.97</cell><cell>25.13</cell></row><row><cell>dev-OOD</cell><cell>46,555</cell><cell>0</cell><cell>0</cell><cell>0</cell><cell>46,555</cell><cell>0</cell><cell>38.35</cell><cell>35.07</cell><cell>26.58</cell></row><row><cell>eval-ID</cell><cell>518,587</cell><cell>69,245</cell><cell cols="2">117,981 331,361</cell><cell>0</cell><cell>0</cell><cell>40.98</cell><cell>33.71</cell><cell>25.30</cell></row><row><cell>eval-OOD</cell><cell>524,048</cell><cell>0</cell><cell>0</cell><cell>0</cell><cell cols="2">479,952 44,096</cell><cell>33.05</cell><cell>35.70</cell><cell>31.24</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">KU is also named epistemic uncertainty and DU is also called aleatoric uncertainty.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">See paper<ref type="bibr" target="#b11">[12]</ref> for equations computing KU and DU in regression tasks.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">The function class in-use does not contain the unknown groundtruth function.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The accuracy, fairness, and limits of predicting recidivism</title>
		<author>
			<persName><forename type="first">J</forename><surname>Dressel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Farid</surname></persName>
		</author>
		<idno type="DOI">10.1126/sciadv.aao5580</idno>
		<ptr target="https://www.science.org/doi/abs/10.1126/sciadv.aao5580.doi:10.1126/sciadv.aao5580" />
	</analytic>
	<monogr>
		<title level="j">Science Advances</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page">5580</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Stone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brooks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Brynjolfsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Calo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Etzioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hager</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hirschberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kalyanakrishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kamar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kraus</surname></persName>
		</author>
		<title level="m">Artificial intelligence and life in 2030, One Hundred Year Study on Artificial Intelligence: Report of the 2015-2016 Study Panel</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">52</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Serban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Poll</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Visser</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2008.03046</idno>
		<ptr target="https://arxiv.org/abs/2008.03046.doi:10.48550/ARXIV.2008.03046" />
		<title level="m">Towards using probabilistic models to design software systems with inherent uncertainty</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">ml assisted decision making</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mcgrath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zytek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Lage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lakkaraju</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2011.06167</idno>
		<ptr target="https://arxiv.org/abs/2011.06167.doi:10.48550/ARXIV.2011.06167" />
	</analytic>
	<monogr>
		<title level="m">When does uncertainty matter?: Understanding the impact of predictive uncertainty in</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Tabular data: Deep learning is not all you need</title>
		<author>
			<persName><forename type="first">R</forename><surname>Shwartz-Ziv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Armon</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2106.03253</idno>
		<ptr target="https://arxiv.org/abs/2106.03253.doi:10.48550/ARXIV.2106.03253" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Greedy function approximation: a gradient boosting machine</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Friedman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Annals of statistics</title>
		<imprint>
			<biblScope unit="page" from="1189" to="1232" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Stochastic gradient boosting</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Friedman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational statistics &amp; data analysis</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="367" to="378" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Uncertainty estimation in deep learning with application to spoken language assessment</title>
		<author>
			<persName><forename type="first">A</forename><surname>Malinin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
		<respStmt>
			<orgName>University of Cambridge</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Tree induction for probabilitybased ranking</title>
		<author>
			<persName><forename type="first">F</forename><surname>Provost</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Domingos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine learning</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="199" to="215" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Calibrating probability estimation trees using venn-abers predictors</title>
		<author>
			<persName><forename type="first">U</forename><surname>Johansson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Löfström</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Boström</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 SIAM International Conference on Data Mining</title>
				<meeting>the 2019 SIAM International Conference on Data Mining</meeting>
		<imprint>
			<publisher>SIAM</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="28" to="36" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Sglb: Stochastic gradient langevin boosting</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ustimenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Prokhorenkova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="10487" to="10496" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Malinin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Prokhorenkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ustimenko</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.10562</idno>
		<title level="m">Uncertainty in gradient boosting via ensembles</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1902.03932</idno>
		<title level="m">Cyclical stochastic gradient mcmc for bayesian deep learning</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Wenzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Roth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">S</forename><surname>Veeling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Świątkowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mandt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Snoek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Salimans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jenatton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nowozin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2002.02405</idno>
		<title level="m">How good is the bayes posterior in deep neural networks really?</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">What are bayesian neural network posteriors really like?</title>
		<author>
			<persName><forename type="first">P</forename><surname>Izmailov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vikram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Hoffman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G G</forename><surname>Wilson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="4629" to="4640" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Learning under model misspecification: Applications to variational and ensemble methods</title>
		<author>
			<persName><forename type="first">A</forename><surname>Masegosa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="5479" to="5491" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Bayesian learning via stochastic gradient langevin dynamics</title>
		<author>
			<persName><forename type="first">M</forename><surname>Welling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">W</forename><surname>Teh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th international conference on machine learning (ICML-11)</title>
				<meeting>the 28th international conference on machine learning (ICML-11)</meeting>
		<imprint>
			<publisher>Citeseer</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="681" to="688" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Stochastic gradient hamiltonian monte carlo</title>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guestrin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1683" to="1691" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Bayesian sampling using stochastic gradient thermostats</title>
		<author>
			<persName><forename type="first">N</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Babbush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">D</forename><surname>Skeel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Neven</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Preconditioned stochastic gradient langevin dynamics for deep neural networks</title>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Carlson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Carin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Thirtieth AAAI Conference on Artificial Intelligence</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Consistency and fluctuations for stochastic gradient langevin dynamics</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">W</forename><surname>Teh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">H</forename><surname>Thiery</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Vollmer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page" from="193" to="225" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Bayesian deep learning and a probabilistic perspective of generalization</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Izmailov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="4697" to="4708" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Ashukha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lyzhov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Molchanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Vetrov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2002.06470</idno>
		<title level="m">Pitfalls of in-domain uncertainty estimation and ensembling in deep learning</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Ngboost: Natural gradient boosting for probabilistic prediction</title>
		<author>
			<persName><forename type="first">T</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Anand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">Y</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K</forename><surname>Thai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schuler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2690" to="2700" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Settles</surname></persName>
		</author>
		<title level="m">Active learning literature survey</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Parameterized indexed value function for efficient exploration in reinforcement learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">R</forename><surname>Dwaracherla</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="5948" to="5955" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems</title>
		<author>
			<persName><forename type="first">S</forename><surname>Depeweg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Hernández-Lobato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Doshi-Velez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Udluft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">stat</title>
		<imprint>
			<biblScope unit="volume">1050</biblScope>
			<biblScope unit="page">11</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Loss surfaces, mode connectivity, and fast ensembling of dnns</title>
		<author>
			<persName><forename type="first">T</forename><surname>Garipov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Izmailov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Podoprikhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Vetrov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Wilson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Essentially no barriers in neural network energy landscape</title>
		<author>
			<persName><forename type="first">F</forename><surname>Draxler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Veschgini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Salmhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hamprecht</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1309" to="1318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">V</forename><surname>Dorogush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ershov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gulin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.11363</idno>
		<title level="m">Catboost: gradient boosting with categorical features support</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Malinin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Band</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chesnokov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Gales</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Noskov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ploskonosov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Prokhorenkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Provilkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Raina</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2107.07455</idno>
		<title level="m">Shifts: A dataset of real distributional shift across multiple large-scale tasks</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
