1. Introduction

Amortized Active Learning for Nonparametric Functions

Cen-You Li

0 1

Marc Toussaint

Barbara Rakitsch

Christoph Zimmer

0 0 Bosch Center for Artificial Intelligence , Germany 1 Technical University of Berlin , Germany

18 32

Active learning (AL) is a sequential learning scheme aiming to select the most informative data. AL reduces data consumption and avoids the cost of labeling large amounts of data. However, AL trains the model and solves an acquisition optimization for each selection. It becomes expensive when the model training or acquisition optimization is challenging. In this paper, we focus on active nonparametric function learning, where the gold standard Gaussian process (GP) approaches sufer from cubic time complexity. We propose an amortized AL method, where new data are suggested by a neural network which is trained up-front without any real data (Figure 1). Our method avoids repeated model training and requires no acquisition optimization during the AL deployment. We (i) utilize GPs as function priors to construct an AL simulator, (ii) train an AL policy that can zero-shot generalize from simulation to real learning problems of nonparametric functions and (iii) achieve real-time data selection and comparable learning performances to time-consuming baseline methods.

1. Introduction

Active learning (AL) is a sequential learning scheme aiming to reduce the efort and cost of labeling data [ 1–3 ]. The goal is to maximize the information given by each data point, so the quantity can be reduced. An Active Learning (AL) method starts with a small amount of labeled data. The model is ifrst trained on the labeled data, and then the trained model is used to evaluate acquisition scores for the unlabeled data. The acquisition function measures the expected knowledge gained from labeling a data point. Labels are then requested for the data points with the peaked acquisition scores, and the labeled dataset is updated for the next AL iteration. AL can be run for several iterations until the budget is exhausted or until a training goal is achieved. To perform AL, however, one would face multiple challenges: (i) training models for every query can be nontrivial, especially when the learning time is constrained [ 4–6 ]; (ii) acquisition criteria need to be selected a priori but none of them clearly outperforms the others in all cases, which makes the selection dificult [ 7, 8 ]; (iii) optimizing an acquisition function can be dificult (e.g. sophisticated discrete search space [ 9 ]).

In this paper, we propose an AL method that suggests new data points for labeling based on a neural network (NN) evaluation instead of the costly model training and acquisition function optimization (Figure 1). To this end, we decouple model training and acquisition function optimization from the AL loop. This is beneficial when we face the aforementioned challenges (i) and (iii), i.e. scenarios where either the querying time (training time pluses optimization time) is precious [ 4–6 ] or it is dificult to optimize an acquisition function [ 9 ]. In these settings, making a high-quality data selection is too expensive, such that one would rather accept a faster and easier active learner even with a potential tradeof of slightly worse acquisition quality. Notably, as AL tackles data scarcity problem, such a NN policy function should be obtained with no additional real data.

We further focus our problem on actively learning regression tasks. The idea is to (i) generate a rich distribution of functions, (ii) simulate AL experiments on those functions, (iii) train the NN policy in simulation, and then (iv) zero-shot generalize to real AL problems. For low data learning problems (up to thousands of data points), Gaussian processes (GPs, [ 10 ]) are a powerful model family that naturally ifts our approach. A GP is a distribution of nonparametric functions that, if used as the model in an AL loop (Figure 1, left), provides well-calibrated predictive distributions suitable for acquisition functions [ 11–14 ]. This paper utilizes GPs to sample functions and simulate AL of regression problems (Figure 1, right). In other words, we perform amortized inference [ 15 ] of an active learner from GP simulations.

Please notice the diference between the model and the NN policy. In this paper, model always refers to the model one wish to actively learn on a specific task, while the NN policy proposes AL queries and the queries are then used to fit the model.

Contributions:

We summarize our contributions: • we formulate a training pipeline of active nonparametric function learning policy which requires no real data; • we propose diferentiable AL objectives in closed form for the training; • we demonstrate empirical analysis on common benchmark problems.

Related works: AL [ 1–3 ] is prominent in various applications such as image classifica- Algorithm 1: Classical AL tion [ 4, 5 ] or physical system modeling [ 16 ]. In regression tasks, GPs demonstrate great advan- Require: 0 ⊆ × , acquisition function tage in AL acquisitions [ 16–21 ]. An acquisition 1: for = 1, ..., do function plays a major role in AL methods (Al- 2: Model ℳ− 1 with − 1 gorithm 1). Entropy, which selects the most 3: = ∈ (|ℳ− 1, − 1) uncertain points in the space, is a popular ac- 4: Evaluate at quisition function due to its efectiveness and 5: ← − 1 ∪ {, } computational simplicity [ 22 ]. Mutual infor- 6: end for mation is another well-known option. A mutual information criterion can focus on the information gain in the space [ 11, 13 ] or take model improvement [ 14 ] into account, which is often considered superior to entropy. However, depending on the settings, mutual information is often intractable and creates computational overhead. A closely related field is Bayesian optimization (BO, [ 23 ]) which aims to find the global optima of functions with limited evaluations. The same algorithm (Algorithm 1) can be applied to BO problem by exchanging the acquisition function. BO as well sufers from repeated model training and acquisition optimization.

Recently, meta learning and amortized inference have been explored to tackle challenges of sequential learning methods. Konyushkova et al. proposed to meta learn an acquisition function for AL, avoiding a priori selection [ 8 ]. Given an acquisition function, Swersky et al. proposed to do an amortized inference on acquisition optimization [ 9 ]. On GP learning problems, Rothfuss et al. proposed to meta learn GP hyperparameters [ 24 ] while Bitzer et al. performed amortized inference to select GP kernels and hyperparameters [ 25 ], both of which simplify the model fitting which is a bottleneck in real time applications [ 6 ].

To the best of our knowledge, very rare works automate the entire data selection process, i.e. decouple model updates, automate acquisition evaluations and optimizations. In [ 26, 27 ], the authors proposed RNN optimizers which query points by simple forwarding. Foster et al. proposed the deep adaptive design (DAD), an amortized Bayesian experimental design, which as well resorts sequential data selection to simple NN forwarding [ 28 ]. While DAD provides an AL deployment procedure as we aim for, they collect data to learn parametric models. The data selection criterion does not necessarily fit into nonparametric functions. Ivanova et al. further extended DAD to learn intractable models, which is however a diferent direction from our goal [ 29 ].

None of the literature we found considers amortized inference of active nonparametric function learning. Interestingly, Krause et al. discussed theoretical perspectives of an a priori acquisition policy for active GP learning [ 12 ]. This provides key insight into our AL simulation. We take inspiration from [ 12, 27, 28 ] to develop our amortized AL method.

2. Problem Statement

We are interested in a regression task of an unknown function : → R, where ⊆ R Algorithm 2: AL with NN Policy is the input space. We assume is a bounded space, which usually holds true in reality, as Require: 0 ⊆ × , AL policy one normally focuses only on a domain of in- 1: for = 1, ..., do terest. The observations we access are always 2: = (− 1) noisy. That is, a labeled data point comprises 3: Evaluate at an input ∈ and its corresponding output 4: ← − 1 ∪ {, } observation () = () + , where () is 5: end for a functional value and is an unknown noise 6: Model with value. For brevity, we write := () and := (). For clarity later, we let ⊆ R denote the output space, i.e. ∈ , ⊆ × denote a dataset, and ( × ) := {| ⊆ ×} denote the space of datasets. is given, and we have

We follow an AL setting: a small labeled dataset 0 := {,, ,}=1 budget to label more data points, denoted by (1, 1), ..., ( , ). The high level goal is to conduct AL to select informative 1, ..., such that = 0 ∪ {1, 1, ..., , } helps us construct a good model of . In a conventional AL method (Figure 1, left and Algorithm 1), the data are selected iteratively by optimizing the acquisition criteria. In this paper, we aim to have a policy function : ( × ) → up front, which sees current observations and directly provide the next query proposal (Figure 1, right and Algorithm 2). We assume no additional real data are available for the policy training. Nevertheless, we make assumptions that has a GP prior and that our observation data are normalized to zero mean and unit variance. In the following, we will sometimes write := (,1, ..., , ), := (,1, ..., , ), := (,1, ..., , , 1, ..., ), := (,1, ..., , , 1, ..., ), for = 1, ..., .

Assumptions: We assume has a GP prior. A GP is a distribution over functions, characterized by the mean (E[ ()]) and kernel (covariance () and (′), for two input points , ′). Without loss of generality, one usually assumes the mean is a zero function, which holds true when the observation values are normalized. The kernel function is typically parameterized, and it encodes the amplitude and smoothness of the function . We refer the readers to [ 10 ] for details. The assumption is formally described below.

Assumption 2.1. The unknown function has a GP prior (0, ). Any observation at is = () + , ∼ (0, 2) is an i.i.d. Gaussian noise [ 10 ]. Here, : × → R is a kernel function parameterized by . We further assume (, ′) ≤ 1.

Bounding the kernel scale by one is not restrictive, as we assume the observations are normalized to unit variance. Due to a GP prior, any finite number of functional values are jointly Gaussian. GP distributions are provided in closed form in Appendix A.

We want to emphasize that the GP assumption is mainly for policy training. On a test function, failing this assumption (we however would not know a priori) may result in bad data selection, but our AL method can still be deployed as the data selection is decoupled from GP modeling.

3. AL with a priori trained policy Our goal here is to train a policy to run Al

gorithm 2. Here we take key inspiration from [ 28, 29 ]. The idea is to exploit the GP prior (Assumption 2.1) before AL experiments. We use the GP prior ( ) and the Gaussian likelihood (|, ) = ︀( | (), 2)︀ to construct a simulator. This allows us to sample functions, simulate policy-based AL (Algorithm 2) and then meta optimize an objective function which encodes the acquisition criterion (Algorithm 3). The key is to ensure that the policy experiences AL on diverse functions, then during a real AL experiment, the policy makes a zero-shot amortized inference from the simulation. Note that the training is performed by simulating active GP learning, while, in a real AL experiment, the policy only collects data, and we are not forced to make GP modeling with the collected data.

Algorithm 3: Nonmyopic AL training

Require: prior (0, ), ( ) = (0, 2), 1: sample a batch of , 2 2: sample a batch of ∼ (0, ) 3: sample 0 ⊆ × , given and ( ) 4: for = 1, ..., do 5: = (− 1) 6: sample ∼ ( ), = () + 7: ← − 1 ∪ {, } 8: end for 9: if entropy loss then 10: compute loss per Eq. (4) 11: else if regularized entropy loss then 12: sample ⊆ 13: sample = () + 14: compute loss per Eq. (5) 15: end if 16: update Training objectives: We first discuss the training objectives, as they provide insight into what exact data we generate. Similar to [ 27, 28 ], the idea is to turn the acquisition criteria we would have optimized in a conventional AL setting into loss objectives where the learner gradient is available (Figure 1).

Imagine we are doing AL with Algorithm 2 on synthetic functions. The first remark is that in a simulation, functions are always sampled from a known GP prior, i.e. parameters , 2 are known before we start the simulated AL procedure. Thus, given a sequence of queries provided by a learner, the joint GP distribution is available in closed form. Therefore, an intuitive approach is to apply common entropy or (approximated) mutual information criteria on the policy selected points. We take the definition from [ 12 ], where the authors discuss policy-based AL which naturally applies to NN policies as well: ℋ() := E((· ), =1,..., ) [− log (,1, ..., , )] , ℐ() := E((· ), =1,..., ) [− log (,1, ..., , ) + log (,1, ..., , |( ∖ ))] , (1) (2) where (· ) and =1,..., are GP and noise realizations, ,1, ..., , correspond to policy selected queries ,1, ..., , , and ( ∖ ) means the realization over space ∖ {,1, ..., , }. In [ 12 ], the the function sampling while the AL policy is dealing with each function deterministically. input space is a discrete space of finite number of elements, which makes ( ∖ ) a computable set of values. We will describe ( ∖ ) in more details later. Note here that stochasticity arises from

Maximizing the entropy objective (Eq. (1)) would favor a set of uncorrelated points and naturally encourage points at the border which are the most scattered [ 11 ]. In our initial experiments, we noticed that this entropy objective needed more careful tuning, as it often overemphasized the boundary and ignored to explore in the space. The mutual information criterion is known to tackle this problem, at ︀( least in conventional AL settings [ 11 ], but, on the other hand, the aforementioned objective ℐ() in its original form makes conditioning on ( ∖ ). This is not well-defined when is a continuous space. cubic complexity | | Even if is discrete, conditioning on large pool (fine discretization) is computationally heavy, i.e. GP 3)︀ (Appendix A). Discrete pool also enforces a classifier-like policy (select points from a pool) which prohibits us from utilizing the existing NN structure developed by [ 28, 29 ].

We thus wish to modify ℐ(). Note that ℐ() is a regularized entropy objective, and ℋ(), although not always well performing, can already be used for training. Therefore, we propose a simple yet efective approach: compute the regularization term only on a sparse set of samples (, ) ∈ ℐ() ≈ E((· ), =1,..., ) [− log (,1, ..., , ) + log (,1, ..., , |)] . should be much larger than . Maximizing this objective encourages {,1, ..., , } to track subsets of . To keep the policy from selecting only those sparse grid samples, which are not necessarily optimal points, we re-sample in each training step. The intuition of this objective is two-fold: (i) it can be viewed as an entropy objective regularized by an additional search space indicator, or (ii) it can be viewed as an imitation objective because a subset of grid points, if happens to have large joint entropy, maximizes the objective.

The above losses consider a fixed set of GP hyperparameters, which encodes only certain function features. To generalize to diverse functions, we take the GP hyperparameters into account, and note that a real AL is initiated with initial data points. Our policy objectives become ℋ() = E(, 2)E((· ), =1,..., ) [− log (,1, ..., , , )]

∝ E(, 2)E((· ), =1,..., ) [− log (,1, ..., , |)] ℐ() ≈ E(, 2)E((· ), =1,..., ) [− log (,1, ..., , , ) + log (,1, ..., , , |)] ∝ E(, 2)E((· ), =1,..., ) [− log (,1, ..., , |) + log (,1, ..., , |, )] . The proportion symbol here indicates equivalency, and this holds by applying Bayes rule and removing the part that is not relevant to the policy gradient. In this paper, we sample , , 2 uniformly. Please see the appendix for numerical details.

To summarize, we are given priors ( ) ∼ (0, ), ( ) ∼ hyperparameters and 2, we may then sample GP function and noise realizations and a policy returns sequences of data by actively learning those functions. Then the data are plugged into meta AL objectives (Eqs. (4) and (5)) where the gradient propagates from the queries backward into the policy. We see here that data are generated where Eqs. (4) and (5) require, and this allows one to easily sample thousands or millions of functions in the training. In the next section, we zoom into the sampling of (0, 2) with uniformly random the policy-queried data.

Simulated AL:

The objective functions provide insight into the simulation procedure: sample a GP function realization, sample initial data, perform AL cycles by forwarding with the policy, and maximize either the policy entropy (Eq. (4)) or the regularized policy entropy (i.e. the modified mutual information Eq. (5)). The training procedure is summarized in Algorithm 3. One can see that lines 4-8 are simulating AL cycles as how the policy will be deployed (Algorithm 2). (3) (4) (5)

The only remaining challenge here is to ensure that ,1, ..., , are from the same GP function. This is not trivial because the observations are sampled iteratively, i.e. ∀ = 1, ..., , = (− 1), which means ,1, ..., ,− 1 need to be sampled before , ..., are known. One way is to make a standard GP posterior sampling ∼ (()|− 1, , 2) instead of line 2 and 6 of Algorithm 3. However, this results in ︀( 3 + ( + 1)3 + ... + ( + − 1)3)︀ complexity in time, i.e. the notorious GP cubic complexity (Appendix A). Sampling (line 11 of Algorithm 3) would also take tremendous time.

We address this issue by applying a decoupled function sampling technique [ 30, 31 ]. The idea is to sample Fourier features to approximate a GP function. As a result, an approximated function is a linear combination of cosine functions (line 2 of Algorithm 3), and we can later compute the function value at any point ∈ in linear time (line 6 & 13 of Algorithm 3). One limitation arises, however, is that the kernel needs to have a Fourier transform (e.g. stationary kernels, see Bochner’s theorem in [ 10 ]).

Notice that this training procedure simulates a nonmyopic AL. That is, the queries are optimized if considered jointly but not necessarily stepwise optimal. We additionally provide a myopic AL training algorithm detailed in appendix (Algorithm 4), which optimizes stepwise data selection. This idea is simple: the initial dataset has size randomly sampled from {, ..., + − 1}, the policy query one point, and then we compute the same loss objectives with the altered sequential structure. A myopic policy is not expected to have better AL performance but can avoid making recursive NN inference during the training. This might be beneficial if we want to scale the training up to larger or larger .

NN structure: We take the NN described in [ 29 ]. Each data pair (, ) is first mapped by a MLP (multilayer perceptrons) to a 32 dimension embedding, then two layers of transformer encoders ([ 32 ], without positional encoding) are applied to the sequence of data pair embeddings, and finally the attended sequence is summed (ensure permutation invariance of observed dataset) before mapped by another MLP to a new data query. The details are described in [ 28, 29 ]. We only add another Tanh layer with rescaling constants to refine the decoder output (refine the output to which is bounded in our case). The query is in continuous space and this is how we train the policy. If an AL problem is considered over discrete , one simple approach is to select the point closest to the NN query (line 2 of Algorithm 2).

Complexity: The training complexities are in Appendix C. The AL deployment complexities are as follows • amortized AL (Algorithm 2): NN forwarding takes ︀( ( + − 1)2)︀ at each ; • conventional GP AL (Algorithm 1): at each , GP modeling takes ︀( ( + − 1)3)︀ in time while complexity of acquisition optimization depends on the exact AL problems.

4. Experiments In this section, we test our methods on a couple of benchmark tasks.

4.1. NN training We prepare the experiments by running Algorithm 3, which corresponds to the up-front preparation block in Figure 1. The entire Algorithm 3 is one NN training step. We implement the training pipeline with PyTorch. In the following experiments, we train one NN policy for 1D benchmark tasks and one NN for 2D tasks. The training time and the hardware are described in Table 1. The state dict (PyTorch model parameters) takes around 200 KB disk space for both NNs. The training of each setting is repeated five times with diferent random seeds. Among the five training jobs of each NN, we select the NN with the best training loss for the following experiments. See Appendix E for details. 4.2. Benchmark tasks We deploy AL over the following benchmark problems. Our NNs are trained with = [ 0, 1 ]. Sin function (1D): This is a one dimension problem ∈ [ 0, 1 ], () = sin(20). In the experiments, we sample Gaussian noise ∼ ︀( 0, 0.12)︀ .

Branin function (2D): This function is defined over (1, 2) =∈ [ − 5, 10 ] × [ 0, 15 ], which requires a rescaling mapping ∀ ∈ , (1, 2) = (− 5 + 15[]1, 15[]2). The function

,,,,, ((1, 2)) = (2 − 12 + 1 − ) + (1 − )(1) + , the experiments, the noise is ∼ where , , , , , = (1, 45 .12 , 5 , 6, 10, 81 ) are constants. We sample noise free data points and use the samples to normalize our output ,,,,, ((1, 2)) = ,,,,,((1,2))− (,,,,,) . In (,,,,,) Unconstrained Simionescu function (2D): This is originally a constrained problem [ 33 ] defined over (1, 2) ∈ [− 1.25, 1.25]2 (which again requires a rescaling mapping → [− 1.25, 1.25]2). We remove the constraint, resulting in (1, 2) = 0.112. As Branin function, we sample noise free data points and use the samples to normalize our output. In the experiments, the noise is ∼ ︀( 0, 0.12)︀ . Unconstrained Townsend function (2D): This is originally a constrained problem [ 34 ] 1 defined over (1, 2) ∈ [− 2.25, 2.25] × [− 2.5, 1.75] (rescaling mapping from required). We remove the constraint, resulting in (1, 2) = − [cos((1 − 0.1)2)]2 − 1 sin(31 + 2). As Branin function, we sample noise free data points and use the samples to normalize our output. In the experiments, the noise is ∼ ︀( 0, 0.12)︀ .

Airline passenger dataset (1D): This is a publically available time series dataset 2. Each data point has a date input (year and month) and a number of passengers as output. We convert the input into real number as + (ℎ − 1)/12, and then rescale the entire input space to [ 0, 1 ] (the earliest date becomes 0 while the latest becomes 1). The output data are again normalized to zero mean and unit variance.

Langley Glide-Back Booster (LGBB) dataset (2D): This is a two dimension dataset described in [ 35 ]3. The dataset has multiple outputs and we take the "lift" to run our experiments (after normalized to zero mean and unit variance). The inputs are 1 (mach) and 2 (alpha). which are normalized by 1 = ℎ/6, 2 = (ℎ + 5)/35.

After doing this, the input space is [ 0, 1 ]2. 4.3. AL deployment We compare our methods with (i) standard GP AL (Algorithm 1) with entropy acquisition (Appendix D) (ii) random selection criterion and (iii) DAD, i.e. amortized Bayesian experimental design proposed by [ 28 ]. In this section, we report the modeling performance and AL deployment time. Since the highlevel goal is to model a regression task, we use the collected datasets to train models and evaluate the RMSE as the modeling performance. Although DAD and our amortized AL methods are not restricted to GP modeling, we still evaluate the data on GP models, as GPs are powerful modeling tools for such amount of data and as this is a fair comparison to baseline (i).

We run experiments over the aforementioned benchmark problems. Our NN policy returns points on continuous space ⊆ R. On benchmark functions, a query is taken as it is (line 2 of Algorithm 2), while on the testing datasets (airline passenger and LGBB), we take the nearest point with 2-norm from the pool. Notice that the single pre-trained 1D NN policy is used for all the 1D tasks and the 2D NN policy for all the 2D tasks.

For each method, we repeat the AL experiments (Algorithms 1 and 2) for five times and report the mean and standard error. Each experiment is executed with individual seed. Note here that initial datasets (and noises of function problems) are randomly sampled, where the seed plays a role.

The results are shown in Figure 2. The RMSEs are evaluated after the AL deployments. For example, with 1 dim problems (sin & airline passenger dataset), we start with 1 initial points and query for 10 iterations, resulting in 11 data points in the end. Then the RMSEs are evaluated with GPs trained with these 11 data points. The query time is the data selection time of all iterations. We can see that, on all the presented benchmark problems except for the Sin function, data selected by our nonmyopic amortized AL approaches achieve as good modeling performances as conventional GP AL, while the querying time is significantly faster. Some of the RMSE out-performance of our nonmyopic approaches (and the GP AL baseline) over Random is statistically significant (Wilcoxon signed-rank test, p-value smaller than 0.05). With myopic training scheme, the policy can perform well in some tasks such as the LGBB but badly in others. In our Appendix E, we present few more trained policies good at diferent tasks. The DAD baseline sometimes performs well on 1 problems but not on any 2 problems.

In general, we consider this result as a huge success. The tens-of-milliseconds-level decision-making time per query allows amortized AL method to be applied to systems where output responses are given in a few dozen Hz. In such systems, it is obviously expensive to wait for GP modeling and entropy optimization.

Acknowledgments

This work was supported by Bosch Center for Artificial Intelligence, which provided financial support, computers and GPU clusters. The Bosch Group is carbon neutral. Administration, manufacturing and research activities no longer leave a carbon footprint. This also includes GPU clusters on which the experiments have been performed.

Appendix.

A. Gaussian process and entropy

We first write down the GP predictive distribution. Given a set of data points = {, } ⊆ ×

, we wish to make inference at points = {,1, ..., ,}. We write = ((,1), ..., (,)) for brevity. The joint distribution of and predictive is Gaussian: (, ) = ︀( 0, ( ∪ , ∪ ) + 2+ ︀)

This leads to the following predictive distribution (or GP posterior distribution) ([ ∪ ], [ ∪ ] ). where ( ∪ , ∪ ) is a gram matrix with [ ( ∪ , ∪ )], = (|) = (| (), ()) , () = (, ) [︀ (, ) + 2 () = (, ) + 2 ︀] − 1 , − (, ) [︀ (, ) + 2 (, ) .

Elements of the predictive mean vector () are the noise-free predictive function values.

Note that the log probability density function is log (|) = − 1/2 log ((2 ) det(()))

− 1/2( − ()) [()]− 1 ( − ()), and, if we consider as a vector of random variables, the entropy is

(|) =/2 log(2 ) + 1/2 log det(()). time. Computing the determinant also has cubic time complexity.

Inverting a ×

matrix [︀ (, ) + 2 ︀] has complexity (3) in

B. Additional losses and myopic training algorithm In our main paper, we introduce

ℋ() ∝E(, 2)E((· ), =1,..., ) [− log (,1, ..., , |)] , ℐ() ∝E(, 2)E((· ), =1,..., ) [− log (,1, ..., , |) + log (,1, ..., , |, )] .

We additionally look into two more similar loss objectives. We treat policy selected points as random variables, compute the entropy directly, and then take expectation over diferent priors and functions: ℋ2() = E(, 2)E((· ), =1,..., ) [(,1, ..., , |)] ℐ2() ≈ E(, 2)E((· ), =1,..., ) [(,1, ..., , |) − (,1, ..., , |, )] . (11) Substituting Eqs. (8) and (9) into the losses, we see that the key diference is whether the observation values ,1, ..., , are taken into account. We suspect that having ,1, ..., , in the loss (main losses) may help the policy adapt in AL deployment.

Our ablation study below compares the losses. 18–32 (6) (7) (8) (9) (10) Myopic policy training: As described in Section 3, we proposed another policy training method which does not require recursive NN forwarding. The idea is simple: as the policy is intended to make AL with initial points for iterations, we sample the size of initial dataset from , ..., + − 1 during the training, and then we simulate onestep AL. This allows the policy to experience all sizes of datasets that it will be tackling during an AL deployment. The training procedure is shown in Algorithm 4. All the loss functions are still the same: we condition on initial dataset with altered sizes, consider one-step query (compute as if = 1), and propagate the gradient.

C. Training complexity Overall, the training complexities are listed below. Algorithm 4: Myopic AL training

Require: prior (0, ), ( ) = (0, 2), 1: sample , 2 2: sample ∼ (0, ) 3: sample = 1, ..., 4: sample − 1 ⊆ × 5: = (− 1) 6: sample ∼ ( ), = () + 7: ← − 1 ∪ {, } 8: if entropy loss then 9: compute loss per Eq. (4) 10: else if regularized entropy loss then 11: sample ⊆ 12: sample = () + 13: compute loss per Eq. (5) 14: end if 15: update • computing loss: ︀( 3 + 3)︀ in time and ︀( 2 + 2)︀ in space for the entropy objective (Eq. (4)), where the terms are time and cost of computing the GP predictive distribution (see Appendix A) while the term of computing the log probability likelihood; • computing loss: ︀( ( + )3 + 3)︀ in time and ︀( ( + )2 + 2)︀ in space for our regularized entropy objective (Eq. (5)); • computing loss: ︀( 3 + 3)︀ in time and ︀( 2 + 2)︀ in space for the entropy version 2 objective (appendix Eq. (10)); • computing loss: ︀( ( + )3 + 3)︀ in time and ︀( ( + )2 + 2)︀ in space for our regularized entropy version 2 objective (appendix Eq. (11)); • NN forwarding: ︁( ∑︀=1( + − 1)2)︁ = ︀( ( + − 1)3 − ( − 1)3)︀ with the nonmyopic AL training (Algorithm 3), as self attention has square complexity [ 32 ]; • NN forwarding: ︀( ( + − 1)2)︀ with our myopic AL training (Algorithm 4) (this algorithm does not make recursive NN forwarding but the performance is worse).

Note that we train only once to get a policy for various AL problems.

D. Numerical details

Policy training: In our current implementation, the data dimension and input bound need to be predefined. We fix = [ 0, 1 ], and rescale all test problems to this region. The last layer of the NN policy is () = (ℎ() + 1)/2 which ensures that the policy proposes points in [ 0, 1 ]. The remaining structure is described in [ 29 ].

In Algorithms 3 and 4, the kernel we use is a RBF kernel, which has +1 variables: the variance and dimension lengthscale vector , i.e. = (, ). We sample ∼ (0.505, 1.0), 2 = 1.01 − (function and noise variances sum to 1.01) and ∼ (0.05, 1.0). The sampling hyperparameters should be tuned according to the applications, our setting only use general assumptions. The variance parameters utilize the assumptions that (i) data are normalized to unit variance and (ii) signal-to-noise ratio is at least one. The lengthscale is kept general, but one has to make sure that it is numerically stable (e.g. too small is bad because each lengthscale component is a divisor in the kernel).

The GP functions are approximated by Fourier features, which means each function sample is a linear combination of cos functions [ 30, 31 ]: , and are sampled from distributions described in [ 30 ]. Larger leads to better approximations. We set = 100. The analytical mean of windows [ 0, 1 ] is computed such that all functions can be shifted to zero mean (a GP function has zero mean and unit variance over the entire real space, but not necessarily in a specific bounded window). The analytical mean is the integral of () divided by volume of [ 0, 1 ]. The integral is

1 ∑︁ √︀2/ sin ( + ) |1=0, for = 1, =1

− 1 ∑︁ √︀2/ cos ([]11 + []22 + ) |11=0|12=0, for = 2, []1[]2 =1

and so on.

It may happen that at least one component of is zero or is close to zero, which causes a problem in the division. In this case, we replace |1/ ∏︀

=1[]| by 100000. The error is negligible, i.e. much smaller than noise level.

The batch sizes are: for nonmyopic training (Algorithm 3), we sample 25 kernels (25 sets of ), 10 sets of noise realizations =1,..., , 25 functions per prior, resulting in overall 6250 AL experiments per loss computation (expectation over 6250 sequences of queries); for myopic training (Algorithm 4), we sample 250 kernels, chunk them to 20 diferent batches, each has its own size of initial datasets, and the remaining settings are the same. The grid samples of regularized entropy objective have = 100. Note that whenever we need to sample input points, e.g. , , we sample uniformly from = [ 0, 1 ].

Experiments: In our experiments, we always model with a RBF kernel. Given a dataset, GP hyperparameters are optimized with Type II maximum likelihood.

In our GP AL baseline, the acquisition function is the predictive entropy (()|− 1) (see Algorithm 1 and Eq. (9)). For the airline passenger and LGBB datasets, the acquisition score can be computed on the entire pool of unseen data, and then the optimization can be solved by selecting the point with the largest score. For function problems, at each , we randomly sample 5000 inputs points, optimize on these points, make query and go to the next iterations.

E. Ablation study

Trained policy selection: For each training pipeline, we train with five diferent seeds. The optimizer is RAdam [ 36 ], and we try a few diferent initial learning rates (lrs). We set a lr scheduler to discount the lr by 2% every 50 training steps. With DAD objective, ℋ and ℋ2, we train with 400 * 50 = 20000 steps, and with ℐ and ℐ2, we train with 200 * 50 = 10000 steps.

The training results of 1 dimension policy (our implementation pre-define dimensions) is shown in Figure 3 and 2 dimension in Figure 4. With our main nonmyopic training, the training loss appears to be a good indicator of AL deployment performances. Policies with the minimized negative objectives seem to perform the best in the test problems. With myopic training, it seems like each trained policy may perform well in certain problems but badly in others. In our main paper, for each training objective, we present the policy with best last-ten-epoch mean losses.

[1]

Settles , Active learning literature survey , University of Wisconsin-Madison ( 2010 ).

[2]

Kumar ,

Gupta , Active learning query strategies for classification, regression, and clustering: A survey , Journal of Computer Science and Technology ( 2020 ).

[3]

Tharwat ,

Schenck , A survey on active learning: State-of-the-art, practical challenges and research directions , Mathematics ( 2023 ).

[4]

Gal ,

Islam ,

Ghahramani , Deep Bayesian active learning with image data , International Conference on Machine Learning ( 2017 ).

[5]

Kirsch , J. van Amersfoort ,

Gal , Batchbald: Eficient and diverse batch acquisition for deep bayesian active learning , Advances in Neural Information Processing Systems ( 2019 ).

[6]

Lederer ,

A. J. O.

Conejo ,

K. A.

Maier ,

Xiao ,

Umlauft ,

Hirche , Gaussian process-based real-time learning for safety critical applications , 2021 .

[7]

Baram ,

El-Yaniv ,

Luz , Online choice of active learning algorithms , Journal of Machine Learning Research ( 2004 ).

[8]

Konyushkova ,

Sznitman ,

Fua , Learning active learning from data , Advances in Neural Information Processing Systems ( 2017 ).

[9]

Swersky ,

Rubanova ,

Dohan ,

Murphy , Amortized bayesian optimization over discrete spaces , Conference on Uncertainty in Artificial Intelligence ( 2020 ).

[10]

Rasmussen ,

Williams , Gaussian processes for machine learning , MIT Press ( 2006 ).

[11]

Guestrin ,

Krause ,

A. P.

Singh , Near-optimal sensor placements in gaussian processes , International Conference on Machine Learning ( 2005 ).

[12]

Krause ,

Guestrin , Nonmyopic active learning of gaussian processes: An explorationexploitation approach , International Conference on Machine Learning ( 2007 ).

[13]

Krause ,

Singh ,

Guestrin , Near-optimal sensor placements in gaussian processes: Theory, eficient algorithms and empirical studies , Journal of Machine Learning Research ( 2008 ).

[14]

Houlsby ,

Huszar ,

Ghahramani ,

Lengyel , Bayesian active learning for classification and preference learning , Computing Research Repository ( 2011 ).

[15]

S. J.

Gershman ,

N. D.

Goodman , Amortized inference in probabilistic reasoning , Annual Meeting of the Cognitive Science Society ( 2014 ).

[16]

Zimmer ,

Meister , D. Nguyen-Tuong, Safe active learning for time-series modeling with gaussian processes , Advances in Neural Information Processing Systems ( 2018 ).

[17]

Garnett ,

Osborne ,

Hennig , Active learning of linear embeddings for gaussian processes , Conference on Uncertainty in Artificial Intelligence ( 2014 ).

[18]

Schreiter ,

Nguyen-Tuong ,

Eberts ,

Bischof ,

Markert ,

Toussaint , Safe exploration for active learning with gaussian processes, Machine Learning and Knowledge Discovery in Databases ( 2015 ).

[19]

Yue ,

Wen ,

J. H.

Hunt ,

Shi , Active learning for gaussian process considering uncertainties with application to shape control of composite fuselage , IEEE Transactions on Automation Science and Engineering ( 2021 ).

[20] C.-Y. Li , B.

Rakitsch , C.

Zimmer , Safe active learning for multi-output gaussian processes , International Conference on Artificial Intelligence and Statistics ( 2022 ).

[21]

Bitzer ,

Meister ,

Zimmer , Hierarchical-hyperplane kernels for actively learning gaussian process models of nonstationary systems , 2023 .

[22]

Seo ,

Wallat ,

Graepel ,

Obermayer , Gaussian process regression: active data selection and test point rejection , Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000 . Neural Computing: New Challenges and Perspectives for the New Millennium 3 ( 2000 ) 241 - 246 vol. 3 . URL: https://api.semanticscholar.org/CorpusID:18551791.

[23]

Brochu ,

V. M.

Cora , N. de Freitas, A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning , arXiv ( 2010 ).

[24]

Rothfuss ,

Fortuin ,

Josifoski ,

Krause , Pacoh: Bayes-optimal meta-learning with pacguarantees , International Conference on Machine Learning ( 2021 ).

[25]

Bitzer ,

Meister ,

Zimmer , Amortized inference for gaussian process hyperparameters of structured kernels , Conference on Uncertainty in Artificial Intelligence ( 2023 ).

[26]

Andrychowicz ,

Denil ,

Gómez ,

M. W.

Hofman ,

Pfau ,

Schaul ,

Shillingford , N. de Freitas, Learning to learn by gradient descent by gradient descent , Advances in Neural Information Processing Systems ( 2016 ).

[27]

Chen ,

M. W.

Hofman ,

S. G.

Colmenarejo ,

Denil ,

T. P.

Lillicrap ,

Botvinick , N. de Freitas, Learning to learn without gradient descent by gradient descent , International Conference on Machine Learning ( 2017 ).

[28]

Foster ,

D. R.

Ivanova , I. Malik, T. Rainforth, Deep Adaptive Design: Amortizing Sequential Bayesian Experimental Design , International Conference on Machine Learning ( 2021 ).

[29]

D. R.

Ivanova ,

Foster ,

Kleinegesse ,

M. U.

Gutmann , T. Rainforth, Implicit Deep Adaptive Design: Policy-Based Experimental Design without Likelihoods ( 2021 ).

[30]

Rahimi ,

Recht , Random features for large-scale kernel machines , Advances in Neural Information Processing Systems ( 2007 ).

[31]

J. T.

Wilson ,

Borovitskiy ,

Terenin ,

Mostowsky ,

M. P.

Deisenroth , Eficiently sampling functions from gaussian process posteriors , International Conference on Machine Learning ( 2020 ).

[32]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , L. u. Kaiser, I. Polosukhin , Attention is all you need , Advances in Neural Information Processing Systems ( 2017 ).

[33]

Simionescu , Computer-aided graphing and simulation tools for autocad users , Computer-Aided Graphing and Simulation Tools for AutoCAD Users ( 2014 ).

[34]

Townsend , Constrained optimization in chebfun, chebfun .org ( 2017 ).

[35]

S. E.

Rogers ,

M. J.

Aftosmis ,

S. A.

Pandya ,

N. M.

Chaderjian , E. T. T.,

J. U.

Ahmad , Automated cfd parameter studies on distributed parallel computers , AIAA Computational Fluid Dynamics Conference ( 2003 ).

[36]

Liu ,

Jiang ,

He ,

Chen ,

Liu ,

Gao , J. Han, On the variance of the adaptive learning rate and beyond , International Conference on Learning Representations ( 2020 ).