Amortized Active Learning for Nonparametric Functions
                         Cen-You Li1,2 , Marc Toussaint1 , Barbara Rakitsch2,† and Christoph Zimmer2,†
                         1
                             Technical University of Berlin, Germany
                         2
                             Bosch Center for Artificial Intelligence, Germany


                                         Abstract
                                         Active learning (AL) is a sequential learning scheme aiming to select the most informative data. AL reduces data
                                         consumption and avoids the cost of labeling large amounts of data. However, AL trains the model and solves
                                         an acquisition optimization for each selection. It becomes expensive when the model training or acquisition
                                         optimization is challenging. In this paper, we focus on active nonparametric function learning, where the
                                         gold standard Gaussian process (GP) approaches suffer from cubic time complexity. We propose an amortized
                                         AL method, where new data are suggested by a neural network which is trained up-front without any real
                                         data (Figure 1). Our method avoids repeated model training and requires no acquisition optimization during the
                                         AL deployment. We (i) utilize GPs as function priors to construct an AL simulator, (ii) train an AL policy that
                                         can zero-shot generalize from simulation to real learning problems of nonparametric functions and (iii) achieve
                                         real-time data selection and comparable learning performances to time-consuming baseline methods.


                         1. Introduction
                         Active learning (AL) is a sequential learning scheme aiming to reduce the effort and cost of labeling
                         data [1–3]. The goal is to maximize the information given by each data point, so the quantity can be
                         reduced. An Active Learning (AL) method starts with a small amount of labeled data. The model is
                         first trained on the labeled data, and then the trained model is used to evaluate acquisition scores for
                         the unlabeled data. The acquisition function measures the expected knowledge gained from labeling
                         a data point. Labels are then requested for the data points with the peaked acquisition scores, and
                         the labeled dataset is updated for the next AL iteration. AL can be run for several iterations until
                         the budget is exhausted or until a training goal is achieved. To perform AL, however, one would
                         face multiple challenges: (i) training models for every query can be nontrivial, especially when the
                         learning time is constrained [4–6]; (ii) acquisition criteria need to be selected a priori but none of them
                         clearly outperforms the others in all cases, which makes the selection difficult [7, 8]; (iii) optimizing an
                         acquisition function can be difficult (e.g. sophisticated discrete search space [9]).
                            In this paper, we propose an AL method that suggests new data points for labeling based on a neural
                         network (NN) evaluation instead of the costly model training and acquisition function optimization (Fig-
                         ure 1). To this end, we decouple model training and acquisition function optimization from the AL loop.
                         This is beneficial when we face the aforementioned challenges (i) and (iii), i.e. scenarios where either
                         the querying time (training time pluses optimization time) is precious [4–6] or it is difficult to optimize
                         an acquisition function [9]. In these settings, making a high-quality data selection is too expensive, such
                         that one would rather accept a faster and easier active learner even with a potential tradeoff of slightly
                         worse acquisition quality. Notably, as AL tackles data scarcity problem, such a NN policy function
                         should be obtained with no additional real data.
                            We further focus our problem on actively learning regression tasks. The idea is to (i) generate a rich
                         distribution of functions, (ii) simulate AL experiments on those functions, (iii) train the NN policy in
                         simulation, and then (iv) zero-shot generalize to real AL problems. For low data learning problems (up
                         to thousands of data points), Gaussian processes (GPs, [10]) are a powerful model family that naturally

                          IAL@ECML-PKDD’24: 8th Intl. Worksh. & Tutorial on Interactive Adaptive Learning, Sep. 9th , 2024, Vilnius, Lithuania
                         *
                           Corresponding author.
                         †
                           equal contribution
                         $ cen-you.li@campus.tu-berlin.de (C. Li); toussaint@tu-berlin.de (M. Toussaint); barbara.rakitsch@de.bosch.com
                         (B. Rakitsch); christoph.zimmer@de.bosch.com (C. Zimmer)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                18
Cen-You Li et al. CEUR Workshop Proceedings                                                           18–32


Figure 1: The conventional AL of a GP regression relies on computationally expensive (orange) GP fitting and
acquisition optimization. Our amortized AL approach meta trains a NN active learner up-front, purely from
synthetic data, allowing a fast, easy and real-time applicable (green) AL deployment.


fits our approach. A GP is a distribution of nonparametric functions that, if used as the model in
an AL loop (Figure 1, left), provides well-calibrated predictive distributions suitable for acquisition
functions [11–14]. This paper utilizes GPs to sample functions and simulate AL of regression problems
(Figure 1, right). In other words, we perform amortized inference [15] of an active learner from GP
simulations.
   Please notice the difference between the model and the NN policy. In this paper, model always refers
to the model one wish to actively learn on a specific task, while the NN policy proposes AL queries and
the queries are then used to fit the model.

Contributions: We summarize our contributions:

    • we formulate a training pipeline of active nonparametric function learning policy which requires
      no real data;
    • we propose differentiable AL objectives in closed form for the training;
    • we demonstrate empirical analysis on common benchmark problems.

Related works: AL [1–3] is prominent in
various applications such as image classifica-
                                                                 Algorithm 1: Classical AL
tion [4, 5] or physical system modeling [16]. In
regression tasks, GPs demonstrate great advan-      Require: 𝒟0 ⊆ 𝒳 × 𝒴, acquisition function 𝑎
                                                     1: for 𝑡 = 1, ..., 𝑇 do
tage in AL acquisitions [16–21]. An acquisition
function plays a major role in AL methods (Al-       2:     Model ℳ𝑡−1 with 𝒟𝑡−1
                                                     3:     𝑥𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑥∈𝒳 𝑎(𝑥|ℳ𝑡−1 , 𝒟𝑡−1 )
gorithm 1). Entropy, which selects the most
uncertain points in the space, is a popular ac-      4:     Evaluate 𝑦𝑡 at 𝑥𝑡
                                                     5:     𝒟𝑡 ← 𝒟𝑡−1 ∪ {𝑥𝑡 , 𝑦𝑡 }
quisition function due to its effectiveness and
                                                     6: end for
computational simplicity [22]. Mutual infor-
mation is another well-known option. A mutual information criterion can focus on the information
gain in the space [11, 13] or take model improvement [14] into account, which is often considered
superior to entropy. However, depending on the settings, mutual information is often intractable and
creates computational overhead. A closely related field is Bayesian optimization (BO, [23]) which aims
to find the global optima of functions with limited evaluations. The same algorithm (Algorithm 1) can
be applied to BO problem by exchanging the acquisition function. BO as well suffers from repeated
model training and acquisition optimization.


                                                    19
Cen-You Li et al. CEUR Workshop Proceedings                                                                              18–32


   Recently, meta learning and amortized inference have been explored to tackle challenges of sequential
learning methods. Konyushkova et al. proposed to meta learn an acquisition function for AL, avoiding a
priori selection [8]. Given an acquisition function, Swersky et al. proposed to do an amortized inference
on acquisition optimization [9]. On GP learning problems, Rothfuss et al. proposed to meta learn
GP hyperparameters [24] while Bitzer et al. performed amortized inference to select GP kernels and
hyperparameters [25], both of which simplify the model fitting which is a bottleneck in real time
applications [6].
   To the best of our knowledge, very rare works automate the entire data selection process, i.e. decouple
model updates, automate acquisition evaluations and optimizations. In [26, 27], the authors proposed
RNN optimizers which query points by simple forwarding. Foster et al. proposed the deep adaptive
design (DAD), an amortized Bayesian experimental design, which as well resorts sequential data
selection to simple NN forwarding [28]. While DAD provides an AL deployment procedure as we aim
for, they collect data to learn parametric models. The data selection criterion does not necessarily fit
into nonparametric functions. Ivanova et al. further extended DAD to learn intractable models, which
is however a different direction from our goal [29].
   None of the literature we found considers amortized inference of active nonparametric function
learning. Interestingly, Krause et al. discussed theoretical perspectives of an a priori acquisition policy
for active GP learning [12]. This provides key insight into our AL simulation. We take inspiration
from [12, 27, 28] to develop our amortized AL method.


2. Problem Statement
We are interested in a regression task of an un-
known function 𝑓 : 𝒳 → R, where 𝒳 ⊆ R𝐷
                                                                           Algorithm 2: AL with NN Policy
is the input space. We assume 𝒳 is a bounded
space, which usually holds true in reality, as                    Require: 𝒟0 ⊆ 𝒳 × 𝒴, AL policy 𝜑
                                                                   1: for 𝑡 = 1, ..., 𝑇 do
one normally focuses only on a domain of in-
                                                                   2:     𝑥𝑡 = 𝜑(𝒟𝑡−1 )
terest. The observations we access are always
noisy. That is, a labeled data point comprises                     3:     Evaluate 𝑦𝑡 at 𝑥𝑡
                                                                   4:     𝒟𝑡 ← 𝒟𝑡−1 ∪ {𝑥𝑡 , 𝑦𝑡 }
an input 𝑥 ∈ 𝒳 and its corresponding output
                                                                   5: end for
observation 𝑦(𝑥) = 𝑓 (𝑥) + 𝜖, where 𝑓 (𝑥) is
                                                                   6: Model 𝑓 with 𝒟𝑇
a functional value and 𝜖 is an unknown noise
value. For brevity, we write 𝑦 := 𝑦(𝑥) and 𝑦𝑠𝑢𝑏𝑠𝑐𝑟𝑖𝑝𝑡 := 𝑦(𝑥𝑠𝑢𝑏𝑠𝑐𝑟𝑖𝑝𝑡 ). For clarity later, we let 𝒴 ⊆ R
denote the output space, i.e. 𝑦 ∈ 𝒴, 𝒟 ⊆ 𝒳 ×𝒴 denote a dataset, and 𝑠𝑝𝑎𝑐𝑒(𝒳 ×𝒴) := {𝒟|𝒟 ⊆ 𝒳 ×𝒴}
denote the space of datasets.
   We follow an AL setting: a small labeled dataset 𝒟0 := {𝑥𝑖𝑛𝑖𝑡,𝑖 , 𝑦𝑖𝑛𝑖𝑡,𝑖 }𝑁              𝑖=1 is given, and we have
                                                                                               𝑖𝑛𝑖𝑡

budget to label 𝑇 more data points, denoted by (𝑥1 , 𝑦1 ), ..., (𝑥𝑇 , 𝑦𝑇 ). The high level goal is to conduct
AL to select informative 𝑥1 , ..., 𝑥𝑇 such that 𝒟𝑇 = 𝒟0 ∪ {𝑥1 , 𝑦1 , ..., 𝑥𝑇 , 𝑦𝑇 } helps us construct a
good model of 𝑓 . In a conventional AL method (Figure 1, left and Algorithm 1), the data are selected
iteratively by optimizing the acquisition criteria. In this paper, we aim to have a policy function
𝜑 : 𝑠𝑝𝑎𝑐𝑒(𝒳 × 𝒴) → 𝒳 up front, which sees current observations and directly provide the next query
proposal (Figure 1, right and Algorithm 2). We assume no additional real data are available for the
policy training. Nevertheless, we make assumptions that 𝑓 has a GP prior and that our observation data
are normalized to zero mean and unit variance. In the following, we will sometimes write 𝑋𝑖𝑛𝑖𝑡 :=
(𝑥𝑖𝑛𝑖𝑡,1 , ..., 𝑥𝑖𝑛𝑖𝑡,𝑁𝑖𝑛𝑖𝑡 ), 𝑌𝑖𝑛𝑖𝑡 := (𝑦𝑖𝑛𝑖𝑡,1 , ..., 𝑦𝑖𝑛𝑖𝑡,𝑁𝑖𝑛𝑖𝑡 ), 𝑋𝑡 := (𝑥𝑖𝑛𝑖𝑡,1 , ..., 𝑥𝑖𝑛𝑖𝑡,𝑁𝑖𝑛𝑖𝑡 , 𝑥1 , ..., 𝑥𝑡 ), 𝑌𝑡 :=
(𝑦𝑖𝑛𝑖𝑡,1 , ..., 𝑦𝑖𝑛𝑖𝑡,𝑁𝑖𝑛𝑖𝑡 , 𝑦1 , ..., 𝑦𝑡 ), for 𝑡 = 1, ..., 𝑇 .

Assumptions: We assume 𝑓 has a GP prior. A GP is a distribution over functions, characterized by
the mean (E[𝑓 (𝑥)]) and kernel (covariance 𝑓 (𝑥) and 𝑓 (𝑥′ ), for two input points 𝑥, 𝑥′ ). Without loss of
generality, one usually assumes the mean is a zero function, which holds true when the observation
values are normalized. The kernel function is typically parameterized, and it encodes the amplitude


                                                              20
Cen-You Li et al. CEUR Workshop Proceedings                                                                    18–32


and smoothness of the function 𝑓 . We refer the readers to [10] for details. The assumption is formally
described below.
Assumption 2.1. The unknown function 𝑓 has a GP prior 𝒢𝒫(0, 𝑘𝜃 ). Any observation at 𝑥 is
𝑦 = 𝑓 (𝑥) + 𝜖, 𝜖 ∼ 𝒩 (0, 𝜎 2 ) is an i.i.d. Gaussian noise [10]. Here, 𝑘𝜃 : 𝒳 × 𝒳 → R is a kernel
function parameterized by 𝜃. We further assume 𝑘𝜃 (𝑥, 𝑥′ ) ≤ 1.
   Bounding the kernel scale by one is not restrictive, as we assume the observations are normalized
to unit variance. Due to a GP prior, any finite number of functional values are jointly Gaussian. GP
distributions are provided in closed form in Appendix A.
   We want to emphasize that the GP assumption is mainly for policy training. On a test function,
failing this assumption (we however would not know a priori) may result in bad data selection, but our
AL method can still be deployed as the data selection is decoupled from GP modeling.


3. AL with a priori trained policy 𝜑
Our goal here is to train a policy 𝜑 to run Al-
gorithm 2. Here we take key inspiration
                                                                   Algorithm 3: Nonmyopic AL training
from [28, 29]. The idea is to exploit the GP
prior (Assumption 2.1) before AL experiments.               Require: prior 𝒢𝒫(0, 𝑘𝜃 ), 𝑝(𝜖) = 𝒩 (0, 𝜎 2 ), 𝑇
                                                             1: sample a batch of 𝜃, 𝜎 2
We use the GP prior 𝑝(𝑓 (︀) and the Gaussian
                                                             2: sample a batch of 𝑓 ∼ 𝒢𝒫(0, 𝑘𝜃 )
likelihood 𝑝(𝑦|𝑥, 𝑓 ) = 𝒩 𝑦|𝑓 (𝑥), 𝜎 2 to con-
                                       )︀
                                                             3: sample 𝒟0 ⊆ 𝒳 × 𝒴, given 𝑓 and 𝑝(𝜖)
struct a simulator. This allows us to sam-
                                                             4: for 𝑡 = 1, ..., 𝑇 do
ple functions, simulate policy-based AL (Al-
                                                             5:     𝑥𝑡 = 𝜑(𝒟𝑡−1 )
gorithm 2) and then meta optimize an objec-
tive function which encodes the acquisition                  6:     sample 𝜖𝑡 ∼ 𝑝(𝜖), 𝑦𝑡 = 𝑓 (𝑥𝑡 ) + 𝜖𝑡
                                                             7:     𝒟𝑡 ← 𝒟𝑡−1 ∪ {𝑥𝑡 , 𝑦𝑡 }
criterion (Algorithm 3). The key is to ensure
                                                             8: end for
that the policy experiences AL on diverse func-
                                                             9: if entropy loss then
tions, then during a real AL experiment, the
policy makes a zero-shot amortized inference                10:     compute loss per Eq. (4)
                                                            11: else if regularized entropy loss then
from the simulation. Note that the training is
performed by simulating active GP learning,                 12:     sample 𝑋𝑔𝑟𝑖𝑑 ⊆ 𝒳
while, in a real AL experiment, the policy only             13:     sample 𝑌𝑔𝑟𝑖𝑑 = 𝑓 (𝑋𝑔𝑟𝑖𝑑 ) + 𝑛𝑜𝑖𝑠𝑒
collects data, and we are not forced to make                14:     compute loss per Eq. (5)
                                                            15: end if
GP modeling with the collected data.
                                                            16: update 𝜑

Training objectives: We first discuss the
training objectives, as they provide insight into
what exact data we generate. Similar to [27, 28], the idea is to turn the acquisition criteria we would
have optimized in a conventional AL setting into loss objectives where the learner gradient is avail-
able (Figure 1).
   Imagine we are doing AL with Algorithm 2 on synthetic functions. The first remark is that in a
simulation, functions are always sampled from a known GP prior, i.e. parameters 𝜃, 𝜎 2 are known before
we start the simulated AL procedure. Thus, given a sequence of queries provided by a learner, the joint
GP distribution is available in closed form. Therefore, an intuitive approach is to apply common entropy
or (approximated) mutual information criteria on the policy selected points. We take the definition
from [12], where the authors discuss policy-based AL which naturally applies to NN policies as well:

          ℋ(𝜑) := E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 )] ,                                           (1)
          ℐ(𝜑) := E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 ) + log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑦(𝒳 ∖ 𝑋𝜑 ))] ,     (2)

where 𝑓 (·) and 𝜖𝑡=1,...,𝑇 are GP and noise realizations, 𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 correspond to policy selected
queries 𝑥𝜑,1 , ..., 𝑥𝜑,𝑇 , and 𝑦(𝒳 ∖ 𝑋𝜑 ) means the realization over space 𝒳 ∖ {𝑥𝜑,1 , ..., 𝑥𝜑,𝑇 }. In [12], the


                                                           21
Cen-You Li et al. CEUR Workshop Proceedings                                                                             18–32


input space 𝒳 is a discrete space of finite number of elements, which makes 𝑦(𝒳 ∖ 𝑋𝜑 ) a computable
set of values. We will describe 𝑦(𝒳 ∖ 𝑋𝜑 ) in more details later. Note here that stochasticity arises from
the function sampling while the AL policy is dealing with each function deterministically.
   Maximizing the entropy objective (Eq. (1)) would favor a set of uncorrelated points and naturally
encourage points at the border which are the most scattered [11]. In our initial experiments, we noticed
that this entropy objective needed more careful tuning, as it often overemphasized the boundary and
ignored to explore in the space. The mutual information criterion is known to tackle this problem, at
least in conventional AL settings [11], but, on the other hand, the aforementioned objective ℐ(𝜑) in its
original form makes conditioning on 𝑦(𝒳 ∖ 𝑋𝜑 ). This is not well-defined when 𝒳 is a continuous space.
Even if 𝒳 is discrete,(︀ conditioning on large pool (fine discretization) is computationally heavy, i.e. GP
cubic complexity 𝒪 |𝒳 | (Appendix A). Discrete pool also enforces a classifier-like policy 𝜑 (select
                            3
                              )︀

points from a pool) which prohibits us from utilizing the existing NN structure developed by [28, 29].
   We thus wish to modify ℐ(𝜑). Note that ℐ(𝜑) is a regularized entropy objective, and ℋ(𝜑), although
not always well performing, can already be used for training. Therefore, we propose a simple yet effective
approach: compute the regularization term only on a sparse set of 𝑁𝑔𝑟𝑖𝑑 samples (𝑋𝑔𝑟𝑖𝑑 , 𝑌𝑔𝑟𝑖𝑑 ) ∈
𝑠𝑝𝑎𝑐𝑒(𝒳 × 𝒴), i.e.

               ℐ(𝜑) ≈ E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 ) + log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑔𝑟𝑖𝑑 )] .                (3)

𝑁𝑔𝑟𝑖𝑑 should be much larger than 𝑇 . Maximizing this objective encourages {𝑥𝜑,1 , ..., 𝑥𝜑,𝑇 } to track
subsets of 𝑋𝑔𝑟𝑖𝑑 . To keep the policy from selecting only those sparse grid samples, which are not
necessarily optimal points, we re-sample 𝑋𝑔𝑟𝑖𝑑 in each training step. The intuition of this objective is
two-fold: (i) it can be viewed as an entropy objective regularized by an additional search space indicator,
or (ii) it can be viewed as an imitation objective because a subset of grid points, if happens to have large
joint entropy, maximizes the objective.
   The above losses consider a fixed set of GP hyperparameters, which encodes only certain function
features. To generalize to diverse functions, we take the GP hyperparameters into account, and note
that a real AL is initiated with initial data points. Our policy objectives become

                        ℋ(𝜑) = E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 , 𝑌𝑖𝑛𝑖𝑡 )]
                                                                                                                            (4)
                                ∝ E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 )]


  ℐ(𝜑) ≈ E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 , 𝑌𝑖𝑛𝑖𝑡 ) + log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 , 𝑌𝑖𝑛𝑖𝑡 |𝑌𝑔𝑟𝑖𝑑 )]
                                                                                                                            (5)
        ∝ E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 ) + log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 , 𝑌𝑔𝑟𝑖𝑑 )] .

The proportion symbol here indicates equivalency, and this holds by applying Bayes rule and removing
the part that is not relevant to the policy gradient. In this paper, we sample 𝑋𝑔𝑟𝑖𝑑 , 𝜃, 𝜎 2 uniformly.
Please see the appendix for numerical details.
   To summarize, we are given priors 𝑝(𝑓 ) ∼ 𝒢𝒫(0, 𝑘𝜃 ), 𝑝(𝜖) ∼ 𝒩 (0, 𝜎 2 ) with uniformly random
hyperparameters 𝜃 and 𝜎 2 , we may then sample GP function and noise realizations and a policy
returns sequences of data by actively learning those functions. Then the data are plugged into meta AL
objectives (Eqs. (4) and (5)) where the gradient propagates from the queries backward into the policy.
We see here that data are generated where Eqs. (4) and (5) require, and this allows one to easily sample
thousands or millions of functions in the training. In the next section, we zoom into the sampling of
the policy-queried data.

Simulated AL: The objective functions provide insight into the simulation procedure: sample a
GP function realization, sample initial data, perform AL cycles by forwarding with the policy, and
maximize either the policy entropy (Eq. (4)) or the regularized policy entropy (i.e. the modified mutual
information Eq. (5)). The training procedure is summarized in Algorithm 3. One can see that lines 4-8
are simulating AL cycles as how the policy will be deployed (Algorithm 2).


                                                             22
Cen-You Li et al. CEUR Workshop Proceedings                                                            18–32


   The only remaining challenge here is to ensure that 𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 are from the same GP function. This
is not trivial because the observations are sampled iteratively, i.e. ∀𝑡 = 1, ..., 𝑇 , 𝑥𝑡 = 𝜑(𝒟𝑡−1 ), which
means 𝑦𝜑,1 , ..., 𝑦𝜑,𝑡−1 need to be sampled before 𝑥𝑡 , ..., 𝑥𝑇 are known. One way is to make a standard
GP posterior(︀sampling 𝑦𝑡 ∼ 𝑝(𝑦(𝑥𝑡 )|𝒟𝑡−1 , 𝑘𝜃 , 𝜎 2 ) instead of)︀line 2 and 6 of Algorithm 3. However, this
results in 𝒪 𝑁𝑖𝑛𝑖𝑡3 + (𝑁            3
                           𝑖𝑛𝑖𝑡 + 1) + ... + (𝑁𝑖𝑛𝑖𝑡 + 𝑇 − 1)
                                                               3 complexity in time, i.e. the notorious GP

cubic complexity (Appendix A). Sampling 𝑌𝑔𝑟𝑖𝑑 (line 11 of Algorithm 3) would also take tremendous
time.
   We address this issue by applying a decoupled function sampling technique [30, 31]. The idea is to
sample Fourier features to approximate a GP function. As a result, an approximated function is a linear
combination of cosine functions (line 2 of Algorithm 3), and we can later compute the function value at
any point 𝑥 ∈ 𝒳 in linear time (line 6 & 13 of Algorithm 3). One limitation arises, however, is that the
kernel 𝑘𝜃 needs to have a Fourier transform (e.g. stationary kernels, see Bochner’s theorem in [10]).
   Notice that this training procedure simulates a nonmyopic AL. That is, the 𝑇 queries are optimized if
considered jointly but not necessarily stepwise optimal. We additionally provide a myopic AL training
algorithm detailed in appendix (Algorithm 4), which optimizes stepwise data selection. This idea is
simple: the initial dataset has size randomly sampled from {𝑁𝑖𝑛𝑖𝑡 , ..., 𝑁𝑖𝑛𝑖𝑡 + 𝑇 − 1}, the policy query
one point, and then we compute the same loss objectives with the altered sequential structure. A myopic
policy is not expected to have better AL performance but can avoid making recursive NN inference
during the training. This might be beneficial if we want to scale the training up to larger 𝑁𝑖𝑛𝑖𝑡 or larger
𝑇.

NN structure: We take the NN described in [29]. Each data pair (𝑥, 𝑦) is first mapped by a MLP
(multilayer perceptrons) to a 32 dimension embedding, then two layers of transformer encoders ([32],
without positional encoding) are applied to the sequence of data pair embeddings, and finally the
attended sequence is summed (ensure permutation invariance of observed dataset) before mapped by
another MLP to a new data query. The details are described in [28, 29]. We only add another Tanh
layer with rescaling constants to refine the decoder output (refine the output to 𝒳 which is bounded in
our case). The query is in continuous 𝒳 space and this is how we train the policy. If an AL problem is
considered over discrete 𝒳 , one simple approach is to select the point closest to the NN query (line 2
of Algorithm 2).

Complexity:      The training complexities are in Appendix C. The AL deployment complexities are as
follows

    • amortized AL (Algorithm 2): NN forwarding takes 𝒪 (𝑁𝑖𝑛𝑖𝑡 + 𝑡 − 1)2 at each 𝑡;
                                                          (︀                )︀

    • conventional GP AL (Algorithm 1): at each 𝑡, GP modeling takes 𝒪 (𝑁𝑖𝑛𝑖𝑡 + 𝑡 − 1)3 in time
                                                                         (︀            )︀

      while complexity of acquisition optimization depends on the exact AL problems.


4. Experiments
In this section, we test our methods on a couple of benchmark tasks.


Table 1
Policy training time
          loss functions entropy (Eq. (4)) regularized entropy (Eq. (5))            DAD baseline
            training steps           20𝑘                       10𝑘                        20𝑘
         NN 1D, 𝑇 = 10         100 to 120 mins            around 70 mins            around 70 mins
         NN 2D, 𝑇 = 20         230 to 240 mins           around 160 mins           around 130 mins
The batch size is described in Appendix D. Regularized entropy method splits the batch into two and computes
              sub batches in sequence. The training is on one NVIDIA v100 (32GB GPU memory).


                                                     23
Cen-You Li et al. CEUR Workshop Proceedings                                                                         18–32


Figure 2: A policy is trained with Algorithm 3 together with the entropy or regularized entropy objective.
See Table 1 for the training time. A myopic policy is trained with Algorithm 4, where detail is given in Appendix B.
𝑁𝑖𝑛𝑖𝑡 = 1. For 1D problems (Sin & airline passenger), 𝑇 = 10, which means a total of 11 observations. For 2D
problems, 𝑇 = 20, which means a total of 21 observations. For each benchmark problem, a star is marked if
RMSE of the method is significantly smaller than Random (Wilcoxon signed-rank test, p-value threshold 0.05).
The time only takes querying time into account. For GP AL method, GP predictive distributions are necessary for
the acquisition function and thus the GP training time is part of the querying time (Figure 1 and Algorithm 1).
Output labeling time is excluded.


4.1. NN training
We prepare the experiments by running Algorithm 3, which corresponds to the up-front preparation
block in Figure 1. The entire Algorithm 3 is one NN training step. We implement the training pipeline
with PyTorch. In the following experiments, we train one NN policy for 1D benchmark tasks and
one NN for 2D tasks. The training time and the hardware are described in Table 1. The state dict
(PyTorch model parameters) takes around 200 KB disk space for both NNs. The training of each setting
is repeated five times with different random seeds. Among the five training jobs of each NN, we select
the NN with the best training loss for the following experiments. See Appendix E for details.

4.2. Benchmark tasks
We deploy AL over the following benchmark problems. Our NNs are trained with 𝒳 = [0, 1]𝐷 .

Sin function (1D): This is a one(︀dimension
                                        )︀ problem 𝑥 ∈ [0, 1], 𝑓 (𝑥) = sin(20𝑥). In the experiments,
we sample Gaussian noise 𝜖 ∼ 𝒩 0, 0.12 .

Branin function (2D): This function is defined over (𝑥1 , 𝑥2 ) =∈ [−5, 10] × [0, 15], which requires
a rescaling mapping ∀𝑥 ∈ 𝒳 , (𝑥1 , 𝑥2 ) = (−5 + 15[𝑥]1 , 15[𝑥]2 ). The function

                  𝑓𝑎,𝑏,𝑐,𝑟,𝑠,𝑡 ((𝑥1 , 𝑥2 )) = 𝑎(𝑥2 − 𝑏𝑥21 + 𝑐𝑥1 − 𝑟) + 𝑠(1 − 𝑡)𝑐𝑜𝑠(𝑥1 ) + 𝑠,

where 𝑎, 𝑏, 𝑐, 𝑟, 𝑠, 𝑡 = (1, 4𝜋 2 , 𝜋 , 6, 10, 8𝜋 ) are constants. We sample noise free data points and use the
                             5.1 5              1
                                                                           𝑓𝑎,𝑏,𝑐,𝑟,𝑠,𝑡 ((𝑥1 ,𝑥2 ))−𝑚𝑒𝑎𝑛(𝑓𝑎,𝑏,𝑐,𝑟,𝑠,𝑡 )
samples to normalize our output 𝑓𝑎,𝑏,𝑐,𝑟,𝑠,𝑡 ((𝑥1 , 𝑥2 ))𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 =                      𝑠𝑡𝑑(𝑓𝑎,𝑏,𝑐,𝑟,𝑠,𝑡 )            . In
the experiments, the noise is 𝜖 ∼ 𝒩 0, 0.12 .
                                   (︀        )︀


Unconstrained Simionescu function (2D): This is originally a constrained problem [33] defined
over (𝑥1 , 𝑥2 ) ∈ [−1.25, 1.25]2 (which again requires a rescaling mapping 𝒳 → [−1.25, 1.25]2 ). We
remove the constraint, resulting in 𝑓 (𝑥1 , 𝑥2 ) = 0.1𝑥1 𝑥2 . As Branin function, we sample noise(︀ free data
points and use the samples to normalize our output. In the experiments, the noise is 𝜖 ∼ 𝒩 0, 0.12 .
                                                                                                           )︀


                                                        24
Cen-You Li et al. CEUR Workshop Proceedings                                                          18–32


Unconstrained Townsend function (2D): This is originally a constrained problem [34] 1 defined
over (𝑥1 , 𝑥2 ) ∈ [−2.25, 2.25] × [−2.5, 1.75] (rescaling mapping from 𝒳 required). We remove the
constraint, resulting in 𝑓 (𝑥1 , 𝑥2 ) = − [cos((𝑥1 − 0.1)𝑥2 )]2 − 𝑥1 sin(3𝑥1 + 𝑥2 ). As Branin function,
we sample noise(︀ free data
                        )︀ points and use the samples to normalize our output. In the experiments, the
noise is 𝜖 ∼ 𝒩 0, 0.12 .

Airline passenger dataset (1D): This is a publically available time series dataset 2 . Each data point
has a date input (year and month) and a number of passengers as output. We convert the input into real
number as 𝑦𝑒𝑎𝑟 + (𝑚𝑜𝑛𝑡ℎ − 1)/12, and then rescale the entire input space to [0, 1] (the earliest date
becomes 0 while the latest becomes 1). The output data are again normalized to zero mean and unit
variance.

Langley Glide-Back Booster (LGBB) dataset (2D): This is a two dimension dataset described
in [35]3 . The dataset has multiple outputs and we take the "lift" to run our experiments (after normalized
to zero mean and unit variance). The inputs are 𝑥1 (mach) and 𝑥2 (alpha). which are normalized by

                                               𝑥1 = 𝑚𝑎𝑐ℎ/6,
                                               𝑥2 = (𝑎𝑙𝑝ℎ𝑎 + 5)/35.

After doing this, the input space is [0, 1]2 .

4.3. AL deployment
We compare our methods with (i) standard GP AL (Algorithm 1) with entropy acquisition (Appendix D)
(ii) random selection criterion and (iii) DAD, i.e. amortized Bayesian experimental design proposed
by [28]. In this section, we report the modeling performance and AL deployment time. Since the high-
level goal is to model a regression task, we use the collected datasets to train models and evaluate the
RMSE as the modeling performance. Although DAD and our amortized AL methods are not restricted
to GP modeling, we still evaluate the data on GP models, as GPs are powerful modeling tools for such
amount of data and as this is a fair comparison to baseline (i).
    We run experiments over the aforementioned benchmark problems. Our NN policy returns points on
continuous space 𝒳 ⊆ R𝐷 . On benchmark functions, a query is taken as it is (line 2 of Algorithm 2),
while on the testing datasets (airline passenger and LGBB), we take the nearest point with 𝐿2-norm
from the pool. Notice that the single pre-trained 1D NN policy is used for all the 1D tasks and the 2D
NN policy for all the 2D tasks.
    For each method, we repeat the AL experiments (Algorithms 1 and 2) for five times and report the
mean and standard error. Each experiment is executed with individual seed. Note here that initial
datasets (and noises of function problems) are randomly sampled, where the seed plays a role.
    The results are shown in Figure 2. The RMSEs are evaluated after the AL deployments. For example,
with 1 dim problems (sin & airline passenger dataset), we start with 1 initial points and query for 10
iterations, resulting in 11 data points in the end. Then the RMSEs are evaluated with GPs trained with
these 11 data points. The query time is the data selection time of all iterations. We can see that, on
all the presented benchmark problems except for the Sin function, data selected by our nonmyopic
amortized AL approaches achieve as good modeling performances as conventional GP AL, while the
querying time is significantly faster. Some of the RMSE out-performance of our nonmyopic approaches
(and the GP AL baseline) over Random is statistically significant (Wilcoxon signed-rank test, p-value
smaller than 0.05). With myopic training scheme, the policy can perform well in some tasks such as the
LGBB but badly in others. In our Appendix E, we present few more trained policies good at different
tasks. The DAD baseline sometimes performs well on 1𝐷 problems but not on any 2𝐷 problems.
1
  https://www.chebfun.org/examples/opt/ConstrainedOptimization.html
2
  https://github.com/jbrownlee/Datasets/blob/master/airline-passengers.csv
3
  https://bobby.gramacy.com/surrogates/lgbb.tar.gz, lgbb_original.txt


                                                           25
Cen-You Li et al. CEUR Workshop Proceedings                                                        18–32


   In general, we consider this result as a huge success. The tens-of-milliseconds-level decision-making
time per query allows amortized AL method to be applied to systems where output responses are given
in a few dozen Hz. In such systems, it is obviously expensive to wait for GP modeling and entropy
optimization.


Acknowledgments
This work was supported by Bosch Center for Artificial Intelligence, which provided financial support,
computers and GPU clusters. The Bosch Group is carbon neutral. Administration, manufacturing and
research activities no longer leave a carbon footprint. This also includes GPU clusters on which the
experiments have been performed.


References
 [1] B. Settles, Active learning literature survey, University of Wisconsin-Madison (2010).
 [2] P. Kumar, A. Gupta, Active learning query strategies for classification, regression, and clustering:
     A survey, Journal of Computer Science and Technology (2020).
 [3] A. Tharwat, W. Schenck, A survey on active learning: State-of-the-art, practical challenges and
     research directions, Mathematics (2023).
 [4] Y. Gal, R. Islam, Z. Ghahramani, Deep Bayesian active learning with image data, International
     Conference on Machine Learning (2017).
 [5] A. Kirsch, J. van Amersfoort, Y. Gal, Batchbald: Efficient and diverse batch acquisition for deep
     bayesian active learning, Advances in Neural Information Processing Systems (2019).
 [6] A. Lederer, A. J. O. Conejo, K. A. Maier, W. Xiao, J. Umlauft, S. Hirche, Gaussian process-based
     real-time learning for safety critical applications, 2021.
 [7] Y. Baram, R. El-Yaniv, K. Luz, Online choice of active learning algorithms, Journal of Machine
     Learning Research (2004).
 [8] K. Konyushkova, R. Sznitman, P. Fua, Learning active learning from data, Advances in Neural
     Information Processing Systems (2017).
 [9] K. Swersky, Y. Rubanova, D. Dohan, K. Murphy, Amortized bayesian optimization over discrete
     spaces, Conference on Uncertainty in Artificial Intelligence (2020).
[10] C. Rasmussen, C. Williams, Gaussian processes for machine learning, MIT Press (2006).
[11] C. Guestrin, A. Krause, A. P. Singh, Near-optimal sensor placements in gaussian processes,
     International Conference on Machine Learning (2005).
[12] A. Krause, C. Guestrin, Nonmyopic active learning of gaussian processes: An exploration-
     exploitation approach, International Conference on Machine Learning (2007).
[13] A. Krause, A. Singh, C. Guestrin, Near-optimal sensor placements in gaussian processes: Theory,
     efficient algorithms and empirical studies, Journal of Machine Learning Research (2008).
[14] N. Houlsby, F. Huszar, Z. Ghahramani, M. Lengyel, Bayesian active learning for classification and
     preference learning, Computing Research Repository (2011).
[15] S. J. Gershman, N. D. Goodman, Amortized inference in probabilistic reasoning, Annual Meeting
     of the Cognitive Science Society (2014).
[16] C. Zimmer, M. Meister, D. Nguyen-Tuong, Safe active learning for time-series modeling with
     gaussian processes, Advances in Neural Information Processing Systems (2018).
[17] R. Garnett, M. Osborne, P. Hennig, Active learning of linear embeddings for gaussian processes,
     Conference on Uncertainty in Artificial Intelligence (2014).
[18] J. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert, M. Toussaint, Safe exploration
     for active learning with gaussian processes, Machine Learning and Knowledge Discovery in
     Databases (2015).
[19] X. Yue, Y. Wen, J. H. Hunt, J. Shi, Active learning for gaussian process considering uncertainties


                                                   26
Cen-You Li et al. CEUR Workshop Proceedings                                                         18–32


     with application to shape control of composite fuselage, IEEE Transactions on Automation Science
     and Engineering (2021).
[20] C.-Y. Li, B. Rakitsch, C. Zimmer, Safe active learning for multi-output gaussian processes, Interna-
     tional Conference on Artificial Intelligence and Statistics (2022).
[21] M. Bitzer, M. Meister, C. Zimmer, Hierarchical-hyperplane kernels for actively learning gaussian
     process models of nonstationary systems, 2023.
[22] S. Seo, M. Wallat, T. Graepel, K. Obermayer, Gaussian process regression: active data selection
     and test point rejection, Proceedings of the IEEE-INNS-ENNS International Joint Conference on
     Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New
     Millennium 3 (2000) 241–246 vol.3. URL: https://api.semanticscholar.org/CorpusID:18551791.
[23] E. Brochu, V. M. Cora, N. de Freitas, A tutorial on bayesian optimization of expensive cost functions,
     with application to active user modeling and hierarchical reinforcement learning, arXiv (2010).
[24] J. Rothfuss, V. Fortuin, M. Josifoski, A. Krause, Pacoh: Bayes-optimal meta-learning with pac-
     guarantees, International Conference on Machine Learning (2021).
[25] M. Bitzer, M. Meister, C. Zimmer, Amortized inference for gaussian process hyperparameters of
     structured kernels, Conference on Uncertainty in Artificial Intelligence (2023).
[26] M. Andrychowicz, M. Denil, S. Gómez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, N. de Fre-
     itas, Learning to learn by gradient descent by gradient descent, Advances in Neural Information
     Processing Systems (2016).
[27] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, N. de Freitas,
     Learning to learn without gradient descent by gradient descent, International Conference on
     Machine Learning (2017).
[28] A. Foster, D. R. Ivanova, I. Malik, T. Rainforth, Deep Adaptive Design: Amortizing Sequential
     Bayesian Experimental Design, International Conference on Machine Learning (2021).
[29] D. R. Ivanova, A. Foster, S. Kleinegesse, M. U. Gutmann, T. Rainforth, Implicit Deep Adaptive
     Design: Policy-Based Experimental Design without Likelihoods (2021).
[30] A. Rahimi, B. Recht, Random features for large-scale kernel machines, Advances in Neural
     Information Processing Systems (2007).
[31] J. T. Wilson, V. Borovitskiy, A. Terenin, P. Mostowsky, M. P. Deisenroth, Efficiently sampling
     functions from gaussian process posteriors, International Conference on Machine Learning (2020).
[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin,
     Attention is all you need, Advances in Neural Information Processing Systems (2017).
[33] P. Simionescu, Computer-aided graphing and simulation tools for autocad users, Computer-Aided
     Graphing and Simulation Tools for AutoCAD Users (2014).
[34] A. Townsend, Constrained optimization in chebfun, chebfun.org (2017).
[35] S. E. Rogers, M. J. Aftosmis, S. A. Pandya, N. M. Chaderjian, E. T. T., J. U. Ahmad, Automated
     cfd parameter studies on distributed parallel computers, AIAA Computational Fluid Dynamics
     Conference (2003).
[36] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, J. Han, On the variance of the adaptive learning
     rate and beyond, International Conference on Learning Representations (2020).


                                                    27
Cen-You Li et al. CEUR Workshop Proceedings                                                                           18–32


 Appendix.


A. Gaussian process and entropy
We first write down the GP predictive distribution. Given a set of 𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒 data points 𝒟 =
{𝑋, 𝑌 } ⊆ 𝒳 × 𝒴, we wish to make inference at points 𝑋𝑡𝑒𝑠𝑡 = {𝑥𝑡𝑒𝑠𝑡,1 , ..., 𝑥𝑡𝑒𝑠𝑡,𝑛 }. We write
𝑌𝑡𝑒𝑠𝑡 = (𝑦(𝑥𝑡𝑒𝑠𝑡,1 ), ..., 𝑦(𝑥𝑡𝑒𝑠𝑡,𝑛 )) for brevity. The joint distribution of 𝑌 and predictive 𝑌𝑡𝑒𝑠𝑡 is Gaus-
sian:

                    𝑝(𝑌, 𝑌𝑡𝑒𝑠𝑡 ) = 𝒩 0, 𝑘𝜃 (𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 , 𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 ) + 𝜎 2 𝐼𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒 +𝑛                     (6)
                                         (︀                                              )︀

where 𝑘𝜃 (𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 , 𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 ) is a gram matrix with [𝑘𝜃 (𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 , 𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 )]𝑖,𝑗 =
𝑘𝜃 ([𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 ]𝑖 , [𝑋 ∪ 𝑋𝑡𝑒𝑠𝑡 ]𝑗 ).
   This leads to the following predictive distribution (or GP posterior distribution)
                  𝑝(𝑌𝑡𝑒𝑠𝑡 |𝒟) = 𝒩 (𝑌𝑡𝑒𝑠𝑡 |𝜇𝒟 (𝑋𝑡𝑒𝑠𝑡 ), 𝑐𝑜𝑣𝒟 (𝑋𝑡𝑒𝑠𝑡 )) ,
                                                                         ]︀−1
                  𝜇𝒟 (𝑋𝑡𝑒𝑠𝑡 ) = 𝑘𝜃 (𝑋𝑡𝑒𝑠𝑡 , 𝑋) 𝑘𝜃 (𝑋, 𝑋) + 𝜎 2 𝐼𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒
                                              [︀
                                                                              𝑌,
                                                                                                                            (7)
               𝑐𝑜𝑣𝒟 (𝑋𝑡𝑒𝑠𝑡 ) = 𝑘𝜃 (𝑋𝑡𝑒𝑠𝑡 , 𝑋𝑡𝑒𝑠𝑡 ) + 𝜎 2 𝐼𝑛
                                                                        ]︀−1
                             − 𝑘𝜃 (𝑋𝑡𝑒𝑠𝑡 , 𝑋) 𝑘𝜃 (𝑋, 𝑋) + 𝜎 2 𝐼𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒
                                              [︀
                                                                             𝑘𝜃 (𝑋, 𝑋𝑡𝑒𝑠𝑡 ) .
Elements of the predictive mean vector 𝜇𝐷 (𝑋𝑡𝑒𝑠𝑡 ) are the noise-free predictive function values.
  Note that the log probability density function is
          log 𝑝(𝑌𝑡𝑒𝑠𝑡 |𝒟) = − 1/2 log ((2𝜋)𝑛 det(𝑐𝑜𝑣𝒟 (𝑋𝑡𝑒𝑠𝑡 )))
                                                                                                                            (8)
                               − 1/2(𝑌𝑡𝑒𝑠𝑡 − 𝜇𝐷 (𝑋𝑡𝑒𝑠𝑡 ))𝑇 [𝑐𝑜𝑣𝒟 (𝑋𝑡𝑒𝑠𝑡 )]−1 (𝑌𝑡𝑒𝑠𝑡 − 𝜇𝐷 (𝑋𝑡𝑒𝑠𝑡 )),
and, if we consider 𝑌𝑡𝑒𝑠𝑡 as a vector of 𝑛 random variables, the entropy is

                       𝐻(𝑌𝑡𝑒𝑠𝑡 |𝒟) =𝑛/2 log(2𝜋𝑒) + 1/2 log det(𝑐𝑜𝑣𝒟 (𝑋𝑡𝑒𝑠𝑡 )).                 (9)

   Inverting a 𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒 × 𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒 matrix 𝑘𝜃 (𝑋, 𝑋) + 𝜎 2 𝐼𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒 has complexity 𝒪(𝑁𝑜𝑏𝑠𝑒𝑟𝑣𝑒
                                                                                      3       ) in
                                          [︀                       ]︀

time. Computing the determinant also has cubic time complexity.


B. Additional losses and myopic training algorithm
In our main paper, we introduce

   ℋ(𝜑) ∝E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 )] ,
    ℐ(𝜑) ∝E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [− log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 ) + log 𝑝(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 , 𝑌𝑔𝑟𝑖𝑑 )] .

  We additionally look into two more similar loss objectives. We treat policy selected points as random
variables, compute the entropy directly, and then take expectation over different priors and functions:

 ℋ𝑣𝑒𝑟𝑠𝑖𝑜𝑛2 (𝜑) = E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [𝐻(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 )]                                          (10)
  ℐ𝑣𝑒𝑟𝑠𝑖𝑜𝑛2 (𝜑) ≈ E𝑝(𝜃,𝜎2 ) E𝑝(𝑓 (·),𝜖𝑡=1,...,𝑇 ) [𝐻(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 ) − 𝐻(𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 |𝑌𝑖𝑛𝑖𝑡 , 𝑌𝑔𝑟𝑖𝑑 )] . (11)

  Substituting Eqs. (8) and (9) into the losses, we see that the key difference is whether the observation
values 𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 are taken into account. We suspect that having 𝑦𝜑,1 , ..., 𝑦𝜑,𝑇 in the loss (main losses)
may help the policy adapt in AL deployment.
  Our ablation study below compares the losses.


                                                            28
Cen-You Li et al. CEUR Workshop Proceedings                                                          18–32


Myopic policy training: As described
                                                                Algorithm 4: Myopic AL training
in Section 3, we proposed another policy train-
ing method which does not require recursive            Require: prior 𝒢𝒫(0, 𝑘𝜃 ), 𝑝(𝜖) = 𝒩 (0, 𝜎 2 ), 𝑇
NN forwarding. The idea is simple: as the               1: sample 𝜃, 𝜎 2
policy is intended to make AL with 𝑁𝑖𝑛𝑖𝑡 ini-           2: sample 𝑓 ∼ 𝒢𝒫(0, 𝑘𝜃 )
tial points for 𝑇 iterations, we sample the size        3: sample 𝑡 = 1, ..., 𝑇
of initial dataset from 𝑁𝑖𝑛𝑖𝑡 , ..., 𝑁𝑖𝑛𝑖𝑡 + 𝑇 − 1      4: sample 𝒟𝑡−1 ⊆ 𝒳 × 𝒴
during the training, and then we simulate one-          5: 𝑥𝑡 = 𝜑(𝒟𝑡−1 )
step AL. This allows the policy to experience           6: sample 𝜖𝑡 ∼ 𝑝(𝜖), 𝑦𝑡 = 𝑓 (𝑥𝑡 ) + 𝜖𝑡
all sizes of datasets that it will be tackling dur-     7: 𝒟𝑡 ← 𝒟𝑡−1 ∪ {𝑥𝑡 , 𝑦𝑡 }
ing an AL deployment. The training procedure            8: if entropy loss then
is shown in Algorithm 4. All the loss func-             9:     compute loss per Eq. (4)
tions are still the same: we condition on initial      10: else if regularized entropy loss then
dataset with altered sizes, consider one-step          11:     sample 𝑋𝑔𝑟𝑖𝑑 ⊆ 𝒳
query (compute as if 𝑇 = 1), and propagate             12:     sample 𝑌𝑔𝑟𝑖𝑑 = 𝑓 (𝑋𝑔𝑟𝑖𝑑 ) + 𝑛𝑜𝑖𝑠𝑒
the gradient.                                          13:     compute loss per Eq. (5)
                                                       14: end if
                                                       15: update 𝜑
C. Training complexity
Overall, the training complexities are listed
below.

    • computing loss: 𝒪 𝑁𝑖𝑛𝑖𝑡        + 𝑇 3 in time and 𝒪 𝑁𝑖𝑛𝑖𝑡       + 𝑇 2 in space for the entropy objec-
                              (︀ 3         )︀                 (︀ 2            )︀

      tive (Eq. (4)), where the 𝑁𝑖𝑛𝑖𝑡 terms are time and cost of computing the GP predictive distribution
      (see Appendix A) while the 𝑇 term of computing the log probability likelihood;
    • computing loss: 𝒪 (𝑁𝑖𝑛𝑖𝑡 + 𝑁𝑔𝑟𝑖𝑑 )3 + 𝑇 3 in time and 𝒪 (𝑁𝑖𝑛𝑖𝑡 + 𝑁𝑔𝑟𝑖𝑑 )2 + 𝑇 2 in space for
                           (︀                         )︀                (︀                    )︀

      our regularized entropy objective (Eq. (5));
    • computing loss: 𝒪 𝑁𝑖𝑛𝑖𝑡       + 𝑇 3 in time and 𝒪 𝑁𝑖𝑛𝑖𝑡      + 𝑇 2 in space for the entropy version 2
                            (︀ 3          )︀                (︀ 2           )︀

      objective (appendix Eq. (10));
    • computing loss: 𝒪 (𝑁𝑖𝑛𝑖𝑡 + 𝑁𝑔𝑟𝑖𝑑 )3 + 𝑇 3 in time and 𝒪 (𝑁𝑖𝑛𝑖𝑡 + 𝑁𝑔𝑟𝑖𝑑 )2 + 𝑇 2 in space for
                           (︀                         )︀                (︀                    )︀

      our regularized entropy (︁∑︀ version 2 objective )︁(appendix Eq. (11));
                                  𝑇
    • NN forwarding: 𝒪                               2 = 𝒪 (𝑁                    3               3 with the
                                                                (︀                                )︀
                                  𝑡=1 (𝑁𝑖𝑛𝑖𝑡 + 𝑡 − 1)               𝑖𝑛𝑖𝑡 + 𝑇 − 1) − (𝑁𝑖𝑛𝑖𝑡 − 1)
      nonmyopic AL training (Algorithm 3), as self attention has square complexity [32];
    • NN forwarding: 𝒪 (𝑁𝑖𝑛𝑖𝑡 + 𝑇 − 1)2 with our myopic AL training (Algorithm 4) (this algorithm
                           (︀                   )︀

      does not make recursive NN forwarding but the performance is worse).

Note that we train only once to get a policy for various AL problems.


D. Numerical details
Policy training: In our current implementation, the data dimension and input bound need to be
predefined. We fix 𝒳 = [0, 1]𝐷 , and rescale all test problems to this region. The last layer of the NN
policy is 𝑙𝑎𝑦𝑒𝑟(𝑥) = (𝑡𝑎𝑛ℎ(𝑥) + 1)/2 which ensures that the policy proposes points in [0, 1]𝐷 . The
remaining structure is described in [29].
   In Algorithms 3 and 4, the kernel we use is a RBF kernel, which has 𝐷 +1 variables: the variance 𝑣 and
𝐷 dimension lengthscale vector 𝑙, i.e. 𝜃 = (𝑣, 𝑙). We sample 𝑣 ∼ 𝑈 𝑛𝑖𝑓 𝑜𝑟𝑚(0.505, 1.0), 𝜎 2 = 1.01 − 𝑣
(function and noise variances sum to 1.01) and 𝑙 ∼ 𝑈 𝑛𝑖𝑓 𝑜𝑟𝑚(0.05, 1.0). The sampling hyperparameters
should be tuned according to the applications, our setting only use general assumptions. The variance
parameters utilize the assumptions that (i) data are normalized to unit variance and (ii) signal-to-noise
ratio is at least one. The lengthscale is kept general, but one has to make sure that it is numerically
stable (e.g. too small is bad because each lengthscale component is a divisor in the kernel).


                                                      29
Cen-You Li et al. CEUR Workshop Proceedings                                                                      18–32


  The GP functions are approximated by Fourier features, which means each function sample is a linear
combination of cos functions [30, 31]:
                                                      𝐿
                                                     ∑︁         √︀
                                                                  2/𝐿 cos 𝑎𝑇𝑙 𝑥 + 𝑏𝑙 ,
                                                                         (︀         )︀
                                           𝑓 (𝑥) =         𝜔𝑙
                                                     𝑙=1

𝜔𝑙 , 𝑏𝑙 and 𝑎𝑙 are sampled from distributions described in [30]. Larger 𝐿 leads to better approximations.
We set 𝐿 = 100. The analytical mean of windows [0, 1]𝐷 is computed such that all functions can be
shifted to zero mean (a GP function has zero mean and unit variance over the entire real space, but
not necessarily in a specific bounded window). The analytical mean is the integral of 𝑓 (𝑥) divided by
volume of [0, 1]𝐷 . The integral is
                               𝐿
                         1 ∑︁ √︀
                             𝜔𝑙 2/𝐿 sin (𝑎𝑙 𝑥 + 𝑏𝑙 ) |1𝑥=0 , for 𝐷 = 1,
                         𝑎𝑙
                               𝑙=1
                                𝐿
                   −1          ∑︁
                                            2/𝐿 cos ([𝑎𝑙 ]1 𝑥1 + [𝑎𝑙 ]2 𝑥2 + 𝑏𝑙 ) |1𝑥1 =0 |1𝑥2 =0 , for 𝐷 = 2,
                                          √︀
                                     𝜔𝑙
               [𝑎𝑙 ]1 [𝑎𝑙 ]2
                               𝑙=1
                                                  and so on.

It may happen that at least one component∏︀       of 𝑎𝑙 is zero or is close to zero, which causes a problem
in the division. In this case, we replace |1/ 𝐷     𝑑=1 [𝑎𝑙 ]𝑑 | by 100000. The error is negligible, i.e. much
smaller than noise level.
   The batch sizes are: for nonmyopic training (Algorithm 3), we sample 25 kernels (25 sets of 𝜃), 10
sets of noise realizations 𝜖𝑡=1,...,𝑇 , 25 functions per prior, resulting in overall 6250 AL experiments per
loss computation (expectation over 6250 sequences of 𝑇 queries); for myopic training (Algorithm 4), we
sample 250 kernels, chunk them to 20 different batches, each has its own size of initial datasets, and the
remaining settings are the same. The grid samples of regularized entropy objective have 𝑁𝑔𝑟𝑖𝑑 = 100.
Note that whenever we need to sample input points, e.g. 𝑋𝑖𝑛𝑖𝑡 , 𝑋𝑔𝑟𝑖𝑑 , we sample uniformly from
𝒳 = [0, 1]𝐷 .

Experiments: In our experiments, we always model with a RBF kernel. Given a dataset, GP hyperpa-
rameters are optimized with Type II maximum likelihood.
   In our GP AL baseline, the acquisition function is the predictive entropy 𝐻(𝑦(𝑥)|𝒟𝑡−1 ) (see Algo-
rithm 1 and Eq. (9)). For the airline passenger and LGBB datasets, the acquisition score can be computed
on the entire pool of unseen data, and then the optimization can be solved by selecting the point with
the largest score. For function problems, at each 𝑡, we randomly sample 5000 inputs points, optimize
on these points, make query and go to the next iterations.


E. Ablation study
Trained policy selection: For each training pipeline, we train with five different seeds. The optimizer
is RAdam [36], and we try a few different initial learning rates (lrs). We set a lr scheduler to discount the
lr by 2% every 50 training steps. With DAD objective, ℋ and ℋ𝑣𝑒𝑟𝑠𝑖𝑜𝑛2 , we train with 400 * 50 = 20000
steps, and with ℐ and ℐ𝑣𝑒𝑟𝑠𝑖𝑜𝑛2 , we train with 200 * 50 = 10000 steps.
   The training results of 1 dimension policy (our implementation pre-define dimensions) is shown
in Figure 3 and 2 dimension in Figure 4. With our main nonmyopic training, the training loss appears
to be a good indicator of AL deployment performances. Policies with the minimized negative objectives
seem to perform the best in the test problems. With myopic training, it seems like each trained policy
may perform well in certain problems but badly in others. In our main paper, for each training objective,
we present the policy with best last-ten-epoch mean losses.


                                                                   30
Cen-You Li et al. CEUR Workshop Proceedings                                                                18–32


Figure 3: A policy (row 1, 2, 3, 5) is trained with Algorithm 3 together with the entropy or regularized entropy
objective. A myopic policy (row 4) is trained with Algorithm 4, where detail is given in Appendix B. This plot
show training of NN for 1D problems. The first column shows the negative loss values. In the second and third
columns, we sort the trained policies by the mean training losses of the last 10 epochs (last 500 training steps).
We set 𝑁𝑖𝑛𝑖𝑡 = 1 and 𝑇 = 10, which means a total of 11 observations after AL is done. For each benchmark
problem, a star is marked if RMSE of the method is significantly smaller than Random (Wilcoxon signed-rank
test, p-value threshold 0.05). The time only takes querying time into account. Output labeling time it excluded.


                                                       31
Cen-You Li et al. CEUR Workshop Proceedings                                                                18–32


Figure 4: A policy (row 1, 2, 3, 5) is trained with Algorithm 3 together with the entropy or regularized entropy
objective. A myopic policy (row 4) is trained with Algorithm 4, where detail is given in Appendix B. This plot
show training of NN for 2D problems. The first column shows the negative loss values. In the second and third
columns, we sort the trained policies by the mean training losses of the last 10 epochs (last 500 training steps).
We set 𝑁𝑖𝑛𝑖𝑡 = 1 and 𝑇 = 20, which means a total of 21 observations after AL is done. For each benchmark
problem, a star is marked if RMSE of the method is significantly smaller than Random (Wilcoxon signed-rank
test, p-value threshold 0.05). The time only takes querying time into account. Output labeling time it excluded.


                                                       32