=Paper= {{Paper |id=Vol-2192/ialatecml_paper7 |storemode=property |title=Adaptive Selection of Gaussian Process Model for Active Learning in Expensive Optimization |pdfUrl=https://ceur-ws.org/Vol-2192/ialatecml_paper7.pdf |volume=Vol-2192 |authors=Jakub Repický,Zbyněk Pitra,Martin Holena |dblpUrl=https://dblp.org/rec/conf/pkdd/RepickyPH18 }} ==Adaptive Selection of Gaussian Process Model for Active Learning in Expensive Optimization== https://ceur-ws.org/Vol-2192/ialatecml_paper7.pdf
    Adaptive Selection of Gaussian Process Model
    for Active Learning in Expensive Optimization

           Jakub Repický 1,2 , Zbyněk Pitra 1,3 , and Martin Holeňa 1
    1
   Institute of Computer Science, Czech Academy of Sciences, repicky@cs.cas.cz
             2
                Faculty of Mathematics and Physics, Charles University
3
  Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University



1       Introduction
Black-box optimization denotes the optimization of objective functions the values
of which are only available through empirical measurements or experiments.
Such optimization tasks are most often tackled with evolutionary algorithms
and other kinds of metaheuristics methods (e. g., [12]), which need to evaluate
the objective function in many points. This is a serious problem in situations
when its evaluation is expensive with respect to some kind of resources, e.g., the
cost of needed experiments.
    A standard attempt to circumvent that problem is to evaluate the original
objective function only in a small fraction of those points, and to evaluate a
surrogate model of the original function [5] in the remaining points. Once a model
has been trained, the success of the optimization in the remaining iterations
depends on a resource aware selection of points in which the original function
will be evaluated, which is a typical active learning task.
    The surrogate model used in the reported research is a Gaussian process (GP)
[11], which treats the values of an unknown function as jointly Gaussian random
variables. The advantage of GP compared to other kinds of surrogate models
is its capability of quantifying the uncertainty of prediction, by calculating the
variance of the posterior distribution of function values.


2       Novel Approach to GP-Based Active Learning
So far, active learning of points in which the original objective function will be
evaluated during a GP-based surrogate modelling nearly always used a fixed
covariance function of the surrogate model and a fixed active learning algorithm
(e.g., [9, 15]).
    In the reported research, an adaptive selection of the covariance function
according to the available data is investigated.
    To this end, a pool of kernels of 8 various kinds has been established, including
non-stationary kernels and composed kernels of depth one (Figure 1).
    Adaptive selection of GP covariance functions is known from GP regression
literature [3, 6, 8]. Differently to our approach, the number of available kinds of
simple kernels is smaller; on the other hand, they can be recurrently composed to


                                         80
Adaptive Selection of Gaussian Process Model for Active Learning

an arbitrary depth through algebraic operations. An additional difference con-
cerns the criteria according to which the models are selected: Whereas in [6], only
likelihood is used, and in [3, 8], only the Bayes information criterion (BIC ) [13]
was used, our research includes the investigation of suitability of different crite-
ria for the selection of GP covariance functions in surrogate modelling. Since the
maximum likelihood estimate is prone to favour overfitting models, we consider
only criteria that account for model complexity. Apart from the BIC used in
[3, 8], we consider also the Akaike information criterion (AIC ) [1], the deviance
information criterion (DIC ) [14], and a criterion proposed by Watanabe in [16],
called by him widely applicable information criterion (WAIC ).


2.1    Algorithm of Active Learning

The adaptive selection of covariance function is based on GP-based Doubly
trained surrogate covariance matrix adaptation evolutionary strategy (DTS-
CMA-ES) [2, 10]. A GP model is built in each generation of the evolution strat-
egy. When the model is trained, a fraction of the population maximizing the
probability of improvement [7] is selected for evaluation with the expensive ob-
jective function. The model is retrained with the newly evaluated points and used
to rank the whole population. The fraction of points for evaluation is adapted
according to recent performance of the surrogate model, more precisely, to its
ranking difference error in the last iteration. A novel contribution of the re-
ported research is a selection of the covariance function according to one of the
aforementioned criteria, taking place at the training phase.


3     Results and Conclusion

The best performing model selection in our experiments are those based on AIC
and BIC, with no clear difference between them. Results for DIC and WAIC are
not shown because we have not yet succeed to implement these.
    On average, optimization results do not show an improvement on the DTS-
CMA-ES. Nevertheless, a promising result has been obtained on the “Step el-
lipsoidal” function, characterized by plateaus lying on a quadratic structure, es-
pecially in multi-dimensional variants (Figure 2). The most frequently selected
kernel for this function under both AIC and BIC has been the sum of the squared
exponential and the quadratic kernel, which provides an intuitive interpretation
of the function.
    This result suggests that composite kernels can capture global features of
the objective function landscape and provide better predictive performance, but
extending the pool of covariances might be needed e. g., by composite kernels
of arbitrary depth. This is challenging due the computational cost of searching
through an open-ended space of kernel expressions. A possible solution might be
found in a co-evolution of the kernel for the surrogate model and the candidate
solution to the objective function.


                                        81
Adaptive Selection of Gaussian Process Model for Active Learning

Acknowledgment
The research reported in this paper has been supported by the SVV project
number 260 453 and the Czech Science Foundation (GAČR) grant 17-01251.

References
 1. Akaike, H.: Information Theory and an Extension of the Maximum Likelihood
    Principle, pp. 199–213. Springer New York (1973)
 2. Bajer, L.: Model-based evolutionary optimization methods. Ph.D. thesis, Faculty
    of Mathematics and Physics, Charles University in Prague, Prague (2018)
 3. Duvenaud, D., Lloyd, J., Grosse, R., Tenenbaum, J., Zoubin, G.: Structure discov-
    ery in nonparametric regression through compositional kernel search. In: Dasgupta,
    S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Ma-
    chine Learning. vol. 28, pp. 1166–1174. PMLR, Atlanta, Georgia, USA (Jun 2013)
 4. Hansen, N., Finck, S., Ros, R., Auger, A.: Real-parameter Black-Box Optimization
    Benchmarking 2012: Experimental setup. Tech. rep., INRIA (2012)
 5. Jin, Y.: Surrogate-assisted evolutionary computation: Recent advances and future
    challenges. Swarm and Evolutionary Computation 1(2), 61–70 (Jun 2011)
 6. Kronberger, G., Kommenda, M.: Evolution of covariance functions for gaussian
    process regression using genetic programming. In: Moreno-Dı́az, R., Pichler, F.,
    Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory - EUROCAST
    2013. pp. 308–315. Springer Berlin Heidelberg, Berlin, Heidelberg (2013)
 7. Kushner, H.J.: A new method of locating the maximum point of an arbitrary
    multipeak curve in the presence of noise. J. Basic Eng. 86(1), 97–106 (1964)
 8. Lloyd, J.R., Duvenaud, D., Grosse, R., Tenenbaum, J.B., Ghahramani, Z.: Auto-
    matic construction and natural-language description of nonparametric regression
    models. CoRR abs/1402.4304 (Apr 2014)
 9. Lu, J., Li, B., Jin, Y.: An evolution strategy assisted by an ensemble of local
    Gaussian process models. In: Proceeding of the fifteenth annual conference on
    Genetic and evolutionary computation conference - GECCO '13. ACM (2013)
10. Pitra, Z., Bajer, L., Holeňa, M.: Doubly trained evolution control for the surro-
    gate CMA-ES. In: Parallel Problem Solving from Nature – PPSN XIV, pp. 59–68.
    Springer International Publishing (2016)
11. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning.
    Adaptative computation and machine learning series, MIT Press (2006)
12. Schaefer, R.: Foundations of Global Genetic Optimization. Springer Publishing
    Company, Incorporated, 1st edn. (2007)
13. Schwarz, G.: Estimating the dimension of a model. The Annals of Statistics 6(2),
    461–464 (1978)
14. Spiegelhalter, D., Best, N., Carlin, B., Van Der Linde, A.: Bayesian measures of
    model complexity and fit. Journal of the Royal Statistical Society. Series B: Sta-
    tistical Methodology 64(4), 583–616 (12 2002)
15. Volz, V., Rudolph, G., Naujoks, B.: Investigating Uncertainty Propagation in
    Surrogate-assisted Evolutionary Algorithms. In: Proceedings of the Genetic and
    Evolutionary Computation Conference. pp. 881–888. GECCO ’17, ACM, New York
    (2017)
16. Watanabe, S.: Algebraic Geometry and Statistical Learning Theory. Cambridge
    Monographs on Applied and Computational Mathematics, Cambridge University
    Press (2009)



                                         82
Adaptive Selection of Gaussian Process Model for Active Learning

A     Appendix
A.1    Covariance Functions


                    4                  5
                    2                                      1                   5                       5
         SE




                    0                  0                 0.5                   0                       0
                   -2                                      0                  -5                      -5
                   -4                 -5                       4 0             5                 5     5                 5
                                                                      0 4          0                       0
                        -2 0 2 4 6       -5     0   5                                  -5 -5 0                 -5 -5 0
                    4                  5
                    2                                      2                   5                       5
         NN




                    0                  0                   0                   0                       0
                   -2                                     -2                  -5                      -5
                   -4                 -5                       4 0             5                 5     5                 5
                                                                      0 4          0                       0
                        -2 0 2 4 6      -5      0   5                                  -5 -5 0                 -5 -5 0
                    4                10
                    2                                     50                   5                      10
         LIN




                    0                  0                   0                   0                       0
                   -2                                    -50                  -5                     -10
                   -4                -10                       4 0             5                 5     5                 5
                                                                      0 4          0                       0
                        -2 0 2 4 6         -5   0   5                                  -5 -5 0                 -5 -5 0
                    4                50
                    2                                   1000                 100                    100
         QUAD




                    0                  0                 500                   0                      0
                   -2                                      0                -100                   -100
                   -4                -50                       4 0             5                 5    5                  5
                                                                      0 4          0                       0
                        -2 0 2 4 6         -5   0   5                                  -5 -5 0                 -5 -5 0
                    4                  5
                    2                                      1                   5                       5
         PER




                    0                  0                 0.5                   0                       0
                   -2                                      0                  -5                      -5
                   -4                 -5                       4 0             5                 5     5                 5
                                                                      0 4          0                       0
                        -2 0 2 4 6       -5     0   5                                  -5 -5 0                 -5 -5 0
                    4                  5
                    2                                      2                  10                       5
         ADD




                    0                  0                   1                   0                       0
                   -2                                      0                 -10                      -5
                   -4                 -5                       4 0             5                 5     5                 5
                                                                      0 4          0                       0
                        -2 0 2 4 6       -5     0   5                                  -5 -5 0                 -5 -5 0
                    4                  5
                                                           5                   5                       5
         SE+NN




                    2
                    0                  0                   0                   0                       0
                   -2                                     -5                  -5                      -5
                   -4                 -5                       4 0             5                 5     5                 5
                                                                      0 4          0                       0
                        -2 0 2 4 6      -5      0   5                                  -5 -5 0                 -5 -5 0
                    4                50
         SE+QUAD




                    2                                   1000                  20                      50
                    0                  0                 500                   0                       0
                   -2                                      0                 -20                     -50
                   -4                -50                       4 0             5                 5     5                 5
                                                                      0 4          0                       0
                        -2 0 2 4 6         -5   0   5                                  -5 -5 0                 -5 -5 0




Fig. 1. Rows: Covariance functions. SE: Squared exponential. NN: neural network. LIN:
linear. QUAD: quadratic. PER: periodic. ADD: additive. SE+NN: sum of squared ex-
ponential and neural network. SE+QUAD: sum of squared exponential and quadratic.
Columns 1–2: The covariance function on R centered at point 2 (Col. 1) and three in-
dependent samples from the GP (Col. 2). Columns 3–5: The covariance function on R2
centered at [2 2]T (Col. 3) and two independent samples from the GP (Col. 4 and 5).




A.2    Experimental Setup
The ongoing experiments are performed within the framework Comparing con-
tinuous optimisers [4], in particular on the 24 benchmark functions forming its


                                                                 83
Adaptive Selection of Gaussian Process Model for Active Learning

noiseless testbed. Each function is defined everywhere on RD and has its global
optimum in [−5, 5]D for all dimensionalities D ≥ 2. For every function and two
dimensionalities, 2D and 10D, 15 independent trials of each algorithm are con-
ducted on multiple function instances, defined by transformations (translations,
rotations and shifts) of both the search space and f -values. A trial terminates
when the optimum is reached within a small tolerance or when a given budget
of function evaluations, 250D in our case, is exhausted.


A.3    Experimental Results



                                f7 Step Ellipsoidal 10D
               0


               -2
         log




               -4
         f




                          DTS
                          AIC-DTS
               -6         BIC-DTS
                          CMA-ES
               -8
                    0      50         100         150         200        250

Fig. 2. Medians (solid) and 1st /3rd quartiles (dotted) of the distances to the optima
against the number of function evaluations in 10D for all satisfactorily implemented
algorithms. CMA-ES: Covariance matrix adaptation evolution strategy. DTS-CMA-
ES: Doubly trained surrogate-CMA-ES. AIC-DTS, BIC-DTS: DTS-CMA-ES with the
adaptive covariance selection according to AIC and BIC, respectively. The medians
and quartiles were calculated from 15 independent runs on different function instances.
Distances to optima are shown in the log10 scale.




                                            84