1. Introduction

10.1007/978-3-540-28650-9\_4

Self-Explaining Variational Gaussian Processes for Transparency and Modelling of Prior Knowledge

Sarem Seitz

0 0 University of Bamberg , An der Weberei 5, 96049 Bamberg , Germany

Bayesian methods have become a popular way to incorporate prior knowledge and a notion of uncertainty into machine learning models. At the same time, the complexity of modern machine learning makes it challenging to comprehend a model's reasoning process, let alone express specific prior assumptions in a rigorous manner. While primarily interested in the former issue, recent developments in transparent machine learning could also broaden the range of prior information that we can provide to complex Bayesian models. Inspired by the idea of self-explaining models, this paper introduces a corresponding concept for variational Gaussian Processes. While the proposed method is inherently transparent, the bayesian nature of the underlying Gaussian Process allows to incorporate prior knowledge about the underlying problem. In one sentence, the goal is to let the human expert explain how to solve a supervised learning problem in a language that both the model and the user understand. For now, we evaluate these capabilities on simple problems.

eol>Explainable Machine Learning Bayesian Machine Learning Gaussian Processe

1. Introduction

models lack the complexity characteristic as regression coeficients can be used by humans to understand the underlying decision process. On the other hand, we can encode existing knowledge about the modelling task in such models. This is usually done either via Bayesian priors over the coeficients or constrained optimization.

Problem. For complex machine learning models, it is usually not as straightforward to encode existing knowledge as in the linear example. Our goal for this paper is therefore twofold. First, we want to derive an approach that can model complex problems in a transparent manner. Subsequently, we want to be able to exploit the transparent representation to encode existing prior knowledge and use it in the model’s training procedure.

Let us split these goals further into three concrete requirements: Transparency - The solution needs to provide insights into its decision process that can be understood by a suficiently trained domain expert. Flexibility - In order to be useful for complex problems, the proposed approach needs to be flexible, i.e. be able to handle a broad range of functional relations between input and target variables. Teachability - Finally, we need to be able to use an interpretable representation of existing knowledge and align the model’s decision process with that knowledge.

Apart from that, a practically relevant solution should also be able to handle with real-world problem. This implies, in particular, that scalability to reasonably large datasets has to be possible.

Contribution. To achieve the above desiderata, this paper proposes self-explaining variational GPs (SEVGPs). The self-explanatory component aims to solve the transparency requirement. By using the right kernel functions, GPs can handle complex functional relations as demanded under the flexibility specification. Since GPs are part of the family of Bayesian models, they are naturally able to incorporate prior knowledge, i.e. they also fall under the idea teachability. The primary limitation in this regard is the representations in which we are able to express our prior knowledge.

While GP models in their original form are unable to deal with large datasets, there exist many scalable solutions nowadays. Our approach will apply the concept of sparse variational GPs (SVGPs) in order to achieve scalability to big data problems as well.

Related work. The results of [2, 3, 4] directly inspired this approach from an explainability and transparency point of view. In fact, the approach [3] relates to this work in a similar way as GPs relate to SVGPs. However, as will be seen, this paper does not merely provide a scalable variant of the former work via SVGPs.

In addition to the transparency component, our aim is to also create a tool that can be used to provide human expert knowledge via transparent representations. [5, 6, 7, 8] all discuss the potentially beneficial role of expert and domain knowledge in machine learning, yet either mention Bayesian methods only briefly or not at all. Nevertheless, Bayesian non-parametrics have already been applied successfully in countless classical statistical modeling problems with an emphasis on incorporating prior knowledge - see [9] for a variety of examples.

Recent work on functional variational inference as discussed particularly in [10, 11] could be a fruitful step towards a synthesis of meaningful prior models and modern Machine Learning architectures.

Outline. In the next section we conduct a brief recap on transparent machine learning with focus on varying coeficient and self-explaining methods. Thereafter, we proceed similarly for GPs and SVGPs. The fourth section marks the main contribution of this paper where the primary formulas of our approach are exposed and discussed. Experimental validation of the approach is conducted in section five. Finally, we discuss limitations and potential extensions of our methodology in the last section. Proofs and derivations, as well as additional details can be found in the appendix.

2. Transparent Machine Learning

In regards to transparency in machine learning, terms like interpretable machine learning or explainable artificial intelligence (XAI) have become quite widespread and popular. However, up to this date, there is still no uniquely accepted definition for many terms in this field. In our context, where we consider supervised learning problems, we will use the following definitions of interpretation and explanation from [12]: Definition 1. An interpretation is the mapping of an abstract concept into a domain that the human can make sense of. An explanation is the collection of features of the interpretable domain, that have contributed for a given example to produce a decision.

The corresponding authors particularly name images and text as interpretable domains. Explanations, on the other hand, could be visualizations that highlight image regions or certain words that contributed in favour of or against a given decision.

As we will see, it makes sense to allow for explanations to also quantify the strength of contribution per interpretable feature. For example, consider a fixed grey-scale image and denote the corresponding vector of the image’s pixels, encoded in the range [0, 1], as ∈ [0, 1]. By introducing a coeficient vector ∈ R with the same dimensionality as , we can derive the usual linear model for a single example

The outcome scalar ∈ R could then be mapped to a valid probability via some monotone, increasing function : R ↦→ ( 0, 1 ). This obviously results in a binary classification problem. Notice that we can equally write ( 1 ) as the sum of pixel-coeficient products, i.e. =

= ∑︁ () ()

With respect to the mentioned classification problem, ( 2 ) now implies the following logic for quantifiable explanations:

Image pixels where () () > 0 contribute towards a positive classification whereas pixels where () () < 0 contribute towards a negative classification 2. Also pixels where |() ()| close to zero provide almost no contribution to the outcome and pixel where |() ()| is large provide large contribution. From now on, let us explicitly name the product () () as the contribution of the -th feature. 2Notice that we might have to add a constant term to this representation in order to account for cases where () = 0. Otherwise, the contribution of those features will always be zero. For simplicity though, we will only consider the model as in ( 2 ). ( 1 ) ( 2 )

Obviously, the contribution of each pixel must be able to difer for diferent images. Even under a mere translation of some baseline image, the corresponding contributions must also shift accordingly. As a result, the static coeficients as implied in ( 1 ) are unrealistic when considering multiple, diferent images. Rather, the should vary with the given input image, i.e.

= ()

Equation ( 3 ) now implies that the coeficient vector is a function of the input vector; in the context of the above grey-scale input: : [0, 1] ↦→ R. At this point, we should reiterate that this formulation is not restricted to image classification but can easily be extended to other domains that permit a similar representation of its input features. In fact, models like ( 3 ) were proposed as early as in [13] for classical statistical regression problems with tabular data.

More recent work around these ’varying coeficient’ models has been done in [ 2], who considered them, under the umbrella term self-explaining models, for modern machine learning problems like image or text classification. The most important novelty is the replacement of regression splines to model (· ) with a feedforward neural network with output neurons.

3. Gaussian Processes

The building blocks of GPs, see [14], are a prior distribution over functions, ( ), and a likelihood (| ). Using Bayes’ law, we are interested in a posterior distribution ( |) obtained as ( |) = (| )( ) ()

. ( ) = ( |(· ), (· , · )) ( 3 ) ( 4 ) ( 5 )

The prior distribution is a Gaussian Process, fully specified by (· ) : ↦→ R, typically () = 0, and covariance kernel function (· , · ) : × ↦→ R0+:

We assume the input domain for to be a bounded subset of the real numbers, ⊂ R. Technically, this invalidates ( 5 ) as then becomes an infinite-dimensional object for which a probability density does not exist. Since we are dealing with finite-dimensional datasets only, this techincal inaccuracy does not pose a problem in our further treatment. To exemplify our focus on finite dimensional marginals, we will make heavy use of subscripts to match inter-related objects.

Most importantly, we denote the × matrix of input data-points as and the corresponding marginal GP output as = ( ). This allows us to discuss GPs either at their multivariate Gaussian marginal output or as actual random functions. We will switch between both concepts depending on the situation.

A common choice for (· , · ) is the ARD3-kernel (, ′) = · (− 0.5( − ′)Σ( − ′))) ( 6 ) 3Automatic Relevance Determination

where Σ = (12, ..., 2 ) is a diagonal matrix with entries in R0+ and > 0. For = 1, ( 6 ) is equivalent to an SE4-kernel. We denote by the positive semi-definite Gram-Matrix, obtained as () = (, ), the -th row of training input matrix . As before, we denote the kernel gram-matrix belonging to as and a potential mean vector as = ( ).

Provided that ( | ) = ∏︀

=1 (|, 2), i.e. training observations are i.i.d. univariate Gaussian conditioned on , it is possible to directly calculate a corresponding posterior distribution for new inputs * as

(* | ) = (* |Λ* , * − Λ* ( + 2)Λ* ) where Λ* = * ( + 2)− 1, * ,() = (* , ), * ,() = (* , * ); is the identity matrix with according dimension.

In order to make GPs feasible for large datasets, the work of [15, 16, 17] developed and refined Sparse Variational Gaussian Processes (SVGPs). SVGPs, introduce a set of so called inducing locations ⊂ and corresponding inducing variables . The resulting posterior distribution, (, |), is then approximated through a variational distribution (, ) = ( | )( ) - often ( ) = ( |, ), = - by maximizing the evidence lower bound (ELBO):

4. Self-explaining variational posterior distributions

The preceding two sections easily motivate the replacement of the feedforward neural network in self-explaining models by a GP model. For a given matrix of training data and target vector , we obtain the following likelihood model: 4Squared Exponential

= ∑︁ E(| )( ) [log (|)] − ( (, )|| ( , )) =1 ( 8 ) where ( (· , · )|| (· , · )) denotes the KL-divergence between two (multivariate) Normal distributions. Finally, let us recall the following distributional properties of the marginal variational posterior process (˜ * ) = ∫︀ (* | )( ) :

(* ) = (* |Λ˜ * , * − Λ˜ * ( − )Λ˜ * ) where Λ˜ * = * −1 . Also, we will write ˜* := Λ˜ * and ˜* := * − Λ˜ * ( − )Λ˜ * . If two input matrices, and each consist of a single datapoint, ˜, ˜ and ˜ can be viewed as the mean and kernel functions of the variational GP, evaluated at and . We then denote the implicit GP mean and kernel functions as ˜(· ) = Λ˜ · and ˜(· , · ) = · − Λ˜ · ( − )Λ˜ · .

This allows us to hide the underlying dependencies on and in our notation and treat the variational GP as a separate entity from the original GP whose posterior distribution we are trying to approximate. ( 7 ) ( 9 ) ( | 1, ..., ; ) = ( | · 1,( )) where "· " means matrix multiplication for clarity (we will omit the "· " from now), 1,( ) = ⎢ . .

. ⎡ 1( ) ⎤ ⎣ ( ) ⎥ ⎦ and we explicitly included the input matrix to exemplify the relation to self-explaining models. Also, let us require independence between the individual GPs. Now, we are dealing with a linear combination of independent GPs instead of a single one. Combining ( 10 ) and the concept of SVGPs, we can introduce variational processes and approximate the respective varying-coeficient GPs:

This directly implies the following ELBO: = E(1,) [︁log (︁ ⃒⃒ 1,)︁]︁ − ∑︁ ( (, )|| ( , )) ( 11 ) where denotes the -th row of . The derivation of ( 11 ) can be found in Appendix A. Notice that we now have sets of inducing variables, . Obtaining a posterior predictive distribution for a Gaussian likelihood is also straightforward under this model: (* |* ) =

(* |* * 1,)(* 1,)* 1, ⃒ =1 = ︃( ⃒⃒⃒ ∑︁ * ⊙ ˜, ∑︁ (︁

︂( * ⃒ * * ︁) ⊙ ˜ )︂ * ⊙ + 2 · )︃ ( 12 ) where ⊙ :

denotes element-wise multiplication, is a unit-diagonal matrix of according dimension and 2 is the variance hyperparameter of the Gaussian likelihood. Finally, we can calculate a posterior distribution of the contribution of the -th feature for a given input vector ∼

︁) · ˜(), ()2 · ˜(, )

Now, let us introduce GPs - ˜ 1, ..., ˜ - with the following finite dimensional marginal distributions: ˜ * ∼ * ⊙ ˜*, * (︁ ︁) ⊙ ˜ )︂ * *

* with ˜, ˜ the mean vector and kernel Gram-matrices per GP as defined in ( 9 ). For a given set of inputs and the underlying mean and kernel functions (· ), (· , · ) fixed, the behavior of the ˜ 1, .., ˜ can be manipulated by adjusting , , the hyper-parameters of the underlying ( 13 ) ( 14 ) ︂( ︁(

∫︁ =1 =1 * * ˜* ∼ ︃( ∑︁ * ⊙ ˜*, ∑︁ * (︁ * ︁) ⊙ ˜*

)︃ =1 =1

Notice that ˜ yields a self-explaining GP whose -th attribution can easily be queried via the corresponding summand GP, ˜ *.

Our goal now is to use ˜* as a variational posterior distribution for an arbitrary GP by ifnding a set of parameters, namely , (and potential hyperparameters for (· ), (· , · )), for ˜* that minimize

(˜( )||( |)) (16) with ˜(· ) the GP distribution as defined in ( 14 ). Unfortunately, the usual route for SVGP inference is not possible since ˜( ) = ∫︀ ˜( | )˜( ) , ( ) = ∫︀ ( | )( ) , hence ˜( | ) ̸= ( | ) and therefore the conditional distributions do not cancel in the derivation of the ELBO. To solve the resulting infinite dimensional variational problem between the two respective GPs, we apply functional variational inference as proposed by [10]. The authors show, that there exists a functional evidence lower bound (fELBO) which can be maximized in order to solve the optimization problem in (16):

= E( )[( | )] − (17) where is a so-called measurement set, obtained by sampling uniformly from the space of all possible inputs, . (,) then denotes the union of and via row-wise stacking, i.e.

E() ︀[ (((,))||((,)))]︀ (,) = ︂[ ]︂ . By applying the fELBO to our prior and variational processe, we obtain inducing variables. Clearly, ( 14 ) can be interpreted as the attribution corresponding to the respective marginal SVGP.

By summing up the ˜ , we obtain yet another GP, ˜, with trivial marginal distribution: (15) (18) (19) 1 =

E˜( )[log ( | )] − E() ︀[ ( (˜,(,), ˜,(,)(,))|| ((,), (,)(,)))]︀ where ˜,(,), ˜,(,)(,) denote the evaluation of the mean vector and Kernel-gram matrix from (15), evaluated at (,). (,), (,)(,) denote the mean vector and Kernelgram matrix of the prior GP, evaluated accordingly.

In essence, this approach allows us to encode functional prior knowledge via the prior GP as usual. By decomposing the variational posterior GP after optimizing (18) into its summand attribution GPs, we obtain a transparent approximation of the true posterior distribution in the tradition of varying coeficient or self-explaining models.

Another promising use-case arises, when we place prior distributions on the attribution GPs themselves, e.g. for arbitrary input :

:= ∼ ( (· ), (· , · ))5 5 can be seen a linear operator on that transforms all finite-dimensional marginals of via ⊙ .

If the respective mean and kernel functions can be decomposed as · () and · (, ), (19) is a GPX problem as discussed before. If this not the case, however, and if we want to retain transparency of the respective posterior distribution, we can approximate the attribution GPs by ˜ 1, ..., ˜ . As in (16) we want to minimize

︁( ˜1 ,...,˜ ︀( 1 , ..., )︀ ⃒⃒⃒ ⃒⃒⃒ ︁( 1 , ..., ⃒⃒⃒ )︁

By invoking (17) again and by the fact that the KL-divergence of the joint distribution between prior and variational GPs decomposes as the sum of the KL divergences for mutually independent GPs, we get: 2 =

[︁ ⃒

E˜1,...,˜ (1 ,...,) log (︁ ⃒⃒ 1 , ..., )︁]︁ − E() [︃ ︁(

︁( ∑︁ ˜ =1 ,(,) , ˜ ,(,) ⃒⃒ ⃒⃒ ︁) ⃒ ⃒ ︁( ,(,), ,(,)(,)) ︁) ]︃

As a brief example, we could choose (· ) << 0 to exemplify the prior belief that the attribution of the -th feature is negative with high probability. Obviously, potential priors could be much more complex. In fact, it might be fruitful to consider implicit processes as introduced in [18] as a prior and use our self-explaining posterior as an approximation.

5. Experiments

In this section, we evaluate the proposed method on several experimental tasks. In particular, we are interested in the explanations generated by our method, its ability to incorporate prior assumption and its predictive performance. All experiments were conducted on regression problems, where the likelihood could be assumed to be Gaussian.

Extended implementation details can be found in Appendix B.

5.1. Evaluation of explanations

In addition to point values for the varying coeficients, the SVGP components allow to also evaluate the variance of varying coeficients. In accordance with the typical interpretation of posterior variance in Bayesian models, this can be interpreted as a measure of coeficient uncertainty or explanation uncertainty.

To evaluate these measures, the coeficient means and variances of a trained SEVGP model (via ( 11 ) ) were calculated for two datapoints from the boston housing dataset. Figures 1. and 2. show the results. While the coeficient means are relatively stable for both examples, the variances difer visibly. Interestingly, the coeficients of the left example show high uncertainty for the most influential coeficient (feature

CHAS). The respective outputs can be used to check for hidden biases or erroneous reasoning in the respective model. (20) (21)

5.2. Evaluation of inclusion of prior knowledge

To verify the model’s capability to incorporate existing prior knowledge, a random sample from a quadratic function with gaussian noise was created in the interval [− 2, 2]. A model that is able to handle knowledge about the underlying quadratic function should be able to extrapolate accordingly beyond the range of the observed data (often termed out-of-distribution problem).

In order to validate this claim for our approach, the three models implied in ( 11 ), (18) and (21) were compared. For (18) (= prior knowledge about ) a GP prior with second-order polynomial kernel was used. For (21) (= prior knowledge about the feature-wise efects) a GP prior with linear kernel was placed on , which is technically equivalent to placing a polynomial kernel on .

The results in Figure 3 indicate that the model is able to correctly handle the functional prior knowledge about the underlying quadratic function. It can be see, that both models that were trained with additional prior knowledge (middle and right) were able to correctly extrapolate the quadratic function. Without such prior knowledge (left model), the resulting posterior predictive distribution only fits the in-sample data but is unable to extrapolate out of distribution.

5.3. Evaluation of predictive performance

To validate the predictive performance of the proposed method, it was evaluated over four regression datasets (boston housing, concrete, wine red and wine white) via five-fold cross validation. For comparison, standard SVGP was also trained and evaluated on the same folds. Table 1 shows average MSE and MSE standard deviation over the folds. All GP models used an ARD covariance kernel and zero-mean prior functions.

Since SEVGP uses one SVGP per coeficient, the amount of inducing points in the SVGP was increased accordingly to account for the increased model capacity of SEVGP. See Appendix B for more details.

It can be seen that our proposed method achieves comparable performance to SVGP. This implies that problems where the latter perform well, allow for the SVGP to be replaced by SEVGP in case the discussed benefits are deemed advantageous.

6. Limitations and discussion

This paper presented a method that combines GPs and recent developments in varyingcoeficient/self-explaining methods for machine learning. By taking advantage of the Bayesian properties of GPs it is also possible to inject prior knowledge into respective models. One area where both the transparency and the teachability aspects can be helpful is the field of fair and unbiased machine learning. On the one hand, transparency allows to detect biased or discriminating results on a per instance basis. On the other hand, teachability could help prevent or eliminate potential biases by carefully encoding non-biasing prior knowledge into the model. While this would certainly not be a silver bullet, there might nevertheless be considerable, general potential at the intersection of explainable and human-in-the-loop machine learning.

A clear limitation is the fact that the idea of explainability that we considered in this paper is a statistical one, with focus on local, per-pixel explanations. In complex problems like image classification, this might not sufice if a class is inferred from multiple symbolic relations of diferent objects that are present in a given image instance. Nevertheless, statistical approaches have recently been shown to be quite successful on such complex problems despite possessing no inherent capabilities for logic deduction.

Future work on the proposed method should try to find a way to make the proposed method scalable to other, potentially high dimensional, supervised learning problems. Particularly problems with image inputs, like image classification or reinforcement learning might greatly benefit from external prior knowledge when training data is only sparsely available. 09421. arXiv:2011.09421. 2017.10.011. doi:10.1016/j.dsp.2017.10.011. [15] M. K. Titsias, Variational learning of inducing variables in sparse gaussian processes, in: D. A. V. Dyk, M. Welling (Eds.), Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, volume 5 of JMLR Proceedings, JMLR.org, 2009, pp. 567–574. URL: http: [16] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, in: A. Nicholson, P. Smyth (Eds.), Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 2013, Bellevue, WA, USA, August 11-15, 2013, AUAI Press, [17] J. Hensman, A. G. de G. Matthews, Z. Ghahramani, Scalable variational gaussian process classification, in: G. Lebanon, S. V. N. Vishwanathan (Eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, volume 38 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015. URL: http://proceedings.mlr.press/v38/hensman15.html. [18] C. Ma, Y. Li, J. M. Hernández-Lobato, Variational implicit processes, in: International Conference on Machine Learning, PMLR, 2019, pp. 4222–4233. URL: https://proceedings. mlr.press/v97/ma19b.html.

A. Derivation of ( 11 ) We write ( 1,) = ( 1, ..., ) and (1,) = (1 , ..., ). Notice that ( 1,) does technically not exist as it involves the infinite dimensional stochastic processes where densities don’t exist. As these objects will cancel out anyway and since such notation is commonly seen in the GP literature, we will keep it here for simplicity. Otherwise, to be notationally exact, we would have to work with KL divergences over probability measures which would make the results much less convenient to derive.

︁(

︁( 1,,

⃒ ⃒ 1,)︁ ⃒ ⃒ ︁(

⃒ ⃒ 1,, 1,⃒⃒ ︁) = ∫︁

log ︁( ∫︁ = log

︁( 1, ⃒⃒ ︁( ︁) log log ︁( ∫︁ ︁( ︁( ︁( ︁( ︁( ︁( ︁( ︁( ︁( ︁( ⃒ ⃒ since depends on 1, only via 1, and only on the marginals given by . ⃒ ⃒ ∑︁ ( (, )|| ( , ))

E (˜ 1, ) ∑︁ ( (, )|| ( , )) ∑︁ ( (, )|| ( , )) by marginalizing out 1, and writing ˜ 1, by independence of prior and variational GPs and by standard i.i.d. assumption about observed We write ( 1, ) = ( 1, ..., ) and (

1, ) = (1 , ..., ). Notice that ( 1, ) does technically not exist as it involves the infinite dimensional stochastic processes where densities don’t exist. As these objects will cancel out anyway and since such notation is commonly seen in the GP literature, we will keep it here for simplicity. Otherwise, to be notationally exact, we would have to work with KL divergences over probability measures which would make the results much less convenient to derive.

︁( 1, , ⃒ ⃒ ⃒ ⃒ 1, , 1, ⃒⃒ ︁) = ∫︁ log

︁( 1, ⃒⃒ ︁(

︁( log

︁( 1, ⃒⃒

︁(

︁( log

︁( ︁( ⃒⃒ 1, ,

︁(

︁( 1, )︁ 1, ⃒⃒

︁( ⃒ ⃒ ⃒ ⃒ log ⃒ ⃒ ︁( since depends on 1,

only via 1, and only on the marginals given by .

by independence of prior and variational GPs and by standard i.i.d. assumption about observed datapoints by marginalizing out and writing ˜ 1,

for clarity as explained before.

)) −

E ) [︁ () ≥ [︁ log log ︁( ⃒ ˜ ⃒ − ( ( , )|| ( ,

)) ( , )|| (

[1]

Doshi-Velez ,

Kim , Towards a rigorous science of interpretable machine learning , arXiv preprint arXiv:1702.08608 ( 2017 ). URL: https://arxiv.org/abs/1702.08608.

[2]

Alvarez-Melis ,

T. S.

Jaakkola , Towards robust interpretability with self-explaining neural networks , in: S. Bengio,

H. M.

Wallach ,

Larochelle ,

Grauman , N. CesaBianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 , NeurIPS 2018 , December 3- 8 , 2018 , Montréal, Canada, 2018 , pp. 7786 - 7795 . URL: https://proceedings.neurips.cc/paper/ 2018/hash/3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html.

[3]

Yoshikawa , T. Iwata, Gaussian process regression with local explanation , CoRR abs/ 2007 .01669 ( 2020 ). URL: https://arxiv.org/abs/ 2007 .01669. arXiv: 2007 .01669.

[4]

Guhaniyogi ,

Li ,

T. D.

Savitsky ,

Srivastava , Distributed bayesian varying coeficient modeling using a gaussian process prior , arXiv preprint arXiv: 2006 . 00783 ( 2020 ). URL: https://arxiv.org/abs/ 2006 .00783.

[5]

Niyogi ,

Girosi ,

Poggio , Incorporating prior information in machine learning by creating virtual examples , Proceedings of the IEEE 86 ( 1998 ) 2196 - 2209 . URL: https://ieeexplore. ieee.org/document/726787#:~:text=DOI%3A-, 10 .1109/5.726787, -Publisher %3A% 20IEEE .

[6]

Ferranti ,

Krane ,

Craft , The value of prior knowledge in machine learning of complex network systems , Bioinform . 33 ( 2017 ) 3610 - 3618 . URL: https://doi.org/10.1093/ bioinformatics/btx438. doi: 10 .1093/bioinformatics/btx438.

[7]

von Rueden ,

Mayer ,

Beckh ,

Georgiev ,

Giesselbach ,

Heese ,

Kirsch ,

Pfrommer ,

Pick ,

Ramamurthy , et al., Informed machine learning-a taxonomy and survey of integrating knowledge into learning systems , arXiv preprint arXiv: 1903 . 12394 ( 2019 ). URL: https://arxiv.org/abs/ 1903 .12394.

[8]

Yang ,

Ren , A quantitative perspective on values of domain knowledge for machine learning , arXiv preprint arXiv:2011 . 08450 ( 2020 ). URL: https://arxiv.org/abs/ 2011 .08450.

[9]

Gelman ,

J. B.

Carlin ,

H. S.

Stern ,

D. B.

Dunson ,

Vehtari ,

D. B.

Rubin , Bayesian data analysis , CRC press, 2013 . URL: https://doi.org/10.1201/b16018.

[10]

Sun , G. Zhang,

Shi ,

R. B.

Grosse , Functional variational bayesian neural networks , in: 7th International Conference on Learning Representations, ICLR 2019 , New Orleans , LA, USA, May 6- 9 , 2019 , OpenReview.net, 2019 . URL: https://openreview.net/forum?id= rkxacs0qY7.

[11]

D. R.

Burt ,

S. W.

Ober ,

Garriga-Alonso , M. van der Wilk, Understanding variational inference in function-space , CoRR abs/ 2011 .09421 ( 2020 ). URL: https://arxiv.org/abs/ 2011 .

[12]

Montavon ,

Samek ,

Müller , Methods for interpreting and understanding deep neural networks , Digit. Signal Process . 73 ( 2018 ) 1 - 15 . URL: https://doi.org/10.1016/j.dsp.

[13]

Hastie ,

Tibshirani , Varying-coeficient models , Journal of the Royal Statistical Society: Series B (Methodological) 55 ( 1993 ) 757 - 779 . URL: https://www.jstor.org/stable/2345993.

[14]

C. E.

Rasmussen , Gaussian processes in machine learning , in: O. Bousquet , U. von Luxburg, G. Rätsch (Eds.), Advanced Lectures on Machine Learning, ML Summer Schools 2003 , Canberra, Australia, February 2- 14 , 2003 , Tübingen, Germany, August 4 - 16 , 2003 , Revised Lectures, volume 3176 of Lecture Notes in Computer Science, Springer, 2003 , pp. 63 - 71 . URL: 2013 . URL: https://dslpitt.org/uai/displayArticleDetails.jsp ?mmnu=1&smnu=2&article_id=