<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-540-28650-9\_4</article-id>
      <title-group>
        <article-title>Self-Explaining Variational Gaussian Processes for Transparency and Modelling of Prior Knowledge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarem Seitz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bamberg</institution>
          ,
          <addr-line>An der Weberei 5, 96049 Bamberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Bayesian methods have become a popular way to incorporate prior knowledge and a notion of uncertainty into machine learning models. At the same time, the complexity of modern machine learning makes it challenging to comprehend a model's reasoning process, let alone express specific prior assumptions in a rigorous manner. While primarily interested in the former issue, recent developments in transparent machine learning could also broaden the range of prior information that we can provide to complex Bayesian models. Inspired by the idea of self-explaining models, this paper introduces a corresponding concept for variational Gaussian Processes. While the proposed method is inherently transparent, the bayesian nature of the underlying Gaussian Process allows to incorporate prior knowledge about the underlying problem. In one sentence, the goal is to let the human expert explain how to solve a supervised learning problem in a language that both the model and the user understand. For now, we evaluate these capabilities on simple problems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable Machine Learning</kwd>
        <kwd>Bayesian Machine Learning</kwd>
        <kwd>Gaussian Processe</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>models lack the complexity characteristic as regression coeficients can be used by humans
to understand the underlying decision process. On the other hand, we can encode existing
knowledge about the modelling task in such models. This is usually done either via Bayesian
priors over the coeficients or constrained optimization.</p>
      <p>Problem. For complex machine learning models, it is usually not as straightforward to
encode existing knowledge as in the linear example. Our goal for this paper is therefore twofold.
First, we want to derive an approach that can model complex problems in a transparent manner.
Subsequently, we want to be able to exploit the transparent representation to encode existing
prior knowledge and use it in the model’s training procedure.</p>
      <p>Let us split these goals further into three concrete requirements: Transparency - The solution
needs to provide insights into its decision process that can be understood by a suficiently trained
domain expert. Flexibility - In order to be useful for complex problems, the proposed approach
needs to be flexible, i.e. be able to handle a broad range of functional relations between input and
target variables. Teachability - Finally, we need to be able to use an interpretable representation
of existing knowledge and align the model’s decision process with that knowledge.</p>
      <p>Apart from that, a practically relevant solution should also be able to handle with real-world
problem. This implies, in particular, that scalability to reasonably large datasets has to be
possible.</p>
      <p>Contribution. To achieve the above desiderata, this paper proposes self-explaining
variational GPs (SEVGPs). The self-explanatory component aims to solve the transparency
requirement. By using the right kernel functions, GPs can handle complex functional relations as
demanded under the flexibility specification. Since GPs are part of the family of Bayesian
models, they are naturally able to incorporate prior knowledge, i.e. they also fall under the idea
teachability. The primary limitation in this regard is the representations in which we are able
to express our prior knowledge.</p>
      <p>While GP models in their original form are unable to deal with large datasets, there exist
many scalable solutions nowadays. Our approach will apply the concept of sparse variational
GPs (SVGPs) in order to achieve scalability to big data problems as well.</p>
      <p>Related work. The results of [2, 3, 4] directly inspired this approach from an explainability
and transparency point of view. In fact, the approach [3] relates to this work in a similar way as
GPs relate to SVGPs. However, as will be seen, this paper does not merely provide a scalable
variant of the former work via SVGPs.</p>
      <p>In addition to the transparency component, our aim is to also create a tool that can be used
to provide human expert knowledge via transparent representations. [5, 6, 7, 8] all discuss the
potentially beneficial role of expert and domain knowledge in machine learning, yet either
mention Bayesian methods only briefly or not at all. Nevertheless, Bayesian non-parametrics
have already been applied successfully in countless classical statistical modeling problems with
an emphasis on incorporating prior knowledge - see [9] for a variety of examples.</p>
      <p>Recent work on functional variational inference as discussed particularly in [10, 11] could be
a fruitful step towards a synthesis of meaningful prior models and modern Machine Learning
architectures.</p>
      <p>Outline. In the next section we conduct a brief recap on transparent machine learning with
focus on varying coeficient and self-explaining methods. Thereafter, we proceed similarly
for GPs and SVGPs. The fourth section marks the main contribution of this paper where the
primary formulas of our approach are exposed and discussed. Experimental validation of the
approach is conducted in section five. Finally, we discuss limitations and potential extensions
of our methodology in the last section. Proofs and derivations, as well as additional details can
be found in the appendix.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Transparent Machine Learning</title>
      <p>In regards to transparency in machine learning, terms like interpretable machine learning or
explainable artificial intelligence (XAI) have become quite widespread and popular. However,
up to this date, there is still no uniquely accepted definition for many terms in this field. In our
context, where we consider supervised learning problems, we will use the following definitions
of interpretation and explanation from [12]:
Definition 1. An interpretation is the mapping of an abstract concept into a domain that the
human can make sense of. An explanation is the collection of features of the interpretable domain,
that have contributed for a given example to produce a decision.</p>
      <p>The corresponding authors particularly name images and text as interpretable domains.
Explanations, on the other hand, could be visualizations that highlight image regions or certain words
that contributed in favour of or against a given decision.</p>
      <p>As we will see, it makes sense to allow for explanations to also quantify the strength of
contribution per interpretable feature. For example, consider a fixed grey-scale image and denote
the corresponding vector of the  image’s pixels, encoded in the range [0, 1], as  ∈ [0, 1]. By
introducing a coeficient vector  ∈ R with the same dimensionality as , we can derive the
usual linear model for a single example</p>
      <p>
        The outcome scalar  ∈ R could then be mapped to a valid probability via some monotone,
increasing function  : R ↦→ (
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ). This obviously results in a binary classification problem.
Notice that we can equally write (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) as the sum of pixel-coeficient products, i.e.
 =
      </p>
      <p>= ∑︁ () ()</p>
      <p>=1</p>
      <p>
        With respect to the mentioned classification problem, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) now implies the following logic for
quantifiable explanations:
      </p>
      <p>
        Image pixels where () () &gt; 0 contribute towards a positive classification whereas pixels
where () () &lt; 0 contribute towards a negative classification 2. Also pixels where |() ()|
close to zero provide almost no contribution to the outcome and pixel where |() ()| is large
provide large contribution. From now on, let us explicitly name the product () () as the
contribution of the -th feature.
2Notice that we might have to add a constant term to this representation in order to account for cases where
() = 0. Otherwise, the contribution of those features will always be zero. For simplicity though, we will only
consider the model as in (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <p>
        Obviously, the contribution of each pixel must be able to difer for diferent images. Even
under a mere translation of some baseline image, the corresponding contributions must also shift
accordingly. As a result, the static coeficients as implied in (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) are unrealistic when considering
multiple, diferent images. Rather, the  should vary with the given input image, i.e.
      </p>
      <p>=   ()</p>
      <p>
        Equation (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) now implies that the coeficient vector is a function of the input vector; in the
context of the above grey-scale input:  : [0, 1] ↦→ R. At this point, we should reiterate that
this formulation is not restricted to image classification but can easily be extended to other
domains that permit a similar representation of its input features. In fact, models like (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) were
proposed as early as in [13] for classical statistical regression problems with tabular data.
      </p>
      <p>More recent work around these ’varying coeficient’ models has been done in [ 2], who
considered them, under the umbrella term self-explaining models, for modern machine learning
problems like image or text classification. The most important novelty is the replacement of
regression splines to model  (· ) with a feedforward neural network with  output neurons.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Gaussian Processes</title>
      <p>The building blocks of GPs, see [14], are a prior distribution over functions, ( ), and a likelihood
(| ). Using Bayes’ law, we are interested in a posterior distribution ( |) obtained as
( |) =
(| )( )
()</p>
      <p>
        .
( ) = ( |(· ), (· , · ))
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
      </p>
      <p>The prior distribution is a Gaussian Process, fully specified by (· ) :  ↦→ R, typically
() = 0, and covariance kernel function (· , · ) :  ×  ↦→ R0+:</p>
      <p>
        We assume the input domain for  to be a bounded subset of the real numbers,  ⊂ R.
Technically, this invalidates (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) as  then becomes an infinite-dimensional object for which
a probability density does not exist. Since we are dealing with finite-dimensional datasets
only, this techincal inaccuracy does not pose a problem in our further treatment. To exemplify
our focus on finite dimensional marginals, we will make heavy use of subscripts to match
inter-related objects.
      </p>
      <p>Most importantly, we denote the  ×  matrix of input data-points as  and the
corresponding marginal GP output as  =  ( ). This allows us to discuss GPs either at their
multivariate Gaussian marginal output or as actual random functions. We will switch between
both concepts depending on the situation.</p>
      <p>
        A common choice for (· , · ) is the ARD3-kernel
(, ′) =  · (− 0.5( − ′)Σ( − ′)))
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
3Automatic Relevance Determination
      </p>
      <p>
        where Σ = (12, ..., 2 ) is a diagonal matrix with entries in R0+ and  &gt; 0. For  = 1, (
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
is equivalent to an SE4-kernel. We denote by  the positive semi-definite Gram-Matrix, obtained
as () = (,  ),  the -th row of training input matrix  . As before, we denote the
kernel gram-matrix belonging to  as  and a potential mean vector as  = ( ).
      </p>
      <p>Provided that ( | ) = ∏︀</p>
      <p>=1  (|,  2), i.e. training observations  are i.i.d.
univariate Gaussian conditioned on  , it is possible to directly calculate a corresponding posterior
distribution for new inputs * as</p>
      <p>(* | ) =  (* |Λ*   , * − Λ*  ( +  2)Λ* )
where Λ*  = *  ( +  2)− 1, * ,() = (* ,  ), * ,() = (* , * );  is the
identity matrix with according dimension.</p>
      <p>In order to make GPs feasible for large datasets, the work of [15, 16, 17] developed and
refined Sparse Variational Gaussian Processes (SVGPs). SVGPs, introduce a set of  so called
inducing locations  ⊂  and corresponding inducing variables  . The resulting posterior
distribution, (,  |), is then approximated through a variational distribution (,  ) =
( | )( ) - often ( ) =  ( |, ),  =  - by maximizing the evidence lower
bound (ELBO):</p>
    </sec>
    <sec id="sec-4">
      <title>4. Self-explaining variational posterior distributions</title>
      <p>The preceding two sections easily motivate the replacement of the feedforward neural network
in self-explaining models by a GP model. For a given matrix of training data  and target
vector  , we obtain the following likelihood model:
4Squared Exponential</p>
      <p>
        = ∑︁ E(| )( ) [log (|)] − ( (, )|| ( ,  ))
=1
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
where ( (· , · )|| (· , · )) denotes the KL-divergence between two (multivariate) Normal
distributions. Finally, let us recall the following distributional properties of the marginal
variational posterior process (˜ * ) = ∫︀ (* | )( ) :
      </p>
      <p>(* ) =  (* |Λ˜ *  , * − Λ˜ *  ( − )Λ˜ *  )
where Λ˜ *  = *  −1 . Also, we will write ˜* := Λ˜ *   and ˜* := * −
Λ˜ *  ( − )Λ˜ *  . If two input matrices,  and  each consist of a single datapoint,
˜, ˜ and ˜ can be viewed as the mean and kernel functions of the variational GP, evaluated
at  and  . We then denote the implicit GP mean and kernel functions as ˜(· ) = Λ˜ ·   and
˜(· , · ) = · − Λ˜ ·  ( − )Λ˜ · .</p>
      <p>
        This allows us to hide the underlying dependencies on  and  in our notation and treat the
variational GP as a separate entity from the original GP whose posterior distribution we are
trying to approximate.
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
( | 1, ...,  ;  ) = ( | ·  1,( ))
where "· " means matrix multiplication for clarity (we will omit the "· " from now),
 1,( ) = ⎢
.
.
      </p>
      <p>
        .
⎡  1( ) ⎤
⎣ ( )
⎥
⎦
and we explicitly included the input matrix  to exemplify the relation to self-explaining
models. Also, let us require independence between the individual GPs. Now, we are dealing
with a linear combination of  independent GPs instead of a single one. Combining (
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) and
the concept of SVGPs, we can introduce  variational processes and approximate the respective
varying-coeficient GPs:
      </p>
      <p>
        This directly implies the following ELBO:
 = E(1,) [︁log  (︁  ⃒⃒  
1,)︁]︁

−
∑︁ ( (, )|| ( ,  ))
(
        <xref ref-type="bibr" rid="ref11">11</xref>
        )
where  denotes the -th row of  . The derivation of (
        <xref ref-type="bibr" rid="ref11">11</xref>
        ) can be found in Appendix A.
Notice that we now have  sets of inducing variables,  . Obtaining a posterior predictive
distribution for a Gaussian likelihood is also straightforward under this model:
(* |* ) =
      </p>
      <p>(* |* * 1,)(* 1,)* 1,
⃒ =1
= 
︃(
 ⃒⃒⃒ ∑︁ * ⊙ ˜, ∑︁   (︁</p>
      <p>
        ︂(
* ⃒ * *
︁) 
⊙ ˜ )︂
*
⊙  +  2 · 
)︃
(
        <xref ref-type="bibr" rid="ref12">12</xref>
        )
where ⊙
:
      </p>
      <p>denotes element-wise multiplication,  is a unit-diagonal matrix of according
dimension and  2 is the variance hyperparameter of the Gaussian likelihood. Finally, we can
calculate a posterior distribution of the contribution of the -th feature for a given input vector
 ∼</p>
      <p>︁)
 · ˜(), ()2 · ˜(, )</p>
      <p>Now, let us introduce  GPs - ˜ 1, ..., ˜  - with the following finite dimensional marginal
distributions:
˜
 * ∼ 
* ⊙ ˜*, * (︁
︁) 
⊙ ˜ )︂
*
*</p>
      <p>
        *
with ˜, ˜ the mean vector and kernel Gram-matrices per GP as defined in (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ). For a given
set of inputs and the underlying mean and kernel functions (· ), (· , · ) fixed, the behavior of
the ˜ 1, .., ˜  can be manipulated by adjusting , , the hyper-parameters of the underlying
(
        <xref ref-type="bibr" rid="ref13">13</xref>
        )
(
        <xref ref-type="bibr" rid="ref14">14</xref>
        )
︂(
︁(
      </p>
      <p>∫︁
=1

=1
*
*
˜* ∼ 
︃(  
∑︁ *  ⊙ ˜*, ∑︁ *  (︁ * ︁)  ⊙ ˜*</p>
      <p>)︃
=1 =1</p>
      <p>Notice that ˜ yields a self-explaining GP whose -th attribution can easily be queried via the
corresponding summand GP, ˜ *.</p>
      <p>Our goal now is to use ˜* as a variational posterior distribution for an arbitrary GP  by
ifnding a set of parameters, namely ,  (and potential hyperparameters for (· ), (· , · )),
for ˜* that minimize</p>
      <p>
        (˜( )||( |)) (16)
with ˜(· ) the GP distribution as defined in (
        <xref ref-type="bibr" rid="ref14">14</xref>
        ). Unfortunately, the usual route for SVGP
inference is not possible since ˜( ) = ∫︀ ˜( | )˜( ) , ( ) = ∫︀ ( | )( ) ,
hence ˜( | ) ̸= ( | ) and therefore the conditional distributions do not cancel in the
derivation of the ELBO. To solve the resulting infinite dimensional variational problem between
the two respective GPs, we apply functional variational inference as proposed by [10]. The
authors show, that there exists a functional evidence lower bound (fELBO) which can be
maximized in order to solve the optimization problem in (16):
      </p>
      <p>= E( )[( | )] − (17)
where  is a so-called measurement set, obtained by sampling uniformly from the space of
all possible inputs,  . (,) then denotes the union of  and  via row-wise stacking, i.e.</p>
      <p>
        E() ︀[ (((,))||((,)))]︀
(,) = ︂[  ]︂ . By applying the fELBO to our prior and variational processe, we obtain
inducing variables. Clearly, (
        <xref ref-type="bibr" rid="ref14">14</xref>
        ) can be interpreted as the attribution corresponding to the
respective marginal SVGP.
      </p>
      <p>By summing up the ˜ , we obtain yet another GP, ˜, with trivial marginal distribution:
(15)
(18)
(19)
 1 =</p>
      <p>E˜( )[log ( | )]
− E() ︀[ ( (˜,(,), ˜,(,)(,))|| ((,), (,)(,)))]︀
where ˜,(,), ˜,(,)(,) denote the evaluation of the mean vector and Kernel-gram
matrix from (15), evaluated at (,). (,), (,)(,) denote the mean vector and
Kernelgram matrix of the prior GP, evaluated accordingly.</p>
      <p>In essence, this approach allows us to encode functional prior knowledge via the prior GP
as usual. By decomposing the variational posterior GP after optimizing (18) into its summand
attribution GPs, we obtain a transparent approximation of the true posterior distribution in the
tradition of varying coeficient or self-explaining models.</p>
      <p>Another promising use-case arises, when we place prior distributions on the attribution GPs
themselves, e.g. for arbitrary input  :</p>
      <p>:=    ∼  ( (· ),  (· , · ))5
5 can be seen a linear operator on   that transforms all finite-dimensional marginals of   via  ⊙  .</p>
      <p>If the respective mean and kernel functions can be decomposed as  · () and  ·
(,  ), (19) is a GPX problem as discussed before. If this not the case, however, and if we
want to retain transparency of the respective posterior distribution, we can approximate the
attribution GPs by ˜ 1, ..., ˜ . As in (16) we want to minimize</p>
      <p>︁(
 ˜1
 ,...,˜
︀( 1 , ..., )︀ ⃒⃒⃒ ⃒⃒⃒  ︁(
1 , ..., ⃒⃒⃒ )︁</p>
      <p>By invoking (17) again and by the fact that the KL-divergence of the joint distribution between
prior and variational GPs decomposes as the sum of the KL divergences for mutually independent
GPs, we get:
 2 =</p>
      <p>[︁ ⃒</p>
      <p>E˜1,...,˜ (1 ,...,) log  (︁  ⃒⃒ 1 , ..., )︁]︁
− E()
[︃ 
︁(</p>
      <p>︁(
∑︁   ˜
=1
 ,(,)
, ˜
 ,(,) ⃒⃒ ⃒⃒ 
︁) ⃒ ⃒
︁(
 ,(,),  ,(,)(,))
︁)
]︃</p>
      <p>As a brief example, we could choose  (· ) &lt;&lt; 0 to exemplify the prior belief that the
attribution of the -th feature is negative with high probability. Obviously, potential priors
could be much more complex. In fact, it might be fruitful to consider implicit processes as
introduced in [18] as a prior and use our self-explaining posterior as an approximation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>In this section, we evaluate the proposed method on several experimental tasks. In particular,
we are interested in the explanations generated by our method, its ability to incorporate prior
assumption and its predictive performance. All experiments were conducted on regression
problems, where the likelihood could be assumed to be Gaussian.</p>
      <p>Extended implementation details can be found in Appendix B.</p>
      <sec id="sec-5-1">
        <title>5.1. Evaluation of explanations</title>
        <p>In addition to point values for the varying coeficients, the SVGP components allow to also
evaluate the variance of varying coeficients. In accordance with the typical interpretation
of posterior variance in Bayesian models, this can be interpreted as a measure of coeficient
uncertainty or explanation uncertainty.</p>
        <p>
          To evaluate these measures, the coeficient means and variances of a trained SEVGP model
(via (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ) ) were calculated for two datapoints from the boston housing dataset. Figures 1. and
2. show the results. While the coeficient means are relatively stable for both examples, the
variances difer visibly. Interestingly, the coeficients of the left example show high uncertainty
for the most influential coeficient (feature
        </p>
        <p>CHAS). The respective outputs can be used to check
for hidden biases or erroneous reasoning in the respective model.
(20)
(21)</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation of inclusion of prior knowledge</title>
        <p>To verify the model’s capability to incorporate existing prior knowledge, a random sample from
a quadratic function with gaussian noise was created in the interval [− 2, 2]. A model that is
able to handle knowledge about the underlying quadratic function should be able to extrapolate
accordingly beyond the range of the observed data (often termed out-of-distribution problem).</p>
        <p>
          In order to validate this claim for our approach, the three models implied in (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ), (18) and (21)
were compared. For (18) (= prior knowledge about ) a GP prior with second-order polynomial
kernel was used. For (21) (= prior knowledge about the feature-wise efects) a GP prior with
linear kernel was placed on  , which is technically equivalent to placing a polynomial kernel
on  .
        </p>
        <p>The results in Figure 3 indicate that the model is able to correctly handle the functional
prior knowledge about the underlying quadratic function. It can be see, that both models
that were trained with additional prior knowledge (middle and right) were able to correctly
extrapolate the quadratic function. Without such prior knowledge (left model), the resulting
posterior predictive distribution only fits the in-sample data but is unable to extrapolate out of
distribution.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluation of predictive performance</title>
        <p>To validate the predictive performance of the proposed method, it was evaluated over four
regression datasets (boston housing, concrete, wine red and wine white) via five-fold cross
validation. For comparison, standard SVGP was also trained and evaluated on the same folds. Table
1 shows average MSE and MSE standard deviation over the folds. All GP models used an ARD
covariance kernel and zero-mean prior functions.</p>
        <p>Since SEVGP uses one SVGP per coeficient, the amount of inducing points in the SVGP was
increased accordingly to account for the increased model capacity of SEVGP. See Appendix B
for more details.</p>
        <p>It can be seen that our proposed method achieves comparable performance to SVGP. This
implies that problems where the latter perform well, allow for the SVGP to be replaced by
SEVGP in case the discussed benefits are deemed advantageous.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations and discussion</title>
      <p>This paper presented a method that combines GPs and recent developments in
varyingcoeficient/self-explaining methods for machine learning. By taking advantage of the Bayesian
properties of GPs it is also possible to inject prior knowledge into respective models. One
area where both the transparency and the teachability aspects can be helpful is the field of
fair and unbiased machine learning. On the one hand, transparency allows to detect biased or
discriminating results on a per instance basis. On the other hand, teachability could help prevent
or eliminate potential biases by carefully encoding non-biasing prior knowledge into the model.
While this would certainly not be a silver bullet, there might nevertheless be considerable,
general potential at the intersection of explainable and human-in-the-loop machine learning.</p>
      <p>A clear limitation is the fact that the idea of explainability that we considered in this paper is
a statistical one, with focus on local, per-pixel explanations. In complex problems like image
classification, this might not sufice if a class is inferred from multiple symbolic relations of
diferent objects that are present in a given image instance. Nevertheless, statistical approaches
have recently been shown to be quite successful on such complex problems despite possessing
no inherent capabilities for logic deduction.</p>
      <p>Future work on the proposed method should try to find a way to make the proposed method
scalable to other, potentially high dimensional, supervised learning problems. Particularly
problems with image inputs, like image classification or reinforcement learning might greatly
benefit from external prior knowledge when training data is only sparsely available.
09421. arXiv:2011.09421.
2017.10.011. doi:10.1016/j.dsp.2017.10.011.
[15] M. K. Titsias, Variational learning of inducing variables in sparse gaussian processes,
in: D. A. V. Dyk, M. Welling (Eds.), Proceedings of the Twelfth International Conference
on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA,
April 16-18, 2009, volume 5 of JMLR Proceedings, JMLR.org, 2009, pp. 567–574. URL: http:
[16] J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, in: A.
Nicholson, P. Smyth (Eds.), Proceedings of the Twenty-Ninth Conference on Uncertainty in
Artificial Intelligence, UAI 2013, Bellevue, WA, USA, August 11-15, 2013, AUAI Press,
[17] J. Hensman, A. G. de G. Matthews, Z. Ghahramani, Scalable variational gaussian process
classification, in: G. Lebanon, S. V. N. Vishwanathan (Eds.), Proceedings of the Eighteenth
International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego,
California, USA, May 9-12, 2015, volume 38 of JMLR Workshop and Conference Proceedings,
JMLR.org, 2015. URL: http://proceedings.mlr.press/v38/hensman15.html.
[18] C. Ma, Y. Li, J. M. Hernández-Lobato, Variational implicit processes, in: International
Conference on Machine Learning, PMLR, 2019, pp. 4222–4233. URL: https://proceedings.
mlr.press/v97/ma19b.html.</p>
      <p>
        A. Derivation of  (
        <xref ref-type="bibr" rid="ref11">11</xref>
        )
We write ( 1,) = ( 1, ...,  ) and (1,) = (1 , ...,  ). Notice that ( 1,) does
technically not exist as it involves the infinite dimensional stochastic processes where densities
don’t exist. As these objects will cancel out anyway and since such notation is commonly seen
in the GP literature, we will keep it here for simplicity. Otherwise, to be notationally exact, we
would have to work with KL divergences over probability measures which would make the
results much less convenient to derive.
      </p>
      <p>︁(</p>
      <p>︁(
   1,,</p>
      <p>⃒ ⃒
1,)︁ ⃒ ⃒ ︁(</p>
      <p>⃒ ⃒   1,, 1,⃒⃒  ︁)
=
∫︁</p>
      <p>log
︁(
∫︁
=
log</p>
      <p>︁(
 
1, ⃒⃒ 

︁(
︁)
log
log
︁(
∫︁
︁(
︁(
︁(
︁(
︁(
︁(
︁(
︁(
︁(
︁(
⃒ ⃒
since  depends on 
1,
only via  1, and only on the marginals given by  .
⃒ ⃒
∑︁  ( (, )|| ( ,    ))</p>
      <p>E
(˜ 1, )
∑︁  ( (, )|| ( ,    ))
∑︁  ( (, )|| ( ,    ))
by marginalizing out 
1, and writing ˜ 1,
by independence of prior and variational GPs and by standard i.i.d. assumption about observed
We write ( 1, ) = ( 1, ...,   ) and (</p>
      <p>1, ) = (1 , ...,  ). Notice that ( 1, ) does
technically not exist as it involves the infinite dimensional stochastic processes where densities
don’t exist. As these objects will cancel out anyway and since such notation is commonly seen
in the GP literature, we will keep it here for simplicity. Otherwise, to be notationally exact, we
would have to work with KL divergences over probability measures which would make the
results much less convenient to derive.</p>
      <p>︁(
  1, , 
⃒ ⃒
⃒ ⃒   1, , 1, ⃒⃒ 
︁)
=
∫︁
log</p>
      <p>︁(
 
1, ⃒⃒ 
︁(</p>
      <p>︁(
log</p>
      <p>︁(
 
1, ⃒⃒</p>
      <p>︁(</p>
      <p>︁(
log</p>
      <p>︁(
︁(
  ⃒⃒  1, ,</p>
      <p>︁(</p>
      <p>︁(
1, )︁   1, ⃒⃒</p>
      <p>︁(
⃒ ⃒
⃒ ⃒

log

⃒ ⃒
︁(
since  depends on 
1,</p>
      <p>only via  1, and only on the marginals given by  .</p>
      <p>by independence of prior and variational GPs and by standard i.i.d. assumption about observed
datapoints
by marginalizing out 
and writing ˜
1,</p>
      <p>for clarity as explained before.</p>
      <p>,</p>
      <p>))
−</p>
      <p>E
)

[︁
()
≥
[︁
log 
log 
︁(
⃒ ˜
⃒  
−
 (
 ( ,  )|| (

,</p>
      <p>))
 ( ,  )|| (</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Towards a rigorous science of interpretable machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:1702.08608</source>
          (
          <year>2017</year>
          ). URL: https://arxiv.org/abs/1702.08608.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Alvarez-Melis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          ,
          <article-title>Towards robust interpretability with self-explaining neural networks</article-title>
          , in: S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          , N. CesaBianchi, R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems</source>
          <year>2018</year>
          ,
          <article-title>NeurIPS 2018</article-title>
          , December 3-
          <issue>8</issue>
          ,
          <year>2018</year>
          , Montréal, Canada,
          <year>2018</year>
          , pp.
          <fpage>7786</fpage>
          -
          <lpage>7795</lpage>
          . URL: https://proceedings.neurips.cc/paper/ 2018/hash/3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoshikawa</surname>
          </string-name>
          , T. Iwata,
          <article-title>Gaussian process regression with local explanation</article-title>
          , CoRR abs/
          <year>2007</year>
          .01669 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2007</year>
          .01669. arXiv:
          <year>2007</year>
          .01669.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guhaniyogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Savitsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Distributed bayesian varying coeficient modeling using a gaussian process prior</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>00783</volume>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2006</year>
          .00783.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Niyogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Girosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Poggio</surname>
          </string-name>
          ,
          <article-title>Incorporating prior information in machine learning by creating virtual examples</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>86</volume>
          (
          <year>1998</year>
          )
          <fpage>2196</fpage>
          -
          <lpage>2209</lpage>
          . URL: https://ieeexplore. ieee.org/document/726787#:~:text=DOI%3A-,
          <volume>10</volume>
          .1109/5.726787,
          <string-name>
            <surname>-Publisher</surname>
          </string-name>
          %3A%
          <fpage>20IEEE</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ferranti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Craft</surname>
          </string-name>
          ,
          <article-title>The value of prior knowledge in machine learning of complex network systems</article-title>
          ,
          <source>Bioinform</source>
          .
          <volume>33</volume>
          (
          <year>2017</year>
          )
          <fpage>3610</fpage>
          -
          <lpage>3618</lpage>
          . URL: https://doi.org/10.1093/ bioinformatics/btx438. doi:
          <volume>10</volume>
          .1093/bioinformatics/btx438.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>von Rueden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Beckh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giesselbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Heese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kirsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfrommer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramamurthy</surname>
          </string-name>
          , et al.,
          <article-title>Informed machine learning-a taxonomy and survey of integrating knowledge into learning systems</article-title>
          , arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>12394</volume>
          (
          <year>2019</year>
          ). URL: https://arxiv.org/abs/
          <year>1903</year>
          .12394.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>A quantitative perspective on values of domain knowledge for machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:2011</source>
          .
          <volume>08450</volume>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2011</year>
          .08450.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Carlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Stern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Dunson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vehtari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <article-title>Bayesian data analysis</article-title>
          , CRC press,
          <year>2013</year>
          . URL: https://doi.org/10.1201/b16018.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sun</surname>
          </string-name>
          , G. Zhang,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <article-title>Functional variational bayesian neural networks</article-title>
          ,
          <source>in: 7th International Conference on Learning Representations, ICLR</source>
          <year>2019</year>
          ,
          <article-title>New Orleans</article-title>
          , LA, USA, May 6-
          <issue>9</issue>
          ,
          <year>2019</year>
          , OpenReview.net,
          <year>2019</year>
          . URL: https://openreview.net/forum?id= rkxacs0qY7.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Burt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Ober</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garriga-Alonso</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>van der Wilk, Understanding variational inference in function-space</article-title>
          , CoRR abs/
          <year>2011</year>
          .09421 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Montavon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Samek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Methods for interpreting and understanding deep neural networks</article-title>
          ,
          <source>Digit. Signal Process</source>
          .
          <volume>73</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . URL: https://doi.org/10.1016/j.dsp.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <article-title>Varying-coeficient models</article-title>
          ,
          <source>Journal of the Royal Statistical Society: Series B (Methodological) 55</source>
          (
          <year>1993</year>
          )
          <fpage>757</fpage>
          -
          <lpage>779</lpage>
          . URL: https://www.jstor.org/stable/2345993.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <article-title>Gaussian processes in machine learning</article-title>
          , in: O.
          <string-name>
            <surname>Bousquet</surname>
          </string-name>
          , U. von Luxburg, G. Rätsch (Eds.),
          <source>Advanced Lectures on Machine Learning, ML Summer Schools</source>
          <year>2003</year>
          , Canberra, Australia, February 2-
          <issue>14</issue>
          ,
          <year>2003</year>
          , Tübingen, Germany,
          <source>August</source>
          <volume>4</volume>
          -
          <issue>16</issue>
          ,
          <year>2003</year>
          , Revised Lectures, volume
          <volume>3176</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2003</year>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>71</lpage>
          . URL:
          <year>2013</year>
          . URL: https://dslpitt.org/uai/displayArticleDetails.jsp
          <article-title>?mmnu=1&amp;smnu=2&amp;article_id=</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>