<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward Faithful Explanatory Active Learning with Self-explainable Neural Nets?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Teso</string-name>
          <email>stefano.teso@cs.kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KU Leuven</institution>
          ,
          <addr-line>Leuven</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <fpage>4</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>From the user's perspective, interaction in active learning is very opaque: the user only sees a sequence of instances to be labeled, and has no idea what the model believes or how it behaves. Explanatory active learning (XAL) tackles this issue by making the model predict and explain its own queries using local explainers. By witnessing the model's (lack of) progress, the user can decide whether to trust it. Despite their promise, existing implementations of XAL rely on post-hoc explainers, which can produce unfaithful and fragile explanations, which misrepresent the beliefs of the predictor, confuse the user, and affect the quality of her supervision. As a remedy, we replace post-hoc explainers with selfexplainable models, and show how these can be actively learned from both labels and corrected explanations. Our preliminary results showcase the dangers of post-hoc explanations and hint at the promise of our solution.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning</kwd>
        <kwd>Active Learning</kwd>
        <kwd>Explainability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Explainable machine learning has so far mostly focused on black-box models
learned offline [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It was recently observed that interactive protocols can be
black-box too [
        <xref ref-type="bibr" rid="ref16 ref24">24, 16</xref>
        ]. For instance, in active learning (AL), the user receives a
sequence of instances (e.g. images, documents) to be labeled, but can witness
neither the behavior of the predictor nor its beliefs.
      </p>
      <p>
        Explanatory active learning (XAL) tackles this issue by injecting
explanations into the learning loop [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]: whenever asking the user to label an instance,
the model also shows a prediction for that instance and an explanation for the
prediction. The explanations are obtained with a local explainer [
        <xref ref-type="bibr" rid="ref14 ref8">14, 8</xref>
        ], which
summarizes and visualizes the local behavior of the predictor in terms of
interpretable feature relevance or other understandable artifacts. By witnessing the
evolution of the beliefs and decisions of the model, the user can justifiably grant
c 2019 for this paper by its authors. Use permitted under CC BY 4.0.
      </p>
    </sec>
    <sec id="sec-2">
      <title>T2oward FSa.iTthesfoul Explanatory Active Learning</title>
      <p>
        or revoke trust to it. This is analogous to trust between individuals, which
requires to develop appropriate expectations through interaction [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. In XAL, the
user is also free to correct the explanations by, e.g., indicating any irrelevant or
sensitive features that the model is currently relying on. This extra supervision
is necessary to correct models that are “right for the wrong reasons” [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Existing implementations of XAL make use of post-hoc local explainers—
for instance, caipi [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] uses lime [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]—which treat the predictor being learned
as a black-box. These approaches extract explanations through an approximate
model translation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] step. The overall process can produce unfaithful, fragile
explanations that misrepresent the model’s beliefs and have high-variance [
        <xref ref-type="bibr" rid="ref1 ref2">1,
2</xref>
        ]. Unfaithful and unstable explanations may confuse the user and affect the
quality of the user-provided corrections. More generally, such explanations are
not trustworthy and conflict with the purpose of explanatory interaction.
      </p>
      <p>
        As a remedy, we propose replacing post-hoc explainers with self-explainable
neural networks (SENNs), a recently proposed class of models that automatically
explain their own predictions. Intuitively, SENNs combine the transparency of
linear models with the flexibility of neural nets [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The explanations produced
by SENNs are exact, robust to small perturbations, and cheap to compute. In
contrast with standard interpretable models (e.g. shallow decision trees), SENNs
can tackle complex problems—including representation learning—via gradient
descent. In order to integrate them into XAL, we show how to learn SENNs from
labels and corrections directly by combining classification and ranking losses.
Our preliminary empirical analysis shows that SENNs can substantially improve
the quality of the explanations used in XAL and can be actively learned from
labels and corrections.
      </p>
      <p>Summarizing, we: 1) highlight the risks of unfaithful explanations in
interactive learning; 2) propose a novel implementation of XAL based on
selfexplainable neural nets; 3) propose a joint loss to learn SENNs from labels and
corrections directly; 4) report on preliminary experiments that showcase the
behavior of post-hoc explainers and the promise of our solution.
2</p>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>In the following, we will stick to binary classification and indicate instances as
x ∈ X and labels as y ∈ Y = {0, 1}. Our observations can be easily generalized
to the multi-class case.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Post-hoc Local Explainers</title>
          <p>
            Given a classifier f : X → Y, for instance a neural network or a random forest,
post-hoc local explainers explain individual predictions without looking at the
exact inference steps performed by f [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. Here we briefly detail lime [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ], which
is central in current XAL implementations.
          </p>
          <p>
            In order to explain a prediction y0 = f (x0), lime learns an interpretable
local model g0 that mimics f in the neighborhood of x0, and then reads off an
explanation from it. The local model is a sparse linear predictor or a shallow
decision tree [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] built with (user-provided) interpretable features ψ(x). These
may capture, e.g., individual words in document classification or objects in
image tagging. The local model is learned from a synthetic dataset that describes
counterfactual (“what if”) information about switching on / off the interpretable
features on f (x0).
          </p>
          <p>
            More formally, the process amounts to:
1. Sampling s interpretable instances ξ1, . . . , ξs by randomly perturbing the
interpretable representation of x0, namely ξ0 = ψ(x0);
2. Labeling each instance ξi using the target model yi = f (xi), where xi =
ψ−1(ξi) is the pre-image of ξi;
3. Weighting each example (ξi, yi) by its similarity to ξ0, i.e., k(ξi, ξ0); the
kernel function k determines the size and shape of the neighborhood of ξ0;
4. Fitting a local model g0 on the synthetic dataset via cost-sensitive learning,
so that examples outside of the neighborhood do not have much of an impact;
this is a form of model translation [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ];
5. Extracting an explanation from g0. For instance, if g0 is linear (i.e., g0(x) =
Pj wj ψj (x)) then the explanation describes the contributions of the
interpretable features according to the weights wj . In practice only the largest
weights are used. If g0 is a decision tree, then the feature contributions can
be read off from the path connecting the root to the predicted leaf.
The advantage of this procedure is that it is completely model-agnostic, as it
treats f as a black-box. The downside, however, is that it is not exact, and so
g0 may not approximate f well. We will discuss the consequences later on.
2.2
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Explanatory Active Learning</title>
          <p>
            Pool-based active learning (AL) is designed for settings where labels are scarce
and expensive to obtain [
            <xref ref-type="bibr" rid="ref22 ref9">22, 9</xref>
            ]. Learning proceeds iteratively. Initially, the model
has access to a small set of labeled examples L ⊆ X × Y and a large pool of
unlabeled instances U ⊆ X . In each iteration, the model asks an oracle (i.e.
a human expert or a measurement device) to label any instance in U —for a
price. The newly labeled instance is then moved to L and the model is adapted
accordingly. These steps are repeated until a labeling budget is exhausted or the
model is deemed good enough. The key challenge in AL is to design a strategy
for selecting informative and representative query instances, so to learn good
predictors at a small labeling cost. A common choice is uncertainty sampling
(US) [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], which picks instances where the model is most uncertain in terms of,
e.g., margin or entropy.
          </p>
          <p>
            In order to make the interaction more transparent and directable,
explanatory active learning (XAL) injects explanations into the learning loop and
enables the user to interact with them [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ]. In XAL, when the model asks the user
to label an instance x, it also presents a prediction for that instance yˆ = fˆ(x)
and an explanation zˆ for the prediction. In the caipi implementation of XAL,
T4oward FSa.iTthesfoul Explanatory Active Learning
the explanation is computed with lime. In exchange, the user provides the true
label of x and optionally corrects the explanation. The correction indicates, for
instance, which interpretable features (e.g. pixels, words) are erroneously being
used by the model. Since models cannot learn from corrections directly, caipi
converts corrections into counter-examples, as follows: if the user indicated that
some interpretable feature ψi(x) is wrongly being used by the model, then caipi
creates c copies of x where feature ψi is randomized, and attaches the true
label to them. Intuitively, the counter-examples teach the predictor to predict the
correct label independently from the value associated to the irrelevant feature.
It was shown that, for lime, explanation corrections describe local orthogonality
constraints [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], and that counter-examples approximate these constraints.
          </p>
          <p>By combining interactions and explanations, XAL helps the user to build a
mental model of the predictor being learned, which is necessary for justifiably
according trust to it. In addition, by virtue of learning from explanation
corrections, XAL makes it less likely that the model learns to predict the right labels
for the wrong reasons. We will see, however, that these promises can be difficult
to keep when post-hoc explainers are involved.
2.3</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Self-explainable Neural Networks</title>
          <p>Sparse linear models over interpretable features are canonically considered to be
highly interpretable. These models have the form1 f (x) = σ(w&gt;φ(x)), where σ is
a sigmoid function, w ∈ Rn is constant, and φ : X → Rn is a fixed, interpretable
feature map. The contribution of the jth feature to the output of the predictor
is determined by the corresponding weight. Despite their interpretability, sparse
linear models are limited to relatively simple learning tasks.</p>
          <p>
            Self-explainable neural networks (SENNs) upgrade linear models by
substantially increasing their capacity and flexibility while preserving their
interpretability [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. More specifically, SENNs have the same functional form, but allow the
weights w to change across the space, i.e.:
f (x) = σ(w(x)&gt;φ(x))
(1)
Here, both w(x) and φ(x) are (arbitrarily deep) neural networks. The inner
product can be replaced by any appropriate aggregation function [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. In order
to make w(x) act as an explanation, two additional restrictions are put in place:
1. The learned feature function φ has to be interpretable. This can be achieved
either by designing it by hand (as with linear predictors and lime), by
defining it in terms of (learned) prototypes, or by other means; see [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] for details.
2. As with linear models, the explanations should capture the local behavior of
the model and not be affected by small displacements of the input instance.
          </p>
          <p>
            This is guaranteed by constraining w to vary slowly with respect to φ.
More formally, w is required to be locally difference bounded by φ, as per the
following definition:
1 Here and below, the bias term is left implicit.
Definition 1 ([
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]). A function f : Rn → Ra is locally difference bounded by
a function g : Rn → Rb if for every x0 ∈ Rn there exists two constants δ &gt; 0
and L &gt; 0 such that for every x ∈ Rn:
          </p>
          <p>kx − x0k &lt; δ =⇒ kf (x) − f (x0)k ≤ Lkg(x) − g(x0)k
In other words, for every point x0 there is a neighborhood within which2 the
change in f is bounded by the change in g. Notice that the local Lipschitz
“constant” L is allowed to vary with x0.</p>
          <p>This requirement is enforced during learning by penalizing the model for any
deviations from linearity through the following regularization term:
Ω(f ) d=ef k∇xf (x) − w(x)&gt;Jxk
where Jx is the Jacobian matrix. The model f is learned by minimizing the
empirical risk of some classification loss `Y plus the above regularizer, i.e.:
minf Eˆ(x,y) [`Y (f (x), y) + αΩ(f )]
(2)
Here Eˆ indicates expectation over mini-batches and α ≥ 0 is a hyperparameter.
As usual with neural networks, minimization is performed with gradient descent
techniques. Of course, an additional term encouraging the output of w(x) to be
sparse can be considered. We will see later on that this is automatically the case
in our extension of SENNs to explanatory active learning.</p>
          <p>It is worth pointing out that SENNs are quite different from attention
models, which are yet another way of explaining (to some extent) neural networks.
Indeed, the former explain exactly which features contribute to the prediction,
as well as their polarity. The features themselves are arbitrary (interpretable)
functions of the inputs. Attention models, in contrast, identify which input
features (e.g. pixels) are relevant, but not how they contribute to the decision nor
whether they are against/in favor.
3
3.1</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Toward Faithful XAL</title>
        <sec id="sec-2-2-1">
          <title>The Dangers of Post-hoc Explanations</title>
          <p>Let us start by discussing the case of lime, which is at the core of existing XAL
implementations. In this case, the accuracy of the model translation process
(described above) depends critically the choice of model class of g0, interpretable
features ψ, kernel k, and number of samples s.</p>
          <p>For instance:
1. If the kernel k is not chosen correctly (i.e. if it is too small, too broad, or has
the wrong topology), then the synthetic dataset may fail to capture label
changes around x0, as shown in Figure 1.
2 This can be limiting in classification scenarios, because the output should change
abruptly when crossing the decision boundary. The formulation can be modified to
account for this, but we keep it as is, for simplicity.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>T6oward FSa.iTthesfoul Explanatory Active Learning</title>
      <p>In all these cases, lime produces high-variance explanations that substantially
misrepresent the behavior of f . Increasing the number of samples may improve
the situation, but also the runtime, and may not be enough to stabilize the
explanations anyway.</p>
      <p>Crucially, the same high-level argument applies to all post-hoc explainers,
because all of them treat f as a black-box and thus—by construction—must
include an inexact model translation step.</p>
      <p>From the standpoint of XAL, unfaithfulness has two major consequences.
First, high-variance explanations portray the model as behaving semi-randomly,
regardless of whether it is good or not. The user may also perceive that her
supervision has no effect, and feel lack of control. In rare cases, caipi may also
accidentally persuade the user into trusting a bad model. Both cases are
problematic in sensitive applications, like the ones that XAL is designed for. More
generally, unfaithful explanations misrepresent the learned model, defeating the
purpose of explanatory active learning. Second, unfaithful explanations
compromise the usefulness of the corrections. If the user is confused by the explanations,
her corrections will be not as informative. Further, the correction may specify
not to use a feature that the model is not using anyway (or vice versa).
Depending on the model being learned, correcting the same feature too many times may
also lead to learning instabilities.</p>
      <p>It is therefore desirable to fix the XAL pipeline to rely on faithful and
trustable explanations. In the next section, we show how to do so.
Algorithm 1 Pseudocode of cali: L is the set of labeled examples, U is the set
of unlabeled instances, and T is the query budget.
Our proposed algorithm—dubbed Calimocho, or cali for short—follows closely
the XAL learning loop; the pseudocode is listed in Algrithm 1. Notice that f
here is a SENN. Explanation corrections are collected in a set C, initially empty.
In each iteration, the algorithm chooses an instance x ∈ U (using uncertainty
sampling, as caipi, for simplicity), predicts its class yˆ, and generates an
explanation zˆ for the prediction (line 5). The explanation zˆ is simply read off of
w(x), i.e., zˆ = w(x), without any projection or sampling step. All components
are presented to the user (lines 6–7), who replies with the true label y of x and
optionally with an improved explanation z¯. Finally, the dataset is updated and
the preference z¯ zˆ is added to the corrections C.</p>
      <p>
        The learning step requires to strategy to train SENNs using corrections too.
One option is to follow caipi, an use counter-examples. However, these only
approximately capture the constraint imposed by the correction, e.g., that the
label should not depend on a particular interpretable feature. Depending on the
application, there is also a (slim) chance that the counter-examples are actually
wrong. Feature influence supervision solves this issue by asking the user [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
but this can be cognitively costly.
      </p>
      <p>Instead, we opt for learning SENNs directly from corrections, as follows.
Recall that, given an explanation zˆ, a correction specifies, e.g., which features are
erroneously being used by the model. Applying the correction to zˆ leads to a
corrected explanation z¯. It is therefore natural to impose that w(x) generates
explanations that are closer to the corrected explanation rather than the
predicted on. This can be accomplished by, e.g., minimizing the squared Euclidean
distance kw(x) − z¯k22 while maximizing kw(x) − zˆk22. It is easy to see that this
is equivalent to imposing a ranking loss:</p>
      <p>kw(x) − z¯k22 − kw(x) − zˆk22 = 2hw(x), zˆ − z¯i + kz¯k22 − kzˆk22
Notice that the features that were not corrected by the user (i.e. z¯j − zˆj = 0) do
not contribute to the loss, as expected. Our solution also generalizes the input</p>
    </sec>
    <sec id="sec-4">
      <title>T8oward FSa.iTthesfoul Explanatory Active Learning</title>
      <p>
        gradient constraints introduced in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] from learning from full explanations to
corrections, which contain information about only few interpretable features.
      </p>
      <p>
        Letting the dataset include instances x, labels y, and preferences z¯ zˆ
(represented with C in the pseudo-code), and denoting the ranking loss as `Z (w(x), z¯
zˆ) = hw(x), zˆ − z¯i, cali fits f by extending Eq. 2 with the above loss:
minf Eˆ(x,z¯ zˆ,y) [λ`Y (f (x), y) + (1 − λ)`Z (w(x), z¯
zˆ) + αΩ(f )]
(3)
Noisy corrections can be handled by tweaking the hyperparameter λ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].
3.3
      </p>
      <sec id="sec-4-1">
        <title>Discussion</title>
        <p>
          A few remarks are in order. First, cali trades off the model-agnosticism of
caipi for exactness and efficiency. This is desirable, since faithfulness is necessary
to justifiably establish trust, especially in sensitive applications. In addition,
despite their apparent simplicity, SENNs are very flexible non-linear models [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ],
substantially more powerful than “standard” interpretable models (e.g. they
natively support representation learning) and easy to learn with state-of-the-art
gradient descent techniques.
        </p>
        <p>It is natural to ask what kind of SENN architectures can be learned in a
label-scarce framework like active learning. On the one hand, shallow SENNs
can be learned efficiently from few labels alone. On the other, by supplementing
label information, we expect explanation corrections do help to actively learn
deep(er) models. Our preliminary results suggest that this might be the case.
4</p>
        <sec id="sec-4-1-1">
          <title>Experiments</title>
          <p>
            We address the following research questions:
Q1 Are the explanations output by LIME faithful?
Q2 Does cali learn from corrections?
Q3 Does explanatory feedback help learn deeper models?
In order to do so, we implemented cali3 and ran a preliminary experiment
with the synthetic color dataset used in [
            <xref ref-type="bibr" rid="ref19 ref24">19, 24</xref>
            ], where the goal is to classify
small synthetic images for the right reasons. The images are 5 × 5 and have four
possible colors. An image is positive if i) either the four corners have the same
color, or ii) the top three middle pixels all have different colors. On all training
images either both rules are satisfied or neither is. This means that labels alone
are not enough to disambiguate between the two potential explanations. Both
images and explanations are represented as 5 × 5 × 4 one-hot arrays. Notice
that the two conditions can be easily expressed as linear concepts in this space.
As in [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], we consider true explanations z that highlight the k most relevant
pixels. The corrections instead highlight (a subset of the) pixels that are wrongly
3 Code at: github.com/stefanoteso/calimocho
highlighted in the predicted explanations zˆ. In all cases, we use a SENN where
φ(x) is simply the one-hot encoding of x and w(x) is a fully-connected
feedforward neural network with L = 1, 3, 5 hidden layers. All results are 5-fold
cross-validated.
          </p>
          <p>Q1: Are the explanations output by caipi faithful? We trained a SENN for 1000
epochs and every 250 iterations computed the recall@k of the explanations
produced by LIME on 10 random (but fixed) test examples. This was done by
looking at how many of the k highest scoring features in the SENN explanation
(which is exact) appear among the k highest scoring features found by LIME.
In practice, caipi stabilizes high-variance explanation by running LIME r times
and averaging the obtained explanations. To check whether this technique is
effective, we repeated LIME r = 5, 10, 25 times and then measured the pair-wise
Euclidean distance between the r explanations. The results, in Figure 2, show
that the LIME explanations are never completely faithful, regardless of the
number of samples s and repeats r, thus confirming our arguments about post-hoc
explainers. In addition, while increasing s does have a clear beneficial effect, r
is surprisingly not beneficial. Regardless, increasing either does have an effect
on runtimes, as shown by the right plot. In comparison, SENN explanations are
exact and robust by construction and also essentially free to compute, because
w(x) must be evaluated anyway whenever doing inference. In practice,
evaluating w amounts to forward propagation and takes a tiny fraction of a second
(data not shown). This can be a substantial advantage in interactive applications
to avoid making the user wait.</p>
          <p>Q2: Does cali learn from corrections? In this experiment, we select each rule in
turn and use cali to actively learn a SENN from explanation corrections. The
ground-truth explanation highlight the 4 or 3 pixels that the decision actually
depends on, and the corrections identify the c = 1, . . . , 4 pixels whose predicted
weight wi(x) is farthest away from the correct weight. Figure 3 shows the label
loss `Y and the explanation loss `Z as more queries are made, up to 300. Turning</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>T1o0ward FSa.iTthesfoul Explanatory Active Learning</title>
      <p>
        on learning from corrections brings a huge benefit, as expected. Indeed, when no
corrections are provied (i.e. λ = 1), the SENN slowly learns the target concept,
as shown by the leftmost plot, while corrections help the SENN to converge
much faster. Even more importantly, unless corrections are enabled, the model
fails to be “right for the right reasons”, as the explanation loss diverges as
more labels are obtained (middle plot). Finally, the rightmost plot shows that
active learning cali is very efficient, requiring less than 0.2 seconds per iteration
on average. One surprising result is that increasing the number of corrections c
does not monotonically increase performance. This needs to be validated further;
however, the overall trend is clear and is consistent with the findings of [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The
results for the two rules and different query strategies (random and margin-based
uncertainty sampling) show exactly the same trend, and are not reported.
Q3: Does explanatory feedback help learn deeper models? Finally, we look at the
effect of learning from corrections while increasing the number of hidden
layers L = 1, 3, 5 of w(x). The results of increasing L are reported in Figure 4.
Once again, the effect of corrections is very clear: they help the label loss to
decay faster, and are necessary for the explanation loss not to diverge. Most
importantly, these results hold regardless of the choice of L, with larger models
behaving much worse on explanation loss when learned from labels only. This
result is (preliminary but) very interesting especially in the light of the
observation that caipi behaves best when learning sparser models [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. In stark contrast,
cali seems to behave well even for deeper SENNs.
      </p>
      <sec id="sec-5-1">
        <title>Related Work</title>
        <p>
          Despite the surge of interest on explainable AI and machine learning, most
research has focused on passive learning, and specifically on (1) designing
interpretable predictors [
          <xref ref-type="bibr" rid="ref12 ref25">25, 12</xref>
          ] and (2) explaining black-box models such as neural
networks [
          <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
          ]. Instead, we consider explanations in an interactive learning
setting.
        </p>
        <p>
          We specifically study explanatory active learning (introduced in [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] in the
wider context of explanatory interactive learning) which injects explanations
into the active learning loop. Related approaches were proposed in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], which
uses lime to communicate the exploration pattern of the active learner to the
user, and in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], where a model is learned explicitly from feature-level feedback.
Like previous work on (interactive) feature selection and dual supervision [
          <xref ref-type="bibr" rid="ref17 ref21 ref4 ref7">17, 7,
4, 21</xref>
          ], these work either ignore the issue of trust or the advantages of learning
from corrections rather than explanations. Please see [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] for a discussion.
Conceptually, XAL is related to techniques like explanatory debugging [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which
however targets simple predictors only. Similar themes have been championed
by [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          The importance of explanation faithfulness has long been recognized [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. More
recently, the hidden dangers of local explainability tools have been the subject of
a number of studies, e.g., [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], and the limitations of post-hoc explainers
have been studied in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The issue of unfaithful explanations in explanatory
interactive learning, however, has not been considered before. These studies have
lead to developing exact and / or robust explanatory techniques like input
gradients [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] (aka instantaneous causal effects), average causal effects [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and of
course SENNs [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. To the best of our knowledge, however, SENNs are the only
method that 1) supports gradient-based optimization and interpretable
representation learning, 2) generates exact explanations that are robust to
perturbations, and 3) can be learned from labels and—with this paper—from corrections
directly.
6
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Conclusion and Outlook</title>
        <p>The present paper highlights the dangers of post-hoc explainers for explanatory
active learning (XAL). Post-hoc explainers are prone to generating unfaithful
explanations that misrepresent the target predictor and prevent the user from
appropriately allocating trust to it. In order to solve this issue, we extend
existing XAL implementations by replacing post-hoc explainers with self-explainable
neural networks (SENNs). SENNs automatically generate exact and robust
explanations for their own predictions. In order to integrate them with XAL, we
show how to learn SENNs from labels and explanation corrections by
combining classification and ranking losses. Our preliminary experiments showcase the
fragility of post-hoc explainers and the potential of SENNs in explanatory active
learning.</p>
        <p>Of course, our results need further validation, including a more direct
comparisong with caipi, which we plan to carry on soon. They also hint at the</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>T1o2ward FSa.iTthesfoul Explanatory Active Learning</title>
      <p>promise of corrections for learning deeper networks even when labels are scarce.
We plan to investigate this direction in future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Adebayo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gilmer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Local explanation methods for deep neural networks lack sensitivity to parameter values</article-title>
          . arXiv e-prints (
          <year>Oct 2018</year>
          ), arXiv:
          <year>1810</year>
          .03307
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alvarez-Melis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaakkola</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          :
          <article-title>On the robustness of interpretability methods</article-title>
          . arXiv e-prints (
          <year>Jun 2018</year>
          ), arXiv:
          <year>1806</year>
          .08049
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diederich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tickle</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          :
          <article-title>Survey and critique of techniques for extracting rules from trained artificial neural networks</article-title>
          .
          <source>Knowledge-based systems 8(6)</source>
          ,
          <fpage>373</fpage>
          -
          <lpage>389</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Attenberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>A unified approach to active dual supervision for labeling features and examples</article-title>
          .
          <source>Machine Learning and Knowledge</source>
          Discovery in Databases pp.
          <fpage>40</fpage>
          -
          <lpage>55</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buciluaˇ</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>Model compression</article-title>
          .
          <source>In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <fpage>535</fpage>
          -
          <lpage>541</lpage>
          . ACM (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chattopadhyay</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manupriya</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balasubramanian</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          :
          <article-title>Neural network attributions: A causal perspective</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          . pp.
          <fpage>981</fpage>
          -
          <lpage>990</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Druck</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.:
          <article-title>Active learning by labeling features</article-title>
          .
          <source>In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-</source>
          Volume 1. pp.
          <fpage>81</fpage>
          -
          <lpage>90</lpage>
          . Association for Computational Linguistics (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Guidotti</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monreale</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruggieri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giannotti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedreschi</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A survey of methods for explaining black box models</article-title>
          .
          <source>ACM computing surveys (CSUR) 51(5)</source>
          ,
          <volume>93</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hanneke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Theory of disagreement-based active learning</article-title>
          .
          <source>Foundations and Trends R in Machine Learning</source>
          <volume>7</volume>
          (
          <issue>2-3</issue>
          ),
          <fpage>131</fpage>
          -
          <lpage>309</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Holzinger</surname>
            ,
            <given-names>A.:</given-names>
          </string-name>
          <article-title>From machine learning to explainable ai</article-title>
          .
          <source>In: 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA)</source>
          . pp.
          <fpage>55</fpage>
          -
          <lpage>66</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kulesza</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , et al.:
          <article-title>Principles of explanatory debugging to personalize interactive machine learning</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Intelligent User Interfaces</source>
          . pp.
          <fpage>126</fpage>
          -
          <lpage>137</lpage>
          . ACM (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lage</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gershman</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doshi-Velez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Human-in-the-loop interpretability prior</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <fpage>10159</fpage>
          -
          <lpage>10168</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          :
          <article-title>A sequential algorithm for training text classifiers</article-title>
          .
          <source>In: SIGIR'94</source>
          . pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          . Springer (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lundberg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>An unexpected unity among methods for interpreting model predictions</article-title>
          . arXiv e-prints (
          <year>Nov 2016</year>
          ), arXiv:
          <fpage>1611</fpage>
          .
          <fpage>07478</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Melis</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaakkola</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Towards robust interpretability with self-explaining neural networks</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <fpage>7775</fpage>
          -
          <lpage>7784</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Phillips</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedler</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>Interpretable active learning</article-title>
          . In: Conference on Fairness, Accountability, and
          <string-name>
            <surname>Transparency</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , et al.:
          <article-title>Active learning with feedback on features and instances</article-title>
          .
          <source>Journal of Machine Learning Research 7(Aug)</source>
          ,
          <fpage>1655</fpage>
          -
          <lpage>1686</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ribeiro</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Why should I trust you?: Explaining the predictions of any classifier</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          . pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          . ACM (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doshi-Velez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Right for the right reasons: training differentiable models by constraining their explanations</article-title>
          .
          <source>In: Proceedings of the 26th International Joint Conference on Artificial Intelligence</source>
          . pp.
          <fpage>2662</fpage>
          -
          <lpage>2670</lpage>
          . AAAI Press (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mardziel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fredrikson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Supervising feature influence</article-title>
          . arXiv e-prints (
          <year>Mar 2018</year>
          ), arXiv:
          <year>1803</year>
          .10815
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances</article-title>
          .
          <source>In: Proceedings of the conference on empirical methods in natural language processing</source>
          . pp.
          <fpage>1467</fpage>
          -
          <lpage>1478</lpage>
          . Association for Computational Linguistics (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Active learning</article-title>
          .
          <source>Synthesis Lectures on Artificial Intelligence and Machine Learning</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>114</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Simpson</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Psychological foundations of trust</article-title>
          .
          <source>Current directions in psychological science 16(5)</source>
          ,
          <fpage>264</fpage>
          -
          <lpage>268</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Teso</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kersting</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Explanatory interactive machine learning</article-title>
          .
          <source>In: Proceedings of AIES'19</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parbhoo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zazzi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doshi-Velez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Beyond sparsity: Tree regularization of deep models for interpretability</article-title>
          .
          <source>In: Thirty-Second AAAI Conference on Artificial Intelligence</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Yeh</surname>
            ,
            <given-names>C.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sai</surname>
            <given-names>Suggala</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Inouye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Ravikumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>On the (In)fidelity and Sensitivity for Explanations</article-title>
          . arXiv e-prints (
          <year>Jan 2019</year>
          ), arXiv:
          <year>1901</year>
          .09392
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>