<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Interactive Label Cleaning with Example-based Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Teso</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Bontempelli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fausto Giunchiglia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Passerini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jilin University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We tackle sequential learning under label noise in applications where a human supervisor can be queried to relabel suspicious examples. Existing approaches are flawed, in that they only relabel incoming examples that look “suspicious” to the model. As a consequence, those mislabeled examples that elude (or don't undergo) this cleaning step end up tainting the training data and the model with no further chance of being cleaned. We propose cincer, a novel approach that cleans both new and past data by identifying pairs of mutually incompatible examples. Whenever it detects a suspicious example, cincer identifies a counter-example in the training set that-according to the model-is maximally incompatible with the suspicious example, and asks the annotator to relabel either or both examples, resolving this possible inconsistency. The counter-examples are chosen to be maximally incompatible, so to serve as explanations of the model's suspicion, and highly influential, so to convey as much information as possible if relabeled. cincer achieves this by leveraging an eficient and robust approximation of influence functions based on the Fisher information matrix (FIM). Our extensive empirical evaluation shows that clarifying the reasons behind the model's suspicions by cleaning the counter-examples helps in acquiring substantially better data and models, especially when paired with our FIM approximation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Interactive Learning</kwd>
        <kwd>Explainable AI</kwd>
        <kwd>Label Noise</kwd>
        <kwd>Influence Functions</kwd>
        <kwd>Fisher Information</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Label noise is a major issue in machine learning as it can lead to compromised predictive
performance and unreliable models [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. We focus on sequential learning settings in which
a human supervisor, usually a domain expert, can be asked to double-check and relabel any
potentially mislabeled example. Applications include crowd-sourced machine learning and
citizen science, where trained researchers can be asked to clean the labels provided by
crowdworkers [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], and interactive personal assistants [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where the user self-reports the initial
annotations (e.g., about activities being performed) and unreliability is due to memory bias [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
unwillingness to report [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], or conditioning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        This problem is often tackled by monitoring for incoming examples that are likely to be
mislabeled, aka suspicious examples, and ask the supervisor to provide clean (or at least better)
annotations for them. Existing approaches, however, focus solely on cleaning the incoming
examples [
        <xref ref-type="bibr" rid="ref10 ref4 ref5 ref9">9, 4, 10, 5</xref>
        ]. This means that noisy examples that did not undergo the cleaning step
(e.g., those in the initial bootstrap data set) or that managed to elude it are left untouched. This
degrades the quality of the model and prevent it from spotting future mislabeled examples that
fall in regions afected by noise.
      </p>
      <p>We introduce cincer (Contrastive and InflueNt CounterExample stRategy), a new explainable
interactive label cleaning algorithm that lets the annotator observe and fix the reasons behind
the model’s suspicions. For every suspicious example that it finds, cincer identifies a
counterexample, i.e., a training example that maximally supports the machine’s suspicion. The idea
is that the example/counter-example pair captures a potential inconsistency in the data—as
viewed from the model’s perspective—which is resolved by invoking the supervisor. More
specifically, cincer asks the user to relabel the example, the counter-example, or both, thus
improving the quality of, and promoting consistency between, the data and the model. Two
hypothetical rounds of interaction on a noisy version of MNIST are illustrated in Figure 1.</p>
      <p>
        cincer relies on a principled definition of counter-examples derived from few explicit,
intuitive desiderata, using influence functions [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ]. The resulting counter-example selection
problem is solved using a simple and eficient approximation based on the Fisher information
matrix [13] that consistently outperforms more complex alternatives in our experiments.
Contributions: We: 1) Introduce cincer, an explanatory interactive label cleaning strategy
that leverages example-based explanations to identify inconsistencies in the data—as perceived
by the model—and enable the annotator to fix them. 2) Show how to select counter-examples
that explain why the model is suspicious and that are highly informative using (an eficient
approximation of) influence functions. 3) Present an extensive empirical evaluation that showcases
the ability of cincer of building cleaner data sets and better models.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>We are concerned with sequential learning under label noise. In this setting, the machine
receives a sequence of examples ˜ := (x, ˜), for  = 1, 2, . . ., where x ∈ R is an instance
and ˜ ∈ [] is a corresponding label, with [] := {1, . . . , }. The label ˜ is unreliable and
might difer from the ground-truth label * . The key feature of our setting is that a human
supervisor can be asked to double-check and relabel any example. The goal is to acquire a clean
dataset and a high-quality predictor while asking few relabeling queries, so to keep the cost of
interaction under control.</p>
      <p>
        The state-of-the-art for this setting is skeptical learning (SKL) [
        <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
        ]. SKL is designed
primarily for sequential interactive learning from unreliable users: in each iteration the machine
receives an example and updates the model accordingly. However, for each example that it
receives, the machine compares (an estimate of) the quality of the annotation with that of its
own prediction, and if the prediction looks more reliable than the annotation by some factor,
SKL asks the user to double-check his/her example. The details depend on the implementation:
in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] label quality is estimated using the empirical accuracy for the classifier and the empirical
probability of contradiction for the annotator, while in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the machine measures the margin
between the user’s and machine’s labels. Our approach follows the latter strategy.
      </p>
      <p>
        Another very related approach is learning from weak annotators (LWA) [
        <xref ref-type="bibr" rid="ref4 ref9">9, 4</xref>
        ], which focuses
on querying domain experts rather than end-users. The most recent approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] jointly learns a
prediction pipeline composed of a classifier and a noisy channel, which allows it to estimate the
noise rate directly, and identifies suspicious examples that have a large impact on the model. A
theoretical foundation is given in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. LWA is however designed for pool-based scenarios, where
the training set is given rather than obtained sequentially. For this reason, in the remainder of
the paper we will chiefly build on and compare to SKL.
      </p>
      <p>Limitations of existing approaches. A major downside of SKL is that it focuses on cleaning
the incoming examples only. This means that if a mislabeled example manages to elude the
cleaning step and gets added to the training set, it is bound to stay there forever. This situation
is actually quite common during the first stage of skeptical learning, in which the model is
highly uncertain and trusts the incoming examples—even if they are mislabeled. The same issue
occurs if the initial training set used to bootstrap the classifier contains mislabeled examples.
As shown by our experiments, the accumulation of noisy data in the training set may have a
detrimental efect on the model’s performance (cf. Figure 2). In addition, it can also afect the
model’s ability to identify suspicious examples: a noisy data point can fool the classifier into
trusting incoming mislabeled examples that fall close to it, further aggravating the situation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Explainable Interactive Label Cleaning with CINCER</title>
      <p>We consider a very general class of probabilistic classifiers  : R → [] of the form  (x;  ) :=
argmax∈[]  ( | x;  ), where the conditional distribution  ( | X;  ) has been fit on training
data by minimizing the cross-entropy loss ℓ((x, ),  ) = − ∑︀∈[] 1{ = } log  ( | x,  ). In
our implementation, we also assume  to be a neural network with a softmax activation at the
top layer, trained using some variant of SGD and possibly early stopping.</p>
      <sec id="sec-3-1">
        <title>3.1. The cincer Algorithm</title>
        <p>The pseudo-code of cincer is listed in Algorithm 1. At the beginning of iteration , the machine
has acquired a training set − 1 = {1, . . . , − 1} and trained a model with parameters  − 1
else
ifnd counterexample  using Eq. 10
present ˜,  to the user, receive possibly cleaned labels ′, ′
 ← (− 1 ∖ {}) ∪ {(x, ′), (x, ′)}
◁ ˜ is compatible
◁ ˜ is suspicious
on it. At this point, the machine receives a new, possibly mislabeled example ˜ (line 3) and has
to decide whether to trust it.</p>
        <p>
          Following skeptical learning [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], cincer does so by computing the margin  (˜,  − 1), i.e., the
diference in conditional probability between the model’s prediction ^ := argmax  ( | x,  − 1)
and the annotation ˜. More formally:
 (˜,  − 1) :=  (^ | x,  − 1) −  (˜ | x,  − 1)
(1)
The margin estimates the incompatibility between the model and the example: the larger the
margin, the more suspicious the example. The example ˜ is deemed compatible if the margin is
below a given threshold  and suspicious otherwise (line 4); possible choices for  are discussed
in Section 3.5.
        </p>
        <p>If ˜ is compatible, it is added to the data set as-is (line 5). Otherwise, cincer computes a
counter-example  ∈ − 1 that maximally supports the machine’s suspicion. The intuition is
that the pair (˜, ) captures a potential inconsistency in the data. For instance, the
counterexample might be a correctly labeled example that is close or similar to ˜ but has a diferent
label, or a distant noisy outlier that fools the predictor into assigning low probability to ˜. How
to choose an efective counter-example is a major focus of this paper and discussed in detail in
Section 3.2 and following.</p>
        <p>Next, cincer asks the annotator to double-check the pair (˜, ) and relabel the suspicious
example, the counter-example, or both, thus resolving the potential inconsistency. The data set
and model are updated accordingly (line 9).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Counter-example Selection</title>
        <p>Counter-examples are meant to illustrate why an example ˜ is deemed suspicious by the
machine in a way that makes it easy to elicit useful corrective feedback from the supervisor.
We posit that a good counter-example  should be:</p>
        <p>D1. Contrastive:  should explain why ˜ is considered suspicious by the model, thus
highlighting a potential inconsistency in the data.</p>
        <p>D2. Influential : if  is mis-labeled, correcting it should improve the model as much as possible,
so to maximize the information gained by interacting with the annotator.</p>
        <p>In the following, we show how, for models learned by minimizing the cross-entropy loss, one
can identify counter-examples that satisfy both desiderata.</p>
        <sec id="sec-3-2-1">
          <title>What is a contrastive counter-example?</title>
          <p>We start by tackling the first desideratum. Let
 − 1 be the parameters of the current model. Intuitively,  ∈ − 1 is a contrastive
counterexample for a suspicious example ˜ if removing it from the data set and retraining leads to a
model with parameters  −− 1 that assigns higher probability to the suspicious label ˜. The most
contrastive counter-example is then the one that maximally afects the change in probability:
argmax∈[− 1]
{︁ (˜ | x;  −  )
− 1 −  (˜ | x;  − 1)
}︁
(2)
While intuitively appealing, optimizing Eq. 2 directly is computationally challenging as it
involves retraining the model |− 1| times. This is impractical for realistically sized models and
data sets, especially in our interactive scenario where a counter-example must be computed in
each iteration.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Influence functions.</title>
          <p>
            We address this issue by leveraging influence functions (IFs), a
computational device that can be used to estimate the impact of specific training examples without
retraining [
            <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
            ]. Let   be the empirical risk minimizer on  and  (,  ) be the minimizer
obtained after adding an example  with weight  to , namely:
  := argmin 
1 ∑︀
=1 ℓ(,  )
 (,  ) := argmin 
1 (︀ ∑︀
=1 ℓ(,  )︀) + ℓ (,  )
(3)
written as  (,  )
−  (, 0) ≈  · (︀   (,  )⃒⃒  =0
          </p>
          <p>
            side is the so-called influence function, denoted
Taking a first-order Taylor expansion, the diference between
︀) . The derivative appearing on the right hand
ℐ  (). It follows that the efect on   of adding
  =  (, 0) and  (,  ) can be
(resp. removing) an example  to  can be approximated by multiplying the IF by  = 1/ (resp.
as ℐ  () = − ( )− 1
∇ ℓ(,  ), where the curvature matrix ( ) := 1 ∑︀
=1 ∇
 = − 1/). Crucially, if the loss is strongly convex and twice diferentiable, the IF can be written
2ℓ(,  ) is
positive definite and invertible. IFs were shown to capture meaningful information even for
non-convex models [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Identifying contrastive counter-examples with influence functions.</title>
          <p>To see the link
between contrastive counter-examples and influence functions, notice that the second term of
Eq. 2 is independent of , while the first term can be conveniently approximated with IFs by
applying the chain rule:</p>
          <p>1 (︂ 
− 
− 1

 (˜ | x;  − 1(,  ))⃒⃒
⃒
⃒  =0
︂)
= − 
1 (︂
− 1
1
∇  (˜ | x;  − 1)⊤</p>
          <p>
            − 1(,  )⃒⃒

= − 
− 1 ∇  (˜ | x;  − 1)⊤ℐ − 1 ()
The constant can be dropped during the optimization. This shows that Eq. 2 is equivalent
to: argmax∈[− 1] ∇  (˜ | x;  − 1)⊤( − 1)− 1∇ ℓ(,  − 1). This problem can be solved
⃒
⃒  =0
︂)
(4)
(5)
eficiently by combining two strategies [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]: i) Caching the inverse Hessian-vector product
(HVP) ∇  (˜ | x;  − 1)⊤( − 1)− 1, so that evaluating the objective on each  becomes a
simple dot product, and ii) Solving the inverse HVP with an eficient stochastic estimator like
LISSA [14]. This gives us an algorithm for computing contrastive counter-examples.
          </p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Contrastive counter-examples are highly influential.</title>
          <p>Can this algorithm be used for
identifying influential counter-examples? It turns out that, as long as the model is obtained
by optimizing the cross-entropy loss, the answer is afirmative. Indeed, note that if
ℓ(,  ) =
− log  ( | x;  ), then:
∇  (˜ | x;  − 1) =  (˜ | x;  − 1) ∇  (˜ | x;  − 1)
=
 (˜ | x;  − 1)
=  (˜ | x;  − 1)∇ log  (˜ | x;  − 1) = −  (˜ | x;  − 1)∇ ℓ(˜,  − 1)
Hence, Eq. 5 can be rewritten as:
−  (˜ | x;  − 1)∇ ℓ(˜,  − 1)⊤( − 1)− 1∇ ℓ(,  − 1)</p>
          <p>
            ∝ −∇  ℓ(˜,  − 1)⊤( − 1)− 1∇ ℓ(,  − 1)
0, Eq. 2 is equivalent to:
It follows that, under the above assumptions and as long as the model satisfies  (˜ | x;  − 1) &gt;
argmax∈[− 1] −∇  ℓ(˜,  − 1)⊤( − 1)− 1∇ ℓ(,  − 1)
This equation recovers exactly the definition of influential examples given in Eq. 2 of [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]
and shows that, for the large family of classifiers trained by cross-entropy, highly influential
counter-examples are highly contrastive and vice versa, so that no change to the selection
algorithm is necessary.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Counter-example Selection with the Fisher information matrix</title>
        <p>
          Unfortunately, we found the computation of IFs to be unreliable in practice, cf. [15]. This leads
to unstable ranking of candidates and reflects on the quality of the counter-examples, as in
gradient-based methods (and possibly early stopping),  − 1 is seldom close to a local minimum
of the empirical risk, rendering the Hessian non-positive definite. In our setting, the situation is
further complicated by the presence of noise, which dramatically skews the curvature of the
empirical risk. Remedies like fine-tuning the model with L-BFGS [
          <xref ref-type="bibr" rid="ref12">12, 16</xref>
          ], preconditioning and
weight decay [15] proved unsatisfactory in our experiments.
        </p>
        <p>We take a diferent approach. The idea is to replace the Hessian by the Fisher information
matrix (FIM). The FIM  ( ) of a discriminative model  ( | X,  ) and training set − 1 is [17,
 ( ) := − 1</p>
        <p>︀[
1 ∑︀−=11 E∼  ( | x, )
∇ log  ( | x,  )∇ log  ( | x,  )⊤]︀
(11)
It can be shown that, if the model approximates the data distribution, the FIM approximates the
Hessian, cf. [19, 20]. Even when this assumption does not hold, as is likely in our noisy setting,
(6)
(7)
(8)
(9)
(10)
the FIM still captures much of the curvature information encoded into the Hessian [17]. Under
this approximation, Eq. 10 can be rewritten as:
argmax∈[− 1] −∇  ℓ(˜,  − 1)⊤ ( − 1)− 1∇ ℓ(,  − 1)
(12)
The advantage of this formulation is twofold. First of all, this optimization problem also admits
caching the inverse FIM-vector product (FVP), which makes it viable for interactive usage.
Second, and most importantly, the FIM is positive semi-definite by construction, making the
computation of Eq.12 much more stable.</p>
        <p>The remaining step is how to compute the inverse FVP. Naïve storage and inversion of the
FIM, which is | | × |  | in size, is impractical for typical models, so the FIM is usually replaced
with a simpler matrix. Three common options are the identity matrix, the diagonal of the FIM,
and a block-diagonal approximation where interactions between parameters of diferent layers
are set to zero [17]. Our best results were obtained by restricting the FIM to the top layer of the
network. We refer to this approximation as “Top Fisher”. While more advanced approximations
like K-FAC [17] exist, the Top Fisher proved surprisingly efective in our experiments.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Selecting Pertinent Counter-examples</title>
        <p>So far we have discussed how to select contrastive and influential counter-examples. Now we
discuss how to make the counter-examples easier to interpret for the annotator. To this end, we
introduce the additional desideratum that counter-examples should be:</p>
        <p>D3 Pertinent: it should be clear to the user why  is a counter-example for ˜.
We integrate D3 into cincer by restricting the choice of possible counter-examples. A simple
strategy, which we do employ in all of our examples and experiments, is to restrict the search
to counter-examples whose label in the training set is the same as the prediction for the
suspicious example, i.e.,  = ^. This way, the annotator can interpret the counter-example as
being in support of the machine’s suspicion. In other words, if the counter-example is labeled
correctly, then the machine’s suspicion is likely right and the incoming example needs cleaning.
Otherwise, if the machine is wrong and the suspicious example is not mislabeled, it is likely the
counter-example – which backs the machine’s suspicions – that needs cleaning.</p>
        <p>Finally, one drawback of IF-selected counter-examples is that they may be perceptually
diferent from the suspicious example. For instance, outliers are often highly influential as they
fool the machine into mispredicting many examples, yet they have little in common with those
examples [20]. This can make it dificult for the user to understand their relationship with
the suspicious examples they are meant to explain. This is not necessarily an issue: first, a
motivated supervisor is likely to correct mislabeled counter-examples regardless of whether they
resemble the suspicious example; second, highly influential outliers are often identified (and
corrected if needed) in the first iterations of cincer (indeed, we did not observe a significant
amount of repetitions among suggested counter-examples in our experiments). Still, cincer
can be readily adapted to acquire more perceptually similar counter-examples. One option is
to replace IFs with relative IFs [20], which trade-of influence with locality. Alas, the resulting
optimization problem does not support eficient caching of the inverse HVP. A better alternative
is to restrict the search to counter-examples  that are similar enough to ˜ in terms of some
given perceptual distance ‖·‖  [21] by filtering the candidates using fast nearest neighbor
techniques in perceptual space. This is analogous to FastIF [22], except that the motivation is
to encourage perceptual similarity rather than purely eficiency, although the latter is a nice
bonus.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Advantages and Limitations</title>
        <p>The main benefit of cincer is that, by asking a human annotator to correct potential
inconsistencies in the data, it acquires substantially better supervision and, in turn, better predictors.
In doing so, cincer also encourages consistency between the data and the model. Another
benefit is that cincer allows the supervisor to spot bugs and justifiably build – or, perhaps more
importantly, reject [23, 24] – trust into the prediction pipeline. cincer only requires to set a
single parameter, the margin threshold  , which determines how frequently the supervisor is
invoked. The optimal value depends on the ratio between the cost of a relabeling query and
the cost of noise. If the annotator is willing to interact (for instance, because it is payed to do
so) then the threshold can be quite generous. cincer can be readily applied in applications in
which the data set is retained over time, even partially. If retaining data is not an option, cincer
could be adapted to synthesize counter-examples ex novo. This is however highly non-trivial
and left to future work.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We address empirically the following research questions: Q1: Do counter-examples contribute to
cleaning the data? Q2: Which influence-based selection strategy identifies the most mislabeled
counter-examples? Q3: What contributes to the efectiveness of the best counter-example
selection strategy?</p>
      <p>
        We implemented cincer using Python and Tensorflow [ 25] on top of three classifiers and
compared diferent counter-example selection strategies on five data sets. The IF code is adapted
from [26]. All experiments were run on a 12-core machine with 16 GiB of RAM and no GPU.
The code for all experiments is available at: https://github.com/abonte/cincer.
Data sets. We used a diverse set of classification data sets: Adult [27]: data set of 48,800
persons, each described by 15 attributes; the goal is to discriminate customers with an income
above/below $50K. Breast [27]: data set of 569 patients described by 30 real-valued features.
The goal is to discriminate between benign and malignant breast cancer cases. 20NG [27]: data
set of newsgroup posts categorized in twenty topics. The documents were embedded using a
pre-trained Sentence-BERT model [28] and compressed to 100 features using PCA. MNIST [29]:
handwritten digit recognition data set from black-and-white, 28 × 28 images with pixel values
normalized in the [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] range. The data set consists of 60K training and 10K test examples.
Fashion [30]: fashion article classification dataset with the same structure as MNIST. For adult
and breast a random 80 : 20 training-test split is used while for MNIST, fashion and 20 NG the
split provided with the data set is used. The labels of 20% of training examples were corrupted
20
10
at random. Performance was measured in terms of 1 score on the (uncorrupted) test set. Error
bars in the plots indicate the standard error. The experiments were repeated five times, each
time changing the seed used for corrupting the data. All competitors received exactly the same
examples in exactly the same order.
      </p>
      <p>Models. We applied cincer to three models: LR, a logistic regression classifier; FC, a
feedforward neural network with two fully connected hidden layers with ReLU activations; and
CNN, a feed-forward neural network with two convolutional layers and two fully connected
layers. For all models, the hidden layers have ReLU activations and 20% dropout while the
top layer has a softmax activation. LR was applied to MNIST, FC to both the tabular data sets
(namely: adult, breast, german, and 20NG) and image data sets (MNIST and fashion), and CNN
to the image data sets only. Upon receiving a new example, the classifier is retrained from
scratch for 100 epochs using Adam [31] with default parameters, with early stopping when
the accuracy on the training set reaches 90% for FC and CNN, and 70% for LR. This helps
substantially to stabilize the quality of the model and speeds up the evaluation. Before each run,
the models are trained on an bootstrap training set (containing 20% mislabeled examples) of 500
examples for 20NG and 100 for all the other data sets. The margin threshold is set to  = 0.2.
Due to space constraints, we report the results on one image data set and three tabular data,
and we focus on FC and CNN. The other results are consistent with what is reported below;
these plots are reported in the Supplementary Material.</p>
      <sec id="sec-4-1">
        <title>4.1. Q1: Counter-examples improve the quality of the data</title>
        <p>
          To evaluate the impact of cleaning the counter-examples, we compare cincer combined with
the Top fisher approximation of the FIM, which works best in practice, against two alternatives,
namely: No CE: an implementation of skeptical learning [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] that asks the user to relabel any
incoming suspicious examples identified by the margin and presents no counter-examples.
Drop CE: a variation of cincer that identifies counter-examples using Top Fisher but drops
them from the data set if the user considers the incoming example correctly labeled. The results
are reported in Figure 2. The plots show that cincer cleans by far the most examples on all data
sets, between 33% and 80% more than the alternatives (top row in Figure 2). This translates into
        </p>
        <p>Pr@10 Pr@10 Pr@10 Pr@10
Figure 3: Counter-example Pr@5 and Pr@10. Standard error information is reported. Left to right :
results for FC on adult, breast and 20NG, and CNN on MNIST.
better predictive performance as measured by 1 score (bottom row). Notice also that cincer
consistently outperforms the drop CE strategy in terms of 1 score, suggesting that relabeling
the counter-examples provides important information for improving the model. These results
validate our choice of identifying and relabeling counter-examples for interactive label cleaning
compared to focusing on suspicious incoming examples only, and allow us to answer Q1 in the
afirmative.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Q2: FIM-based strategies identify the most mislabeled CEs</title>
        <p>
          Next, we compare the ability of alternative approximations of IFs of discovering mislabeled
counter-examples. To this end, we trained a model on a noisy bootstrap data set, selected 100
examples from the remainder of the training set, and measured how many truly mislabeled
counter-examples are selected by alternative strategies. In particular, we computed influence
using the IF LISSA estimator of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], the actual FIM (denoted “full Fisher” and reported for the
simpler models only for computational reasons) and its approximations using the identity matrix
(aka “practical Fisher” [32]), and Top Fisher. We computed the precision at  for  ∈ {5, 10}, i.e,
the fraction of mislabeled counter-examples within five or ten highest-scoring counter-examples
retrieved by the various alternatives, averaged over 100 iterations for five runs. The results in
Figure 3 show that, in general, FIM-based strategies outperform the LISSA estimator, with Full
Fisher performing best and Top Fisher a close second. Since the full FIM is highly impractical to
store and invert, this confirms our choice of Top Fisher as best practical strategy.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Q3: Both influence and curvature contribute to the efectiveness of Top</title>
      </sec>
      <sec id="sec-4-4">
        <title>Fisher</title>
        <p>Finally, we evaluate the impact of selecting counter-examples using Top Fisher on the model’s
performance, in terms of use of influence, by comparing it to an intuitive nearest neighbor
alternative (NN), and modelling of the curvature, by comparing it to the Practical Fisher. NN
simply selects the counter-example that is closest to the suspicious example. The results can be
viewed in Figure 4. Top Fisher is clearly the best strategy, both in terms of number of cleaned
examples and 1 score. NN is always worse than Top Fisher in terms of 1, even in the case of
adult (first column) when it cleans the same number of examples, confirming the importance
of influence in selecting impactful counter-examples. Practical Fisher clearly underperforms
compared with Top Fisher, and it shows the importance of having the curvature matrix. For
each data set, all methods make a similar number of queries: 58 for 20NG, 21 for breast, 31
for adult and 37 for MNIST. In general, cincer detects around 75% of the mislabeled examples
(compared to 50% of the other methods) and only about 5% of its queries do not involve a
20
10</p>
        <p>Top Fisher
corrupted example or counter-example. The complete values are reported in the Supplementary
Material. As a final remark, we note that cincer cleans more suspicious examples than
counterexamples (in a roughly 2 : 1 ratio), as shown by the number of cleaned suspicious examples
vs. counter-examples reported in the Supplementary Material. Comparing this to the curve for
Drop CE shows that proper data cleaning improves the ability of the model of being suspicious
for the right reasons, as expected.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>
        Interactive learning under noise. Most existing approaches are designed for crowd-sourcing
applications in which items are labelled by diferent annotators of varying quality and the goal
is to aggregate weak annotations into a high-quality consensus label [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Our work is strongly
related to skeptical learning [
        <xref ref-type="bibr" rid="ref10 ref5">10, 5</xref>
        ] and learning from weak annotators [
        <xref ref-type="bibr" rid="ref4 ref9">9, 4</xref>
        ]. These approaches
completely ignore the issue of noise in the training set, which can be quite detrimental, as shown
by our experiments. Moreover, they are completely black-box, making it hard for him/her to
establish or reject trust into the data and the model.
      </p>
      <sec id="sec-5-1">
        <title>Influence functions and Fisher information. It is well known that mislabeled examples</title>
        <p>
          tend to exert a larger influence on the model [
          <xref ref-type="bibr" rid="ref12">33, 12, 34, 20</xref>
          ] and indeed IFs may be a valid
alternative to the margin for identifying suspicious examples. Building on the seminal work of
Koh and Liang [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we instead leverage IFs to define and compute contrastive counter-examples
that explain why the machine is suspicious. The diference is that noisy training examples
influence the model as a whole, whereas contrastive counter-examples inuflence a specific
suspicious example. To the best of our knowledge, this application of IFs is entirely novel. IFs
has been used to present to the user training examples that negatively influence the model
and from them, augmented examples are created to fine-tune the model [ 35]. This approach
does not consider mislabeled examples. Notice also that empirical evidence that IFs recover
noisy examples is restricted to ofline learning [
          <xref ref-type="bibr" rid="ref12">12, 34</xref>
          ]. Our experiments extend this to a less
forgiving interactive setting in which only one counter-example is selected per iteration and
the model is trained on the whole training set. The idea of exploiting the FIM to approximate
the Hessian has ample support in the natural gradient descent literature [17, 18]. The FIM has
been used for computing example-based explanations by Khanna et al. [34]. Their approach is
however quite diferent from ours. cincer is equivalent to maximizing the Fisher kernel [32]
between the suspicious example and the counter-example (Eq. 12) for the purpose of explaining
the model’s margin, and this formulation is explicitly derived from two simple desiderata. In
contrast, Khanna et al. maximize the squared Fisher kernel between  and ˜ divided by the
norm of  in the RKHS, which is not equivalent to Eq. 12 and does not admit caching the
inverse FIM-vector product.
        </p>
        <p>Other works. cincer draws inspiration from explanatory active learning, which integrates
local [24, 36, 37, 38] or global [39] explanations into interactive learning and allows the annotator
to supply corrective feedback on the model’s explanations. These approaches difer from cincer
in that they neither consider the issue of noise nor perform label cleaning, and indeed they
explain the model’s predictions rather than the model’s suspicion. Another notable diference is
that they rely on attribution-based explanations, whereas the backbone of cincer are
examplebased explanations, which enable users to reason about labels in terms of concrete (training)
examples [40, 41]. Saliency maps could potentially be integrated into cincer to provide more
ifne-grained information.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We introduced cincer, an approach for handling label noise in sequential learning that asks
a human supervisor to relabel any incoming suspicious examples. Compared to previous
approaches, cincer identifies the reasons behind the model’s skepticism and asks the supervisor
to double-check them too. This is done by computing a training example that maximally
supports the machine’s suspicions. This enables the user to correct both incoming and old
examples, cleaning inconsistencies in the data that confuse the model. Our experiments shows
that, by removing inconsistencies in the data, cincer enables acquiring better data and models
than less informed alternatives.</p>
      <p>
        Our work can be improved in several ways. cincer can be straightforwardly extended to
online active and skeptical learning, in which the label of incoming instances is acquired on
the fly [
        <xref ref-type="bibr" rid="ref10">42, 10</xref>
        ]. cincer can also be adapted to correcting multiple counter-examples as well
as the reasons behind mislabeled counter-examples using “multi-round” label cleaning and
group-wise measures of influence [ 43, 44, 45]. This more refined strategy is especially promising
for dealing with systematic noise, in which counter-examples are likely afected by entire groups
of mislabeled examples.
      </p>
      <p>Potential negative impact. Like most interactive approaches, there is a risk that cincer
annoys the user by asking an excessive number of questions. This is currently mitigated by
querying the user only when the model is confident enough in its own predictions (through the
margin-based strategy) and by selecting influential counter-examples that have a high chance
to improve the model upon relabeling, thus reducing the future chance of pointless queries.
Moreover, the margin threshold  allows to modulate the amount of interaction based on the
user’s commitment. Another potential issue is that cincer could give malicious annotators
finegrained control over the training data, possibly leading to poisoning attacks. This is however not
an issue for our target applications, like interactive assistants, in which the user benefits from
interacting with a high-quality predictor and is therefore motivated to provide non-adversarial
labels.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research has received funding from the European Union’s Horizon 2020 FET Proactive
project “WeNet - The Internet of us”, grant agreement No. 823783, and from the “DELPhi
DiscovEring Life Patterns” project funded by the MIUR Progetti di Ricerca di Rilevante Interesse
Nazionale (PRIN) 2017 – DD n. 1062 del 31.05.2019. The research of ST and AP was partially
supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme
under GA No 952215.
[13] E. L. Lehmann, G. Casella, Theory of point estimation, Springer Science &amp; Business Media,
2006.
[14] N. Agarwal, B. Bullins, E. Hazan, Second-order stochastic optimization for machine
learning in linear time, The Journal of Machine Learning Research 18 (2017) 4148–4187.
[15] S. Basu, P. Pope, S. Feizi, Influence functions in deep learning are fragile, arXiv preprint
arXiv:2006.14651 (2020).
[16] C.-K. Yeh, J. S. Kim, I. E. Yen, P. Ravikumar, Representer point selection for explaining
deep neural networks, in: Proceedings of the 32nd International Conference on Neural
Information Processing Systems, 2018, pp. 9311–9321.
[17] J. Martens, R. Grosse, Optimizing neural networks with kronecker-factored approximate
curvature, in: International conference on machine learning, PMLR, 2015, pp. 2408–2417.
[18] F. Kunstner, L. Balles, P. Hennig, Limitations of the empirical fisher approximation for
natural gradient descent, in: 33rd Conference on Neural Information Processing Systems
(NeurIPS 2019), Curran Associates, Inc., 2020, pp. 4133–4144.
[19] D. Ting, E. Brochu, Optimal subsampling with influence functions, in: Advances in Neural</p>
      <p>Information Processing Systems, 2018, pp. 3650–3659.
[20] E. Barshan, M.-E. Brunet, G. K. Dziugaite, Relatif: Identifying explanatory training
examples via relative influence, in: The 23rd International Conference on Artificial
Intelligence and Statistics, 2020.
[21] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two
time-scale update rule converge to a local nash equilibrium, in: Proceedings of the 31st
International Conference on Neural Information Processing Systems, 2017, pp. 6629–6640.
[22] H. Guo, N. F. Rajani, P. Hase, M. Bansal, C. Xiong, Fastif: Scalable influence functions for
eficient model interpretation and debugging, arXiv preprint arXiv:2012.15781 (2020).
[23] C. Rudin, Stop explaining black box machine learning models for high stakes decisions
and use interpretable models instead, Nature Machine Intelligence 1 (2019) 206–215.
[24] S. Teso, K. Kersting, Explanatory interactive machine learning, in: Proceedings of the
2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 239–245.
[25] M. Abadi, et al., TensorFlow: Large-scale machine learning on heterogeneous systems,
2015. URL: https://www.tensorflow.org/, software available from tensorflow.org.
[26] T. Ye, Example based explanation in machine learning, https://github.com/Timothy-Ye/
example-based-explanation, 2020.
[27] D. Dua, C. Graf, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
[28] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
in: EMNLP-IJCNLP, 2019.
[29] Y. LeCun, C. Cortes, C. Burges, Mnist handwritten digit database, ATT Labs [Online].</p>
      <p>Available: http://yann.lecun.com/exdb/mnist 2 (2010).
[30] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms, 2017. arXiv:cs.LG/1708.07747.
[31] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014).
[32] T. S. Jaakkola, D. Haussler, et al., Exploiting generative models in discriminative classifiers,</p>
      <p>Advances in neural information processing systems (1999) 487–493.
[33] M. Wojnowicz, B. Cruz, X. Zhao, B. Wallace, M. Wolf, J. Luan, C. Crable, “influence
sketching”: Finding influential samples in large-scale regressions, in: 2016 IEEE International
Conference on Big Data (Big Data), IEEE, 2016, pp. 3601–3612.
[34] R. Khanna, B. Kim, J. Ghosh, S. Koyejo, Interpreting black box predictions using fisher
kernels, in: The 22nd International Conference on Artificial Intelligence and Statistics,
2019, pp. 3382–3390.
[35] H. Zylberajch, P. Lertvittayakumjorn, F. Toni, Hildif: Interactive debugging of nli models
using influence functions, in: Proceedings of the First Workshop on Interactive Learning
for Natural Language Processing, 2021, pp. 1–6.
[36] R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, S. Ghosh, L. Heck, D. Batra, D. Parikh, Taking a
hint: Leveraging explanations to make vision and language models more grounded, in:
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp.
2591–2600.
[37] P. Lertvittayakumjorn, L. Specia, F. Toni, Human-in-the-loop debugging deep text
classiifers, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2020, pp. 332–348.
[38] P. Schramowski, W. Stammer, S. Teso, A. Brugger, F. Herbert, X. Shao, H.-G. Luigs, A.-K.</p>
      <p>Mahlein, K. Kersting, Making deep neural networks right for the right scientific reasons
by interacting with their explanations, Nature Machine Intelligence 2 (2020) 476–486.
[39] T. Popordanoska, M. Kumar, S. Teso, Machine guides, human supervises: Interactive
learning with global explanations, arXiv preprint arXiv:2009.09723 (2020).
[40] T. Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial</p>
      <p>Intelligence (2018).
[41] J. V. Jeyakumar, J. Noor, Y.-H. Cheng, L. Garcia, M. Srivastava, How Can I Explain This to
You? An Empirical Study of Deep Neural Network Explanation Methods, Advances in
Neural Information Processing Systems 33 (2020).
[42] L. Malago, N. Cesa-Bianchi, J. Renders, Online active learning with strong and weak
annotators, in: NIPS Workshop on Learning from the Wisdom of Crowds, 2014.
[43] P. W. W. Koh, K.-S. Ang, H. Teo, P. S. Liang, On the accuracy of influence functions for
measuring group efects, in: Advances in Neural Information Processing Systems, 2019,
pp. 5254–5264.
[44] S. Basu, X. You, S. Feizi, Second-order group influence functions for black-box predictions,
arXiv preprint arXiv:1911.00418 (2019).
[45] A. Ghorbani, J. Zou, Data shapley: Equitable valuation of data for machine learning, in:
International Conference on Machine Learning, PMLR, 2019, pp. 2242–2251.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Frénay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Verleysen</surname>
          </string-name>
          ,
          <article-title>Classification in the presence of label noise: a survey</article-title>
          ,
          <source>IEEE transactions on neural networks and learning systems 25</source>
          (
          <year>2014</year>
          )
          <fpage>845</fpage>
          -
          <lpage>869</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Learning from noisy labels with deep neural networks: A survey</article-title>
          , arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>08199</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <article-title>Learning from crowdsourced labeled data: a survey</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>46</volume>
          (
          <year>2016</year>
          )
          <fpage>543</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kremer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Igel</surname>
          </string-name>
          ,
          <article-title>Robust active label correction</article-title>
          ,
          <source>in: International conference on artificial intelligence and statistics</source>
          , PMLR,
          <year>2018</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bontempelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Teso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passerini</surname>
          </string-name>
          ,
          <article-title>Learning in the Wild with Incremental Skeptical Gaussian Processes</article-title>
          ,
          <source>in: Proceedings of the 29th International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tourangeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rasinski</surname>
          </string-name>
          ,
          <article-title>The psychology of survey response</article-title>
          , Cambridge University Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Corti</surname>
          </string-name>
          ,
          <article-title>Using diaries in social research</article-title>
          ,
          <source>Social research update 2</source>
          (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. T.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Sinibaldi,</surname>
          </string-name>
          <article-title>The quality of paradata: A literature review</article-title>
          ,
          <source>Improving Surveys with Paradata</source>
          (
          <year>2013</year>
          )
          <fpage>339</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Urner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ben-David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shamir</surname>
          </string-name>
          ,
          <article-title>Learning from weak teachers</article-title>
          ,
          <source>in: Artificial intelligence and statistics</source>
          , PMLR,
          <year>2012</year>
          , pp.
          <fpage>1252</fpage>
          -
          <lpage>1260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , E. Bignotti,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          ,
          <article-title>Fixing mislabeling by human annotators leveraging conflict resolution and prior knowledge</article-title>
          ,
          <source>Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies</source>
          <volume>3</volume>
          (
          <year>2019</year>
          )
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Cook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Weisberg</surname>
          </string-name>
          , Residuals and influence in regression, New York: Chapman and Hall,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Understanding black-box predictions via influence functions</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Machine Learning, JMLR. org</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1885</fpage>
          -
          <lpage>1894</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>