<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Multimodal, Afective and Interactive eXplainable Artificial Intelligence, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On Modifying the Perception of a Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuel de Sousa Ribeiro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>João Leite</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>NOVA LINCS, NOVA School of Science and Technology, NOVA University Lisbon</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>2</volume>
      <fpage>5</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>Artificial neural networks are typically regarded as black boxes, given how dificult it is for humans to interpret how these models reach their results. One way in which humans attempt to interpret complex systems is by imagining how they behave in hypothetical scenarios. In this work, we propose a method that allows one to modify what an artificial neural network is perceiving regarding specific human-defined concepts of interest, allowing one to test how they behave under such hypothetical scenarios. Through empirical evaluation, we test the proposed method on diferent models and datasets, assessing its qualities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Neural Networks</kwd>
        <kwd>Interpretability</kwd>
        <kwd>Explainability</kwd>
        <kwd>Counterfactual</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this paper, we investigate how to influence neural network models to identify specific human-defined
concepts of interest not directly encoded in their inputs, with the ultimate goal of better understanding
how such concepts afect a model’s predictions. Our results suggest that this is possible with little
labeled data and without the need to change or retrain the existing neural network model.</p>
      <p>
        As neural networks start to be applied in critical domains, their lack of interpretability [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] becomes
a central problem. This lack of interpretability stems from the fact that, apart from their internal
activations – which lack any declarative meaning – neural networks do not provide any indication to
support their results [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Recent methods to help interpret artificial neural network models – e.g., saliency and attribution
methods [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5, 6, 7, 8</xref>
        ], or proxy-based methods [9, 10, 11, 12, 13, 14] – mostly do so in terms of the
inputs of the model being interpreted, with explanations essentially consisting of sets of input features
and their corresponding contributions.
      </p>
      <p>However, various user studies [15, 16, 17] have shown that explanations such as the ones given by
these methods are often disregarded or unhelpful to end users. One reason is that humans do not
typically reason with features at a very low level, such as the individual pixels of an image, but rather
with higher-level human-defined concepts of interest, which are not necessarily directly available in
the input. E.g., humans would typically expect an explanation of why a particular picture of a train was
classified as being of a ‘freight train’ to be based on the presence of a ‘freight wagon’, and the absence
of a ‘passenger car’, rather than a list of relevant pixels.</p>
      <p>The need to pay attention to these higher-level human-defined concepts of interest 1, together with
the research conducted in the field of neuroscience in identifying highly selective neurons known as
concept cells that seem to “provide a sparse, explicit and invariant representation of concepts” [19, 20],
inspired a recent area of research generally known as concept-based explainable AI [21, 22]. This
research area is based on the idea of peeking into the activations of groups of neurons to comprehend
their function and what they encode, resulting in methods such as TCAV [23], Concept Whitening [24],
and Mapping Networks [25].</p>
      <p>Whereas these methods soundly contribute to the development of better explanations for neural
networks – by bringing into play higher-level human-defined concepts of interest not directly available
in the input – they may still be insuficient in providing humans the insight they usually need to
properly understand a model. It is well known that humans often attempt to determine the causes
for non-trivial phenomena by resorting to counterfactual reasoning [26], i.e., trying to understand
how diferent scenarios would lead to diferent outcomes. E.g., a child might try to understand what a
‘freight train’ is by asking whether the train would still be considered a freight train if it was carrying
passengers, instead of cargo. This approach seems to be helpful for interpreting neural networks [27], as
it emphasizes the causes – what is changed – and the efects – what mutates as a result of the changes.
For example, to better interpret how a neural network is classifying a particular image of a train, the
user could ask what would have been the output had the model identified a passenger car instead of a
freight wagon.</p>
      <p>There has been work on developing methods to generate counterfactuals for artificial neural networks
[28], with some allowing for generating counterfactual samples with respect to particular attributes
[29]. However, such methods are typically complex to implement, often requiring the training of an
additional model to produce the counterfactuals [30], or the use of specific neural network architectures,
e.g., invertible neural networks [31], which is at odds with our goal of understanding existing (deployed)
models, hence not modifiable. But, more importantly, these methods focus on particular changes
to the input samples, neglecting what the model is actually perceiving from those inputs, which is
inadequate when at the center of the counterfactual is a concept not directly encoded in the input.
When questioning whether a neural network model would still consider a train to be a ‘freight train’
had it been carrying passengers, it is not suficient to modify the image to include a passenger car, since
the model may still misinterpret the modified image. What we are interested in knowing is whether
perceiving a passenger car would have changed the model’s output.</p>
      <p>In this work, we address the issue of counterfactual generation at a diferent level of abstraction,
focusing on what a model perceives from a given input regarding specific human-defined concepts of
interest and how that afects the model’s output. Given the evidence that neurons similar to concept cells
seem to emerge in artificial neural networks [ 32], and inspired by the technique of optogenetics which
allows the manipulation of the activity of specific neurons to learn their purpose [ 33], we propose and
test a method to generate counterfactuals by manipulating the outputs of the neurons of an artificial
neural network associated with the concepts of interest, without producing a specific corresponding
input. Our method is based on the idea of “convincing” the model that it is perceiving a particular
concept of interest, for example, a passenger wagon, rather than generating a modified image containing
one.</p>
      <p>By abstracting from the specific input sample and instead focusing on generating counterfactuals
by modifying what a model perceives with respect to a particular concept, we avoid commiting to a
concrete instantiation of that concept in the input sample. In this way, we can focus directly on what a
model is perceiving and how changing it impacts the model’s predictions in a human-understandable
manner.</p>
      <p>Our results show that by changing the activations of these concept-like neurons using our method,
the model reacts as if it had indeed perceived that concept, changing its output accordingly.</p>
      <p>In Section 2, we present the proposed method, with Section 2.1 discussing how to pinpoint which
neurons identify a particular concept of interest in a model, and Sections 2.2 to 2.4 addressing diferent
aspects of the method and providing experimental evidence to support our claims. Section 3 illustrates
two alternative applications of the method. In Section 4, we apply and test the method in two real-world
datasets. In Section 5, we discuss related work, concluding in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>In order to generate counterfactuals regarding what a neural network model is perceiving about
humandefined concepts of interest, we propose a method composed of three main steps. Namely, for a given
concept of interest:</p>
      <p>a) Estimate how sensitive each neuron is to that concept, i.e., how well its activations separate</p>
      <p>samples where that concept is present from those where it is absent;
b) Based on the neurons’ sensitivity values, select which neurons are considered as “concept cell-like”,
which we will refer to as concept neurons;
c) For each concept neuron, compute two activation values, representing, respectively, the output
of that neuron for samples where that concept is present and absent.</p>
      <p>Then, to generate a counterfactual output with regard to the considered concept of interest, one only
has to modify the activations of the selected units to the respective newly computed ones.</p>
      <p>Consider a neural network model ℳ : I → O, which is the model being analyzed, where I is the
input data space and O denotes the model output space. For each human-defined concept C, we assume
a set of positive C ⊆ I and negative C ⊆ I samples where the concept is, respectively, present and
absent. These sets provide an extensional definition of concept C, conveying its meaning. Let ℳ
be the set of neurons of model ℳ. Then, for each neuron  ∈ ℳ, we consider ℳ : I → R to be a
function providing the output of neuron  of ℳ.</p>
      <p>
        The first step of our method consists of estimating how sensitive each neuron is to some concept C,
i.e., how well the activations of each neuron separate samples where a concept is present from those
where it is not. We denote by ℳ : 2I × 2I → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] the function which, for a given neuron  ∈ ℳ,
takes a set C and C and provides a value representing how sensitive  is to concept C. In Section 2.1,
we consider diferent implementations of this function.
      </p>
      <p>The second step is to select which neurons of ℳ are to be considered as concept neurons for C.
This set is defined as C = { : ℳ(C, C) &gt;  C}, where  C is a sensitivity threshold associated
with concept C. One might vary this threshold to observe how the model reacts when diferent sets of
neurons are selected as concept neurons. In the remaining of the paper, we finetune each sensitivity
threshold by using 20% of the samples in C and C as a validation set, iteratively decreasing  C until
15 neurons are added with no improvement. To estimate the performance of a sensitivity threshold, the
samples in the validation set should be classified with respect to the expected output of ℳ given the
presence or absence of C.</p>
      <p>Lastly, for each selected neuron in C, we compute two activation values representing, respectively,
when that concept is present and absent. This is captured by a function ℳ : 2I → R that computes
an activation value for neuron  ∈ ℳ, which is then used to compute both an activation value
representing the presence of concept C, ℳ(C), and one representing its absence, ℳ(C). In Section
2.2, we discuss diferent implementations of ℳ and compare them.</p>
      <p>Subsequently, when some input  ∈ I is fed to neural network ℳ, to generate a counterfactual
scenario where concept C is present or absent, we only need to replace the activation value ℳ() of
each neuron  in C with ℳ(C) or ℳ(C), respectively. We refer to this step, as injecting a concept
into a neural network model, and illustrate it in Figure 1. The result of the proposed method might be
observed in the model’s output, which now reflects a counterfactual scenario – the model’s response to
some input  ∈ I conditioned by the performed injection.</p>
      <p>Performing a thorough and systematic testing of this method requires a dataset with additional
labels regarding the human-defined concepts of interest related to the task of a neural network model.
Furthermore, the dataset needs to provide a description of how these labels are related to the labels of
the model’s output, allowing for an understanding of what would be expected of the model given the
injection of some concept. Unfortunately, there are no popular benchmarks meeting such criteria, nor
many datasets with these features. One notable exception is the Explainable Abstract Trains Dataset
(XTRAINS) [34]. This is a synthetic dataset composed of varied representations of trains, such as those
shown in Figure 2, inspired by the work of J. Larson and R. S. Michalski in [35]. This dataset contains
labels regarding various visual concepts and a logic-based ontology describing how these concepts are
related. Figure 3 shows a subset of the dataset’s ontology illustrating how diferent concepts relate to
each other. This ontology specifies, for example, that TypeA is either WarTrain or EmptyTrain, and
that WarTrain encompasses those having a ReinforcedCar and a PassengerCar. These concepts have
a visual representation, e.g, a ReinforcedCar is shown as a car having two lines on each wall, such as
the first two cars of the leftmost sample in Figure 2. In the remainder of this section, we discuss and
thoroughly validate the method in the setting of the XTRAINS dataset. The method is further tested
using real-world data, although with a more limited set of concepts, in Section 4.</p>
      <p>In the experiments described below, we adopt a neural network developed in [36], trained to identify
trains of TypeA – referred to as ℳA, which achieves an accuracy of about 99% in a balanced test set of
10 000 images. All experiments use |C| = |C| = 200 samples and were run 5 times reporting the
average result on 400 test samples, unless stated otherwise. We consider the definition of relevancy
given in [25] to establish which concepts are related to the task of a given neural network (w.r.t. the
dataset’s accompanying ontology).</p>
      <sec id="sec-2-1">
        <title>2.1. Identifying concept cell-like neurons</title>
        <p>To modify a neural network model’s perception regarding some human-defined concept of interest, it is
ifrst necessary to determine which neurons in the model are identifying that concept. Based on the
evidence that neural networks seem to have information encoded in their internals regarding concepts
that are related to their tasks [25], and that neurons with “concept cell-like” properties seem to emerge
in neural networks [32], we hypothesize that neural networks have neurons which act as concept cells
for concepts related with the tasks they perform. In this section, we investigate how to determine such
neurons in a neural network model.</p>
        <p>We consider a neuron to be a concept neuron for some concept C based on how it separates samples
− 2
0
2
4 − 2
Activation
0
2
4
where C is present from those where it is absent. Figure 4, illustrates the estimated probability density
function of two neurons for some C and C. The neuron on the left behaves as a concept neuron for
C, showing well-separated distributions for both sample sets. On the other hand, the neuron on the
right does not behave as a concept neuron for this concept, given that its activations do not clearly
distinguish both sets.</p>
        <p>While methods such as TCAV [23] and Mapping Networks [25] allow for an understanding of
whether a given model is sensitive to a certain concept, here we are interested in understanding whether
individual neurons are sensitive to that concept. Our goal is not to check whether the model is sensitive
to a concept, for which only some neurons might be suficient, but rather to find all neurons that are
sensitive to that concept and manipulate their activations.</p>
        <p>We consider three implementations of ℳ to evaluate the adequacy of a neuron  of ℳ as a concept
neuron for C which provide diferent approximations for how well separated two distributions are while
having diferent assumptions regarding the underlying data distributions and diferent computational
cost. They are:</p>
        <p>6(∑2︀− 21) |, where  = ℳ(C ) − ℳ(C ) and  = |C| = |C|;
• Spearman rank-order correlation between ’s activations and C’s presence, computed as |1 −
• Accuracy of a linear binary classifier ( ) predicting C’s presence from ’s activations, computed
as |{:∈C∩(ℳ())=0)}|+|{:∈C∩(ℳ())=1)}| ;</p>
        <p>|C|+|C|
• Area of intersection between the probability density function curves for ’s activations in C
and C, computed as 1 − ∫︀ (C (), C ()) , where C and C are the estimated
probability density functions for ’s activation values in C and C, respectively.2
To test our hypothesis that “concept cell-like” neurons exist in neural network models for concepts
that are related to the task a model is performing, we compute the three metrics described above for
four random concepts that are relevant to the task performed by ℳA: EmptyTrain, ∃has.PassengerCar,
∃has.ReinforcedCar, and WarTrain; and for four random non-relevant concepts: ∃has.LongWagon,
∃has.OpenRoofCar, LongFreightTrain, and MixedTrain.</p>
        <p>According to our hypothesis, we would expect to find neurons for the relevant concepts with high
values in the described metrics, while for non-relevant concepts we should be unable to do so.</p>
        <p>Table 1 shows the mean and standard deviation of the top 20 most sensible neurons identified by that
metric for each concept and each implementation of ℳ. For instance, the 20 neurons most sensible
to the EmptyTrain concept allow on average for this concept to be predicted from their activations
with an accuracy of 99% using a linear binary classifier. It is observable that while sensitive neurons
can be identified for all of the relevant concepts, as indicated by the high sensitivity values, for the
non-relevant concepts we are unable to find such neurons, as witnessed by the low sensitivity values.</p>
        <p>While the number of neurons that are sensible to each concept will vary depending on the specific
sensitivity threshold  C used, for the concepts we considered it varied between 10 and 40. Considering
that ℳA contains about 2 × 106 neurons, the amount of selected concept neurons is minuscule in
2Probability density functions are estimated using kernel density estimation, and integrals using Simpson’s rule.
comparison. Interestingly, most selected neurons belong to the dense part of the model, with those
corresponding to more abstract and complex concepts appearing in the later dense layers. Section 2.2
explores the importance of the selected concept neurons for the overall method’s performance.</p>
        <p>These results seem to support our hypothesis, indicating that some of the neurons in a neural network
model behave as concept cells for specific concepts related with the task of that neural network. We
note that while this does not imply that it is always possible to find such neurons for any given concept,
it shows that neurons with such behavior do arise in neural network models.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Manipulating a neural network’s perception</title>
        <p>Being able to identify concept neurons in a model, we now test our main hypothesis: – that it is possible
to manipulate a neural network’s perception regarding specific human-defined concepts of interest by
manipulating the activations of their respective concept neurons.</p>
        <p>If our hypothesis holds, we expect that when injecting a given concept into a model, the output of the
model reflects the performed injection. For instance, if we feed an image of a train having a passenger
car and no reinforced cars to ℳA, which was trained to identify TypeA trains, we would expect it to
output that this is not a TypeA train (cf. ontology in Figure 3). However, if we identify the concept
neurons for ∃has.ReinforcedCar – having a reinforced car – and modify their outputs to indicate the
presence of a reinforced car, i.e., we inject the concept ∃has.ReinforcedCar, we expect the model to
change its output and identify a TypeA train. To analyze the performance of the method in the setting
of the XTRAINS dataset, we consider, for each of the relevant concepts C analyzed in Section 2.1, four
diferent test sample sets according to the following criteria:
• 1 - injecting C should not change ℳA’s output.
• 2 - injecting C should change ℳA’s output.
• 3 - injecting ¬C should not change ℳA’s output.</p>
        <p>• 4 - injecting ¬C should change ℳA’s output.</p>
        <p>To determine whether the output of ℳA should change, we reason with the ontology provided with
the dataset.</p>
        <p>We consider three alternatives to compute the new activation values for each concept neuron (ℳ):
computing the median of the neuron’s activations; computing the mode; and using the method from
[37], which suggests the use of an auxiliary probe model trained to predict whether a concept is present
in the activations of a trained neural network and computes new activation values that lead to the
desired output of the probe.</p>
        <p>Table 2 shows the average accuracy of the method for the four considered test sets using the diferent
implementations of ℳ and ℳ. The high percentage of samples where the model’s output changed as
expected, given the performed injection, provides evidence that it is possible to manipulate a neural
networks’ perception regarding diferent human-defined concepts by modifying the activations of
specific neurons which seem to identify those concepts. Furthermore, these results suggest that the
sensitivity metric based on the area of the intersection between the probability density functions
of a neuron’s activations for C and C, when using the median value of the neurons’ activations,
provides better results, while modifying the activations of fewer neurons. For this reason, throughout
all remaining experiments, we adopt this metric to compute ℳ, and adopt ℳ as the median value.</p>
        <p>Table 3 shows the percentage of samples where injecting a concept resulted in the expected change
in ℳA’s output for each of the sets described above. Overall the results seem to be quite positive,
indicating that typically the model outputs the expected result given the concept being injected. The
result of injecting ∃has.PassengerCar in set 2 and its negation in set 4 seem to be somewhat inferior.
While inspecting these samples, we found that this seems to be due to ℳA not having properly captured
the relationship between the concept of ∃has.PassengerCar and the concept of EmptyTrain. We further
explore this matter in Section 3.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Importance of the selected neurons</title>
        <p>In this section, we investigate how dependent the proposed method is on the sensitivity threshold
 C. We vary the sensitivity threshold  C for each concept C, while examining the number of concept
neurons it selects and how they afect the method’s performance.</p>
        <p>Figure 5 shows the distribution of the 10 best performing sets of concept neurons obtained when
varying the sensitivity threshold  C for each of the four relevant concepts analyzed in Section 2.1, with
respect to the four test sets introduced in Section 2.2. For instance, Figure 5 shows that the injection of
concept EmptyTrain performs best between 13 to 26 neurons, with its median at about 20 neurons.</p>
        <p>Figure 5 also shows that each concept has a diferent optimal range for its sensitivity threshold  C,
with some being broader than others. This highlights the importance of finetuning the sensitivity
threshold  C for each concept C. Finetuning the sensitivity threshold  C might be done by taking into
account the number of neurons that are selected as concept neurons in set C and gradually considering
additional neurons until the method’s performance stops increasing. This can be done using a small
validation set – here we consider 80 validation samples. Section 2.4 illustrates the usage of fewer
samples.</p>
        <p>Also, by inspecting the results for each test set, we observed that when the sensitivity threshold
 C is too high – excluding neurons that act as concept cells – the method is unable to produce the
desired change in the model’s predictions, resulting in the output of the model being unaltered (poor
performance in the test sets 2 and 4). The performance also degrades for low sensitivity thresholds
since neurons that do not behave as concept cells become selected.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Counterfactual’s cost</title>
        <p>The proposed method has two main costs: the data in C and C; and the labeling of the samples used
to finetune  C. In this section, we verify the feasibility of the proposed method regarding its costs, i.e.,
the amount of data required to modify a model’s perception regarding some concept C.</p>
        <p>Figure 6, shows the average percentage of samples – from the four test sets introduced in Section 2.2
– where the output of the model is consistent with the performed manipulation for diferent quantities
of available data. The number of samples shown corresponds to the total amount of used samples
(including validation data).</p>
        <p>We can observe that even with as few as 60 samples, the method is able to manipulate ℳA’s
perception with success in more than 90% of the test set samples for the considered concepts. The
performance seems to stop improving after around 300 samples.</p>
        <p>Whereas, as expected, the method’s performance seems to drop significantly when the amount of
data is insuficient to properly identify the concept neurons, the low cost associated with the fairly low
amount of data required for the method to properly function, as shown by the experiments, seems to
suggest that it may be applicable in settings with limited available data.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Applications and other uses</title>
      <p>The previous section described a method to generate counterfactuals for a neural network model by
modifying the activations of neurons which act as concept cells for human-defined concepts of interest.
This method allows for an understanding of how diferent concepts influence the output of a model
under diferent scenarios. What if one is interested in understanding how a model relates diferent
concepts of interest that are not represented in the model’s output? Or what if one knows a model is
providing an incorrect result and wants to provide additional information? In this section, we illustrate
the application of the proposed method to address these two questions.</p>
      <sec id="sec-3-1">
        <title>Interpreting neural networks</title>
        <p>Consider the situation introduced in Section 2.2, where we hypothesize that ℳA may not have properly
captured the relationship between the concepts of ∃has.PassengerCar and EmptyTrain. Although both
of these concepts are not represented in ℳA’s output, they are both relevant for its task, and one might
be interested in ensuring that the model understands the relationship between both.</p>
        <p>To understand whether a given concept that is not represented in the output of a model is being
identified by it, we use the mapping networks from [25]. These are small probe neural networks trained
to discern whether a given human-defined concept of interest is present in the activations of a neural
network model. Mapping networks receive as input the activations of a trained neural network model,
and output whether a specific human-defined concept of interest is present in those activations. Thus,
when an input sample is fed to a neural network model, we can use a mapping network to verify
whether a specific concept of interest is identified by that model.</p>
        <p>To verify whether a model has captured the relationship between two concepts, one might inject one
of the concepts and use a mapping network to assess how the other concept reacts to the performed
injection. For instance, if ℳA captured that empty trains do not have passenger cars, when the concept
of EmptyTrain is injected, a mapping network for ∃has.PassengerCar should indicate the absence of
passenger cars. Similarly, if ℳA has captured that if a train has a passenger car, then it is not an empty
train, when ∃has.PassengerCar is injected, a mapping network for EmptyTrain should indicate the
absence of an empty train.</p>
        <p>To test this, we train mapping networks for EmptyTrain and ∃has.PassengerCar – both
achieving an accuracy of about 99% on a 1000 balanced test set. By injecting concept EmptyTrain on
1000 samples of trains with passenger cars, we verify that in 95.6% of the samples, after injection,
concept ∃has.PassengerCar goes from being present to absent. However, when injecting concept
∃has.PassengerCar on 1000 samples of empty trains, in only 2% of them the concept EmptyTrain turns
absent. This suggests that while ℳA has captured that empty trains do not have passenger cars, it has
not picked up that if a train has a passenger car, it is not empty.</p>
        <p>This example illustrates a possible application of the proposed method, namely to investigate whether
certain relations between user-defined concepts of interest are encoded in a model. More importantly,
the results suggest that the model was able to naturally integrate the injected information, impacting
related concepts, as if the injected concept was truly being perceived by the model.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Correcting misunderstandings</title>
        <p>When a neural network model provides an incorrect result, it is often dificult to interpret the cause of
the error. A possible reason might be the model being unable to correctly identify what is represented
in its input, in which case the proposed method might be used to “correct” the model’s perception.</p>
        <p>To test this hypothesis, we consider the use of mapping networks to identify the cause for a model’s
misclassification and use our proposed method to inject a concept, generating a counterfactual scenario
where the misidentification is corrected. To this end, mapping networks are trained for the concepts
∃has.ReinforcedCar and ∃has.PassengerCar – with an accuracy of about 99% on a 1000 balanced test
set – and used to identify whether ℳA provides wrong results due to either of these concepts being
misperceived.</p>
        <p>We consider all samples where ℳA provides a false negative, indicating that a sample input was
not of TypeA when it was. We observe that in those where the concept of ∃has.ReinforcedCar was
not identified, but present in the input, injecting ∃has.ReinforcedCar led to the output being corrected
in 96.1% of the samples. Similarly, when considering the samples where ∃has.PassengerCar was not
identified, but present in the input, injecting it led to the output being corrected in 98.7% of the samples.</p>
        <p>These results provide evidence that the method is applicable even when a model yields incorrect
results, allowing one to test whether diferent concepts might be responsible for the incorrect result.
This enables users to further understand why their models achieved an incorrect output and helps
identify potential flaws in a model.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Validation with real-world data</title>
      <p>So far, we have tested our method in the setting of a synthetic dataset. We now validate whether we can
replicate similar results in the setting of two real-world datasets, the German Trafic Sign Recognition
Benchmark Dataset (GTSRB) [38] and ImageNet [39].</p>
      <sec id="sec-4-1">
        <title>German Trafic Sign Recognition Benchmark dataset</title>
        <p>Using the GTSRB dataset, we train a VGG model to distinguish four diferent types of trafic signals:
road closed; other prohibitory signs; end of all restrictions; and the end of specific restrictions. These
classes are, respectively, illustrated in Figure 7. The trained model achieves an accuracy of 99% in the
GTSRB test set.</p>
        <p>To test the method in this setting, we consider the concepts of ‘having a black symbol’ and ‘having a
black band’ described in the Vienna Convention on Road Signs and Signals [40]. For each concept, we
define four test sets in the same manner as those described in Section 2.2, allowing for an understanding
of whether the output of the model is changing according to the performed injection. For instance, one
could expect that if a black symbol is added to a road closed sign, the model would modify its output
to predict some other prohibitory sign; or that if a black bar is added to a road closed sign, the model
would modify its output to predict the end of all restrictions sign. We consider |C| = |C| = 200
samples and test our method with the GTSRB test set.</p>
        <p>Table 4 shows the results for each concept and test set. The high percentage of samples where the
model’s output changed as expected given the performed injection provides evidence that the method
is applicable in a real-world data setting. We attribute the higher value of selected concept neurons for
‘having a black symbol‘ to the variety of diferent symbols a prohibitory sign may have, which might
have led the model to allocate more neurons to identify them.</p>
      </sec>
      <sec id="sec-4-2">
        <title>ImageNet dataset</title>
        <p>Using the pre-trained MobileNetV2 model from [41], and considering the setting of the ImageNet
dataset [39], we both test the proposed method and illustrate its use to correct false negatives. To this
end, we select a random output class: ‘Rhodesian ridgeback’, which is a specific dog’s breed. We define
the concept of ‘Rhodesian ridgeback face’ by selecting a set of 98 images from ImagenetNet where a
Rhodesian ridgeback dog is observable and its face is centered and completely visible, as illustrated in
the samples shown in Figure 8.</p>
        <p>To test whether the concept of ‘Rhodesian ridgeback face’ is important for the model to be able
to recognize that breed, we manually censored every ‘Rhodesian ridgeback face’ in the ImageNet’s
validation set, as shown in Figure 8, and compared the model’s accuracy before and after censoring the
dogs’ faces. The results, shown in Table 5, indicate that censoring the dogs’ faces leads to a significant
drop in accuracy, which seems to be evidence of its importance for the model. However, notice that
even after the censoring the model is still able to correctly classify 36% of the images, indicating
that although important, there are other concepts which are relevant for identifying this class. After
censoring the images, we inject the concept of ‘Rhodesian ridgeback face’ into 109 concept neurons and
observe a big increase in accuracy. This indicates that the injected concept was successful in providing
some of the censored information and helping the model provide the correct classification.</p>
        <p>To illustrate the method’s ability to correct false negatives, we select all 329 samples where the model
outputs a false negative for the Rhodesian ridgeback class, considering its top-1 result. The injection of
the concept of ‘Rhodesian ridgeback face’ led 48% of these samples to output the class of Rhodesian
ridgeback. If we consider the 38 samples where the model outputs a false negative considering its top-5
outputs, we were able to modify the model’s output in 71.1% to the Rhodesian ridgeback class.</p>
        <p>These results show us that even in a setting with real data, a moderately sized model (≈ 7 × 106
neurons), and few labeled data, it is possible to identify which neurons in a model act as concept cells
for a given concept and, more importantly, manipulate the neural network’s perception of that concept
by manipulating the activations of those neurons.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related work</title>
      <p>The last few years have seen an increase in the development of methods for interpreting artificial
neural networks, with proxy-based methods being one of the most popular approaches (cf. [42]). These
methods aim to substitute the model being explained for one that is inherently interpretable and which
exhibits similar behavior. In contrast, our approach to generate counterfactual scenarios does not
require changing or substituting the original model, which might not always be feasible or desirable.
Additionally, proxy-based methods, such as LIME [11] or most automatic rule extraction algorithms
[12], attempt to explain a model in terms of its input features, which is not always adequate since they
might not be meaningful to some end-users or might even be subsymbolic, e.g., pixels of an image.
Our method avoids this issue by relying on human-defined concepts of interest, allowing end-users to
experiment with meaningful symbolic concepts that might be tailored to their needs. Automatic rule
extraction algorithms, such as the one introduced in [43], allow for inducing a theory using additional
concept labels. However, being only data annotations, there is no guarantee that the resulting rules will
be faithful to the neural network’s internal classification process.</p>
      <p>Another popular approach is that of saliency and attribution methods (cf. [44]), where a model’s
behavior is explained by attributing a value to each input feature representing its contribution to a
given prediction. Although these methods might provide some insights into which features contributed
most to a prediction, they do not provide clarification regarding why those contributions justify the
output, leaving the burden of understanding how those contributions are related to the specific output
to the user.</p>
      <p>Some counterfactual-based methods (cf. [28]) sufer from similar issues, providing a counterfactual
sample but lacking clarification of why that sample leads to a diferent result. We believe that
counterfactual methods would benefit from abstracting away from only generating specific counterfactual
samples, to describing in a human-understandable way why those counterfactuals lead to some other
particular output based on how they afect the model. One approach that seems conceptually related to
ours is that of [45], which produces counterfactuals by modifying the activations of a given layer to
remove some property (concept). However, it is restricted to answering questions in the form of “How
will the prediction of a task difer without access to some property?”. Our proposal is more general,
allowing one to examine how a model’s prediction difers when a concept of interest is present or
absent.</p>
      <p>Concept-based approaches [21, 22] have aimed at understanding what human-defined concepts of
interest are encoded in a model [46, 47, 48], whether models are sensitive to specific human-defined
concepts, like TCAV [23], and how we can leverage the representations of those concepts in a model to
provide explanations for its outputs [25]. These methods allow for a better understanding of what is
encoded within a neural network model through human-understandable symbolic concepts, but do not
allow for an understanding of how such concepts influence a model’s output. Our method opens this
possibility, allowing end-users to explore counterfactual scenarios by manipulating whether a neural
network perceives such concepts of interest and observing how the model reacts to these scenarios.</p>
      <p>The aforementioned literature provides plenty of evidence that neural networks’ internal
representations often align with human-defined concepts, not in the sense that these models encode these abstract
concepts per se [49], but that their internals present statistical regularities which distinguish samples
where those concepts are present/absent.</p>
      <p>A method that combines the automatic rule extraction and neuro-symbolic approaches is presented in
[36]. This method induces logic-based theories that are faithful to the internal classification process of a
neural network model based on human-defined concepts [ 36]. However, since the method induces the
theory based on a set of observations, it mostly captures correlations between the concepts of interest
and the model’s output and thus is unsuitable for the purpose of examining counterfactual behavior.</p>
      <p>Recently, a set of intervention methods known as activation patching [50], which intervene on
the activations of a model to inspect how specific parts of it contribute to its behavior, has gained
traction. However, the way in which the interventions are executed – e.g., substituting by zeros, noise,
or activations from another sample [51] – does not necessarily convey a specific human-defined concept.
In contrast, our method focuses on how the representation of a human-defined concept of interest
might modify a model’s behavior, rather than on investigating a specific part of a model.</p>
      <p>Our proposed method has two main limitations, which can nevertheless be mitigated by existing
related work – the need for identifying concepts of interest with which to manipulate a given neural
network model, and the need for some labeled data concerning such concepts. Methods like [46]
and [48], which are aimed at discovering concepts that are relevant to a model’s predictions and how
they are related, can help users discover promising concepts of interest with which to manipulate a
neural network. Regarding the labeled data, methods like Mapping Networks [25], which allow for an
understanding of when a model is identifying a given concept of interest based on its activations, might
provide us the necessary labels for each concept of interest. Additionally, literature on repairing neural
networks [52, 53, 54], ofer methods such as KnowledgeEditor [ 55], which modifies the information
encoded in a model to fix undesired predictions, may be used to help address any issues found when
inspecting how a model reacts to diferent counterfactual scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper, we proposed a method to generate counterfactuals regarding what a neural network
model is perceiving about human-defined concepts of interest, enabling users to inquire the model, and
understand how its outputs are dependent on those concepts. To this end, we explored how to identify
which neurons in a model are sensitive to a given concept of interest, and how to leverage such neurons
to manipulate a model into perceiving that concept. Through experimental evaluation, we showed that
it is possible to manipulate a neural network’s perception regarding diferent concepts, requiring few
labeled data to do so, and without needing to change or retrain the original model.</p>
      <p>We provided a formalization of the proposed method and illustrated how it can be applied to generate
counterfactuals, but also how to use it to better understand how diferent concepts are related within a
model, and how to inspect and correct a model’s misclassifications.</p>
      <p>The two showcased applications of the proposed method illustrate its versatility and how it opens
further possibilities for new ways to inspect and interpret neural network behavior in a
humanunderstandable manner. The first application shows how users might explore the relations between two
user-defined concepts of interest encoded in a neural network model by manipulating the perception of
a model regarding one and examining how the other reacts to the performed manipulation. The second
application demonstrates how the method might be used for correcting a model’s predictions, allowing
users to straightforwardly override some aspect of the model’s internal predictive process.</p>
      <p>We conclude that for some concepts that are relevant to a neural network’s task, it is often the case
that there exist specific neurons that encode such concepts and are responsible for their identification
within the model. Modifying the activations of these neurons, similarly to stimulating concept cells
in the human brain, allows one to trick the model into perceiving that concept. This simple method
seems efective in allowing sophisticated manipulations of high-level abstract concepts, enabling users
to explore, with little efort, how the model would respond in diferent scenarios.</p>
      <p>We believe this work provides a step forward towards making neural networks more interpretable
by addressing one of the natural ways in which humans attempt to understand complex models – by
imagining hypothetical scenarios and thinking about how they would impact the model. By providing
a method that allows end-users to define their own meaningful concepts and to test such hypothetical
scenarios, we seek to answer the necessity that humans have of validating their expectations regarding
how a model works. Ultimately, we hope that this method contributes to promoting better testing and
validation of artificial neural networks, improving the trust that humans might have in these systems.</p>
      <p>In the future, we are interested in exploring how to leverage this method to search for learned biases
and identify missing or undesired associations between concepts in artificial neural networks.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[6] A. Ignatiev, N. Narodytska, J. Marques-Silva, Abduction-based explanations for machine learning
models, in: AAAI’19, 2019.
[7] S. Rebufi, R. Fong, X. Ji, A. Vedaldi, There and Back Again: Revisiting Backpropagation Saliency</p>
      <p>Methods, in: CVPR’20, 2020.
[8] M. Ivanovs, R. Kadikis, K. Ozols, Perturbation-based methods for explaining deep neural networks:</p>
      <p>A survey, Pattern Recognit. Lett. 150 (2021) 228–234.
[9] G. P. J. Schmitz, C. Aldrich, F. S. Gouws, ANN-DT: an algorithm for extraction of decision trees
from artificial neural networks, IEEE Trans. Neural Networks 10 (1999) 1392–1401.
[10] M. G. Augasta, T. Kathirvalavakumar, Reverse engineering the neural networks for rule extraction
in classification problems, Neural Process. Lett. 35 (2012) 131–150.
[11] M. T. Ribeiro, S. Singh, C. Guestrin, "why should I trust you?": Explaining the predictions of any
classifier, in: SIGKDD’16, 2016.
[12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A Survey of Methods
for Explaining Black Box Models, ACM Comput. Surv. 51 (2019) 93:1–93:42.
[13] F. Shakerin, G. Gupta, Induction of non-monotonic logic programs to explain boosted tree models
using LIME, in: AAAI’19, 2019.
[14] F. Shakerin, G. Gupta, White-box induction from SVM models: Explainable AI with logic
programming, Theory Pract. Log. Program. 20 (2020) 656–670.
[15] J. Adebayo, M. Muelly, I. Liccardi, B. Kim, Debugging Tests for Model Explanations, in: NeurIPS’20,
2020.
[16] E. Chu, D. Roy, J. Andreas, Are Visual Explanations Useful? A Case Study in Model-in-the-Loop</p>
      <p>Prediction, CoRR abs/2007.12248 (2020).
[17] H. Shen, T. K. Huang, How Useful Are the Machine-Generated Interpretations to General Users?</p>
      <p>A Human Evaluation on Guessing the Incorrectly Predicted Labels, in: HCOMP’20, 2020.
[18] Y. Belinkov, Probing Classifiers: Promises, Shortcomings, and Advances, Comput. Linguistics 48
(2022) 207–219.
[19] R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, I. Fried, Invariant visual representation by single
neurons in the human brain, Nature 435 (2005) 1102–1107.
[20] R. Q. Quiroga, Concept cells: the building blocks of declarative memory functions, Nature Reviews</p>
      <p>Neuroscience 13 (2012) 587–597.
[21] E. Poeta, G. Ciravegna, E. Pastor, T. Cerquitelli, E. Baralis, Concept-based explainable artificial
intelligence: A survey, CoRR abs/2312.12936 (2023).
[22] G. Schwalbe, B. Finzel, A comprehensive taxonomy for explainable artificial intelligence: a
systematic survey of surveys on methods and concepts, Data Min. Knowl. Discov. 38 (2024)
3043–3101.
[23] B. Kim, M. Wattenberg, J. Gilmer, C. J. Cai, J. Wexler, F. B. Viégas, R. Sayres, Interpretability
Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), in:
ICML’18, 2018.
[24] Z. Chen, Y. Bei, C. Rudin, Concept whitening for interpretable image recognition, Nature Machine</p>
      <p>Intelligence 2 (2020) 772–782.
[25] M. de Sousa Ribeiro, J. Leite, Aligning Artificial Neural Networks and Ontologies towards
Explainable AI, in: AAAI’21, 2021.
[26] T. Miller, Explanation in Artificial Intelligence: Insights from the Social Sciences, Artif. Intell. 267
(2019) 1–38.
[27] R. M. J. Byrne, Counterfactuals in explainable artificial intelligence (XAI): evidence from human
reasoning, in: IJCAI’19, 2019.
[28] R. Guidotti, Counterfactual explanations and how to find them: literature review and benchmarking,</p>
      <p>Data Mining and Knowledge Discovery (2022).
[29] F. Yang, N. Liu, M. Du, X. Hu, Generative Counterfactuals for Neural Networks via
Attribute</p>
      <p>Informed Perturbation, SIGKDD Explor. 23 (2021) 59–68.
[30] I. Stepin, J. M. Alonso, A. Catalá, M. Pereira-Fariña, A Survey of Contrastive and Counterfactual
Explanation Generation Methods for Explainable Artificial Intelligence, IEEE Access 9 (2021)
11974–12001.
[31] F. Hvilshøj, A. Iosifidis, I. Assent, ECINN: Eficient Counterfactuals from Invertible Neural</p>
      <p>Networks, in: BMVC’21, 2021.
[32] G. Goh, N. Cammarata, C. Voss, S. Carter, M. Petrov, L. Schubert, A. Radford, C. Olah, Multimodal</p>
      <p>Neurons in Artificial Neural Networks, Distill (2021).
[33] T. Okuyama, Social memory engram in the hippocampus, Neuroscience Research 129 (2018)
17–23.
[34] M. de Sousa Ribeiro, L. Krippahl, J. Leite, Explainable Abstract Trains Dataset, CoRR abs/2012.12115
(2020).
[35] J. Larson, R. S. Michalski, Inductive inference of VL decision rules, SIGART Newsl. 63 (1977)
38–44.
[36] J. Ferreira, M. de Sousa Ribeiro, R. Gonçalves, J. Leite, Looking Inside the Black-Box: Logic-based</p>
      <p>Explanations for Neural Networks, in: KR’22, 2022.
[37] M. Tucker, P. Qian, R. Levy, What if This Modified That? Syntactic Interventions with
Counterfactual Embeddings, in: ACL/IJCNLP’21, 2021.
[38] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, C. Igel, Detection of trafic signs in real-world
images: The german trafic sign detection benchmark, in: IJCNN’13, 2013.
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge,
International Journal of Computer Vision (IJCV) 115 (2015) 211–252.
[40] U. UNECE, Convention on Road Trafic of 1968 and European Agreement Supplementing the</p>
      <p>Convention (2006 Consolidated Versions), United Nations Publications, 2007.
[41] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. Chen, MobileNetV2: Inverted Residuals and</p>
      <p>Linear Bottlenecks, in: CVPR’18, 2018.
[42] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. A. Specter, L. Kagal, Explaining Explanations: An</p>
      <p>Overview of Interpretability of Machine Learning, in: DSAA’18, 2018.
[43] M. K. Sarker, N. Xie, D. Doran, M. L. Raymer, P. Hitzler, Explaining trained neural networks with
semantic web technologies: First steps, in: NeSy’17, 2017.
[44] X. Li, Y. Shi, H. Li, W. Bai, C. C. Cao, L. Chen, An Experimental Study of Quantitative Evaluations
on Saliency Methods, in: KDD’21, 2021.
[45] Y. Elazar, S. Ravfogel, A. Jacovi, Y. Goldberg, Amnesic probing: Behavioral explanation with
amnesic counterfactuals, Trans. Assoc. Comput. Linguistics 9 (2021) 160–175.
[46] A. Ghorbani, J. Wexler, J. Y. Zou, B. Kim, Towards automatic concept-based explanations, in:</p>
      <p>NeurIPS’19, 2019.
[47] B. Zhou, D. Bau, A. Oliva, A. Torralba, Interpreting deep visual representations via network
dissection, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2019) 2131–2145.
[48] V. A. C. Horta, I. Tiddi, S. Little, A. Mileo, Extracting knowledge from deep neural networks
through graph analysis, Future Gener. Comput. Syst. 120 (2021) 109–118.
[49] J. Jo, Y. Bengio, Measuring the tendency of cnns to learn surface statistical regularities
abs/1711.11561 (2017).
[50] K. Meng, D. Bau, A. Andonian, Y. Belinkov, Locating and editing factual associations in GPT, in:</p>
      <p>NeurIPS’22, 2022.
[51] L. Bereska, S. Gavves, Mechanistic interpretability for AI safety - A review, Trans. Mach. Learn.</p>
      <p>Res. (2024).
[52] P. Henriksen, F. Leofante, A. Lomuscio, Repairing misclassifications in neural networks using
limited data, in: SAC’22, 2022.
[53] D. L. Calsi, M. Duran, T. Laurent, X. Zhang, P. Arcaini, F. Ishikawa, Adaptive search-based repair
of deep neural networks, in: GECCO’23, 2023.
[54] J. Kim, N. Humbatova, G. Jahangirova, P. Tonella, S. Yoo, Repairing DNN architecture: Are we
there yet?, in: ICST’23, 2023.
[55] N. D. Cao, W. Aziz, I. Titov, Editing factual knowledge in language models, in: EMNLP’21, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tiño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Leonardis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>A survey on neural network interpretability</article-title>
          ,
          <source>IEEE Trans. Emerg. Top. Comput. Intell</source>
          .
          <volume>5</volume>
          (
          <year>2021</year>
          )
          <fpage>726</fpage>
          -
          <lpage>742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ebrahimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Sarker</surname>
          </string-name>
          ,
          <article-title>Neural-symbolic Integration and the Semantic Web</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>3</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Zeiler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <article-title>Visualizing and Understanding Convolutional Networks</article-title>
          ,
          <source>in: ECCV'14</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundararajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Taly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Axiomatic attribution for deep networks</article-title>
          ,
          <source>in: ICML'17</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ancona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ceolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Öztireli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <article-title>Towards better understanding of gradient-based attribution methods for Deep Neural Networks</article-title>
          ,
          <source>in: ICLR'18</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>