<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Medicinal Chemistry 34 (1991) 786-797.
URL: https://doi.org/10.1021/jm00106a046. arXiv:https://doi.org/10.1021/jm00106a046.
[32] N. Wale</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3701764</article-id>
      <title-group>
        <article-title>Explanation Robustness and Reproducibility under Adversarial Influence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amir Reza Mohammadi</string-name>
          <email>amir.reza@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael M. Müller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Peintner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Beatriz Barroso Gstrein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Zangerle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Günther Specht</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Graph Neural Networks, Counterfactual Explanation, Adversarial Examples, Reproducibility</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Universität Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>139</volume>
      <fpage>8</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>Within Machine Learning, Graph Neural Networks (GNNs) have emerged as prominent techniques, particularly excelling in tasks tailored for graph structures. Due to the intricate nature of GNNs and the essential role in conveying outcomes to users, there is a pressing demand to enhance the explainability of these approaches. Among state-of-the-art explanation strategies, counterfactual explanation provides intuitive and easily understandable insights into model predictions by showing how a small change in the input would lead to a diferent outcome. However, the absence of benchmarks and standardized tasks hampers the evaluation of such approaches. Moreover, there has not been an empirical comparison of counterfactuals and adversarial examples, both aiming to alter model outputs with minimal perturbations. This reproducibility study rigorously analyzes prominent GNNbased counterfactual explanation methods, contrasting them against established adversarial attack baselines. Our objective is to look into counterfactual methods through the lens of adversarials and thereby, explore the interconnectedness of these techniques and foster a deeper understanding of their combined utility and implications. We validate five selected GNN-based counterfactual explanation methods in two levels of local and model-level explanation and compare them to two well-established adversarial attack methods. Our findings reveal that adversarial methods can serve as a competitive baseline for counterfactual explanation on node classification, and in certain tasks, they may even outperform them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Graph Neural Networks (GNNs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have witnessed a surge in prominence within the realm of Machine
Learning (ML), showcasing remarkable efectiveness in tasks explicitly designed for graph-structured
data [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. GNNs ofer a powerful framework for capturing intricate relationships and patterns
embedded in graph data. While the empirical success of GNNs is evident [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], their black-box nature
often hinders the interpretability of their decisions, limiting their broader adoption in critical domains.
      </p>
      <p>
        Addressing the need for transparency and interpretability, Explainable Artificial Intelligence (XAI)
has garnered substantial interest across various communities. Among these approaches, counterfactual
explanation (CE) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is dedicated to advancing model explainability. CE not only provides intuitive
and easily understandable insights into model predictions but also enables users to grasp how minor
alterations in the input can lead to divergent outcomes. CE addresses a key question: “For a specific
instance, how should the input features  be subtly perturbed for new features  ′ to yield a distinct
predicted label (typically a desired label) from ML models?” CE promotes human interpretation through
the comparison between  and  ′. Departing from conventional CE studies centered on tabular or
image data, there is a growing emphasis on CE within graphs [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Despite the popularity of CE
Germany
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
methods, the absence of established tasks, widely used metrics, and standardized benchmarks, has
impeded comprehensive evaluations, hindering the establishment of robust and widely used baselines.</p>
      <p>
        The exploration of adversarial attacks intersects with the study of counterfactuals (in fact, they have
even been shown to be equivalent [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Adversarial Examples (AEs) are inputs that closely resemble
authentic data but are misclassified by a trained ML model—for instance, an image of a turtle being
classified as a rifle 1. In this context, misclassified implies that the algorithm assigns the incorrect
class or value compared to a predefined (usually human-provided) ground-truth [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The intriguing
convergence of AE and CE in their shared goal of perturbing model outputs with minimal changes has
ignited ongoing discussions within the research community [
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref9">9, 11, 12, 13</xref>
        ]. Surprisingly, there are no
empirical comparisons between these two methodological paradigms, especially within the context of
GNN counterfactuals [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Motivated by the lack of quantitative study and analysis of CE methods and also quantitative
comparison of AE and CE, our study conducts a thorough reproducibility analysis of prominent GNN-based
counterfactual explanation methods, juxtaposing their performance against established adversarial
attack baselines. Our investigation systematically validates five selected GNN-based counterfactual
explanation methods, namely C 2 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], CF-GNNExplainer [16], GCFExplainer [17], CLEAR [18],
RCExplainer [19] which are the current SOTA methods in both model-level and instance level explanation
(Figure 1). This examination aims to provide nuanced insights into the strengths, limitations, and
contextual relevance of each approach. Moreover, to ensure a robust and meaningful comparison,
two well-established adversarial attack methods are used as benchmarks against their counterfactual
counterparts. By adopting a holistic perspective, we seek to foster a deeper understanding of their
combined utility and potential applications within the dynamic landscape of GNNs.
      </p>
      <p>We emphasize the significance of transparency and reproducibility in scientific research. Accordingly,
we have made our code publicly accessible2, providing comprehensive details of all comparative
experiments.</p>
      <sec id="sec-1-1">
        <title>Instance-Level</title>
      </sec>
      <sec id="sec-1-2">
        <title>Heuristic-Based</title>
      </sec>
      <sec id="sec-1-3">
        <title>RCExplainer</title>
      </sec>
      <sec id="sec-1-4">
        <title>Graph Counterfactual</title>
      </sec>
      <sec id="sec-1-5">
        <title>Explainer</title>
      </sec>
      <sec id="sec-1-6">
        <title>Learning-Based</title>
        <p>Perturbation</p>
        <p>Matrix
CF-GNNExplainer
CF2
Generative</p>
      </sec>
      <sec id="sec-1-7">
        <title>CLEAR</title>
      </sec>
      <sec id="sec-1-8">
        <title>Model-Level</title>
      </sec>
      <sec id="sec-1-9">
        <title>Heuristic-Based</title>
      </sec>
      <sec id="sec-1-10">
        <title>GCFExplainer</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        The field of interpretable ML has advanced significantly [ 20, 21, 22], particularly in local interpretability
with early methods like LIME [23] and SHAP [22]. These works laid the foundation for post-hoc
explainability, treating models as black boxes. Initial studies primarily focused on improving the
1https://www.theverge.com/2017/11/2/16597276/google-ai-image-attacks-adversarial-turtle-rifle-3d-printed
2https://github.com/amirreza-m95/CE_vs_AE
interpretability of the models themselves [24, 25]. In the context of GNNs, GNNExplainer [26] marked
a breakthrough by identifying subgraphs responsible for node-level predictions, though it focused
on factual explanations. The shift toward counterfactual explanations (CEs), as introduced by
CFGNNExplainer [16], enabled reasoning over “what-if” scenarios and inspired a range of new methods [
        <xref ref-type="bibr" rid="ref15">18,
15, 17, 19</xref>
        ]. While several surveys [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8, 27, 28</xref>
        ] review the landscape, reproducibility studies specifically
targeting GNN counterfactual explanations remain absent. To our knowledge, this is the first to
empirically address that gap.
      </p>
      <p>
        From a high-level view, CEs seek minimal perturbations that flip model predictions, closely resembling
adversarial examples (AEs). While [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] highlight conceptual diferences (e.g., “impossible worlds”), others
argue for formal equivalence [29] or stress semantic and application-specific distinctions [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ].
Freiesleben [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] emphasizes that AEs require misclassification, whereas CEs may retain the same class.
This distinction becomes relevant in GNNs, where adversarial attacks (e.g., Nettack [30]) often degrade
global performance rather than target local node predictions. CF-GNNExplainer also highlights this
nuance. Despite these theoretical debates, empirical work comparing CEs and AEs in GNNs is scarce.
This paper fills that gap through a systematic reproducibility and comparison study.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Formulation and Definition</title>
      <p>We denote a graph as  = ( , ℰ ) , where  represents the set of nodes and ℰ denotes the set of edges.
Each node   ∈  is characterized by a feature vector   ∈ ℝ .</p>
      <p>
        The existing body of literature on GNN explainability has predominantly concentrated on scenarios
involving graph classification and node classification, with a focus on categorical output spaces (see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
for a comprehensive survey of CE methods). In the context of graph classification, the input consists of
a set of graphs, each associated with a specific class label. The objective for the GNN  is to accurately
predict these class labels. Conversely, in node classification, class labels are linked to individual nodes,
and predictions are made at the node level.
      </p>
      <p>In a message-passing GNN with  layers, the embedding of a node is intricately tied to its  -hop
neighborhood. We introduce the term “neighborhood subgraph” to characterize this  -hop neighborhood.
Henceforth, for the sake of clarity, we will employ the term “graph” to denote the neighborhood subgraph
when referring to node classification.</p>
      <p>Counterfactual Reasoning: Let  be the input graph and Φ( ) the prediction on  . The task of
the counterfactual approach is to introduce the minimal set of perturbations to distinguish a new graph
 ∗ such that Φ ( ) ≠ Φ ( ∗). Mathematically, this entails solving the following optimization problem.
 ∗ = arg min dist ( , 
′) .. {
 ′ ∈ 
Φ( ) ≠ Φ ( ′)
(1)
where dist ( ,  ′) quantifies the distance between graphs  and  ′ and  is the set of all graphs one
may construct by perturbing  . Typically, distance is measured as the number of preformed edge
perturbations while keeping the node set fixed.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>In the following, we detail the experimental setup for our studies.</p>
      <sec id="sec-4-1">
        <title>4.1. Research Questions</title>
        <p>Given our overall goals of (i) reproducing state-of-the-art counterfactual explanation methods, and (ii)
comparing counterfactual explanation approaches to adversarial learning baselines, we address the
following research questions in this paper:</p>
        <p>(RQ1) How feasible is it to reproduce the outcomes of the state-of-the-art CE methods? To what
degree do the underlying assumptions in these approaches withstand scrutiny? Which insights can be
gained regarding the error modes associated with these methods?</p>
        <p>(RQ2) What distinguishes counterfactual examples from adversarial examples within the framework
of GNNs? How can these two directions mutually benefit each other?</p>
        <p>(RQ3) Which CE performance evaluation metrics hold more promise? What are the respective
advantages and drawbacks of each?</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Datasets</title>
        <p>
          We evaluate the algorithms on a diverse set of datasets, encompassing both synthetic and real-world
scenarios. Specifically, we employ two synthetic datasets, BA-shapes and Tree-Cycles, introduced
by [26], adhering to the same setup as outlined in their work. BA-Shapes and Tree-Cycles are employed
for node classification, featuring predefined motifs (“house” and “cycle” structures) for interpretability.
For real-world contexts, we utilize Mutagenicity [
          <xref ref-type="bibr" rid="ref15">31, 15</xref>
          ], NCI1 [
          <xref ref-type="bibr" rid="ref15">32, 15</xref>
          ] and Ogbg-molhiv [18]. The
Mutagenicity dataset classifies molecules as either mutagenic or non-mutagenic, while the NCI1 dataset
categorizes chemical compounds as positive or negative to cell lung cancer. Moreover, in Ogbg-molhiv,
each graph represents a molecule, with each node denoting an atom and each edge symbolizing a
chemical bond. Due to the unavailability of a ground-truth causal model, methods usually simulate both
the label  and the causal relations of interest  [18]. Additionally, we utilized Mutag0, a smaller subset
of the Mutagenicity dataset. [34] made the assumption that the nitro group (NO2) and amino group
(NH2) serve as the true contributors to mutagenicity. Consequently, they filtered out mutagens that did
not contain these specific groups. However, NH2 has minimal impact on mutagenecity, with
benzeneNO2 being the sole discriminative motif [35]. In response to this, a sub-dataset, Mutag0, has been
crafted by [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], encompassing chemical compounds featuring benzene-NO2 that exhibit mutagenicity,
or those lacking benzene-NO2 and displaying non-mutagenic properties and skipping other instances.
An overview of the details of each dataset can be found in Table 1.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Methods</title>
        <p>This study aims to replicate the most impactful and enduring methods from the CE era and
timetested approaches from the AE community. The selection of approaches is guided by the following
considerations:</p>
        <p>(a) The chosen methods must have a significant influence and highly known by community through
being used as the SOTA baseline (see Table 2). (b) We prioritize methods with diverse representation
techniques to enhance the generalizability of our research. This diversity is crucial for capturing a
comprehensive understanding of GNN explanation methodologies, as illustrated in Figure 1. (c) The
selected methods are specifically drawn from the GNN context to maintain consistency within the
framework of this study.
4.3.1. Counterfactual Methods
We selected five representative counterfactual explanation methods (see Table 2) based on their influence
and diversity.</p>
        <p>
          CF2 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] solves a multi-objective optimization problem to balance factual and counterfactual
reasoning, controlled by the parameter  . We consider both the optimized model ( = 0.6 ) and the fully
counterfactual variant ( = 0 ).
        </p>
        <p>GCFExplainer[17] is a model-level method for graph classification. It constructs a meta-graph of
candidate counterfactuals and selects diverse explanations using vertex-reinforced random walks[36]
and a greedy algorithm.</p>
        <p>CF-GNNExplainer [16] is designed for node classification and learns a binary perturbation mask to
sparsify the graph’s adjacency matrix, minimizing changes needed to alter the prediction.</p>
        <p>CLEAR [18] uses a variational autoencoder to generate counterfactuals in latent space. It outputs
complete graphs with edge weights reflecting uncertainty, closely resembling the original input.</p>
        <p>RCExplainer [19] identifies linear decision boundaries via an unsupervised strategy, enhancing
robustness by generalizing across instances. It generates concise counterfactuals by selecting edge
subsets guided by a boundary-based loss.
4.3.2. Adversarial Methods
We evaluate two widely used adversarial attack methods for graph-structured data [37]:</p>
        <p>Nettack [30] targets node-level predictions by iteratively perturbing node features to deceive the
GNN, while preserving the graph structure. It computes gradients to identify minimal changes that flip
the model’s output.</p>
        <p>Meta Attack [38] uses a meta-learning approach to generate global adversarial attacks. Trained on
various graph datasets and models, it can eficiently poison graph classifiers without requiring gradient
access during inference.
4.3.3. Configuration
We employed the adversarial attack methods outlined earlier by utilizing the implementations available
in the DeepRobust open-source project3. Subsequently, we seamlessly integrated these methods into
our pipeline. We adhere to the recommended hyper-parameter settings provided by DeepRobust. It is
important to highlight that certain modifications were necessary to harmonize these adversarial attack
methods with counterfactual techniques, enabling a meaningful comparison of results on the same
datasets. A noteworthy example is the Nettack method, which, by default, tends to both add and remove
edges as a perturbation of the graph but to ensure a fair comparison with CE methods that primarily
involve edge removal, we introduced constraints to the optimization method. These constraints guide
the Nettack method to focus solely on edge removal, aligning it with the nature of CE methods. Further
elaboration on these adjustments and their implications will be provided in Section 5.3. All the training
is performed using an AMD Ryzen 2950X with 128GB RAM and a GeForce RTX 2070 with 8GB memory.
3https://github.com/DSE-MSU/DeepRobust
We repeat our experiments 3 times and report the average performance. We share both our dataset
processing scripts, the source code, and the hyper-parameters using an anonymous repository4.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Metrics</title>
        <p>In this section, we discuss diferent evaluation metrics used in the community and compare them to
clarify the rationale behind the metrics of our choice for this study.</p>
        <p>
          Necessity [
          <xref ref-type="bibr" rid="ref15">15, 28</xref>
          ] measures the percentage of graphs in which the removal of the explanation
subgraph induces a change in the GNN prediction, thereby establishing its necessity in influencing the
model’s output. Intuitively, Necessity quantifies the frequency with which removing subgraphs leads
to prediction changes, divided by the total number of instances. This metric resembles with the metric
called Validity or correctness introduced by [18]. In the context of explainable GNNs, Necessity refers
to:
        </p>
        <p>Necessity( ) =</p>
        <p>∑|=|1  (Φ (ℛ ) ≠ Φ (  ))</p>
        <p>Where  is a graph set of  , ℝ is a residual graph set of ℛ , Φ (  ) is the prediction of the model on
  ,    is explanation subgraph of   and ℛ =  −    . In the same setting of variables, Fidelity [16] is
the exact opposite metric:</p>
        <p>Fidelity(ℱ ) =</p>
        <p>∑|=|1  (Φ (ℛ ) = Φ (  ))</p>
        <p>As a result, in the context of counterfactual reasoning, we want lower values for Fidelity and higher
values for Necessity and Validity.</p>
        <p>Suficiency is defined as the percentage of generated explanations that prove to be suficient for an
instance to achieve the same prediction as using the entire graph. In essence, Suficiency intuitively
quantifies the percentage of graphs where the explanation subgraph alone is capable of maintaining the
GNN prediction unchanged.</p>
        <p>Suficiency ( ) =
∑|=|1  (Φ (   ) = Φ (  ))
||</p>
        <p>Explanation size serves as a minimality evaluation metric which refers to the count of removed
edges, representing the disparity between the original graph  and the counterfactual graph  ′. Given
our aim to minimize explanations, a smaller value for this metric is preferable.</p>
        <p>Coverage is a metric for evaluating recourse representation ℂ for the graph classification task [ 17]
which is the percentage of input graphs that possess nearby counterfactuals from  , within a specified
distance threshold  .</p>
        <p>Coverage(ℂ) = |{ ∈  ∣ min{(, )} ≤  }| /|| (5)</p>
        <p>∈ℂ</p>
        <p>In this context, [17] used the metric Cost which is recourse cost, representing the distance between
each input graph and its respective counterfactual within the dataset.
||
||
Cost(ℂ) = agg {min{(, )}
∈ ∈ℂ
}
This metric also closely resembles with the explanation size from local CE.
(2)
(3)
(4)
(6)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we present the results of our empirical study, shedding light on the performance and
efectiveness of both adversarial examples and counterfactual explanations. Our experiments aim to
provide insights into the comparative aspects of these methodological paradigms, addressing their
impact.</p>
      <sec id="sec-5-1">
        <title>5.1. CE Reproducibility Study (RQ1)</title>
        <p>We assess the reproducibility of each counterfactual explanation method by replicating their experiments
and evaluating consistency with reported results.</p>
        <p>CF2: All experiments ran smoothly except for the NCI1 dataset, due to missing code and data. We
observed large fluctuations in the Necessity metric (0.58–0.90), depending on classifier performance and
preprocessing. This suggests high sensitivity to the underlying model. Using the original Mutagenicity
dataset (vs. Mutag0) also caused a drop in classification accuracy, though explanation performance
remained stable.</p>
        <p>CF-GNNExplainer: Reproduced results successfully despite minor code deprecations. Model
performance was sensitive to hyperparameters, requiring re-optimization to match reported values.</p>
        <p>GCFExplainer: As the only model-level method, it reproduced well using both pretrained and
freshly trained models. However, its pretraining phase (VRRW) was resource-intensive, requiring a
256GB RAM machine due to memory constraints.</p>
        <p>CLEAR: Results were partially reproducible. We matched the validity score on IMDB-M (0.91 vs.
0.96 reported) but could not evaluate the Community dataset due to unavailable data. On larger datasets
(e.g., Mutagenicity), CLEAR failed due to memory issues, though it worked on the smaller Mutag0.</p>
        <p>RCExplainer: The original code link was inactive, but we obtained a working version from the
authors5. With that, we reproduced the results using both pretrained models and our own training.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Comparing Diferent CE-Methods (RQ1)</title>
        <p>In this section, we undertake a comparative analysis of the methods within the two primary categories of
node and graph classification tasks. The evaluation is based on metrics outlined in Section 4.4, specifically
Necessity, Suficiency, Explanation Size, Coverage and Cost. For node classification, we utilize the
BA-shapes and Tree-Cycles datasets, while for graph classification, we employ the Mutagenicity and
NCI1 datasets. These datasets were chosen due to their widespread adoption in the field, as indicated in
Table 2. To ensure a thorough comparison, we attempted to use the most commonly adopted datasets
and metrics. As a result, for some of the methods, we employed metrics and datasets that difer from
those originally used in their respective research. We believe this approach opens up new possibilities
for integrating various metrics and datasets, potentially enhancing the robustness of evaluations across
diferent methodologies. The only exception to this approach is CLEAR, which utilized metrics and
datasets not used in other papers. To ensure a fair comparison, we adapted the CLEAR method
to the Mutag0 (non-original for CLEAR) dataset and also adapted C 2 to the Ogbg-molhiv dataset
(from CLEAR), facilitating a more comprehensive evaluation of the method. For an overview of this
experiment, we refer to Tables 3 to 4.</p>
        <p>Our comparison for the graph classification task is shown in Table 3. The evaluation is based on
the suficiency and explanation size metrics, with higher suficiency and lower size values indicating
better performance. RCExplainer demonstrates strong performance on the Mutagenicity dataset with
a suficiency score of 0.64. On the other hand, C  2 (Opt.) displays a lower suficiency score on both
datasets compared to RCExplainer and GCFExplainer since it notably exhibits a much larger size,
particularly on the NCI1 dataset with a size of 17.70, indicating a less concise explanation. Similarly, C 2
( = 0) also shows lower suficiency scores on both datasets as expected and reported by authors. While
C 2 ( = 0) achieves competitive suficiency scores, its larger explanation size makes it less concise
than RCExplainer and GCFExplainer. In contrast, RCExplainer, which provides moderate suficiency
scores and slightly better performance on NCI1, consistently produces smaller explanations, indicating
higher conciseness across both datasets.</p>
        <p>In summary, RCExplainer stands out for its balance of high suficiency and consistent size across
datasets, making it a strong candidate for applications where interpretability and eficiency are
paramount. Conversely, while C 2 (Opt.) and C 2 ( = 0) ofer competitive suficiency scores,
their larger sizes may indicate more complex explanations. GCFExplainer, with its moderate suficiency
scores and consistently low sizes, presents a viable alternative for scenarios where a balance between
interpretability and complexity is desired. We also have to consider that GCFExplainer is the only
model-level explanation method.</p>
        <p>In Table 4, we include the Necessity metric to compare node classification methods. Suficiency values
for CF-GNNExplainer are not presented in the paper as this method does not incorporate suficiency
in its optimization process. CF-GNNExplainer primarily concentrates on minimizing perturbations to
change the class label without explicitly optimizing for generating a concise summary of the graph.
However, CF-Explainer demonstrates moderate performance with Necessity scores of 0.61 and 0.79 on
BA-Shapes and Tree-Cycles datasets respectively, indicating its capability to identify essential features
for classification. CF-Explainer exhibits relatively low sizes on both datasets, with values of 2.39 and
2.09, suggesting concise explanations.</p>
        <p>C 2 ( = 0) and C 2 ( = .) outperform CF-GNNExplainer in terms of Necessity on both datasets.
However, both variants of C 2 display larger sizes compared to CF-Explainer, with values ranging from
3.6 to 7.76, suggesting potentially more complex explanations. To facilitate a more comprehensive
comparison between these two methods while maintaining a fixed explanation value, we evaluated their
performance on the Necessity metric. This analysis revealed that CF-GNNExplainer exhibits superior
performance at lower explanation size values. However, as we progress towards higher explanation
size values, CF2 outperforms CF-GNNExplainer. We refer to the results depicted in Figure 2 for further
insights into the comparison between these two methods.</p>
        <p>Overall, CF-Explainer ofers concise explanations with moderate Necessity scores, but lacks at
suficiency. C 2 variants provide more comprehensive explanations with higher Necessity and suficiency
scores, albeit at the cost of larger sizes, indicating potentially more complex explanations. Depending
on the specific requirements of the application, practitioners may choose between CF-Explainer for its
simplicity and C 2 variants for their comprehensiveness. These findings are novel, as no prior studies
have conducted such evaluation and comparison of these methods.</p>
        <p>Table 3 compares CF2, CLEAR, and others across Mutag0 and Ogbg-molhiv. CF2 ofers compact,
high-suficiency explanations, while CLEAR sufers from scalability issues, failing on larger datasets
due to memory limitations. These results emphasize the need to balance accuracy, explanation size,
and scalability in real-world graph classification tasks.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. GNN Counterfactuals Compared with Adversarial Examples (RQ2)</title>
        <p>Building on the theoretical connections between counterfactual explanations (CEs) and adversarial
examples (AEs), we compare two representative methods from each: Nettack vs. CF2 on the BA-Shapes
dataset (node classification) and Meta Attack vs. GCFExplainer on Mutagenicity (graph classification).
Results are summarized in Figures 2 and 3.</p>
        <p>To ensure fair comparison, we adapted Nettack to align with CF2’s constraints by limiting it to edge
deletions within 3-hop neighborhoods and using the same evaluation pipeline. As shown in Figure 2,
Nettack achieves higher Necessity scores at low perturbation levels, consistent with its goal of minimal
changes. This supports its role as a strong baseline when minimal perturbations are desired.</p>
        <p>In graph classification, Figure 3 shows that GCFExplainer outperforms Meta Attack in both coverage
and cost, indicating that AEs are less efective for generating diverse, interpretable explanations. Meta
Attack introduces larger changes while remaining less competitive, highlighting the trade-ofs in global
adversarial perturbations.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Insights and Observations on Evaluation Metrics (RQ3)</title>
        <p>
          In this section, we delve into the metrics used for comparing and analyzing methods, as outlined
in Table 2. It’s noteworthy that none of the papers we studied used the same metrics for evaluation,
posing challenges when comparing methods. Additionally, conflicts in the interpretation of these
metrics further complicate matters. For example, regarding suficiency, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] advocates for higher values,
while [28] argues that higher values indicate superior performance for factual explanations, yet lower
values are preferred for counterfactual scenarios where the goal is to flip the class label.
        </p>
        <p>Moreover, there are limitations in these metrics’ ability to guide model improvement. For instance, in
the case of the Fidelity metric, [16] showed that the Random algorithm outperforms all other methods
with 0.0 percent accuracy, leaving no room for enhancement. This misconfiguration arises since the
metric evaluates correctness based on ground truth labels rather than predictions, resulting in random
perturbations failing to impact model performance, leading to a minimum fidelity score. Additionally,
as observed in the Necessity metric for C 2, results fluctuate depending on classifier performance,
underscoring the need for benchmarked metrics within the CE community. These insights underscore
the complexities in metric interpretation and stress the importance of standardized evaluation protocols
in CE research.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this reproducibility paper, we conducted a comprehensive empirical study on prominent GNN-based
counterfactual explanation methods, juxtaposing their performance against established adversarial
attack baselines. Through our investigation, we systematically validated five selected GNN-based CE
methods, namely C 2, CF-GNNExplainer, GCFExplainer, CLEAR, and RCExplainer, shedding light on
their strengths, limitations, and contextual relevance.</p>
      <p>Our comparative analysis revealed nuanced insights into the performance of these methods across
various datasets and tasks. Notably, RCExplainer emerged as a standout performer in graph classification
tasks, exhibiting a balance of high suficiency and consistent explanation size. Conversely, while
C 2 variants displayed competitive suficiency scores, they often presented larger explanation sizes,
potentially indicating more complex explanations.</p>
      <p>Our study lays the groundwork for several avenues of future research aimed at advancing the field
of XAI and GNNs. Here, we outline potential directions for further exploration: (a) The intersection
of CE and adversarial attacks presents opportunities for developing hybrid approaches that leverage
the strengths of both paradigms. Future work could explore the integration of CE and adversarial
defense strategies to develop more robust and interpretable AI systems. (b) Future research could
explore the development of context-aware counterfactual explanation methods tailored to specific
application domains. Context-awareness involves considering additional contextual information such
as user preferences, domain-specific constraints, and situational factors when generating explanations.
(c) Future research could focus on unifying and standardizing metrics used for evaluating counterfactual
explanation methods.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
network explanations based on counterfactual and factual reasoning, in: F. Laforest, R. Troncy,
E. Simperl, D. Agarwal, A. Gionis, I. Herman, L. Médini (Eds.), WWW ’22: The ACM Web
Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, ACM, 2022, pp. 1018–1027. URL:
https://doi.org/10.1145/3485447.3511948.
[16] A. Lucic, M. A. ter Hoeve, G. Tolomei, M. de Rijke, F. Silvestri, Cf-gnnexplainer: Counterfactual
explanations for graph neural networks, in: G. Camps-Valls, F. J. R. Ruiz, I. Valera (Eds.),
International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022,
Virtual Event, volume 151 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 4499–4511.</p>
      <p>URL: https://proceedings.mlr.press/v151/lucic22a.html.
[17] Z. Huang, M. Kosan, S. Medya, S. Ranu, A. K. Singh, Global counterfactual explainer for graph
neural networks, in: T. Chua, H. W. Lauw, L. Si, E. Terzi, P. Tsaparas (Eds.), Proceedings of the
Sixteenth ACM International Conference on Web Search and Data Mining, WSDM 2023, Singapore,
27 February 2023 - 3 March 2023, ACM, 2023, pp. 141–149. URL: https://doi.org/10.1145/3539597.
3570376.
[18] J. Ma, R. Guo, S. Mishra, A. Zhang, J. Li, CLEAR: generative counterfactual
explanations on graphs, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
(Eds.), Advances in Neural Information Processing Systems 35: Annual Conference on
Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,
November 28 - December 9, 2022, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/
a69d7f3a1340d55c720e572742439eaf-Abstract-Conference.html.
[19] M. Bajaj, L. Chu, Z. Y. Xue, J. Pei, L. Wang, P. C. Lam, Y. Zhang, Robust counterfactual
explanations on graph neural networks, in: M. Ranzato, A. Beygelzimer, Y. N. Dauphin,
P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems 34:
Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December
6-14, 2021, virtual, 2021, pp. 5644–5655. URL: https://proceedings.neurips.cc/paper/2021/hash/
2c8c3a57383c63caef6724343eb62257-Abstract.html.
[20] A. Ghorbani, J. Wexler, J. Y. Zou, B. Kim, Towards automatic concept-based explanations, in: H. M.</p>
      <p>Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in
Neural Information Processing Systems 32: Annual Conference on Neural Information Processing
Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 9273–9282.
URL: https://proceedings.neurips.cc/paper/2019/hash/77d2afcb31f6493e350fca61764efb9a-Abstract.
html.
[21] A. R. Mohammadi, A. Peintner, M. Müller, E. Zangerle, Are we explaining the same recommenders?
incorporating recommender performance for evaluating explainers, in: RecSys ’24, ACM, 2024, p.
1113–1118.
[22] A. Ghorbani, J. Y. Zou, Neuron shapley: Discovering the responsible neurons, in: H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual, 2020. URL: https://proceedings.neurips.cc/paper/2020/hash/
41c542dfe6e4fc3deb251d64cf6ed2e4-Abstract.html.
[23] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should I trust you?”: Explaining the predictions of
any classifier, in: B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, R. Rastogi
(Eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17, 2016, ACM, 2016, pp. 1135–1144. URL:
https://doi.org/10.1145/2939672.2939778.
[24] A. Peintner, A. R. Mohammadi, E. Zangerle, SPARE: shortest path global item relations for
eficient session-based recommendation, in: J. Zhang, L. Chen, S. Berkovsky, M. Zhang, T. D. Noia,
J. Basilico, L. Pizzato, Y. Song (Eds.), Proceedings of the 17th ACM Conference on Recommender
Systems, RecSys 2023, Singapore, Singapore, September 18-22, 2023, ACM, 2023, pp. 58–69. URL:
https://doi.org/10.1145/3604915.3608768.
[25] A. Peintner, A. R. Mohammadi, E. Zangerle, Eficient session-based recommendation with
contrastive graph-based shortest path search, ACM Trans. Recomm. Syst. 3 (2025). URL:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Scarselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Tsoi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagenbuchner</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Monfardini,</surname>
          </string-name>
          <article-title>The graph neural network model</article-title>
          ,
          <source>IEEE Trans. Neural Networks</source>
          <volume>20</volume>
          (
          <year>2009</year>
          )
          <fpage>61</fpage>
          -
          <lpage>80</lpage>
          . URL: https://doi.org/10.1109/TNN.
          <year>2008</year>
          .
          <volume>2005605</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yanardag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          ,
          <article-title>Deep graph kernels</article-title>
          , in: L.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , T. Joachims,
            <given-names>G. I.</given-names>
          </string-name>
          <string-name>
            <surname>Webb</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          <string-name>
            <surname>Margineantu</surname>
          </string-name>
          , G. Williams (Eds.),
          <source>Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , Sydney,
          <string-name>
            <surname>NSW</surname>
          </string-name>
          , Australia,
          <source>August 10-13</source>
          ,
          <year>2015</year>
          , ACM,
          <year>2015</year>
          , pp.
          <fpage>1365</fpage>
          -
          <lpage>1374</lpage>
          . URL: https://doi.org/10.1145/2783258.2783417.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sen</surname>
          </string-name>
          , G. Namata,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bilgic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Eliassi-Rad,
          <article-title>Collective classification in network data</article-title>
          ,
          <source>AI Mag</source>
          .
          <volume>29</volume>
          (
          <year>2008</year>
          )
          <fpage>93</fpage>
          -
          <lpage>106</lpage>
          . URL: https://doi.org/10.1609/aimag.v29i3.
          <fpage>2157</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Semi-supervised classification with graph convolutional networks</article-title>
          ,
          <source>in: 5th International Conference on Learning Representations, ICLR</source>
          <year>2017</year>
          , Toulon, France,
          <source>April 24-26</source>
          ,
          <year>2017</year>
          , Conference Track Proceedings, OpenReview.net,
          <year>2017</year>
          . URL: https://openreview.net/forum? id=
          <fpage>SJU4ayYgl</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Velickovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Casanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liò</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Graph attention networks</article-title>
          ,
          <source>in: 6th International Conference on Learning Representations, ICLR</source>
          <year>2018</year>
          , Vancouver, BC, Canada, April 30 - May 3,
          <year>2018</year>
          , Conference Track Proceedings, OpenReview.net,
          <year>2018</year>
          . URL: https:// openreview.net/forum?id=rJXMpikCZ.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Towards a rigorous science of interpretable machine learning</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1702</volume>
          .
          <fpage>08608</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>Explainability in graph neural networks: A taxonomic survey</article-title>
          , CoRR abs/
          <year>2012</year>
          .15445 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2012</year>
          .15445. arXiv:
          <year>2012</year>
          .15445.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Prado-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Prenkaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Stilo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <article-title>A survey on graph counterfactual explanations: Definitions, methods</article-title>
          , evaluation,
          <source>CoRR abs/2210</source>
          .12089 (
          <year>2022</year>
          ). URL: https://doi.org/10. 48550/arXiv.2210.12089. arXiv:
          <volume>2210</volume>
          .
          <fpage>12089</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wachter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Mittelstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Counterfactual explanations without opening the black box: Automated decisions and the GDPR</article-title>
          ,
          <source>CoRR abs/1711</source>
          .00399 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/ 1711.00399. arXiv:
          <volume>1711</volume>
          .
          <fpage>00399</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Hsieh, ZOO: zeroth order optimization based blackbox attacks to deep neural networks without training substitute models</article-title>
          , in: B.
          <string-name>
            <surname>Thuraisingham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Biggio</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Freeman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          Sinha (Eds.),
          <source>Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security</source>
          ,
          <source>AISec@CCS</source>
          <year>2017</year>
          , Dallas, TX, USA, November 3,
          <year>2017</year>
          , ACM,
          <year>2017</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>26</lpage>
          . URL: https://doi.org/10.1145/3128572.3140448.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Dickerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hines</surname>
          </string-name>
          ,
          <article-title>Counterfactual explanations for machine learning: A review</article-title>
          , CoRR abs/
          <year>2010</year>
          .10596 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2010</year>
          .10596. arXiv:
          <year>2010</year>
          .10596.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>McGrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Costabello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Van</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kamiab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lécué</surname>
          </string-name>
          ,
          <article-title>Interpretable credit application predictions with counterfactual explanations</article-title>
          , CoRR abs/
          <year>1811</year>
          .05245 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1811</year>
          .05245. arXiv:
          <year>1811</year>
          .05245.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Laugel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lesot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Marsala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Renard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Detyniecki</surname>
          </string-name>
          ,
          <article-title>Unjustified classification regions and counterfactual explanations in machine learning</article-title>
          , in: U. Brefeld, É. Fromont,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Knobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Maathuis</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Robardet (Eds.),
          <source>Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD</source>
          <year>2019</year>
          , Würzburg, Germany,
          <source>September 16-20</source>
          ,
          <year>2019</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>11907</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2019</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>54</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -46147-
          <issue>8</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Freiesleben</surname>
          </string-name>
          ,
          <article-title>The intriguing relation between counterfactual explanations and adversarial examples</article-title>
          ,
          <source>Minds Mach</source>
          .
          <volume>32</volume>
          (
          <year>2022</year>
          )
          <fpage>77</fpage>
          -
          <lpage>109</lpage>
          . URL: https://doi.org/10.1007/s11023-021-09580-9.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Learning and evaluating graph neural
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>