<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extending Merlin-Arthur Classifiers for Improved Interpretability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Berkant Turan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department for AI in Society</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>and Technology, Zuse Institute Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In my doctoral research, I aim to address the interpretability challenges associated with deep learning by extending the Merlin-Arthur Classifier framework. This novel approach employs a pair of feature selectors, including an adversarial player, to generate informative saliency maps. My research focuses on enhancing the classifier's performance and exploring its applicability to complex datasets, including a recently established human benchmark for detecting pathologies in X-ray images. Tackling the min-max optimization challenge inherent in the Merlin-Arthur Classifier for high-dimensional data, I will explore and apply diverse stabilization strategies to bolster the framework's robustness and training stability. Finally, the goal is to expand the framework beyond pixel-level saliency maps to encompass modalities, such as text and learned feature spaces, fostering a comprehensive understanding of interpretability across various domains and data types.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Interactive Classification</kwd>
        <kwd>Mutual Information</kwd>
        <kwd>Merlin-Arthur Classifier</kwd>
        <kwd>Interpretability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Motivation</title>
      <p>
        Over the past decade, machine learning has made tremendous progress, especially with the
advancement of deep learning. But despite the astonishing advances in deep learning, major
concerns have been raised about Artificial Intelligence (AI) safety in view of its large-scale
application [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. One of the safety-related issues concerns the lack of interpretability of deep
neural networks deployed in mission-critical tasks. To address this issue, diferent techniques
have been developed in the field of Explainable AI (XAI), including local and global methods [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3,
4, 5</xref>
        ]. A common feature of these methods is that they are often based on heuristics [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Although
heuristic methods have had their successes, such as unmasking biases of established classifiers
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a large body of research has highlighted the growing concern about the disadvantages of
non-formal interpretability techniques [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. Additionally, several XAI methods have shown
vulnerability to manipulation through the strategic design of neural networks [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ].
      </p>
      <p>
        These concerns emphasize the need for the development and adoption of formal and
robust explanation methods in the field of AI. Approaches to interpretability, such as Mutual
Information [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or Shapley values [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], have been proposed as more rigorous alternatives
to heuristic-based methods. However, these formal techniques require a faithful modeling
      </p>
      <p>Merlin-Arthur Classifier
Feature
Selector
Merlin:
Feature
Classifier</p>
      <p>
        Arthur:
of the underlying distribution, which is often dificult for non-synthetic data. This has been
practically achieved with generative models [
        <xref ref-type="bibr" rid="ref13 ref15">13, 15</xref>
        ], but there is still a requirement for trust
in the underlying generative model. We can circumvent trusting a generative model by an
interactive classicfiation setup that allows us to provide bounds on the precision, which in turn
provide bounds on the mutual information.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Merlin-Arthur Classifier</title>
      <p>
        In this section, we discuss the Merlin-Arthur Classifiers , a novel classification framework
developed by our research group that utilizes multi-agent interaction to allow for theoretical
interpretability guarantees [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This framework is inspired by the Interactive Proof System
(IPS), specifically the Merlin-Arthur protocol. The prover (Merlin) selects a feature from a data
point and presents it to a verifier (Arthur) who determines its class, as illustrated in Figure 1.
This interactive approach has already been studied in [16]. However, it has been noted that
Merlin and Arthur can cooperate to achieve high accuracy with uninformative features [
        <xref ref-type="bibr" rid="ref6">17, 6</xref>
        ],
see Figure 2. Thus, the adversarial aspect of interactive proof system is crucial. We introduce a
second, adversarial prover (Morgana) with the objective to convince Arthur of the wrong class.
In essence, our theory shows that the only strategy that Merlin and Arthur can use is one that
cannot be exploited by Morgana, see Figure 3 for an illustration.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Formal Description</title>
        <p>Consider a dataset  with a ground truth class map  :  → , where  is the set of classes,
and a feature space Σ ⊂ 2. We say a data point x ∈  has the feature  ∈ Σ if x ∈ . We
define the notion of a feature selector as a map  :  → Σ such that ∀x ∈  : x ∈  (x).
Such a feature selector can, for example, be realized by using a pixel-based saliency method and
selecting the  most salient pixels. For a visual representation, refer to Figure 1. The quality of
a selected feature  is stated in terms of the mutual information</p>
        <p>∼ ((y); y ∈ ) := y∼ ((y)) − y∼ ((y)|y ∈ ),
where y∼ ((y)|y ∈ ) is the class conditional entropy given that y contains the feature ,
and  is the data distribution on . We extend this definition to a feature selector by taking
an average over the features selected from the dataset, i.e., Ex∼ y∼ ((y)|y ∈  (x)). This
quantity should be close to zero for a high-quality feature selector. The problem is that for most
datasets, measuring y∼ ((y)|y ∈ ) is not feasible, particularly for high-dimensional data.
This complexity, amplified by continuous features or large datasets, results from the curse of
dimensionality and therefore requires approximations or alternative methods.</p>
        <p>In our setup, we define notions of</p>
        <p>Completeness: Px∼ [( (x)) = (x)] and</p>
        <p>Soundness: 1</p>
        <p>max
− ∈∖{(x)}</p>
        <p>
          Px∼ [(̂︁(x)) = ],
which are estimated on a test dataset and can be used to bound the conditional entropy for the
features exchanged between Merlin and Arthur, see [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>The challenge now lies in developing a training process for this setup that efectively handles
complex data while simultaneously achieving high levels of completeness and soundness.
(1)
(2)
(3)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research Objective</title>
      <p>The primary benefit of Merlin-Arthur Classifiers over other existing approaches is the capability
to quantitatively bound the quality of the features in terms of mutual information, even under
reasonable assumptions, without relying on heuristics. Calculating the mutual information,
however, may not always be feasible, particularly for complex datasets, such as Street View
House Numbers (SVHN) [18], CIFAR-10 [19] or ImageNet [20]. To date, the numerical assessment
of the framework has been restricted to basic datasets, such as MNIST [21] and the UCI Census
dataset [22], where the mutual information between the selected feature and the target class
can be computed.</p>
      <p>This leads to the following question:
How can Merlin-Arthur Classifiers be extended to complex datasets while maintaining a strong
alignment between theoretical foundations and practical implementation?
In my doctoral research, my objective is to expand the scope of the Merlin-Arthur Classifiers
and evaluate their efectiveness on more demanding datasets, demonstrating a strong alignment
between theoretical foundations and practical implementation.</p>
      <sec id="sec-3-1">
        <title>3.1. Dimensionality Reduction through Generative Models</title>
        <p>The development of the Merlin-Arthur Classifiers framework would involve several key steps.
First, it is crucial to identify alternative measures or techniques that can be used in place of
mutual information for complex datasets. A potential research direction includes generative
models, such as Variational Autoencoders (VAE) [23], to represent the high-dimensional data
point on a lower-dimensional manifold. More precisely, the feature selectors - Merlin and
Morgana - would not select individual pixels, but instead select features derived from the latent
representation that typically correspond to more generalized features and present them to the
classifier. This approach would result in a two-fold reduction in complexity, as it not only
decreases the dimensionality but also permits maintaining a smaller maximum size for the
feature.</p>
        <p>Can generative models, like Variational Autoencoders, enhance Merlin-Arthur Classifiers for
complex datasets while maintaining feature quality?</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Improving the Stability of Training on Complex Data</title>
        <p>As the Merlin-Arthur Classifiers framework features a min-max objective, addressing potential
instabilities that may arise during the training process is a crucial aspect of enhancing its
performance on complex datasets. To stabilize the training process, it is important to explore
various techniques and strategies that have been successful in addressing similar issues in other
models with min-max objectives.</p>
        <p>One prominent example is Generative Adversarial Networks (GANs) [24], which have been
the subject of extensive research on stabilizing min-max objectives [25]. Drawing from this
research, the stabilization of the Merlin-Arthur Classifiers can be achieved by incorporating
a combination of techniques, including diverse activation functions (e.g., Leaky ReLU), using
spectral normalization, and employing various optimizers or replay strategies to enhance
training stability [26]. These techniques can be tailored and integrated into the Merlin-Arthur
framework to ensure a robust training process.</p>
        <p>Can we extend Merlin-Arthur classification to complex data in a stable way?</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Validation and Performance Comparison</title>
        <p>Once suitable alternatives have been identified and the framework has been adapted, validation
of the adapted framework is essential and should involve diverse datasets, spanning multiple
domains and complexity levels, to demonstrate the framework’s robustness and versatility.</p>
        <p>
          For example, a recent study in the field of medical imaging has established a human benchmark
for detecting pathologies in X-ray images. This study found that all investigated state-of-the-art
interpretability methods lack accuracy and reliability [27]. Therefore, it is essential to benchmark
the extended Merlin-Arthur Classifiers framework against existing methods such as LIME [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],
SHAP [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and LRP [28] to showcase its potential advantages and limitations in the medical
domain. By doing so, the efectiveness and reliability of the framework can be better understood
and established.
        </p>
        <p>How can we rigorously validate the extended Merlin-Arthur Classifiers framework across diverse
datasets? Can we compare and identify synergies between the framework and other XAI methods?</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Contributions</title>
      <p>In the course of my doctoral research, I have already made progress towards achieving the
objectives and key steps outlined earlier. In this section, I will present preliminary results and
contributions that have been made to date.</p>
      <sec id="sec-4-1">
        <title>4.1. Feature Quality Measurement</title>
        <p>
          To assess the quality of the feature selection, appropriate datasets must be identified. While the
lack of a ground truth is a common issue in XAI, we addressed this by using a modified version
of the UCI Census dataset in our preprint [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. For image data, popular datasets like CIFAR-10
and ImageNet are often used, with bounding boxes serving as the ground truth to calculate
Intersection over Union (IoU) between the saliency map and the box [29].
        </p>
        <p>We opted for the original SVHN dataset, which has variable-resolution color images with
digits at diferent locations, as a better alternative to images where target objects dominate
the entire image. We trained the Merlin-Arthur Classifier to distinguish images containing the
digit "1" and generate saliency maps that highlight regions where the digit appears, as shown
in Figure 4(b). Comparing these maps with bounding boxes, we calculated the IoU to assess
map quality. Our preliminary results suggest that the Merlin-Arthur Classifiers framework
efectively highlights the target regions of interest in the SVHN dataset. However, further
improvements are needed for complex datasets and classification tasks.</p>
        <p>(a) Samples from SVHN.</p>
        <p>(b) Saliency
bounding boxes.</p>
        <p>maps with</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Future Directions and Potential Contributions</title>
      <p>As we continue to extend the Merlin-Arthur Classifiers framework to handle complex datasets,
there are several potential contributions that this research can make to the field of AI and XAI
in particular:
1. Establishing a robust and versatile framework that can handle a wide range of datasets
and classification tasks, thereby enhancing the applicability of Merlin-Arthur Classifiers
in real-world scenarios.
2. Comparing the Merlin-Arthur Classifiers framework with other methods, highlighting
strengths and weaknesses, and guiding future explainable AI research. Specifically,
evaluating its applicability and efectiveness on medical benchmarks, such as detecting
pathologies in X-ray images, where other methods have underperformed.
3. Exploring challenges, such as transcending pixel-level relevance by incorporating
textbased agent conversations or leveraging Variational Autoencoder feature embeddings,
to significantly advance the capabilities and impact of the Merlin-Arthur Classifiers
framework.</p>
      <p>By addressing these challenges, my doctoral research aims to advance Merlin-Arthur Classifiers,
broadening their applicability, impact, and potential in XAI.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>I would like to express my gratitude to Stephan Wäldchen for his invaluable guidance in
shaping this doctoral proposal. My thanks also go to Prof. Dr. Sebastian Pokutta for his diligent
supervision, and to Kartikey Sharma for his constructive feedback during this initial phase.
Proceedings of the AAAI Conference on Artificial Intelligence 32 (2018). URL: https:
//ojs.aaai.org/index.php/AAAI/article/view/11491. doi:10.1609/aaai.v32i1.11491.
[16] T. Lei, R. Barzilay, T. Jaakkola, Rationalizing neural predictions, arXiv preprint
arXiv:1606.04155 (2016).
[17] M. Yu, S. Chang, Y. Zhang, T. Jaakkola, Rethinking cooperative rationalization:
Introspective extraction and complement control, in: Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019.
doi:10.18653/v1/D19-1420.
[18] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y. Ng, Reading digits in natural
images with unsupervised feature learning, in: NIPS Workshop on Deep Learning and
Unsupervised Feature Learning 2011, 2011. URL: http://ufldl.stanford.edu/housenumbers,
accessed: 2022-02-23.
[19] A. Krizhevsky, Learning multiple layers of features from tiny images, Technical Report,
2009.
[20] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
neural networks, Advances in neural information processing systems 25 (2012).
[21] Y. LeCun, L. Bottou, Y. Bengio, P. Hafner, Gradient-based learning applied to document
recognition, Proceedings of the IEEE 86 (1998) 2278–2324.
[22] D. Dua, C. Graf, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
[23] D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, in: 2nd International
Conference on Learning Representations, ICLR 2014, Banf, AB, Canada, April 14-16, 2014,
Conference Track Proceedings, 2014. arXiv:http://arxiv.org/abs/1312.6114v10.
[24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in: Z. Ghahramani, M. Welling, C. Cortes,
N. Lawrence, K. Weinberger (Eds.), Advances in Neural Information Processing Systems,
volume 27, Curran Associates, Inc., 2014. URL: https://proceedings.neurips.cc/paper_files/
paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
[25] M. Wiatrak, S. V. Albrecht, A. Nystrom, Stabilizing generative adversarial networks: A
survey, arXiv preprint arXiv:1910.00927 (2019).
[26] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved
techniques for training gans, arXiv preprint arXiv:1606.03498 (2016).
[27] A. Saporta, X. Gui, A. Agrawal, A. Pareek, S. Q. Truong, C. D. Nguyen, V.-D. Ngo, J. Seekins,
F. G. Blankenberg, A. Y. Ng, et al., Benchmarking saliency methods for chest x-ray
interpretation, Nature Machine Intelligence 4 (2022) 867–878.
[28] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise
explanations for non-linear classifier decisions by layer-wise relevance propagation, PLOS
ONE 10 (2015) 1–46. doi:10.1371/journal.pone.0130140.
[29] P. Dabkowski, Y. Gal, Real time image saliency for black box classifiers, in:
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R.
Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran
Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
0060ef47b12160b9198302ebdb144dcf-Paper.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mané</surname>
          </string-name>
          ,
          <article-title>Concrete problems in AI safety</article-title>
          ,
          <source>arXiv preprint arXiv:1606.06565</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohseni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zarei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Ragan</surname>
          </string-name>
          ,
          <article-title>A multidisciplinary survey and framework for design and evaluation of explainable AI systems</article-title>
          ,
          <source>ACM Trans. Interact. Intell. Syst</source>
          .
          <volume>11</volume>
          (
          <year>2021</year>
          ). URL: https://doi.org/10.1145/3387166. doi:
          <volume>10</volume>
          .1145/3387166.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>51</volume>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .1145/ 3236009.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>"why should i trust you?": Explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '16,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2016</year>
          , p.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          . URL: https://doi.org/10.1145/2939672.2939778. doi:
          <volume>10</volume>
          .1145/2939672.2939778.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <article-title>The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery</article-title>
          .,
          <source>Queue</source>
          <volume>16</volume>
          (
          <year>2018</year>
          )
          <fpage>31</fpage>
          -
          <lpage>57</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3236386.3241340.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wäldchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zimmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Turan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pokutta</surname>
          </string-name>
          ,
          <article-title>Formal interpretability with Merlin-Arthur Classifiers</article-title>
          ,
          <source>arXiv preprint arXiv:2206.00759</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lapuschkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wäldchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Binder</surname>
          </string-name>
          , G. Montavon,
          <string-name>
            <given-names>W.</given-names>
            <surname>Samek</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-R. Müller</surname>
          </string-name>
          ,
          <article-title>Unmasking clever hans predictors and assessing what machines really learn</article-title>
          ,
          <source>Nature communications 10</source>
          (
          <year>2019</year>
          )
          <fpage>1096</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Marques-Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ignatiev</surname>
          </string-name>
          ,
          <article-title>Delivering trustworthy AI through formal XAI</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>36</volume>
          (
          <year>2022</year>
          )
          <fpage>12342</fpage>
          -
          <lpage>12350</lpage>
          . URL: https: //ojs.aaai.org/index.php/AAAI/article/view/21499. doi:
          <volume>10</volume>
          .1609/aaai.v36i11.
          <fpage>21499</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Slack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hilgard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          ,
          <article-title>Fooling lime and shap: Adversarial attacks on post hoc explanation methods</article-title>
          ,
          <source>in: Proceedings of the AAAI/ACM Conference on AI</source>
          ,
          <string-name>
            <surname>Ethics</surname>
          </string-name>
          , and Society, AIES '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>180</fpage>
          -
          <lpage>186</lpage>
          . URL: https://doi.org/10.1145/3375627.3375830. doi:
          <volume>10</volume>
          .1145/ 3375627.3375830.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Slack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hilgard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          ,
          <article-title>Fooling lime and shap: Adversarial attacks on post hoc explanation methods (</article-title>
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1145/3375627.3375830.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dimanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Bhatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <article-title>You shouldn't trust me: Learning models which conceal unfairness from multiple explanation methods</article-title>
          , in: SafeAI@AAAI,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Heo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joo</surname>
          </string-name>
          , T. Moon,
          <article-title>Fooling neural network interpretations via adversarial model manipulation</article-title>
          ,
          <source>in: Neural Information Processing Systems</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <article-title>Learning to explain: An informationtheoretic perspective on model interpretation</article-title>
          ,
          <year>2018</year>
          . arXiv:arXiv preprint arXiv:
          <year>1802</year>
          .07814.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          , NIPS'17, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2017</year>
          , p.
          <fpage>4768</fpage>
          -
          <lpage>4777</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Anchors: High-precision model-agnostic explanations,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>