<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Check for Trustworthy AI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sascha Mücke</string-name>
          <email>sascha.muecke@tu-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukas Pfahler</string-name>
          <email>lukas.pfahler@tu-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Explainable AI, Trustworthy AI, Convolutional Neural Networks, Chess</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AI Group, TU Dortmund University</institution>
          ,
          <addr-line>Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LWDA'22: Lernen</institution>
          ,
          <addr-line>Wissen, Daten, Analysen</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Methods of Explainable AI (XAI) try to illuminate the decision making process of complex Machine Learning models by generating explanations. However, for most real-world data there is no “groundtruth” explanation, which makes evaluating the correctness of XAI methods and model decisions dificult. Often visual assessment or anecdotal evidence is the only type of evaluation. In this work we propose to use the game of chess as a source of “near ground-truth” (NGT) explanations, which XAI methods can be compared against using various metrics, serving as a “sanity check”. We demonstrate this process in an experiment with a deep convolutional neural network, to which we apply a range of commonly used XAI methods. As our main contribution, we publish our data set of 30 million chess positions along with their NGT explanations for free use in XAI research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>https://www-ai.cs.tu-dortmund.de/PERSONAL/muecke.html (S. Mücke);
https://www-ai.cs.tu-dortmund.de/PERSONAL/pfahler.html (L. Pfahler)</p>
      <p>© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the position is called check. If, in addition, the player cannot avoid that their King be captured
on the next move, the position is called checkmate, and the game is over.</p>
      <p>
        Chess is strictly rule-based, yet highly complex regarding the number of possible game
positions. It has historical significance for AI, being perceived as an important milestone in
approaching (and surpassing) human-level intelligence, which was reached when IBM’s chess
computer Deep Blue defeated then world chess champion Garry Kasparov in 1997 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Today,
chess computers are far stronger than any human player.
      </p>
      <p>However, in this paper we follow a diferent path and use chess as a data source for an XAI
benchmark. To this end, we extract millions of chess positions from a large database of real
chess games between human players. We sort all positions into one of three classes (no check,
check or checkmate) and generate several binary 8 × 8 masks each, indicating which squares
are important to explain the position at hand (see fig. 1, which will be explained in detail in
section 2). We call these near ground-truth (NGT) explanations, as there are numerous equally
valid representations of an explanation, and we select some of them.</p>
      <p>Specifically, in this work we restrict the explanations to binary 8 × 8 masks, which highlight
certain squares. The advantage is that feature attributions produced by common XAI methods
can easily be compared to these masks by means of distance metrics such as Euclidean or cosine.
If one finds a high correspondence between XAI output and any of our NGT explanations, one
can be more confident that (i) their model pays attention to useful features, and (ii) the XAI
method at hand works as intended.</p>
      <sec id="sec-1-1">
        <title>1.1. Related Work</title>
        <p>
          Explainable Machine Learning is a very active field of research, and a multitude of post-hoc model
explanation methods has been proposed [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15">10, 11, 12, 13, 14, 15</xref>
          ]. We should note that evaluating
explanations is a non-trivial task on its own [
          <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
          ]. Particularly visual explanations
and example-based explanations [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] are often subjective and defining quantitative metrics is
dificult. Datasets where expert explanations are provided as an additional layer of annotation
may allow quantitative evaluations. As such, the benchmark we propose can be classified as a
“Controlled Synthetic Data Check” according to [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], though our data is not synthetic but is
taken from actual chess games played by humans.
        </p>
        <p>
          However, it is not always apparent that there is a single, unique explanation. Synthetic data
can ensure this. For instance, Ismail et al. use synthetic time-series data [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and
SanchezLengeling et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] provide a set of synthetic graph-classification tasks where labeled sets
of nodes are responsible for the target variable. Tritscher et al. generate datasets of synthetic
categorical features and use randomly generated boolean functions as labeling functions with
3 to 10 important features [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. In these checks, the XAI methods should identify only truly
relevant parts of the input as relevant. More recently, Tritscher et al. provide a more realistic
dataset for fraud detection where domain experts have manually annotated the “problematic”
features of simulated fraud cases [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
        </p>
        <p>Our contribution is somewhat more general, as we provide a dataset of raw chess positions for
a generic classification task, which can be approached with a wide variety of feature extraction
methods and model types. Also, to the best of our knowledge, the domain of chess has not yet
been explored as an XAI benchmark.</p>
        <p>
          Other related research focuses on evaluating how well XAI methods perform from a user’s
perspective [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. We focus instead on the objective similarity between generated and NGT
explanations, which are independent from human users.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Structure of this Ppaper</title>
        <p>In section 2 we introduce our dataset of chess positions and NGT explanations, detailing our
pre-processing steps and our reasoning behind the specific explanations we chose. In section 3
we train an ML model to classify the chess positions, apply post-hoc XAI methods to the trained
model and compare the output to our NGT explanations using various metrics. Finally, in
section 4 we discuss other possible applications of our data set, and further research directions
we identified. Information about how to access our data set can be found there, as well.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Data Set</title>
      <p>Our data set is based on the Lichess open database1, which contains records of over 3 billion
games of chess played online by human players on the free chess website lichess.org. The
records are saved in Portable Game Notation (PGN) format, containing the series of moves played
as well as meta information, such as player ratings, timings and computer evaluations. The
database is split into downloadable files for each month since January 2013. For our data set,
we arbitrarily chose May 2021, as it contains over 100 million games. To read and process the
games and to create the explanations, we used the Python package chess2.</p>
      <p>1https://database.lichess.org/
2https://python-chess.readthedocs.io/en/latest/
8 0Z0Z0Z0Z
7 Z0Z0Z0j0
6 0Z0Z0Z0Z
5 Z0Z0ZpZ0
4 0Z0Z0ZpZ
3 Z0Z0Z0OK
2 0Z0ZrZ0Z
1 Z0Z0Z0Z0
a b c d e f g h
8 rZbZnZkZ
7 Z0o0Z0ZR
6 pZ0ZpOpZ
5 ZqZpZ0M0
4 0ZpO0ZPZ
3 Z0O0Z0Z0
2 POQS0O0Z
1 ZKZ0Z0Z0</p>
      <p>a b c d e f g h
8 rmbl0Z0s
7 opZ0okap
6 0ZpZpmpZ
5 Z0ZpM0ZP
4 0Z0O0Z0Z
3 Z0MBZ0Z0
2 POPZ0OPZ
1 S0AQJ0ZR
a b c d e f g h
8 rmblka0s
7 Z0ZpZpop
6 0ZpZnZ0Z
5 Z0Z0O0Z0
4 pZpO0o0Z
3 Z0Z0ZNZ0
2 POBZ0ZPO
1 SNAQZRJ0</p>
      <p>a b c d e f g h
8 0srZ0ZkZ
7 Z0Z0Zpop
6 0Z0ZpZ0O
5 Z0lpZ0Z0
4 0Z0Z0Z0Z
3 oPm0ZPZ0
2 0JPZ0Z0Z
1 Z0S0Z0LR
a b c d e f g h</p>
      <p>We selected only those games that end in checkmate, excluding those that end by timeout or
resignation. In a first pre-processing step, we iterate through the games and extract random
positions that occur during the game. Each position falls into one of three classes: no check (0),
check (1) or checkmate (2).</p>
      <p>There are many more non-check positions than check positions in a typical game of chess,
and again many more checks than checkmates. To approximately balance the classes while
iterating over the positions, we accept each position  with label  randomly with probability
8 0ZrZ0skZ
7 o0Z0Zpop
6 0o0Z0Z0Z
5 ZQZPZ0Z0
4 0Z0Z0Z0Z
3 O0Z0ZPZ0
2 0AqZ0Z0Z
1 Z0ZKSRa0
8 0Z0Z0Z0Z
7 s0Z0Z0Z0
6 0Z0Z0Z0o
5 Z0L0Z0o0
4 0Z0LkZ0Z
3 Z0Z0Z0ZP
2 0Z0Z0ZPJ
1 Z0Z0Z0Z0
inversely proportional to the observed class proportions.Moreover, we skip the first ten moves
of each game, as there is considerable overlap between games within the first few moves, which
would lead to a large number of duplicate data points. In total, we extracted 30 million chess
positions. The data set is summarized in table 1.</p>
      <p>The data is saved as a CSV file containing the chess positions in Forsyth–Edwards Notation
(FEN) and the label (0-2) as columns. The FEN string can be read by most chess software
packages and encodes the current piece setup, whose turn it is and some more game-specific
information (castling rights, en-passant squares).</p>
      <sec id="sec-2-1">
        <title>2.1. Explanations</title>
        <p>For each position we generate explanations consisting of 8 × 8 bit masks that can be laid over
the chess board, highlighting certain squares. For each class, we identified one explanation type
that characterizes it most accurately:
• No check (0): All squares that are controlled by the enemy player, i.e., all squares that can
be reached or captured on by any enemy piece. The fact that the King is not on any of
these squares proves that the position is not check.
• Check (1): All squares (origin or target) of legal moves. As a checkmate is a check where
the player under attack has no more legal moves, highlighting legal moves is suficient to
disprove a checkmate. Note that the piece giving check is only highlighted when it can
be legally captured.
• Checkmate (2): All squares with pieces that are essential for creating the checkmate.</p>
        <p>This includes attackers, friendly pieces blocking the King, enemy pieces guarding escape
squares and enemy pieces protecting attackers.</p>
        <p>Examples of explanations for 0 and 1 can be seen in fig. 2, for 2 in fig. 3. As discussed in
section 1, none of these explanations is the perfect explanation. For instance, the explanation
given for class 0 is also a valid explanation for class 1, and vice versa. However, in this work
we aim to provide a variety of explanations with diferent semantics, and the choice which
explanation to apply to which class is up to the users of our data set. We therefore generate
each type of explanation mentioned above for each data point, regardless of its class.</p>
        <p>We save the explanations in a CSV file as unsigned integers representing 64-bit binary masks.
Using the SquareSet class from the chess package, these integers can be converted back to
binary masks (or other representations). Along with our data set, we provide code to convert
between diferent representations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>We transformed the FEN representation of each chess position into a 12 × 8 binary feature tensor
 , where the first dimension represents all combinations of 6 chess piece types and 2 colors, and
the remaining dimensions are board coordinates. Feature   = 1 if there is a piece of type 
on square (, ) . We used this representation because it contains a minimal amount of domain
knowledge, though other representations are possible, too (e.g., 8 categorical features).</p>
      <p>We train a convolutional neural network with 4 convolution layers with a total of 782,851
trainable parameters on the given 3-class decision problem. A detailed description can be found
in appendix A. The convolutions use filter widths 7, 5 and 3, respectively, with centered padding.
The larger filters help the network detect how far-away positions interplay. We add weight
decay of 0.001 to combat overfitting in light of these large filter matrices. We use the Adam
optimizer for a total of 400,000 weight updates with an initial learning rate of 0.001 that is
reduced by a factor of 0.1 four times during training. The final model achieves a classification
accuracy of 99.02% on 10,000 hold-out examples. The training accuracy also converges at 99%.
It is not entirely surprising that a suficiently large deep network can learn a deterministic
function given millions of data points.</p>
      <p>We want to stress that the point of this work is not to propose to use ML to classify chess
positions. A deterministic, polynomial-time algorithm can do that just fine, such as the one
included in the chess Python package we used. Instead we want to explore how some commonly</p>
      <p>(b) Saliency Map
8 0ZrZ0skZ
7 o0Z0Z0op
6 0Z0ZRZ0Z
5 Z0Z0Z0Zq
4 0Z0ZNonZ
3 Z0ZBZ0ZK
2 POPZ0ZPZ
1 Z0ZRZ0Z0
used post-hoc explanation approaches applied to our convolutional neural network explain
the model’s decision, showcasing a possible use case of our data set. Because we have domain
knowledge on the rules of chess and access to the true decision boundary, we can inspect the
explanations and see if they are sound and in concordence with the rules of chess. In addition,
our NGT explanations allow us to compute distances, reducing the necessity of subjective
evaluation through domain experts.</p>
      <p>
        The explanation methods we consider are Saliency maps and Gradient times Input [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
Feature Permutation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Guided Grad-CAM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Guided Backprop [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Integrated Gradient
      </p>
      <p>(a) Integrated Gradient
8 rZ0Z0ZkZ
7 Z0ZqZpo0
6 pLpZ0Z0o
5 S0ZpZ0Z0
4 0Z0Z0Z0Z
3 ZPO0ZPO0
2 0Z0ZrZ0O
1 Z0Z0ZRJ0
a b c d e f g h
(c) Input times Gradient</p>
      <p>
        (b) Saliency Map
8 rZ0Z0ZkZ
7 Z0ZqZpo0
6 pLpZ0Z0o
5 S0ZpZ0Z0
4 0Z0Z0Z0Z
3 ZPO0ZPO0
2 0Z0ZrZ0O
1 Z0Z0ZRJ0
a b c d e f g h
(d) Guided Backpropagation
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Occlusion [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. All these methods are implemented in the Captum software library 3
for explainable machine learning. We apply Guided Grad-CAM to the first and last convolution
layers of our model, respectively.
      </p>
      <p>As distance measures, we choose Chebyshev (ℓ∞), correlation, cosine and Euclidean (ℓ2)
distance. To make the 12 × 8 × 8 feature attributions compatible with our 8 × 8 NGT explanations,
we take the absolute maximum ‖⋅‖∞ over the first dimension, which gives us the strongest
attribution found for each square.</p>
      <p>We apply each distance measure to each combination of XAI method and class-specific NGT
explanation (as described in section 2).
3.1. Results
We report the results of our experiment in tables 2 to 4. Cosine distance turned out to be almost
equal to 1 in all cases, which is why we dropped it from the result tables. Although there is
no clear overall trend, Guided Grad-CAM is often closest to NGT among the XAI methods we
tested on classes 0 and 1. On class 2, Integrated Gradient is closest, though Guided Grad-CAM
performs also very well.</p>
      <p>Figures 4 and 5 show some exemplary visualizations of explanations generated by various
methods. In fig. 4 we observe high attributions in the general vicinity of the King and its attacker
(the Queen on h5), although important squares that are contained in our NGT explanation
(Pawns on f4 and g2 blocking the King’s escape squares) have no significant attribution. A
possible explanation is that XAI methods may recognize that the Queen is often involved in
checkmate positions when it is in the King’s vicinity, but fails to capture the more subtle rules
of the game, which make f4 and g2 essential pieces for this checkmate. To understand why they
are essential, one has to understand the movement patterns of both the King and the pawn,
while also looking one move ahead. Understanding the Queen attacking the King requires only
the movement pattern of one piece, and no looking ahead, which makes it arguably easier to
recognize as a reason for check or checkmate.</p>
      <p>Figure 5 shows radically diferent attributions from our NGT, indicating that the model uses
other criteria to judge that a position is not a check. This is not entirely surprising, as explaining
why a position is not a check is far more dificult and vague than showing the opposite.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We have presented a data set of 30 million chess positions falling into one of three classes (no
check, check or checkmate), for which we generated three diferent types of explanations. These
explanations are based on domain knowledge and can be considered “near ground-truth”, as the
game of chess has perfect information, and every position can be classified according to a fixed
set of rules. Such a data set of explanations is a valuable benchmark for evaluating post-hoc
explanation methods which are developed within the research area of XAI. We showed this by
training a deep convolutional neural network on the classification problem, applying a range of
feature attribution methods and comparing the results to our NGT explanations using various
distance metrics. This practically eliminates the need for a domain expert assessment, which is
still the most prevalent way to judge the quality of XAI methods. We found that, among the
XAI methods we looked into, Guided Grad-CAM was closest to NGT according to our metrics.</p>
      <p>
        With the application we showed in this paper, we have barely scratched the surface of
possible use cases for our data set. Our NGT explanations may be used to guide model training
to obtain more interpretable models [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. This applies to all gradient-based methods that we
used for our experiments. By using an oracle instead of a trained model, one could evaluate
the performance of black-box XAI methods, factoring out the model performance. Instead of
varying the explanation method, one could vary the model type in order to investigate the
overall decising making strategies of diferent models and architectures.
      </p>
      <p>The three classes we chose (no check, check and checkmate) are more or less arbitrary, and
any other class that is deterministically decidable could be used in a similar way, e.g. which
player has more material on the board, which player controls more squares, etc. Also it is
easily conceivable to apply our method to other games or rule-based systems in general as a
similar data source. Board games such as Go or Chinese Chess are straightforward examples.
Exploring more such NGT explanations from various domains could lead to even more thorough
automated performance evaluation of XAI methods.</p>
      <sec id="sec-4-1">
        <title>4.1. Using our Data Set</title>
        <p>We strongly encourage active use of our data set for evaluating XAI methods or any other
research project. We have published our data set along with some code for processing and
feature-extraction on https://www.kaggle.com/datasets/smuecke/chess-xai-benchmark.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research has been funded by the Federal Ministry of Education and Research of Germany
and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning
and Artificial Intelligence, LAMARR22B. This work has further been supported by Deutsche
Forschungsgemeinschaft (DFG), as part of the Collaborative Research Center SFB 876 ”Providing
Information by Resource-Constrained Analysis”, project A1.
2</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
          </string-name>
          , et al.,
          <article-title>Language Models are Few-Shot Learners</article-title>
          ,
          <source>in: Proceedings of NeurIPS</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          , Switch transformers:
          <article-title>Scaling to trillion parameter models with simple and eficient sparsity</article-title>
          ,
          <source>arXiv preprint arXiv:2101.03961</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Norick</surname>
          </string-name>
          , P. LeGresley, S. Rajbhandari,
          <string-name>
            <given-names>J.</given-names>
            <surname>Casper</surname>
          </string-name>
          , et al.,
          <article-title>Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model</article-title>
          , arXiv preprint arXiv:
          <volume>2201</volume>
          .11990 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>25</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , Y. Bengio,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Deep learning</article-title>
          ,
          <source>Nat. 521</source>
          (
          <year>2015</year>
          )
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. R.</given-names>
            <surname>Acharya</surname>
          </string-name>
          , P. Dua (Eds.),
          <source>Machine Learning in Healthcare Informatics</source>
          , volume
          <volume>56</volume>
          <source>of Intelligent Systems Reference Library</source>
          , Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esteva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Robicquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ramsundar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kuleshov</surname>
          </string-name>
          , M. DePristo,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chou</surname>
          </string-name>
          , et al.,
          <article-title>A guide to deep learning in healthcare</article-title>
          ,
          <source>Nature medicine 25</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Linardatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Papastefanopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kotsiantis</surname>
          </string-name>
          ,
          <article-title>Explainable ai: A review of machine learning interpretability methods</article-title>
          ,
          <source>Entropy</source>
          <volume>23</volume>
          (
          <year>2020</year>
          )
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J. Hoane</given-names>
            <surname>Jr</surname>
          </string-name>
          , F.-h. Hsu, Deep blue,
          <source>Artificial intelligence 134</source>
          (
          <year>2002</year>
          )
          <fpage>57</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps</article-title>
          ,
          <source>arXiv preprint arXiv:1312.6034</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Molnar</surname>
          </string-name>
          ,
          <source>Interpretable Machine Learning</source>
          ,
          <year>2020</year>
          . URL: https://christophm.github.io/ interpretable-ml-book/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-CAM:
          <article-title>Visual Explanations from Deep Networks via Gradient-Based Localization</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>128</volume>
          (
          <year>2019</year>
          )
          <fpage>336</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Springenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Striving for simplicity: The all convolutional net</article-title>
          ,
          <source>in: Workshop Track Proceedings of ICLR</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundararajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Taly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Axiomatic attribution for deep networks</article-title>
          ,
          <source>in: Proceedings of ICML</source>
          <year>2017</year>
          ,
          <article-title>JMLR</article-title>
          .org,
          <year>2017</year>
          , pp.
          <fpage>3319</fpage>
          -
          <lpage>3328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M. D. Zeiler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Fergus</surname>
          </string-name>
          ,
          <article-title>Visualizing and understanding convolutional networks</article-title>
          ,
          <source>in: Proceedings of ECCV 2014, Lecture Notes in Computer Science</source>
          , Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Gandomi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holzinger</surname>
          </string-name>
          ,
          <article-title>Evaluating the Quality of Machine Learning Explanations : A Survey on Methods and Metrics (</article-title>
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bodria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naretto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rinzivillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Survey</given-names>
            <surname>Of Methods For Explaining</surname>
          </string-name>
          Black-Box Models
          <volume>51</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nauta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Trienes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          , E. Nguyen,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlötterer</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Keulen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Seifert, From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI</article-title>
          ,
          <source>Technical Report 1</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>A</surname>
            .-p. Nguyen,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Martínez</surname>
          </string-name>
          ,
          <article-title>On quantitative aspects of model interpretability</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nauta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Trienes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          , E. Nguyen,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlötterer</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          v. Keulen,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Seifert, From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI</article-title>
          , arXiv preprint arXiv:
          <volume>2201</volume>
          .08164 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ismail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Gunady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Bravo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feizi</surname>
          </string-name>
          , Benchmarking Deep Learning Interpretability in Time Series Predictions, in:
          <source>NeurIPS</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Sanchez-Lengeling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mccloskey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Colwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wiltschko</surname>
          </string-name>
          ,
          <article-title>Evaluating Attribution for Graph Neural Networks</article-title>
          ,
          <source>in: Advances In Neural Information Precessing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tritscher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hettinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          ,
          <article-title>Evaluation of Post-hoc XAI Approaches Through Synthetic Tabular Data</article-title>
          ,
          <source>in: Foundations of Intelligent Systems</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>422</fpage>
          -
          <lpage>430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Julian Tritscher</surname>
          </string-name>
          , Fabian Gwinner, Daniel Schlör, Anna Krause,
          <article-title>Open ERP System Data For Occupational Fraud Detection Julian</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Mueller</surname>
          </string-name>
          , G. Klein,
          <string-name>
            <given-names>J.</given-names>
            <surname>Litman</surname>
          </string-name>
          ,
          <article-title>Metrics for explainable AI: Challenges and prospects</article-title>
          , arXiv preprint arXiv:
          <year>1812</year>
          .
          <volume>04608</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Beckh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Toborek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Houben</surname>
          </string-name>
          , L. von Rueden,
          <article-title>Explainable machine learning with prior knowledge: an overview</article-title>
          ,
          <source>arXiv preprint arXiv:2105.10172</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>