1. Introduction

Check for Trustworthy AI

Sascha Mücke

sascha.muecke@tu-dortmund.de 0 1

Lukas Pfahler

lukas.pfahler@tu-dortmund.de 0 1

Explainable AI, Trustworthy AI, Convolutional Neural Networks, Chess

0 AI Group, TU Dortmund University , Dortmund , Germany 1 LWDA'22: Lernen , Wissen, Daten, Analysen

Methods of Explainable AI (XAI) try to illuminate the decision making process of complex Machine Learning models by generating explanations. However, for most real-world data there is no “groundtruth” explanation, which makes evaluating the correctness of XAI methods and model decisions dificult. Often visual assessment or anecdotal evidence is the only type of evaluation. In this work we propose to use the game of chess as a source of “near ground-truth” (NGT) explanations, which XAI methods can be compared against using various metrics, serving as a “sanity check”. We demonstrate this process in an experiment with a deep convolutional neural network, to which we apply a range of commonly used XAI methods. As our main contribution, we publish our data set of 30 million chess positions along with their NGT explanations for free use in XAI research.

1. Introduction

https://www-ai.cs.tu-dortmund.de/PERSONAL/muecke.html (S. Mücke); https://www-ai.cs.tu-dortmund.de/PERSONAL/pfahler.html (L. Pfahler)

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the position is called check. If, in addition, the player cannot avoid that their King be captured on the next move, the position is called checkmate, and the game is over.

Chess is strictly rule-based, yet highly complex regarding the number of possible game positions. It has historical significance for AI, being perceived as an important milestone in approaching (and surpassing) human-level intelligence, which was reached when IBM’s chess computer Deep Blue defeated then world chess champion Garry Kasparov in 1997 [ 9 ]. Today, chess computers are far stronger than any human player.

However, in this paper we follow a diferent path and use chess as a data source for an XAI benchmark. To this end, we extract millions of chess positions from a large database of real chess games between human players. We sort all positions into one of three classes (no check, check or checkmate) and generate several binary 8 × 8 masks each, indicating which squares are important to explain the position at hand (see fig. 1, which will be explained in detail in section 2). We call these near ground-truth (NGT) explanations, as there are numerous equally valid representations of an explanation, and we select some of them.

Specifically, in this work we restrict the explanations to binary 8 × 8 masks, which highlight certain squares. The advantage is that feature attributions produced by common XAI methods can easily be compared to these masks by means of distance metrics such as Euclidean or cosine. If one finds a high correspondence between XAI output and any of our NGT explanations, one can be more confident that (i) their model pays attention to useful features, and (ii) the XAI method at hand works as intended.

1.1. Related Work

Explainable Machine Learning is a very active field of research, and a multitude of post-hoc model explanation methods has been proposed [ 10, 11, 12, 13, 14, 15 ]. We should note that evaluating explanations is a non-trivial task on its own [ 16, 17, 18 ]. Particularly visual explanations and example-based explanations [ 19 ] are often subjective and defining quantitative metrics is dificult. Datasets where expert explanations are provided as an additional layer of annotation may allow quantitative evaluations. As such, the benchmark we propose can be classified as a “Controlled Synthetic Data Check” according to [ 20 ], though our data is not synthetic but is taken from actual chess games played by humans.

However, it is not always apparent that there is a single, unique explanation. Synthetic data can ensure this. For instance, Ismail et al. use synthetic time-series data [ 21 ] and SanchezLengeling et al. [ 22 ] provide a set of synthetic graph-classification tasks where labeled sets of nodes are responsible for the target variable. Tritscher et al. generate datasets of synthetic categorical features and use randomly generated boolean functions as labeling functions with 3 to 10 important features [ 23 ]. In these checks, the XAI methods should identify only truly relevant parts of the input as relevant. More recently, Tritscher et al. provide a more realistic dataset for fraud detection where domain experts have manually annotated the “problematic” features of simulated fraud cases [ 24 ].

Our contribution is somewhat more general, as we provide a dataset of raw chess positions for a generic classification task, which can be approached with a wide variety of feature extraction methods and model types. Also, to the best of our knowledge, the domain of chess has not yet been explored as an XAI benchmark.

Other related research focuses on evaluating how well XAI methods perform from a user’s perspective [ 25 ]. We focus instead on the objective similarity between generated and NGT explanations, which are independent from human users.

1.2. Structure of this Ppaper

In section 2 we introduce our dataset of chess positions and NGT explanations, detailing our pre-processing steps and our reasoning behind the specific explanations we chose. In section 3 we train an ML model to classify the chess positions, apply post-hoc XAI methods to the trained model and compare the output to our NGT explanations using various metrics. Finally, in section 4 we discuss other possible applications of our data set, and further research directions we identified. Information about how to access our data set can be found there, as well.

2. Data Set

Our data set is based on the Lichess open database1, which contains records of over 3 billion games of chess played online by human players on the free chess website lichess.org. The records are saved in Portable Game Notation (PGN) format, containing the series of moves played as well as meta information, such as player ratings, timings and computer evaluations. The database is split into downloadable files for each month since January 2013. For our data set, we arbitrarily chose May 2021, as it contains over 100 million games. To read and process the games and to create the explanations, we used the Python package chess2.

1https://database.lichess.org/ 2https://python-chess.readthedocs.io/en/latest/ 8 0Z0Z0Z0Z 7 Z0Z0Z0j0 6 0Z0Z0Z0Z 5 Z0Z0ZpZ0 4 0Z0Z0ZpZ 3 Z0Z0Z0OK 2 0Z0ZrZ0Z 1 Z0Z0Z0Z0 a b c d e f g h 8 rZbZnZkZ 7 Z0o0Z0ZR 6 pZ0ZpOpZ 5 ZqZpZ0M0 4 0ZpO0ZPZ 3 Z0O0Z0Z0 2 POQS0O0Z 1 ZKZ0Z0Z0

a b c d e f g h 8 rmbl0Z0s 7 opZ0okap 6 0ZpZpmpZ 5 Z0ZpM0ZP 4 0Z0O0Z0Z 3 Z0MBZ0Z0 2 POPZ0OPZ 1 S0AQJ0ZR a b c d e f g h 8 rmblka0s 7 Z0ZpZpop 6 0ZpZnZ0Z 5 Z0Z0O0Z0 4 pZpO0o0Z 3 Z0Z0ZNZ0 2 POBZ0ZPO 1 SNAQZRJ0

a b c d e f g h 8 0srZ0ZkZ 7 Z0Z0Zpop 6 0Z0ZpZ0O 5 Z0lpZ0Z0 4 0Z0Z0Z0Z 3 oPm0ZPZ0 2 0JPZ0Z0Z 1 Z0S0Z0LR a b c d e f g h

We selected only those games that end in checkmate, excluding those that end by timeout or resignation. In a first pre-processing step, we iterate through the games and extract random positions that occur during the game. Each position falls into one of three classes: no check (0), check (1) or checkmate (2).

There are many more non-check positions than check positions in a typical game of chess, and again many more checks than checkmates. To approximately balance the classes while iterating over the positions, we accept each position with label randomly with probability 8 0ZrZ0skZ 7 o0Z0Zpop 6 0o0Z0Z0Z 5 ZQZPZ0Z0 4 0Z0Z0Z0Z 3 O0Z0ZPZ0 2 0AqZ0Z0Z 1 Z0ZKSRa0 8 0Z0Z0Z0Z 7 s0Z0Z0Z0 6 0Z0Z0Z0o 5 Z0L0Z0o0 4 0Z0LkZ0Z 3 Z0Z0Z0ZP 2 0Z0Z0ZPJ 1 Z0Z0Z0Z0 inversely proportional to the observed class proportions.Moreover, we skip the first ten moves of each game, as there is considerable overlap between games within the first few moves, which would lead to a large number of duplicate data points. In total, we extracted 30 million chess positions. The data set is summarized in table 1.

The data is saved as a CSV file containing the chess positions in Forsyth–Edwards Notation (FEN) and the label (0-2) as columns. The FEN string can be read by most chess software packages and encodes the current piece setup, whose turn it is and some more game-specific information (castling rights, en-passant squares).

2.1. Explanations

For each position we generate explanations consisting of 8 × 8 bit masks that can be laid over the chess board, highlighting certain squares. For each class, we identified one explanation type that characterizes it most accurately: • No check (0): All squares that are controlled by the enemy player, i.e., all squares that can be reached or captured on by any enemy piece. The fact that the King is not on any of these squares proves that the position is not check. • Check (1): All squares (origin or target) of legal moves. As a checkmate is a check where the player under attack has no more legal moves, highlighting legal moves is suficient to disprove a checkmate. Note that the piece giving check is only highlighted when it can be legally captured. • Checkmate (2): All squares with pieces that are essential for creating the checkmate.

This includes attackers, friendly pieces blocking the King, enemy pieces guarding escape squares and enemy pieces protecting attackers.

Examples of explanations for 0 and 1 can be seen in fig. 2, for 2 in fig. 3. As discussed in section 1, none of these explanations is the perfect explanation. For instance, the explanation given for class 0 is also a valid explanation for class 1, and vice versa. However, in this work we aim to provide a variety of explanations with diferent semantics, and the choice which explanation to apply to which class is up to the users of our data set. We therefore generate each type of explanation mentioned above for each data point, regardless of its class.

We save the explanations in a CSV file as unsigned integers representing 64-bit binary masks. Using the SquareSet class from the chess package, these integers can be converted back to binary masks (or other representations). Along with our data set, we provide code to convert between diferent representations.

3. Experiments

We transformed the FEN representation of each chess position into a 12 × 8 binary feature tensor , where the first dimension represents all combinations of 6 chess piece types and 2 colors, and the remaining dimensions are board coordinates. Feature = 1 if there is a piece of type on square (, ) . We used this representation because it contains a minimal amount of domain knowledge, though other representations are possible, too (e.g., 8 categorical features).

We train a convolutional neural network with 4 convolution layers with a total of 782,851 trainable parameters on the given 3-class decision problem. A detailed description can be found in appendix A. The convolutions use filter widths 7, 5 and 3, respectively, with centered padding. The larger filters help the network detect how far-away positions interplay. We add weight decay of 0.001 to combat overfitting in light of these large filter matrices. We use the Adam optimizer for a total of 400,000 weight updates with an initial learning rate of 0.001 that is reduced by a factor of 0.1 four times during training. The final model achieves a classification accuracy of 99.02% on 10,000 hold-out examples. The training accuracy also converges at 99%. It is not entirely surprising that a suficiently large deep network can learn a deterministic function given millions of data points.

We want to stress that the point of this work is not to propose to use ML to classify chess positions. A deterministic, polynomial-time algorithm can do that just fine, such as the one included in the chess Python package we used. Instead we want to explore how some commonly

(b) Saliency Map 8 0ZrZ0skZ 7 o0Z0Z0op 6 0Z0ZRZ0Z 5 Z0Z0Z0Zq 4 0Z0ZNonZ 3 Z0ZBZ0ZK 2 POPZ0ZPZ 1 Z0ZRZ0Z0 used post-hoc explanation approaches applied to our convolutional neural network explain the model’s decision, showcasing a possible use case of our data set. Because we have domain knowledge on the rules of chess and access to the true decision boundary, we can inspect the explanations and see if they are sound and in concordence with the rules of chess. In addition, our NGT explanations allow us to compute distances, reducing the necessity of subjective evaluation through domain experts.

The explanation methods we consider are Saliency maps and Gradient times Input [ 10 ], Feature Permutation [ 11 ], Guided Grad-CAM [ 12 ], Guided Backprop [ 13 ], Integrated Gradient

(a) Integrated Gradient 8 rZ0Z0ZkZ 7 Z0ZqZpo0 6 pLpZ0Z0o 5 S0ZpZ0Z0 4 0Z0Z0Z0Z 3 ZPO0ZPO0 2 0Z0ZrZ0O 1 Z0Z0ZRJ0 a b c d e f g h (c) Input times Gradient

(b) Saliency Map 8 rZ0Z0ZkZ 7 Z0ZqZpo0 6 pLpZ0Z0o 5 S0ZpZ0Z0 4 0Z0Z0Z0Z 3 ZPO0ZPO0 2 0Z0ZrZ0O 1 Z0Z0ZRJ0 a b c d e f g h (d) Guided Backpropagation [ 14 ] and Occlusion [ 15 ]. All these methods are implemented in the Captum software library 3 for explainable machine learning. We apply Guided Grad-CAM to the first and last convolution layers of our model, respectively.

As distance measures, we choose Chebyshev (ℓ∞), correlation, cosine and Euclidean (ℓ2) distance. To make the 12 × 8 × 8 feature attributions compatible with our 8 × 8 NGT explanations, we take the absolute maximum ‖⋅‖∞ over the first dimension, which gives us the strongest attribution found for each square.

We apply each distance measure to each combination of XAI method and class-specific NGT explanation (as described in section 2). 3.1. Results We report the results of our experiment in tables 2 to 4. Cosine distance turned out to be almost equal to 1 in all cases, which is why we dropped it from the result tables. Although there is no clear overall trend, Guided Grad-CAM is often closest to NGT among the XAI methods we tested on classes 0 and 1. On class 2, Integrated Gradient is closest, though Guided Grad-CAM performs also very well.

Figures 4 and 5 show some exemplary visualizations of explanations generated by various methods. In fig. 4 we observe high attributions in the general vicinity of the King and its attacker (the Queen on h5), although important squares that are contained in our NGT explanation (Pawns on f4 and g2 blocking the King’s escape squares) have no significant attribution. A possible explanation is that XAI methods may recognize that the Queen is often involved in checkmate positions when it is in the King’s vicinity, but fails to capture the more subtle rules of the game, which make f4 and g2 essential pieces for this checkmate. To understand why they are essential, one has to understand the movement patterns of both the King and the pawn, while also looking one move ahead. Understanding the Queen attacking the King requires only the movement pattern of one piece, and no looking ahead, which makes it arguably easier to recognize as a reason for check or checkmate.

Figure 5 shows radically diferent attributions from our NGT, indicating that the model uses other criteria to judge that a position is not a check. This is not entirely surprising, as explaining why a position is not a check is far more dificult and vague than showing the opposite.

4. Conclusion

We have presented a data set of 30 million chess positions falling into one of three classes (no check, check or checkmate), for which we generated three diferent types of explanations. These explanations are based on domain knowledge and can be considered “near ground-truth”, as the game of chess has perfect information, and every position can be classified according to a fixed set of rules. Such a data set of explanations is a valuable benchmark for evaluating post-hoc explanation methods which are developed within the research area of XAI. We showed this by training a deep convolutional neural network on the classification problem, applying a range of feature attribution methods and comparing the results to our NGT explanations using various distance metrics. This practically eliminates the need for a domain expert assessment, which is still the most prevalent way to judge the quality of XAI methods. We found that, among the XAI methods we looked into, Guided Grad-CAM was closest to NGT according to our metrics.

With the application we showed in this paper, we have barely scratched the surface of possible use cases for our data set. Our NGT explanations may be used to guide model training to obtain more interpretable models [ 26 ]. This applies to all gradient-based methods that we used for our experiments. By using an oracle instead of a trained model, one could evaluate the performance of black-box XAI methods, factoring out the model performance. Instead of varying the explanation method, one could vary the model type in order to investigate the overall decising making strategies of diferent models and architectures.

The three classes we chose (no check, check and checkmate) are more or less arbitrary, and any other class that is deterministically decidable could be used in a similar way, e.g. which player has more material on the board, which player controls more squares, etc. Also it is easily conceivable to apply our method to other games or rule-based systems in general as a similar data source. Board games such as Go or Chinese Chess are straightforward examples. Exploring more such NGT explanations from various domains could lead to even more thorough automated performance evaluation of XAI methods.

4.1. Using our Data Set

We strongly encourage active use of our data set for evaluating XAI methods or any other research project. We have published our data set along with some code for processing and feature-extraction on https://www.kaggle.com/datasets/smuecke/chess-xai-benchmark.

Acknowledgments

This research has been funded by the Federal Ministry of Education and Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning and Artificial Intelligence, LAMARR22B. This work has further been supported by Deutsche Forschungsgemeinschaft (DFG), as part of the Collaborative Research Center SFB 876 ”Providing Information by Resource-Constrained Analysis”, project A1. 2

[1] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , et al., Language Models are Few-Shot Learners , in: Proceedings of NeurIPS 2020 , 2020 .

[2]

Fedus ,

Zoph ,

Shazeer , Switch transformers: Scaling to trillion parameter models with simple and eficient sparsity , arXiv preprint arXiv:2101.03961 ( 2021 ).

[3]

Smith ,

Patwary ,

Norick , P. LeGresley, S. Rajbhandari,

Casper , et al., Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , arXiv preprint arXiv: 2201 .11990 ( 2022 ).

[4]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Imagenet classification with deep convolutional neural networks , Advances in neural information processing systems 25 ( 2012 ).

[5]

LeCun , Y. Bengio,

G. E.

Hinton , Deep learning , Nat. 521 ( 2015 ) 436 - 444 .

[6]

Dua ,

U. R.

Acharya , P. Dua (Eds.), Machine Learning in Healthcare Informatics , volume 56 of Intelligent Systems Reference Library , Springer, 2014 .

[7]

Esteva ,

Robicquet ,

Ramsundar ,

Kuleshov , M. DePristo,

Chou , et al., A guide to deep learning in healthcare , Nature medicine 25 ( 2019 ).

[8]

Linardatos ,

Papastefanopoulos ,

Kotsiantis , Explainable ai: A review of machine learning interpretability methods , Entropy 23 ( 2020 ) 18 .

[9]

Campbell ,

A. J. Hoane

Jr , F.-h. Hsu, Deep blue, Artificial intelligence 134 ( 2002 ) 57 - 83 .

[10]

Simonyan ,

Vedaldi ,

Zisserman , Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , arXiv preprint arXiv:1312.6034 ( 2014 ).

[11]

Molnar , Interpretable Machine Learning , 2020 . URL: https://christophm.github.io/ interpretable-ml-book/.

[12]

R. R.

Selvaraju ,

Cogswell , A. Das , R.

Vedantam , D.

Parikh , D.

Batra , Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , International Journal of Computer Vision 128 ( 2019 ) 336 - 359 .

[13]

J. T.

Springenberg ,

Dosovitskiy ,

Brox ,

M. A.

Riedmiller , Striving for simplicity: The all convolutional net , in: Workshop Track Proceedings of ICLR 2015 , 2015 .

[14]

Sundararajan ,

Taly ,

Yan , Axiomatic attribution for deep networks , in: Proceedings of ICML 2017 , JMLR .org, 2017 , pp. 3319 - 3328 .

[15] M. D. Zeiler , R. Fergus , Visualizing and understanding convolutional networks , in: Proceedings of ECCV 2014, Lecture Notes in Computer Science , Springer, 2014 .

[16]

Zhou ,

A. H.

Gandomi ,

Chen ,

Holzinger , Evaluating the Quality of Machine Learning Explanations : A Survey on Methods and Metrics ( 2021 ) 1 - 19 .

[17]

Bodria ,

Giannotti ,

Guidotti ,

Naretto ,

Pedreschi ,

Rinzivillo ,

A Survey

Of Methods For Explaining Black-Box Models 51 ( 2021 ) 1 - 33 .

[18]

Nauta ,

Trienes ,

Pathak , E. Nguyen,

Peters ,

Schmitt ,

Schlötterer , M. van Keulen , C. Seifert, From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , Technical Report 1 , 2022 .

[19] A .-p. Nguyen, M. R. Martínez , On quantitative aspects of model interpretability , Technical Report , 2020 .

[20]

Nauta ,

Trienes ,

Pathak , E. Nguyen,

Peters ,

Schmitt ,

Schlötterer , M. v. Keulen, C. Seifert, From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI , arXiv preprint arXiv: 2201 .08164 ( 2022 ).

[21]

A. A.

Ismail ,

M. K.

Gunady ,

H. C.

Bravo ,

Feizi , Benchmarking Deep Learning Interpretability in Time Series Predictions, in: NeurIPS 2020 , 2020 .

[22]

Sanchez-Lengeling ,

Wei ,

Lee ,

Reif ,

P. Y.

Wang ,

Qian ,

Mccloskey ,

Colwell ,

Wiltschko , Evaluating Attribution for Graph Neural Networks , in: Advances In Neural Information Precessing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1 - 13 .

[23]

Tritscher ,

Ring ,

Schlr ,

Hettinger ,

Hotho , Evaluation of Post-hoc XAI Approaches Through Synthetic Tabular Data , in: Foundations of Intelligent Systems , Springer International Publishing, Cham, 2020 , pp. 422 - 430 .

[24]

A. H.

Julian Tritscher , Fabian Gwinner, Daniel Schlör, Anna Krause, Open ERP System Data For Occupational Fraud Detection Julian , Technical Report , 2022 .

[25]

R. R.

Hofman ,

S. T.

Mueller , G. Klein,

Litman , Metrics for explainable AI: Challenges and prospects , arXiv preprint arXiv: 1812 . 04608 ( 2018 ).

[26]

Beckh ,

Müller ,

Jakobs ,

Toborek ,

Tan ,

Fischer ,

Welke ,

Houben , L. von Rueden, Explainable machine learning with prior knowledge: an overview , arXiv preprint arXiv:2105.10172 ( 2021 ).