Towards Automated Human-Centered Recommendation of Explainable AI Solutions⋆ Nils Ole Breuer1,2,* , Sahin Albayrak1,2 1 GT-ARC gGmbH, Ernst-Reuter-Platz 7, 10587 Berlin 2 DAI-Labor, Technische Universität Berlin, Ernst-Reuter-Platz 7, 10587 Berlin Abstract Finding a suitable XAI method from the many XAI possibilities for a specific use case is a non-trivial task. There are recommendation algorithms for recommending XAI methods. However, these are often based solely on the underlying ML model’s technical characteristics and do not consider the needs of the target group. Also, these systems often recommend only off-the-shelf XAI methods, frequently failing to achieve the desired explanatory goal. We therefore introduce an automated recommendation framework that tackles both of these problems. On the one hand, we created a low-threshold process in which the needs of the target group can be captured in natural language. On the other hand, we recommend XAI Solutions that include both a suitable XAI method and actionable human-centered design guidelines, which describe how the explanation should be adjusted to be useful for the target group. Our recommendation framework consists of a customized GPT that offers suitable XAI Solutions based on the given design principles and an XAI database. We evaluate our recommendation framework in two real-world scenarios. The evaluation shows that it can generate human-centered XAI solutions that meet the needs of the target group. Keywords Explainable AI, XAI, Recommendation framework, Interactive explanations 1. Introduction The advent of accessible AI technology presents a significant opportunity for individuals without a technical background to realize their ideas for AI systems. While this democratization of AI signifies a beneficial advancement, as it fosters the creation of new interdisciplinary AI systems incorporating a diverse range of ideas, it also introduces risks. At the same time, it emphasizes the need for explainable and transparent AI. This underlines the central role of explainable AI (XAI) in ensuring transparency and control. Despite the potential benefits of XAI, there are two challenges for applying XAI methods. On the one hand, it is very difficult to select the most suitable method for a specific use case from the large and confusing range of XAI methods. On the other hand, off-the-shelf XAI methods are usually not easily applicable, as explanations for AI systems usually have to be generated in a very context-specific way and the explanations have to be adapted to the specific target group Multimodal, Affective and Interactive eXplainable AI Workshop, 27th European Conference on Artificial Intelligence ⋆ You can use this document as the template for preparing your publication. We recommend using the latest version of the ceurart style. * Corresponding author. $ nils.breuer@gt-arc.com (N. O. Breuer) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings to be interpretable. Consequently, the laudable intentions to make the AI system as transparent and explainable as possible usually fail because an unsuitable XAI method was selected or the target group does not understand the explanation, as the helpfulness in specific use cases is questionable [1, 2, 3, 4, 5, 6]. Our analysis reveals two problem areas in the practical application of XAI methods: first the selection of an XAI method and second the adaptation of the explanation to a specific target group. There are solutions for both problem areas. Recommendation algorithms exist to support the search process for a suitable XAI method. These recommendations are mostly based on the technical details of the ML model and not on the audience to whom the explanation is directed [7, 8]. Additionally, there is an awareness among researchers in the XAI field that explanations should be tailored to the audience to be understandable and useful. The research direction of human-centered XAI (HCXAI) addresses this problem and has already produced useful concepts [9, 10, 11]. However, these concepts have rarely been formulated as practical actionable guidelines so that they can be quickly and easily applied for an explanation. In this work, we aim to connect and address these two problems with a new human-centered recommendation framework for XAI Solutions. The objective of our recommendation framework is twofold. First, to enable practitioners without extensive experience in XAI to easily access a suitable XAI Solution. This requires a low-threshold and interactive process to gather relevant information. For this, we implemented a natural language-based procedure to collect information about the target group, i.e., the explainees, and also the technical characteristics of the AI model to be explained. Second, we want to recommend XAI Solutions which we define not only as an equivalent for an off-the-shelf XAI method but rather we use the information about the target group to generate a combination of a suitable XAI method and actionable human-centered design guidelines that are based on HCXAI concepts to make the explanation as understandable as possible for the target group. For this, we formulate human-centered design principles based on theories of HCXAI and social sciences. Additionally, we build an XAI database with the most relevant XAI methods. We create a custom GPT for the recommendation process, using the options provided by OpenAI’s ChatGPT platform. The custom GPT uses knowledge from our human-centered design principles and knowledge from our XAI database to recommend a custom XAI Solution according to our definition. In summary, our work provides the following contributions: • We formulate human-centered design principles that are based on socio-cognitive theory and HCXAI [12, 13] with which explanations of XAI methods can be easily adapted. • We propose an automated process that allows a wide range of individuals, including non- XAI-experts to receive recommendations for an XAI Solution for their use case, grounded in human-centered principles. • We create a custom GPT that can use pre-defined knowledge to generate customized XAI Solutions for a specific target group. 2. Related Work 2.1. Human-Centered XAI (HCXAI) Many explanations of XAI approaches currently rely on an algorithm-centric perspective and are therefore based on the intuitions and explanatory objectives of XAI researchers. These per- spectives and explanatory objectives diverge significantly from the requirements of layperson- friendly explanations [11]. This leads to a dissonance between the theory of XAI methods and their practical application [14]. For this rationale, researchers posit that in crafting explanations for AI systems, primacy should be accorded to the human recipient for whom the explanation is intended [15]. This concept led to the research field of human-centered XAI (HCXAI). The idea that many HCXAI scientists pursue is that explanations will be more understandable and useful if they correspond to the social and cognitive processes of human beings. This human- centered perspective also builds on the article of Miller [12], which examines socio-cognitive theories and XAI through extensive analysis. HCXAI has developed many theories and principles to date, as can be seen from the sheer number of recently published articles [10, 11, 16, 17, 18, 9]. These theories and principles have already been put into practice in some studies [19, 20, 21, 22, 23]. 2.2. XAI Method Recommendation The problem of selecting a suitable XAI method is ubiquitous in the literature. The usual approach is to use one of the many available XAI libraries [24, 25, 26] and then select a method that fits the model architecture. Another option is to make a selection of XAI methods and then analyze them using technical evaluation methods and select the “best” one [8, 27]. These processes require a high level of expertise and are also demanding for data scientists and machine learning engineers. Based on interviews with data scientist Retzlaff et al. [28] introduce a decision tree-based approach for selecting the most suitable XAI method for a specific use case. The decision tree enables data scientists to understand the tradeoffs between different XAI methods and also shows the user how methods can be combined to ensure the best possible use. One of the few works that aim to automate the recommendation process is the AutoXAI framework by Cugny et al. [7]. In this framework, a user can specify different context variables, technical data about the ML model, and the data set. The XAI Question Bank from Liao et al. [16] is then used to select a suitable XAI method. In addition Caro-Martínez et al. [29] created a holistic platform for the recommendation of personalized XAI experiences based on a case-based reasoning approach with an ontology. For Caro-Martínez et al. [29] an explanation experience consists of a solution to an explainability problem and an evaluation of the proposed solution. Studies that focus more on the needs of the target group such as [30, 31] offer helpful analyses of the target group and also conceptual approaches to what should be considered to recommend an XAI method in a human-centered way. The actual matching and recommendation process in these studies are left to future work. Figure 1: Our Recommendation Framework: On the left side is the input (information about the target group) that the user provides to the recommendation GPT (turquoise), in the middle is the custom recommendation GPT (blue) that uses our formulated actionable human-centered design principles (3.1) and the XAI method database (3.3, and on the right the final XAI Solution. 3. Recommending Human-Centered XAI Solutions In this section, we delve deeper into the individual components of our recommendation frame- work. First, we explain our human-centered design principles, then discuss how we gather information about the target group of the XAI solution, how we map this information to XAI methods, and finally, how the individual components are processed through our recommen- dation procedure. Figure 1 shows an overview of all the components of our recommendation framework. 3.1. Human-Centered Design Principles As already described in the introduction, existing XAI recommendation algorithms [7] are limited to simply recommending the raw XAI methods, mostly based on the technical characteristics of the AI application. Evaluation and user studies have shown that these are often misleading and do not lead to a causal understanding of the ML model [1, 6]. That is why we refer to XAI Solutions in this paper and not just XAI methods. As stated above we define an XAI Solution as a comprehensive recommendation that encompasses not only an XAI method but also actionable human-centered design guidelines, which can be used to tailor the explanation to benefit the target group, leading to a deep understanding of the behavior of the ML algorithm. From the insights of human-centered XAI research [11] and social sciences [12], we derive design principles for four categories [13, 32, 23] that are relevant for understanding an explana- tion. The design principles are therefore a broad collection of theories, which are then adapted by the recommender to form actionable guidelines for the explainee target group. The guidelines for the four categories can then be used to easily tailor the explanation to be more interpretable for the target group. • Communication. Social science theories indicate that explaining is a form of social dialogue [12]. User studies in the XAI domain also show that explainees primarily seek some form of social dialogue in explanations or prefer explanations to be supplemented with verbal descriptions [17]. • Interactivity. Studies have shown that the mere presentation of diagrams as explanations is an obstacle to interpretability, for both experts and non-experts [33]. Research in HCXAI indicates that it’s preferable to design explanations for an AI system as an interactive process in which the behavior of the algorithm can be understood [5, 34]. • Selectivity / Complexity. The social sciences have shown that explanations are selective, which means that the explainer only selects the most important and relevant causes that are necessary to form an explanation [12]. To apply this to explainable AI, a way should be found to query the needs of the consumer of the explanation before the explanation is generated in order to select the most important parts of the explanation [22]. • Customizability. Studies have shown that if the person receiving the explanation has the opportunity to personalize the explanation to their level of knowledge, mental model, and preferences, it can have a positive impact on the understanding of the explanation [35]. These personalizations could be, on the one hand, that the complexity or the form of presentation can be adapted independently. The full design principles for all categories can be found in Appendix A. These design principles form one of the knowledge files (see Figure 1, blue box) that the recommendation GPT uses to generate the actionable design guidelines of the XAI Solution. 3.2. Background Knowledge and Explanatory Needs of the Target Group As can be seen from Figure 1, the recommendation GPT receives information about the back- ground knowledge and the explanatory needs of the target group as input. Background Knowledge In many studies where a target group is characterized to create more personalized explanations, strict stakeholder groups are defined, each with specific needs for an explanation [36, 37]. However, this rigid classification is outdated due to the widespread use of AI applications in society. Therefore, for our recommendation framework, we use a more detailed analysis to identify the needs of the target group. To do this, we reference the expertise definition of [38], which was adapted within a framework for XAI purposes by [31]. Following the framework [31] we decompose expertise into types of knowledge in specific contexts. The types of knowledge relate to formal knowledge, e.g. familiarity with theories, instrumental knowledge, e.g., programming experience, and personal knowledge, e.g., information that a person knows from the media. To retrieve this information about the explainee group we formulate open-ended questions that the user of the recommendation framework should answer in natural language. We provide exemplary answers for each question to guide the answers in the right direction. All questions can be found in the Appendix B.1. By allowing users themselves to describe the target group in their language, we generate a much more accurate description of the target group. However, through our example questions, we still guide them in a specific direction that aligns with our design principles. Explanatory Needs Further important information about the target group includes the needs they have for an explanation. Explanatory needs can be divided into tasks that the target group wants to solve with the explanation and goals that the target group wants to achieve with an explanation [31]. Tasks correspond to low-level questions that the target group wants an answer for, e.g., what features are most important for the ML model? Goals are more high-level objectives that the target group wants to achieve, e.g., ensure that the ML model complies with regulations and laws. Just as for the collection of background information, we formulated open questions that the user of the recommendation framework should answer in their own words. The questions and sample answers can be found in the Appendix B.2. 3.3. XAI Database In addition to the actionable human-centered design guidelines, XAI methods are also part of our XAI Solution. For this purpose, we created an XAI database with common XAI methods. The database defines for each method technical characteristics with which the method is compatible. Beyond that, we also match goals and tasks to the XAI methods so that the explanatory needs of the target group can be used to find a suitable XAI method. For example, the above-mentioned task (What features are most important for the ML model?) is assigned to a feature attribution method. 3.4. Recommendation Procedure Due to the high context specificity, the individual requirements of the target group, and also the many different types of ML models, it is very challenging to create a structured selection process of XAI methods and human-centered design principles. Many of our attempts with structural databases failed because we always identified a new use case with a new alternative ML model in combination with specific requirements of a target group for which no suitable XAI method in combination with well-founded design guidelines could be found. This is why we have opted for a more open approach. This approach is reflected in the way the information is collected, namely in the form of guided free text. To match the freely formulated explanatory needs and background knowledge of the target group with actionable human-centered design guidelines and a suitable XAI method from our XAI database, we utilize the remarkable association possibilities and few-shot learning capabilities of OpenAI’s large language model GPT-4 [39]. We create a custom XAI Solution recommendation GPT. For this, we formulated a detailed instruction prompt explaining the recommender’s procedure, and also we provided few-shot prompting examples. In Figure 1 it can be seen that we additionally provide the recommender GPT with two so-called “knowledge files”: Our Human-Centered Design Principles (compare Section 3.1) and our XAI Database (compare Section 3.3). As can be seen in Figure 1 the recommendation GPT considers the explanatory needs, and the background knowledge of the target group. Furthermore, it also uses technical aspects of the ML model if they are mentioned in the answers of the user. Even though the information retrieval process is structured as a question-answering scheme where the users of the framework formulate the knowledge and explanatory needs of the target group by answering questions the recommender GPT does not act as a chat agent. It rather receives all of the answers of the information retrieval process concatenated as one text section. Based on that input it then uses the “knowledge files” to generate an individual XAI Solution consisting of actionable human-centered design guidelines and an XAI method. The detailed instruction prompt can be found in Appendix C. 4. Evaluation of XAI Solutions We evaluate the capabilities of our recommendation GPT with two use cases. As there is still no standardized methodology for evaluating the output of LLM, we have evaluated the XAI solutions of our recommendation GPT in three different ways. Human evaluation is still the gold standard for assessing the quality of LLM output. Because conducting an extensive human-evaluation study is time-consuming we limited ourselves to a qualitative analysis of the generated XAI solutions. In addition, we use two automated evaluation metrics to assess the quality of the design guidelines of the XAI solutions. An embedding-based metric for semantic textual similarity called SemScore [40]. And a LLM-based evaluation method called G-Eval [41]. 4.1. Use Case 1: Oncology AI The first scenario that we analyzed is that of a senior doctor working in the field of oncology. Recently, the hospital where she works decided to install a new type of AI algorithm to help detect early breast cancer. However, there are concerns among doctors as they have heard of other medical algorithms that have also produced negative aspects. The doctors would therefore like to have an explanation for the algorithm’s decision so that they can evaluate whether the explanation is flawed and at the same time create confidence in working with the algorithm. The whole information retrieval process can be viewed in the Appendix D.1 in addition to the XAI Solution provided by our recommendation GPT. In Figure 2 an excerpt of the information retrieval process for the formal knowledge of the target group is visualized. 4.2. Use Case 2: Extreme Weather AI In the second scenario, a group of ML engineers is searching for an explanation for an ML algorithm with which extreme weather events can be predicted based on climate data. The problem is that the engineers do not have any formal and instrumental knowledge about the weather data. Therefore, they want to use the explanation to improve the model and also understand the climate data in more detail. Again the whole fictive interaction and the XAI Solution can be found in Appendix D.2. In Figure 3 we visualized the information retrieval process of the instrumental knowledge and the explanatory needs of the target group. Figure 2: Information retrieval of formal knowledge of target group doctors. Figure 3: Information retrieval of instrumental knowledge and explanatory needs target group ML engineers. 4.3. Qualitative Analysis The XAI Solution (Appendix D.1, Figure 4) shows that for each of the four explanatory categories, the recommender GPT gives an actionable design recommendation. The recommendation follows our definitions and design principles which we extracted from HCXAI research. For example, the recommendation is to use clear and concise language that integrates medical terms for the explanation. This corresponds to our rationale for using language as a supporting medium for visual explanations. This shows that the recommender GPT uses the provided knowledge file to generate the design recommendation. The selected XAI methods also meet the needs of the target group. Counterfactual explanation methods are especially often used to create a causal understanding of an ML model. Since the goal of the explanation for the target group is to evaluate how trustworthy the algorithm is, a counterfactual explanation fits well, as it allows decision characteristics to be recognized. These characteristics can then be compared with the physicians’ medical expertise to decide whether the prediction is trustworthy or not. From a technical perspective, both recommended XAI methods can also be used for the ML model. As for use case 1 the XAI Solution (Appendix D.2, Figure 5) shows some favorable properties for use case 2 that are based on the needs and goals of the target group. For example, one goal is to understand how the model uses climate data to make its prediction. The XAI Solution, therefore, recommends adding an interactive component to the explanation where the engineers can construct different scenarios and understand how different climate features impact the prediction of the ML model. Additionally, the XAI Solution suggests that the explanation can use highly specific technical language and statistical concepts because the target group consists of ML engineers with several years of working experience. Another objective of the target group is that they want to know which features are most relevant for the ML model, hence the XAI methods recommended in the XAI Solution are suitable, as two of them are feature attribution methods. 4.4. Semantic Similarity Evaluation We also use two automated evaluation metrics so that the evaluation does not depend solely on our, possibly biased, perception. SemScore [40] compares the semantic content of a model’s output and a reference text using embeddings. This method fits well because we want to compare if the design guidelines confirm our proposed human-centered design principles. SemSocre computes a correlation value between the embedding of the model output and the reference text. G-Eval [41] is a framework that uses Chain-of-Thought prompting to assess the quality of LLM output based on some evaluation criteria, in our case semantic similarity on a scale from 1 to 5. Meta evaluations show high human alignment values for both metrics. XXX XXX Metric SemScore G-Eval Use Case XXX X Oncology AI 0.555 4.38 Extreme Weather AI 0.675 4.625 Table 1 Results of both semantic similarity evaluation methods SemScore [40] and G-Eval [41]. SemScore has a scale from -1 to 1 and G-Eval has a scale from 1 to 5. Table 1 shows the results of both metrics SemScore and G-Eval for our two use cases. Sem- Scores are between -1 and 1 where a value close to one means high semantic similarity. G-Eval is an evaluation metric where an LLM assigns a score between 1 and 5 where 5 indicates high semantic similarity. The evaluation results show that we get moderately good correlation scores (0.555 and 0.675) with SemScore. This means that the recommended design guidelines of the recommendation GPT align positively with our human-centered design principles. Also, the scores of the G-Eval framework show positive values for the semantical similarity criteria. Table 1 shows the mean values (4.38 and 4.63) of 20 trials of the G-Eval framework with the same LLM as a backbone. 5. Discussion In this study, we embarked on developing a recommendation framework that finds and suggests human-centered XAI Solutions for individual use cases. For that, we first had to introduce our notion of XAI Solution which is a combination of conventional XAI methods and actionable human-centered design guidelines with which these methods can be tailored to a specific target group. To recommend the design guidelines we formulated human-centered design principles based on research, drawing on studies from human-centered XAI [11] and socio-cognitive analysis [12]. These principles describe how explanations should be created for different target groups so that they are as understandable as possible. Through the formulation of our design principles, we can recommend actionable guidelines. Because a human-centered XAI Solution is highly individual and context-specific, it is almost impossible to implement a hard-coded and structured approach with database queries or decision trees. This conclusion has also been reached by numerous predecessors [30, 31, 7] who have made the theoretical formulations but were unable to implement a recommendation algorithm. Hence we decided to utilize the impressive association capabilities of the large language model GPT-4. We build a customized recommendation GPT that uses among others “knowledge files” as information bases to generate its recommendation. We evaluated the results of the recommendation framework for two use cases with three eval- uation metrics. The qualitative analysis of the XAI solutions shows that the recommender GPT is capable of using the information about the target group to formulate actionable human-centered design guidelines that correspond to the design principles we formulated. The recommendation framework grasps the important parts of the target group information and matches it to the design principles, for example, for the target group ML engineer it recommends that statistical concepts and machine learning concepts should be used for the explanation. In contrast, for the target group doctor, it is recommended to avoid technical terms of machine learning but rather to convey the explanation in medical jargon. Furthermore, if possible, the explanation should not be centered around ML theories, as the information is not helpful and rather coun- terproductive for them. The positive direction suggested by our qualitative analysis is also backed by the quantitative evaluation metrics. We focused the automated evaluation metrics on the criteria of semantic similarity. The rationale behind that choice is that we want to assess whether the generated human-centered design guidelines semantically follow our design principles, i.e. have the same meaning. The evaluation with both metrics shows positive results which means that even though both evaluation techniques are based on completely different methodologies our recommendation GPT produces trustworthy actionable design guidelines which are in agreement with our design principles. Furthermore, we argue that the framework pays attention to the needs of the target group when selecting a suitable XAI method. In both of the use cases, the recommender provides suitable help on which XAI method should be used to achieve the explanatory goal. For example, the group of ML engineer wants to know which features are most important for the ML model to use this information to improve the ML model. The recommendation framework rightfully suggests using feature attribution methods that can provide precisely this information. Another distinguishing feature of our recommendation framework is that it is language-based. This opens up the possibility for people who are not necessarily familiar with AI to describe their needs in their own words. Natural language enables a whole new level of customization and users can describe their use cases much more precisely than approaches that only offer a limited choice for specific use cases [7, 20, 42]. Our recommendation framework thus addresses both issues introduced in the introduction regarding XAI. It assists in selecting an appropriate XAI method, considering not only the technical characteristics of the ML model but also selecting the method based on the knowledge level and explanation goals of the target group. Furthermore, the recommendation framework provides recommendations and guidelines based on well-established theories of HCXAI, on how explanations should be tailored to maximize understandability for the target group. To the best of our knowledge, our approach is the first to integrate these two issues and propose a solution. Furthermore, unlike any previous methods, our approach offers unprecedented individuality in the selection and deployment of XAI methods. Of course, we are aware of the risks and limitations associated with the use of large language models [43, 44]. We are also aware that it is somewhat contradictory to utilize a type of ML architecture that is inherently unexplainable for a problem within the domain of XAI. Nevertheless, we believe that our evaluation approach, combining qualitative human-analysis which is still considered the gold standard, and automated evaluation metrics could be a way forward to reliably assess the output of custom GPTs if the possibility for a large-scale expert survey is not given. Another promising direction for evaluating our recommendation framework is human-in-the-loop feedback where practitioners provide feedback through conversational interactions as has been implemented by [29]. Unfortunately, due to the basic structure and proprietary nature of most large language models, it cannot be completely ruled out that fatal hallucinations will be generated despite extensive evaluation. However, we believe that the task to be solved involves such a high degree of complexity and that a fundamental feature of the framework is the focus on linguistic exchange that only an LLM is capable of producing satisfactory results. 6. Conclusion In this work, we build a human-centered recommendation framework for XAI Solutions for specific use cases. Our XAI Solutions consist of a suitable XAI method and actionable human- centered design guidelines with which the explanation can be tailored for the target group for better understanding. For this, we build a customized GPT that uses well-defined human- centered design principles and an XAI database to generate the XAI Solution. The evaluation of the recommendation framework on two use cases shows that practitioners can overcome two long-lasting problems in the applications of XAI: first finding a good XAI method and second making the explanation understandable to the target group. There are numerous possibilities for expanding the framework in future work. On the one hand, we aim to expand the human-centered design principles and incorporate even more insights from interdisciplinary sciences. On the other hand, the XAI database can be endlessly expanded with ever-new XAI methods. To make the recommender GPT more robust with larger datasets, methods like retrieval-augmented generation (RAG) could be employed. Acknowledgments This work was conducted as part of the Go-KI project (Offenes Innovationslabor KI zur Förderung gemeinwohlorientierter KI-Anwendungen), funded by the German Federal Ministry of Labour and Social Affairs (BMAS) under the funding reference number DKI.00.00032.21. References [1] H. Vasconcelos, M. Jörke, M. Grunde-McLaughlin, T. Gerstenberg, M. S. Bernstein, R. Kr- ishna, Explanations can reduce overreliance on ai systems during decision-making, Pro- ceedings of the ACM on Human-Computer Interaction 7 (2023) 1–38. [2] C. Chen, S. Feng, A. Sharma, C. Tan, Machine explanations and human understanding (2022), URL: http://arxiv. org/abs/2202.04092 (2022). [3] S. S. Kim, N. Meister, V. V. Ramaswamy, R. Fong, O. Russakovsky, Hive: Evaluating the human interpretability of visual explanations, in: European Conference on Computer Vision, Springer, 2022, pp. 280–298. [4] R. S. Zimmermann, J. Borowski, R. Geirhos, M. Bethge, T. Wallis, W. Brendel, How well do feature visualizations support causal understanding of cnn activations?, Advances in Neural Information Processing Systems 34 (2021) 11730–11744. [5] A. Bertrand, T. Viard, R. Belloum, J. R. Eagan, W. Maxwell, On selective, mutable and dialogic xai: a review of what users say about different types of interactive explanations, in: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–21. [6] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, B. Kim, Sanity checks for saliency maps, Advances in neural information processing systems 31 (2018). [7] R. Cugny, J. Aligon, M. Chevalier, G. Roman Jimenez, O. Teste, Autoxai: A framework to automatically select the most adapted xai solution, in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 315–324. [8] P. Bommer, M. Kretschmer, A. Hedström, D. Bareeva, M. M.-C. Höhne, Finding the right xai method–a guide for the evaluation and ranking of explainable ai methods in climate science, arXiv preprint arXiv:2303.00652 (2023). [9] K. Sokol, P. Flach, One explanation does not fit all: The promise of interactive explanations for machine learning transparency, KI-Künstliche Intelligenz 34 (2020) 235–250. [10] U. Ehsan, M. O. Riedl, Human-centered explainable ai: Towards a reflective sociotechnical approach, in: HCI International 2020-Late Breaking Papers: Multimodality and Intelligence: 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings 22, Springer, 2020, pp. 449–466. [11] Q. V. Liao, K. R. Varshney, Human-centered explainable ai (xai): From algorithms to user experiences, arXiv preprint arXiv:2110.10790 (2021). [12] T. Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial intelligence 267 (2019) 1–38. [13] K. Sokol, P. Flach, Explainability fact sheets: A framework for systematic assessment of explainable approaches, in: Proceedings of the 2020 conference on fairness, accountability, and transparency, 2020, pp. 56–67. [14] H. Shen, T.-H. Huang, How useful are the machine-generated interpretations to general users? a human evaluation on guessing the incorrectly predicted labels, in: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 8, 2020, pp. 168–172. [15] M. Ribera, A. Lapedriza, Can we do better explanations? a proposal of user-centered explainable ai, CEUR Workshop Proceedings, 2019. [16] Q. V. Liao, M. Pribić, J. Han, S. Miller, D. Sow, Question-driven design process for explainable ai user experiences, arXiv preprint arXiv:2104.03483 (2021). [17] H. Lakkaraju, D. Slack, Y. Chen, C. Tan, S. Singh, Rethinking explainability as a dialogue: A practitioner’s perspective, arXiv preprint arXiv:2202.01875 (2022). [18] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, H. Müller, Causability and explainability of artificial intelligence in medicine, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (2019) e1312. [19] M. Chromik, Making shap rap: Bridging local and global insights through interaction and narratives, in: Human-Computer Interaction–INTERACT 2021: 18th IFIP TC 13 International Conference, Bari, Italy, August 30–September 3, 2021, Proceedings, Part II 18, Springer, 2021, pp. 641–651. [20] D. Slack, S. Krishna, H. Lakkaraju, S. Singh, Explaining machine learning models with interactive natural language conversations using talktomodel, Nature Machine Intelligence 5 (2023) 873–883. [21] Y. Pi, Infeature: An interactive feature-based-explanation framework for non-technical users, in: International Conference on Human-Computer Interaction, Springer, 2023, pp. 262–273. [22] V. Lai, Y. Zhang, C. Chen, Q. V. Liao, C. Tan, Selective explanations: Leveraging human input to align explainable ai, arXiv preprint arXiv:2301.09656 (2023). [23] X. Kong, S. Liu, L. Zhu, Toward human-centered xai in practice: A survey, Machine Intelligence Research (2024) 1–31. [24] N. Kokhlikyan, V. Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, et al., Captum: A unified and generic model interpretability library for pytorch, arXiv preprint arXiv:2009.07896 (2020). [25] A. Saucedo, U. Iqbal, S. Krishna, Xai-an explainability toolbox for machine learning, 2018. [26] V. Arya, R. K. E. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V. Liao, R. Luss, A. Mojsilović, S. Mourad, P. Pedemonte, R. Raghavendra, J. Richards, P. Sattigeri, K. Shanmugam, M. Singh, K. R. Varshney, D. Wei, Y. Zhang, One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques, 2019. URL: https: //arxiv.org/abs/1909.03012. [27] A. Perotti, C. Borile, A. Miola, F. P. Nerini, P. Baracco, A. Panisson, Explainability, quantified: Benchmarking xai techniques, in: World Conference on Explainable Artificial Intelligence, Springer, 2024, pp. 421–444. [28] C. O. Retzlaff, A. Angerschmid, A. Saranti, D. Schneeberger, R. Roettger, H. Mueller, A. Holzinger, Post-hoc vs ante-hoc explanations: xai design guidelines for data scientists, Cognitive Systems Research (2024) 101243. [29] M. Caro-Martínez, J. A. Recio-García, B. Díaz-Agudo, J. M. Darias, N. Wiratunga, K. Martin, A. Wijekoon, I. Nkisi-Orji, D. Corsar, P. Pradeep, et al., isee: A case-based reasoning platform for the design of explanation experiences, Knowledge-Based Systems (2024) 112305. [30] T. Vermeire, T. Laugel, X. Renard, D. Martens, M. Detyniecki, How to choose an explain- ability method? towards a methodical implementation of xai in practice, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2021, pp. 521–533. [31] H. Suresh, S. R. Gomez, K. K. Nam, A. Satyanarayan, Beyond expertise and roles: A framework to characterize the stakeholders of interpretable machine learning and their needs, in: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–16. [32] T. A. Schoonderwoerd, W. Jorritsma, M. A. Neerincx, K. Van Den Bosch, Human-centered xai: Developing design patterns for explanations of clinical decision support systems, International Journal of Human-Computer Studies 154 (2021) 102684. [33] H. Kaur, H. Nori, S. Jenkins, R. Caruana, H. Wallach, J. Wortman Vaughan, Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning, in: Proceedings of the 2020 CHI conference on human factors in computing systems, 2020, pp. 1–14. [34] M. Chromik, A. Butz, Human-xai interaction: a review and design principles for explana- tion user interfaces, in: Human-Computer Interaction–INTERACT 2021: 18th IFIP TC 13 International Conference, Bari, Italy, August 30–September 3, 2021, Proceedings, Part II 18, Springer, 2021, pp. 619–640. [35] J. Schneider, J. Handali, Personalized explanation in machine learning: A conceptualization, arXiv preprint arXiv:1901.00770 (2019). [36] S. Mohseni, N. Zarei, E. D. Ragan, A multidisciplinary survey and framework for design and evaluation of explainable ai systems, ACM Transactions on Interactive Intelligent Systems (TiiS) 11 (2021) 1–45. [37] M. Langer, D. Oster, T. Speith, H. Hermanns, L. Kästner, E. Schmidt, A. Sesing, K. Baum, What do we want from explainable artificial intelligence (xai)?–a stakeholder perspective on xai and a conceptual model guiding interdisciplinary xai research, Artificial Intelligence 296 (2021) 103473. [38] J. Fleck, Expertise: knowledge, power and tradeability, in: Exploring expertise: Issues and perspectives, Springer, 1998, pp. 143–171. [39] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al- tenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023). [40] A. Aynetdinov, A. Akbik, Semscore: Automated evaluation of instruction-tuned llms based on semantic textual similarity, arXiv preprint arXiv:2401.17072 (2024). [41] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, G-eval: Nlg evaluation using gpt-4 with better human alignment, arXiv preprint arXiv:2303.16634 (2023). [42] V. B. Nguyen, J. Schlötterer, C. Seifert, From black boxes to conversations: Incorporating xai in a conversational agent, in: World Conference on Explainable Artificial Intelligence, Springer, 2023, pp. 71–96. [43] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of stochastic parrots: Can language models be too big?, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021, pp. 610–623. [44] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al., Siren’s song in the ai ocean: a survey on hallucination in large language models, arXiv preprint arXiv:2309.01219 (2023). A. Human-Centered Design Principles This is the content of our knowledge file which the recommender GPT uses to build the XAI Solution. The knowledge file contains design principles for four explanatory characteristics. COMMUNICATION From a psychological and philosophical perspective, explanations are a form of social dialogue. Explanations based solely on visualizations can be hard to interpret. In addition to just using images and tables, the visual explanation should also be described verbally, like with text. Moreover, each explanation should be accompanied by some sort of interpretation guide. This is especially helpful for people who don’t have a high level of formal, instrumental, and personal knowledge about AI. For AI experts with significant knowledge, the verbal component isn’t as crucial, but an interpretation guide still improves understanding. INTERACTIVITY Simply presenting raw visualizations like saliency maps and tabular ex- planations does not provide complete understanding. From a social science perspective, expla- nations are often viewed as an interactive process. Thus, explanations for a machine learning model should also be interpreted and structured as an interactive process. This means there should be a back-and-forth between the explaining medium and the person receiving the expla- nation. Through this process, most audiences can gain a reliable mental model of the behavior of an ML model. SELECTIVITY / COMPLEXITY Traditional explanations like saliency maps or tabular visualizations are often cognitively demanding and can overwhelm people who don’t have significant formal, instrumental, or personal knowledge because they are too complex. Social sciences have shown that explanations are selective, meaning the explainer chooses only the most important and relevant causes to build an explanation. To apply this to explainable AI, a way should be found to ask for the needs of the explanation consumer before generating the explanation, allowing the most critical parts to be selected. For example, in feature-importance methods, not all features should be displayed, just those that are most important or of interest to the consumer. Similarly, saliency maps can be simplified to make them less complex. PERSONALIZABILITY Studies have shown that when the recipient of the explanation can customize it to their knowledge level, mental model, and preferences, it positively impacts understanding. These personalizations could include adjusting the complexity or presentation style. However, this should be available only to those with a higher level of AI knowledge. Simpler personalizations, like changing the color scheme or size, could be offered to those with less technical expertise, giving them opportunities to engage with the explanation and better understand it. B. Information retrieval for Background Information and Explanatory Needs B.1. Background Knowledge of the Target Group We retrieve the background knowledge and expertise the same way as the explainee needs. We let the user of the framework describe the knowledge of the target group themselves and just give initial ideas on what to describe. First, we ask if the target group can be categorized into specific stakeholder groups, for example: • Affected: People who do not actively use the system but are affected by its decisions. • Users: People who actively use the system. • Developers: People who implement and build the system • Owners: People who own the system but not necessarily develop or use it. • Validators: People who have a supervisory function. Second, questions about the formal knowledge of the target group: • What theoretical/formal knowledge do they have about AI and ML learning, e.g. do they know the mathematical background of the ML algorithm? • What theoretical/formal knowledge do they have about the domain in which the ML model is applied? • What is the target group’s level of knowledge about similar types of ML algorithms from the public or from training programs? Third, questions about the instrumental knowledge of the target group • What practical knowledge about programming and ML learning does the target group have? • Does the target group have practical experience in the domain? • Does the target group use a lot of AI systems in general? B.2. Explanatory Needs of the Target Group First, the user of the framework can name questions that the target group wants an answer for. We ask the question: What type of question does the target group of the explanation want to be answered? and provide some examples of what we mean by the question. • How reliable is the prediction of the ML model? • Is the prediction of my ML model faulty/discriminatory/random? • What information and features does the ML model use to generate the prediction? • Which factors are particularly important for the output of the ML model? • What are the limitations of the ML model? Second the user of the framework can name goals that the target group wants to achieve with the explanation. Again we first ask the question: What goals should be achieved by the explanation? and then provide some examples for inspiration. • Error detection and improvement of the ML model • I want to ensure that the ML model complies with regulations and laws • I want to understand how the ML model can be used for specific tasks • I want to justify and explain the decisions of the ML model with explanations • I want to know how the data is utilized • I want to learn about the knowledge domain in which the model is deployed • I want to learn how to modify the decisions of the ML model • I want to create trust through explanation so that users of the ML model have a positive experience when using it C. Recommender GPT Instructions The following section is the instruction prompt for the recommender GPT in the OpenAI ChatGPT platform. The JSON files refer to the design principles 3.1 and the XAI database 3.3: This GPT is an assistant that gives human-centered XAI recommendations. The recommendation consists of human-centered design principles which are based on the HCDesignPrinciples.json file and on XAI Methods which should be from the methods.json file. The recommendation for human-centered design principles consists of recommendations in four characteristics. The recommendation says how explanations should be tailored to a target group in these four characteristics. The characteristics are communication, interactivity, selectivity, and customizability. Guidelines for all four characteristics are discussed in the knowledge file (HCDesignPrinciples.json). Based on these guidelines, recommendations for the four characteristics will be made for a specific target group. The GPT receives answers to questions that describe the knowledge of the target group and also information about what the explanatory goal of the target group is. Based on this information the GPT matches the guidelines in the knowledge file (HCDesignPrinciples.json) to the information of the target group and makes recommendations for all four characteristics on how an explanation should be adjusted so that it is understandable to the target group. D. Use Case Interactions and XAI Solutions D.1. Use Case 1: Explanation for Doctors Information Retrieval Q: For whom are you seeking an explanation for? Describe the target group. Can the target group be assigned to one of the following categories (Affected, User, Developers, Owners, Validators)? A: I am seeking an explanation for doctors at a hospital who are the users of an ML algorithm. Q: We will now ask you more about the knowledge of the target group. First, we want to know about the formal knowledge of the target group. What theoretical/formal knowledge do they have about AI and ML learning, e.g., do they know the mathematical background of the ML algorithm? A: The doctors in the hospital do not know anything about AI and ML. Some of them have a solid mathematical background but in general, they do not know how the ML algorithm works in theory. Q: What theoretical/formal knowledge do they have about the domain in which the ML model is applied to? A: The doctors who use the algorithm are well-trained doctors in the domain of oncology specifically breast cancer. Q: What is the target group’s level of knowledge about similar types of ML algorithms from the public or from training programs? A: Before the algorithm was used, all users attended an introduction and information event in which the use of the algorithm was explained. Q: Now we will ask about the instrumental knowledge of the target group. What practical knowledge about programming and ML learning does the target group have? A: The doctors using the ML system do not have any knowledge about practical programming or ML learning. Q: Does the target group have practical experience in the domain? A: The target group works as doctors and is experienced in practice. Q: Does the target group use a lot of AI systems in general? A: All users have an average amount of contact with AI in general. Q: What type of question does the target group of the explanation want to be answered? (E.g., How faithful is the prediction of the algorithm? Which features are most important for the algorithm?) A: The target group wants to know if the prediction is incorrect and also how trustworthy the prediction of the algorithm is. Q: What goals should be achieved by the explanation? A: The goal is to use the explanation to build trust among the doctors using the algorithm but also to justify the decision that we make based on the algorithm’s prediction. D.2. Use Case 2: Explanation for ML Engineers Information Retrieval Q: For whom are you seeking an explanation for? Describe the target group. Can the target group be assigned to one of the following categories (Affected, User, Developers, Owners, Validators)? A: The target group is ML Engineers working at an AI software company that is implementing an ML algorithm. Individual XAI Solution Communication The explanation should employ Interactivity An interactive explanation framework clear, concise language that integrates medical would be highly beneficial. This could include tools terminology familiar to oncologists with basic that allow the doctors to input hypothetical patient statistical concepts to explain the algorithm's data or modify existing data points to see how these outputs. Use of visual aids like graphs or charts to changes affect the algorithm’s predictions. Such show statistical reliability (like confidence intervals or features would help them understand the conditions prediction probabilities) can be helpful. Since the under which the algorithm performs best, and when it doctors have attended an introductory event, the might be prone to errors, thus fostering a deeper explanation can build on the foundational knowledge understanding of its reliability and limitations they already possess about the algorithm’s intended use, focusing on practical application rather than theoretical aspects Selectivity / Complexity The explanation should Customizability The explanation system should be selectively focus on aspects of the algorithm that adaptable to cater to varying levels of interest and directly relate to clinical decision-making in oncology. time constraints among the doctors. While some may This might include an emphasis on the most prefer a quick overview of the prediction outcomes significant features the algorithm considers when and trust metrics, others might be interested in a making a prediction, such as tumor size, type, and more detailed analysis of the algorithm's location, and how these are weighted within the decision-making process. Offering layered model. Presenting simplified case studies where the explanations that users can dive into at their algorithm's predictions were particularly successful, discretion will make the system more user-friendly as well as scenarios where it was less accurate, can and effective across the board. illustrate its performance spectrum in a clinically relevant manner. Recommended XAI Methods 1. Counterfactual Explanations: This method would allow doctors to see how altering certain input variables (like changing a measurement or condition slightly) could lead to different predictions. This is particularly useful for understanding edge cases or close-call decisions made by the algorithm. 2. Feature Importance: Methods like SHAP (SHapley Additive exPlanations) can elucidate which features are most influential in the algorithm’s predictions. This aligns well with a need to understand and trust the predictive factors, especially in complex cases of breast cancer diagnosis. These explanations, tailored to the needs and backgrounds of the doctors, aim to enhance their trust in the AI system by making its operations transparent and directly relevant to their clinical practice. This approach should also aid in justifying clinical decisions influenced by the algorithm's predictions, ensuring that the tool is a valuable asset to the oncological team. Figure 4: Recommended XAI Solution of our recommendation GPT for Use Case 1 Q: We will now ask you more about the knowledge of the target group. First, we want to know about the formal knowledge of the target group. What theoretical/formal knowledge do they have about AI and ML learning, e.g., do they know the mathematical background of the ML algorithm? A: We are all ML Engineers and have extensive knowledge about the theory behind most AI and ML learning algorithms. Q: What theoretical/formal knowledge do they have about the domain in which the ML model is applied to? A: The algorithm that we are implementing is concerned with predicting extreme weather events from climate data. We do not have knowledge about meteorology and climate in general. We only use the data. Q: What is the target group’s level of knowledge about similar types of ML algorithms from the public or from training programs? A: We researched similar algorithms online that also use climate data to predict extreme weather events. Q: Now we will ask about the instrumental knowledge of the target group. What practical knowledge about programming and ML learning does the target group have? Individual XAI Solution Communication Given the target group’s extensive Interactivity An interactive explanation interface knowledge in ML, the explanations can use technical would be highly beneficial for this group. It could language, incorporating advanced statistical and allow engineers to manipulate input data points and machine learning concepts. However, since the observe how these changes affect the model’s engineers are not familiar with meteorology, it’s predictions. Tools that enable them to test model crucial to clearly define any meteorological terms and behavior under various synthetic scenarios they concepts used in the model. Visual aids like construct can deepen their understanding of how advanced data visualizations that map climate data different climate features influence predictions. This inputs to prediction outputs can be beneficial. interactive process facilitates a practical exploration Documentation or tooltips that provide brief of the model's behavior, providing insights that are explanations of meteorological terms and their crucial for both trust and model refinement. relevance to the model’s predictions can enhance understanding. Selectivity / Complexity The explanation should Customizability Providing options to customize the focus on elements of the model that directly impact its level of detail in the explanations would cater to performance and decision-making process. This individual preferences within the group. Some includes a detailed presentation of feature engineers might prefer a high-level overview of model importance, showing which climatic variables most performance metrics, while others might delve into significantly affect predictions and how. It would also the nuances of algorithmic decisions and data usage. be helpful to illustrate the model's limitations by Allowing engineers to choose the complexity of the discussing conditions under which the model may information presented or to toggle between a basic perform poorly, such as certain climatic scenarios not and advanced view of explanations could enhance well-represented in the training data. usability and engagement. Recommended XAI Methods 1. SHAP (SHapley Additive exPlanations): This method can provide insights into the contribution of each feature to individual predictions, helping understand feature importance and the decision-making process of the model in a granular way. 2. LIME (Local Interpretable Model-agnostic Explanations): Useful for showing how individual predictions are made by approximating the model locally with an interpretable model. 3. Global Surrogate Models: These can be used to approximate the behavior of the complex model with a simpler one, providing an overall understanding of the model behavior across different regions of the input space. This method is not explicitly listed in the provided files but follows logically from the principles of XAI for highly technical users. These approaches will enable ML engineers to gain deeper insights into the machine learning model, facilitating improvements and fostering trust through transparency and a better understanding of how the model processes climate data to predict extreme weather events. Figure 5: Recommended XAI Solution of our recommendation GPT for Use Case 2 A: We are all experienced ML engineers with several years of working experience. Q: Does the target group have practical experience in the domain? A: No, we do not have any knowledge about meteorology. Q: Does the target group use a lot of AI systems in general? A: We are very proficient with AI systems and frequently use them in our daily lives. Q: What type of question does the target group of the explanation want to be answered? (E.g., How faithful is the prediction of the algorithm? Which features are most important for the algorithm?) A: We want to learn the behavior of the model and want to know which features are most important for the ML model. We also want to know what the limitations of the ML model are. Q: What goals should be achieved by the explanation? A: We want to improve the ML model by knowing the behavior of the model and additionally, we want to know how the data is used by the ML model.