Joint Proceedings of Modellierung 2020 Short, Workshop and Tools & Demo Papers Workshop on Models in AI 185 Going Beyond Explainability in Medical AI Systems Ulrich Reimer,1 Edith Maier,1 Beat Tödtli1 Abstract: Despite the promises of artificial intelligence (AI) technology, its adoption in medicine has met with formidable obstacles due to the inherent opaqueness of the internal decision processes that are based on models which are difficult or even impossible to understand. In particular, the increasing usage of (deep) neural nets and the resulting black-box algorithms has led to wide-spread demands for explainability. Apart from discussing how explainability might be achieved, the paper also looks at other approaches at building trust such as certification or controlling bias. Trust is a crucial prerequisite for the use and acceptance of AI systems in the medical domain. Keywords: Explainability; Certification; Machine Learning; AI; Bias; Medical Device 1 Introduction: Medical AI Systems AI systems are beginning to have an impact in the medical domain. They comprise applications for clinicians that allow rapid and accurate image interpretation, genome sequencing or recommend treatments for patients (such as IBM Watson Health’s cancer algorithm). AI technology can also be found in applications for lay people such as biosensors with continuous output of physiologic metrics that enable them to process their own data to promote their health. According to [9], for these purposes, algorithms are desperately needed because analysing such huge amounts of data exceeds human capacities. The question is how AI and human intelligence can best be integrated. The decision on a patient’s next treatment is shared between the patient, the physician and the medical AI system2. The degree of autonomy and responsibility in the decision-making process of each role is central to the attribution of trust since clearly, systems require less trust if they are supervised by a human or another system. Therefore, it may be useful to distinguish between decision support systems and decision making systems, as well as recognising the continuum between these extremes. For example, a skin melanoma detection system may be viewed as a decision making system if false- negative detections are not re-examined by a physician. Medical information retrieval systems on the other hand are clearly only capable of providing decision support. 1 Institute for Information & Process Management, University of Applied Sciences St.Gallen, firstname.lastname@ fhsg.ch 2 One may object to the anthropomorphic notion of a system making a decision. Alternatively, the decision of an AI system may be viewed as a shared decision by the physician and the engineers who built the system. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 186 Reimer et al. 2 Trust-Enhancing Features of Medical AI Systems An analysis of trust-influencing features of automated systems in general is given by Chien et al. [1]. The authors suggest three main categories, each having several dimensions some of which are particularly relevant for our topic: • The category Performance Expectancy is defined “as an individual’s belief that applying automation will help her to enhance job performance”. One particularly relevant dimension of this category is perceived usefulness, which we will discuss briefly in Section 2.5. • The second category Process Transparency refers to factors that influence an indi- vidual’s perceived difficulty in using an automated system. Relevant dimensions are understandability (see Sec.2.1) and reliability (see Sec.2.4) of the system. • Finally, the category of Purpose Influence relates to “a person’s knowledge of what the automation is supposed to do”, i.e. the alignment of a user’s expectations and the system designers’ goals. An important dimension of this category is the certification of an automated system, which will be discussed in Section 2.2. An additional aspect, specific to data-driven AI systems and not mentioned in [1] is controlling the sampling bias. We will deal with this aspect in Section 2.3. In this position paper we focus on the following questions: • What makes a user trust a (medical) AI system to provide correct and adequate advice or decisions? • Will (medical) AI systems override experience and intuition, i.e. the largely tacit knowledge harboured by experts, when it comes to taking decisions? While the above mentioned issues are relevant for AI systems in general, they deserve particular attention in medicine where lives may be at stake. The models AI systems are based on primarily determine their behaviour. We therefore argue that certain characteristics of these models such as interpretability and representativeness play a crucial role for the usage and acceptance of AI systems. In the following subsections we will have a closer look at each of the above mentioned dimensions and discuss how they may influence a physician’s or patient’s trust in a medical AI system. 2.1 Explainability The trust-enhancing dimension of understandability is closely related to the notion of explainability as discussed in the machine learning community [2]. Gilpin et al. [3] Going Beyond Explainability in Medical AI Systems 187 distinguish between explainability and interpretability. According to them the goal of interpretability is to describe the internals of a system in a way that is understandable to humans. Since humans are biased towards simple descriptions, an oversimplification of the description of the actual system is unavoidable and even necessary. Explainability includes interpretability but additionally requires the explanation to be complete, i.e. not lacking relevant aspects. Clearly, there is a trade-off between an explanation being complete and comprehensible. Other authors refuse to make a distinction between explainability and interpretability (e.g. [6]). Explainability is typically problematic for sub-symbolic models, i.e. (deep) neural networks as opposed to symbolic models such as decision trees, which can in principle be inspected by a human. Nevertheless, inspecting a complex decision tree with hundreds or even thousands of nodes would quite probably be pointless since reading does not automatically imply understanding. Thus we run into the performance-explainability-tradeoff between having simple (symbolic) models that facilitate explainability and complex (possibly sub-symbolic) models that result in a better performance of the AI application but cannot be understood anymore [5]. Since sub-symbolic models cannot (easily) be inspected, several researchers have come up with the idea of having an additional, simpler (symbolic) model just for the purpose of explainability while the actual system performance relies on the full-fledged, complex (sub-symbolic) model [7]. Other authors suggest to transfer a part of the input-to-output transformation complexity from the modelling to the preprocessing stage [8]. This may help to improve the explainability of a model at the expense of a more complex and opaque preprocessing procedure . It can be stipulated that explainability or at least interpretability is an essential capability of a medical AI system. Otherwise, a physician or clinician either just has to trust the conclusions of the system or has to go through a subsequent verification process, which may well be costly and time-consuming and thus nullify any potential efficiency benefits of the AI system. At the same time, he or she may not be willing to go through a lengthy explanation to understand the decision offered by the system. Since neither approach is desirable or practicable, the task of verification is normally taken on by officially recognised agencies or regulatory bodies such as the Food and Drug Administration (FDA). 2.2 Certification of Medical AI Systems Medical AI systems need to get certified as medical devices by regulatory bodies such as the already-mentioned FDA in the US or national bodies such as SwissMedics in Switzerland before they can be used in practice. A clinical trial is usually at the core of the certification process. By assuming responsibility for the adequacy of the medical AI system, regulatory bodies provide an established source of trust. 188 Reimer et al. However, the certification of a medical AI system requires a different approach than that for approving e.g. a new drug. The decisions suggested by an AI system must be compared to the decisions of a physician, whereas clinical trials evaluate new treatments by comparing them with traditional or treatment as usual. Since the model underlying an AI system is a generalization of a possibly large but limited input it will sometimes come up with inadequate decisions. Thus, for AI systems to be certifiable at all it is important that other criteria are fulfilled – e.g. that a clinician can easily cross-check the systems’ decision against his/her own expertise, by simple lab tests or the explanation given by the system. Depending on its complexity, the certification of an AI system can require a huge effort. One possibility to simplify the process is to narrow down the functional range of the system, e.g. by having it diagnose only one kind of disease or a small range of diseases. The downside of breaking up the diagnostic scope of an AI system into smaller systems poses the problem that a disease might not be diagnosed if it comes with atypical symptoms. A system with a broader range can more easily do a differential diagnosis between several possible diagnoses. Another approach to narrow down the scope of an AI system is the range of patients it can be applied to. For example, if it was developed on data from a group of people of a specific ethnic group and gender the clinical trial can be narrowed down to the same kind of sample of applicants. This approach would also help with the issue of bias, as is further discussed in Section 2.3. Certification amounts to model testing. This means the absence of errors (wrong diagnoses, wrong therapies) cannot be shown. When the certification process uses a sample with a similar bias as the sample used for developing the AI system there might exist fundamental flaws that will not be uncovered during certification. Thus, instead of only doing model testing via a clinical trial, a certification process should additionally include inspecting the model and checking it for plausibility. Here, an explanation component could provide considerable support and help to “ensure algorithmic fairness, identify potential bias/problems in the training data, and to ensure that the algorithms perform as expected"[3]. 2.3 Reducing Sampling Bias in medical AI Systems Sampling bias refers to the bias incurred due to the specific data set chosen to train an AI system. The resulting system extrapolates from the training data to the general case. If the training set is skewed the system does not generalize well. For example, if a medical AI system is trained on data from Asian people it might not work well for Africans or Europeans. While the bias concerning gender and ethnic group can be controlled relatively easily [10], the feature space is huge so that other, less obvious biases can exist that neither the developer nor the certification agency are aware of. The problem is that it is usually unknown what the effect of a feature on the generated model is and how its values should be distributed to provide a representative sample. Going Beyond Explainability in Medical AI Systems 189 Bias can be reduced by utilizing domain knowledge about features and their impact on the learned model [4]. Another approach is testing for specific biases. When using medical AI systems a patient may rightfully ask how well she or he is represented in the training data. An approach to do this might be to go back to the original dataset and determine the nearest neighbours of the data point representing the patient. Comparing the number and closeness of the nearest neighbours to the number and closeness of the average person in the sample gives an estimate how well that patient is covered or if he or she is an outlier. 2.4 Reliability The reliability of an AI system may influence the willingness to use it [1]. Reliability can be estimated by classical measures from machine learning such as precision, specificity, sensitivity (or recall), false positive rate and false negative rate. An important aspect in this context is if the system is optimized to reduce false positives or false negatives. 2.5 Perceived Usefulness Finally, a further criterion for deciding to use a (medical) AI system is its perceived usefulness, which refers to a user’s belief that the system would be of assistance to achieve a certain goal [1]. This criterion is strongly influenced by reliability (see above) but also by other factors such as the degree of support the system gives and how often it fails to comes up with a viable solution. 3 Points for discussion Let us summarize the main points to be addressed when discussing the prerequisites for accepting AI technology in the medical domain: • Physicians do not need an explanation component because a medical AI system has already undergone a formal certification process which relieves them from the responsibility to ensure that advice obtained from the system is correct. • Even if a medical AI system has obtained official certification, clinicians might at least want to check the plausibility of a system’s decision. Therefore an explanation component should be available that provides succinct and intuitive (but not necessarily complete) explanations. • AI systems can only be certified if the system is not too complex, i.e. has a limited functional range and/or patient range. Otherwise the required effort would make certification too costly and impractical. 190 Reimer et al. • Certification should not only focus on system performance, i.e. on testing the underlying model with concrete cases, but also include the inspection of the underlying decision model as well as the training and test sets used for creating the model. • An AI system might be more reliable than a human expert, make decisions in less time and with lower costs and come up with better suggestions, but is still bound to make errors eventually. The error rate, however, should be inferior to that of a physician. • The sample used to train the decision model of an AI system will always be biased. To make up for this it should be possible to determine for specific patients how well they are covered by the training sample to get an estimate how much to trust the AI system in those specific cases. • Instead of applying a pure machine learning approach to generate the model underlying an AI system it should also comprise (manually engineered) domain knowledge to increase its reliability, e.g. for plausibility checking of a decision. 4 Conclusions and Future Work As we have seen there are approaches to building trust in medical AI systems that go beyond explainability such as certification, controlling the bias of the training set or checking how well an individual patient is covered by the learned model. But even if all these are applied, we would argue that full autonomy is unlikely to ever be attained for medical AI systems. Humans will always be required for oversight of algorithmic interpretation of images and data in this sensitive realm. References [1] Shih-Yi Chien et al. “An Empirical Model of Cultural Factors on Trust in Automation”. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting 58 (Oct. 2014), pp. 859–863. [2] Filip Karlo Došilović, Mario Brčić, and Nikica Hlupić. “Explainable artificial intelligence: A survey”. In: 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO). IEEE. 2018, pp. 0210–0215. [3] L. H. Gilpin et al. “Explaining Explanations: An Overview of Interpretability of Machine Learning”. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). Oct. 2018, pp. 80–89. [4] L. Clark Johnson et al. “Sampling bias and other methodological threats to the validity of health survey research”. In: International Journal of Stress Management 7.4 (2000), pp. 247–267. Going Beyond Explainability in Medical AI Systems 191 [5] David Martens et al. “Performance of classification models from a user perspective”. In: Decision Support Systems 51.4 (2011), pp. 782–793. [6] Tim Miller. “Explanation in Artificial Intelligence: Insights from the Social Sciences”. In: CoRR abs/1706.07269 (2017). arXiv: 1706.07269. [7] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “"Why Should I Trust You?": Explaining the Predictions of Any Classifier”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2016, pp. 1135–1144. [8] Cynthia Rudin. “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead”. In: Nature Machine Intelligence 1.5 (2019), pp. 206–215. [9] Eric J Topol. “High-performance medicine: the convergence of human and artificial intelligence”. In: Nature medicine 25.1 (2019), pp. 44–56. [10] M. B. Zafar et al. “Fairness Constraints: A Flexible Approach for Fair Classification”. In: Journal of Machine Learning Research 20.75 (2019), pp. 1–42.