Outlining the Design Space of Explainable Intelligent Systems for Medical Diagnosis Yao Xie Xiang ‘Anthony’ Chen Ge Gao UCLA, ECE UCLA, ECE University of Maryland, iSchool Los Angeles, California Los Angeles, California College Park, Maryland xac@ucla.edu gegao@umd.edu yaoxie@g.ucla.edu ABSTRACT promise of assisting human decision making through a data-driven The adoption of intelligent systems creates opportunities as well as approach, non-computing professionals often find it challenging to challenges for medical work. On the positive side, intelligent understand how the system transforms their initial input into a systems have the potential to compute complex data from patients final decision and why. and generate automated diagnosis recommendations for doctors. In the medical field, systems such as However, medical professionals often perceive such systems as the CheXNet [40] have been “black boxes” and, therefore, feel concerned about relying on developed to interpret a patient’s chest system-generated results to make decisions. In this paper, we X-ray scan using deep learning. While contribute to the ongoing discussion of explainable artificial the system can perform faster than intelligence (XAI) by exploring the concept of explanation from a human doctors with impressive human-centered perspective. We hypothesize that medical accuracy, it offers little clue to professionals would perceive a system as explainable if the system indicate what happens within the was designed to think and act like doctors. We report a “black box”. Human doctors holding preliminary interview study that collected six medical medical responsibility can hardly trust professionals’ reflection of how they interact with data for the system’s results without diagnosis and treatment purposes. Our data reveals when and how understanding its underlying decision- doctors prioritize among various types of data as a central part of making process [21]. their diagnosis process. Based on these findings, we outline future To help non-computing professional directions regarding the design of XAI systems in the medical Figure 1. The input and better comprehend results generated context. output image of by intelligent systems, a growing body CheXNet [40] of research has been conducted with CCS CONCEPTS the goal of building explainable AI (XAI). It provides various system-centric solutions, such as developing accountable and • Human-centered computing ~ Interactive systems and tools transparent algorithms [11,43], visualizing obscure features • Human-centered computing ~ HCI design and evaluation [12,49], and employing theories from cognitive psychology to methods explore effective explanations [28,29,32]. The current limitation of these approaches is that there is a lack of empirical evidence to KEYWORDS support the understanding by domain professionals [24]. Explainable artificial intelligence; human-centered design; In this project, we tackle the challenge of XAI from a user-centric medical data; system design. perspective. We identify medical domain as the focus of our research given the proliferation of AI-powered diagnosis systems ACM Reference format: in recent years. We hypothesize that human doctors will find a Yao Xie, Xiang ‘Anthony’ Chen and Ge Gao. 2019. Outlining the Design system more explainable when the system ‘speaks the language’ Space for Explainable Intelligent Systems for Medical Diagnosis. In Joint of a doctor and ‘thinks like’ a doctor, Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March The remainders of this paper present our first step to the design of 20, 2019, 6 pages. an explainable AI system by taking the perspective of medical professionals. We firstly review prior research on XAI, intelligent system in the medical field, and mental model of medical 1 Introduction professionals, respectively. After that, we report a preliminary Intelligent systems, the computational agent that employs interview study with six doctors that tells how medical algorithms to process and make sense of data, are becoming professionals interact with data for diagnosis and treatment increasingly ubiquitous in modern workplaces [1]. Despite the purposes in their daily work practice. Based on findings from the interview, we discuss how interaction designers can incorporate IUI Workshop’19, March 20, 2019, Los Angeles, USA. human doctors’ data processing model into medical intelligent Copyright © 2019 for the individual papers by the papers' authors. systems and make such systems more explainable for the users. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. IUI Workshops'19, March 20, 2019, Los Angeles, USA Y. Xie et al. 2 Background & Related Work also the understanding of social context [33]. Software Learnability is an important part of usability. It focuses on how to In this section, we first lay out a background review on XAI use complex software applications with the help of demonstrations research, and then zoom into an HCI-oriented approach towards or in-context videos [19] and it evaluates the easiness of using a XAI. Since our focused field is in medicine, we further discuss system. prior work in medical AI, and specifically related to our interest— Systems need to provide users with not only results but also the literature on the reasoning process of medical professionals. account of their behaviors [3]. Furthermore, research about a Explainable Artificial Intelligence (XAI) Systems tailored interface that provides the visual or textual explanation for Explainable artificial intelligence (XAI) raised a lot of concerns in context-aware rules has been done [10]. Researchers also studied recent years [20]. Since 1970, researchers have focused on the the design strategies of interaction and how to help users predict explanations of expert intelligent systems [31,46]. Recently, the system behavior through feedforward [2,3,47,48]. How users need for explainable artificial intelligent is called for again understand and control the machine learning programs is also a because of the development of machine learning and artificial relevant trend, which also works towards the debuggable and intelligence. Especially, algorithms like deep learning are intelligible machine learning [50]. Understandability and intrinsically difficult to be understood and it brings the need for predictability are very important in artificial intelligence better explainable systems. applications such as autonomous vehicles [37]. Besides the A lot of work of interpretable machine learning has been done to algorithmic accountability, transparency, and fairness, data explain the inner principles of the machine learning models with visualization is also a stream from the computational perspective mathematical and algorithmic solutions [4]. The main methods of of HCI, which seems to be isolated from what machine learning interpretable machine learning are explanations of the complex researchers do [7]. algorithm like deep learning, causal inference, Bayesian rules, and Intelligent Systems in Medical Fields visual analytics. Algorithm accountability means that the In medical fields, artificial intelligent systems also have a broad algorithm should explain the decisions. For example, “right to prospect. With the growth of availability of medical data and the explanation” law in EU [18]. Planning oversight, retrospective data processing techniques, artificial intelligent systems are analyses, and continuous review are needed to make the algorithm possible to be applied in the healthcare domain. They are able to accountable[19]. However, there are still many challenges in XAI. dig out useful information from a large amount of data which is Lipton [27] proposed a taxonomy of the reasons for difficult to be processed by doctors and thus, assist the medical interpretability and also the ways to interpret but there is still no decision making [16,35]. In the medical field, it has three major consensus about the definition of interpretability. Some applications: early detection, diagnosis, and treatment plan. They researchers studied the evaluation of whether a system is can also help with the diagnostic process of diseases such as interpretable and evaluation methods are proposed [13]. Attempts cardiology, cancer, and neurology [23]. The research in medical have also been made to map the intelligibility, interpretable artificial intelligence mainly focuses on pathology and radiology. algorithms and explainability with the related work. In social For example, systems are able to identify the radiographs and science, researchers also study how people define, select, generate, recognize patterns for radiologist and pathologist and work as an evaluate and express an explanation [33]. information specialist during the diagnostic process [22]. Besides Intelligibility and Explainable Systems Research in HCI the image analysis applications in radiology and pathology, In HCI, researchers are focusing on user’s interaction with the artificial intelligent systems are also applied to read the medical intelligent systems and explanation is one important topic. HCI scientific literature and integrate electronic medical records. In researchers focus more on the interaction between the artificial addition, they may optimize and predict the treatment of chronic intelligent system and users and they have done a lot of work from disease [32]. this aspect. Artificial intelligent systems have been criticized that However, comparing to the booming industry, the actual usage of their rigid concepts are incompatible with human behavior styles the autodiagnostic system in hospitals is relatively low. A study [45]. Explainable artificial intelligence in HCI contains topics has been made to know doctors’ acceptance and the adopt including context awareness, cognitive psychology, and software intention of these systems [15]. Another research proposed the learnability [44]. Context awareness is used to recognize user methods of evaluating the clinical performance and effect of the reactions and activities. In the early 2000s, context awareness has artificial intelligent systems in medical diagnosis and one of the raised a lot of concern with the development of mobile devices methods mentioned the explanations [39]. and sensors [9,42]. People should understand what is sensed and The explanation capabilities of artificial intelligence systems using what reaction is taken under a specific situation. For a context- knowledge bases are firstly added for the applications in medical aware system, it should let users know “what they know, how they decision making and computer-aided diagnosis in 1983 [46]. After know it and what they are going to do next” [3]. The needs for that, a diagnostic reasoning theory is used to find the components simplistic representations of the context in explainable AI is called of systems that lead to and explains the discrepancy between the to let users be aware of what is obtained and which action will be expected result and observed behaviors [41] and it has a variety of done by systems [14]. Cognitive psychology is more about settings such as medical auto-diagnostic systems. Further, how explanation theory. Lombrozo studied cognitive explanations [28] doctors make decisions under uncertain and information and found that it is strongly connected with causality reasoning. overloaded cases raises a lot of concerns. An argument-based Also, XAI not only focuses on human cognitive psychology but interaction that is flexible and easily understood by human users is Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis IUI Workshops'19, March 20, 2019, Los Angeles, USA proposed to help doctors make decisions based on this question 3 Interview [17]. It is also proved that a fuller explanation has a positive effect Even though a lot of researches have been done to explain the on users’ trust of such systems and also helps to solve reliance intelligent systems. They seldom look into specific domains and issues. Better explanations can let users better understand the incorporate empirical knowledge when explaining. We try to reasoning chain, thus enhancing the system’s confidence and help understand this problem from the doctors’ perspective and that’s doctors provide better diagnoses [5]. An interactive visual why we seek to investigate the following research question: analytics system is also designed to help support interactive RQ: How do medical professionals interact with patients’ data for dependence diagnostics by feature representation and visualization diagnosis and/or treatment purposes? [24]. Overview Medical Reasoning, Decision Making & Mental Models We conducted an interview study to explore research questions Cosby summarized tow models of clinical reasoning: analytical presented above. Our current sample consists of six licensed and intuitive [8]. The analytical approach is based on the medical professionals working in California, United States. Each hypothetic-deductive model that is common in scientific research interview lasts about 1 hour. During the research process, we and discovery, whereas the intuitive approach is akin to iterated between collecting new data, generating codes, and recognizing common patterns from a patient’s symptoms rather revising/elaborating the existing coding book as suggested by the than deliberately going through a methodological decision-making grounded theory [30]. Findings from these interviews offered process. Doctors often choose one of these models based on how insights revealing the relationship between medical professionals, experienced they are and how complicated a case is. data and intelligent systems from a human-centered perspective. Also, due to the uniqueness of the medical field, medical Participants and Data Collection reasoning and decision-making mean more than what they mean in other fields. From the doctors’ perspective, explanation of the decision making process is not only how the results come out, but Domain of # of Years also the cost of medical decisions such as the responsibility and ID Expertise/Special Gender risk [6]. In different scenarios, the requirement of explanations ty in the Medical Field also varies. In addition, the decision-making process in the medical field can be regarded as a combination of basic medical P1 Pathologist Male 22 knowledge such as pathology, the experience gained by previous patients in similar conditions and the cognition of the patient’s demographic information. It is a lot more complex than regular P2 Orthopedist Female 17 decision-making process and mental model which can be reached by splitting different features with “yes” or “no” [8]. P3 Neurologist Male 7 Broadly, the term ‘mental model’ is a concept derived from cognitive psychology. It is the explanation of people’s thought process about how things work [38]. The mental model can also P4 Family physician Male 10 be regarded as an internal representation of the external factors and it is important in cognition, decision making, and reasoning P5 General Male 5 [36]. The internal conceptualizations including users’ beliefs and physician understanding about the system behavior will guide their interaction with the systems [38]. Also, during the interaction, the mental models will develop individually according to different P6 Cardiologist Male 18 users. In general, most mental models are simpler than the actual Table 1. Background information of our six interviewees, systems and it is sufficient to allow users to understand the system including their participant ID, the domain of expertise, behavior [34]. However, when it comes to the complex cases, for gender, and number of years working/studying in the medical example, medical diagnosis, if mental models cannot reflect the field actual complexity of these artificial intelligent systems, users All interviewees joined this study by responding to an online might feel difficult to understand, explain or predict the system participant call posted by the research team. We intentionally behavior [38]. In order to make users better understand and looked for participants who hold different domains of expertise explain how the system works, the system should be transparent within the medical field, so that the interview data can best capture and show the mental model similar to human’s mental models both the commonalities and the differences between the thinking [25,26]. Otherwise, users are likely to build flawed mental models styles of various medical professionals. Table 1 summarizes the when interacting with such systems and be confused about the background information of each interviewee. For the anonymous process of decision making [38]. For systems with improved purpose, we replaced their names by randomly assigned IDs. mental models, user’s satisfaction perceived control, and the Between September and November of 2018, the first author of this overall trust of the system will all be enhanced, which will also paper conducted semi-structured interviews with each participant. facilitate understanding [25]. The interview protocol was initially developed through in-group brainstorming sessions among the authors of this paper. It then got 3 IUI Workshops'19, March 20, 2019, Los Angeles, USA Y. Xie et al. revised based on two pilot interviews with senior M.D. students at diseases, the guideline cannot include all of them. It will depend UCLA. The final protocol consisted of questions revolving around on the doctor’s experience or some innovations to accomplish the four issues: 1) the interviewee’s work and education experience in treatment. [P4, Family physician] the medical field, 2) how s/he accesses to, processes and interprets In the rest of this section, we describe how medical professionals medical related data during daily work practice, 3) challenges and navigate around the complexity of their interaction with medical solutions s/her ever experienced, if any, when working with data. We identify three critical steps from interviewees’ reflection, medical data, and 4) experience and/or expectations of using including detecting/reacting to borderline cases, generating computer-based systems to facilitate daily medical work. All prioritization matrices, and coordinating with computer-based interviews were conducted face-to-face in English and audio-taped systems. Across all these steps, medical professionals keep for transcription. prioritizing and re-prioritizing among information collected at Analysis different stages of the diagnosis process. Three authors of this paper analyzed the interview data together following an inductive approach. There were 60 codes and 201 Borderline Cases: When Challenges Emerge quotations generated from the initial open coding. They yield All participants of our interview reported running into borderline participants’ self-reflection regarding the forms of data they cases as the moments when the processing and interpreting of interact with at daily medical work, the thinking process they go medical data turn challenging. One representative situation of through when interacting with various data, the decisions they try encountering borderline cases is when the symptoms are still in to make based on data processing, and the types of work they have their early state: been delegating or hope to delegate to computer-based systems. We reiteratively discussed and compared between codes as they At the very early state [of cancer], it is difficult to tell if the cell is were generated. During the discussion, prioritization emerged as a abnormal. The architecture is minimally disrupted. You may think focal theme from the data. It indicates that a central task medical it is abnormal, but you don’t know whether it is malignance. We professional performs during diagnosis is to prioritize among will show the cases to other colleges to get consents, or we have to various and sometimes conflicting information given by patients, say this case is inconclusive. [P1, Pathologist] other doctors, and computer-based systems. We went through In other situations of the borderline cases, medical professionals further coding to identify connections between this focal theme receive conflicting information that indicates different directions and other emerged themes and categories. The following section of the diagnosis: presents our detailed findings. Words and phrases directly quoted from participants are written in italic. Many of us have run into cases when the MRI doesn’t confirm [our diagnosis]. We think the problem is in the right brain, but the 4 Findings image shows nothing there. In that case, we may do the test again. The process of generating a proper diagnosis and/or treatment plan We can also go back to the patient to ask them again, or we is frequently described by our interviewees as being context- discuss with other doctors. [P3, Neurologist] dependent, data-intensive, and open to alternative possibilities. In To clear up the ambiguity as indicated by the two quotations many cases, there lack one-to-one correspondences between signs, above, doctors often need to cross-validate their initial evaluation symptoms, and diseases. Medical professionals in the field, of the patient by requesting further data. Our interview with the therefore, are often required to integrate various kinds of data and six medical professionals documented multiple types of such data, think outside the box. As it is pointed out by the following two including but not limited to, the patient’s demographic participants: information, cardinal symptoms, results from further physical For medicine, it’s usually the grey area that matters. Everything is examinations and lab tests, historical data from reference groups, hardly black and white, and that’s why it is always difficult. … and evaluations given by other doctors. People say that medicine is both a science and an art, because every disease is different, and every patient’s representation will Participants in our study yielded similar insights regarding how be different. Every doctor obviously has different steps in making they deal with the rich yet complex medical data. Instead of the decision. [P6, Cardiologist] following one hard rule of data processing, interviewees tend to Authorities, like the American Heart Association, will publish weight/interpret each type of data differently based on their guidelines and flow charts that we can refer. It prevents personalized prioritization matrices. physicians from making ridiculous mistakes. But for more complex Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis IUI Workshops'19, March 20, 2019, Los Angeles, USA Prioritization Matrices: Validity and Beyond We identified six parameters from participants’ self-reflections that reveal how they perform data prioritization for diagnosis and/or treatment purpose. These parameters are labeled as below: Theoretical validity. Robustness of connections between signs, symptoms, and diseases as proved by theories, medical textbooks, and guidelines; Severity of consequence. Quality and quantity of potential consequences if the detected signs/symptoms get put aside at this moment; side-effects of a treatment; interactions between different treatments; Time constraint. Timing; urgency; sequential Figure 2. A diagram that compares the prioritization order of taking care of different matrices held by two different medical professionals (blue symptoms and diseases; vs. orange line) when making diagnosis decisions. While one doctor uses severity as the primary parameter to weight Domain of expertise. The extent to which the signs and various data during diagnosis /treatment, the other cares symptoms connected to the most about the calculation of risks and responsibilities. doctor’s specialty; the level of confidence in offering a candidate treatment; Our interviewees sometimes referred to the personalized prioritization matrices (or styles) to explain the disagreement Risk avoidance. Responsibility assigned to a between diagnosis suggestions provided by different doctors (see specific doctor; power dynamics Figure 2 for illustration). between junior vs. senior doctors; Coordination Between Medical Professionals & Systems Technical feasibility. The sensitivity of the All medical professionals in our study reported that they have measurement; reliability of the been using computer-based tools and systems to facilitate their technique; the false daily work practice. Most participants, for instance, have greatly positive/negative rate of relied on cloud-based platforms to store and connect their local symptom detection. medical data with other databases [P1, P2, P4, P5, P6]. They also Participants often used styles to describe the detailed prioritization used various systems to generate automated calculation of matrices held by different doctors. Similar to other dispositional chromosomes [P1], identify the degree of scoliosis [P2], check attributes such as personality, the prioritization matrix of a possible interactions between medications [P5], and etc. The medical professional is perceived as being self-aware and primary function of such tools is to “provide quantified consistent across various diagnosis made by the same individual: information to doctors, but not [to give] answers in terms of high- level decisions [P3]”. The diagnosis depends on many factors –severity, possibility, consistency with the patient’s history, and others. Some doctors While participants were confident that the auto-quantified will make the most severe issues on the priority, others will make information given by systems is usually trustworthy and helpful, the most possible ones their priority. It depends on their this optimism does not remain in their narratives of auto-diagnosis perspective. It also depends on the time concern. For example, or treatment recommended by systems. The following quotation neurologists may have a longer period of diagnosis, but surgeons from P6 indicates a shared attitude as reflected across all the six and ER doctors don’t. [P2, Orthopedist] interviews: Some doctors trust images [over other information], like MRI, to tell what’s happening. About 80% of the time you would have good images. You are very confident about the diagnosis from the images. But I think most important information [to facilitate diagnosis] is what the patient tells you. It helps to track the patient’s history. [P6, Cardiologist] 5 IUI Workshops'19, March 20, 2019, Los Angeles, USA Y. Xie et al. There is a lot of advanced analysis involving machine learning, [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and and some of them have entered the clinical realm. For example, Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI you will have the nuclear images, and you will have the software Conference on Human Factors in Computing Systems (CHI ’18), 582:1–582:18. https://doi.org/10.1145/3173574.3174156 telling you “it’s abnormal here and there.” It’s as if you have a [2] Victoria Bellotti, Maribeth Back, W. Keith Edwards, Rebecca E. Grinter, Austin second reader next to you. I would love to have the system Henderson, and Cristina Lopes. 2002. Making sense of sensing systems. In generating results, but ultimately, it’s you that’s deciding on the Proceedings of the SIGCHI conference on Human factors in computing systems Changing our world, changing ourselves - CHI ’02. diagnosis. When there is a disagreement, me and everyone will be https://doi.org/10.1145/503447.503450 overwriting the machine-generated interpretation. [P6, [3] Victoria Bellotti and Keith Edwards. 2001. Intelligibility and Accountability: Human Considerations in Context-Aware Systems. Human-Computer Cardiologist] Interaction 16, 2-4: 193–212. https://doi.org/10.1207/S15327051HCI16234_05 [4] O. Biran and C. Cotton. 2017. Explanation and justification in machine learning: To step forward from quantifying information to directly assisting A survey. IJCAI-17 Workshop on Explainable AI (XAI). Retrieved from diagnosis and treatment, systems are expected to “give an http://www.intelligentrobots.org/files/IJCAI2017/IJCAI- 17_XAI_WS_Proceedings.pdf#page=8 argument for why the data should be interpreted in that way [P5]”. [5] A. Bussone, S. Stumpf, and D. O’Sullivan. 2015. The Role of Explanations on The majority of our participants proposed the concept of reference Trust and Reliance in Clinical Decision Support Systems. In 2015 International and comparison as one approach to ground the systems’ diagnostic Conference on Healthcare Informatics, 160–169. https://doi.org/10.1109/ICHI.2015.26 reasoning with that of human doctors’: [6] C. K. Cassel and A. L. Jameton. 1981. Dementia in the elderly: an analysis of medical responsibility. Annals of internal medicine 94, 6: 802–807. Retrieved Any machine has to give an evidence for the top reasons like in from https://www.ncbi.nlm.nih.gov/pubmed/7235423 descending order for why in some matrices. It’s like if I say [7] Chi Conference Committee. 2016. CHI 16 Vol 1: CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery. something and you think differently, then we should be able to Retrieved from https://market.android.com/details?id=book-MK8IMQAACAAJ really compare the two. Otherwise, it doesn’t matter if the [8] Pat Croskerry, Karen Cosby, Mark L. Graber, and Hardeep Singh. 2017. Diagnosis: Interpreting the Shadows. CRC Press. Retrieved from machine’s suggestion is right. I don’t know what its thinking is https://www.taylorfrancis.com/books/9781351652926 and, ultimately, I take all the responsibility in this decision. [P1, [9] Anind K. Dey. 2001. Understanding and Using Context. Personal and Ubiquitous Computing 5, 1: 4–7. https://doi.org/10.1007/s007790170019 Pathologist] [10] Anind K. Dey and Alan Newberger. 2009. Support for context-aware intelligibility and control. In Proceedings of the SIGCHI Conference on Human There are different ways [to help validate the systems’ diagnosis Factors in Computing Systems, 859–868. recommendations]. One is showing me past examples in the https://doi.org/10.1145/1518701.1518832 [11] Nicholas Diakopoulos. 2016. Accountability in Algorithmic Decision Making. database - will that support its conclusion? Another one is sources Communications of the ACM 59, 2: 56–62. https://doi.org/10.1145/2844110 of data, something like research articles or convincing cases have [12] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric been done. That’s upper-level evidence. [P5, General physician] Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In International Conference on Interviewees further suggested that to build an ideal auto- Machine Learning, 647–655. Retrieved December 14, 2018 from http://proceedings.mlr.press/v32/donahue14.html diagnosis/treatment system, the algorithm should be able to [13] Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of contextualize its reference data with personalized information of a Interpretable Machine Learning. arXiv [stat.ML]. Retrieved from http://arxiv.org/abs/1702.08608 patient. Such contextualization work is what human doctors are [14] Paul Dourish. 1995. Developing a reflective model of collaborative systems. good at based on their professional training and experience, but it ACM transactions on computer-human interaction: a publication of the Association for Computing Machinery 2, 1: 40–63. is perceived to be the major obstacle for systems to overcome. https://doi.org/10.1145/200968.200970 [15] Wenjuan Fan, Jingnan Liu, Shuwan Zhu, and Panos M. Pardalos. 2018. 5 Implication for Design Investigating the impacting factors for the healthcare professionals to adopt artificial intelligence-based medical diagnosis support system (AIMDSS). Based on the findings of the preliminary interview, we outline Annals of Operations Research. https://doi.org/10.1007/s10479-018-2818-y [16] M. Fieschi. 2013. Artificial Intelligence in Medicine: Expert Systems. Springer. design suggestions for explainable medical AI systems. Retrieved from https://market.android.com/details?id=book-_Vf0BwAAQBAJ Specifically, we envisage a system that can [17] J. Fox, D. Glasspool, D. Grecu, S. Modgil, M. South, and V. Patkar. 2007. Argumentation-Based Inference and Decision Making--A Medical Perspective. Allow a medical professional to prioritize different types and IEEE intelligent systems 22, 6: 34–41. https://doi.org/10.1109/MIS.2007.102 [18] Bryce Goodman and Seth Flaxman. 2016. European Union regulations on sources of data by directly manipulating a user interface akin to algorithmic decision-making and a “right to explanation.” arXiv [stat.ML]. our proposed prioritization matrix (Figure 2); Retrieved from http://arxiv.org/abs/1606.08813 [19] Tovi Grossman and George Fitzmaurice. 2010. ToolClips: An Investigation of Support gradual engagement of medical AI systems into a medical Contextual Video Assistance for Functionality Understanding. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10), professional’s diagnosis process, spanning from low-level 1515–1524. https://doi.org/10.1145/1753326.1753552 automated measurement tasks, to mid-level constraint-aware [20] D. Gunning. 2016. Explainable artificial intelligence (XAI). Defense Advanced planning of medical tests, and to high-level suggestions of Research Projects agency. [21] Andreas Holzinger, Chris Biemann, Constantinos S. Pattichis, and Douglas B. plausible diagnoses. Kell. 2017. What do we need to build explainable AI systems for the medical domain? arXiv preprint arXiv:1712. 09923. Retrieved from ACKNOWLEDGMENTS https://arxiv.org/abs/1712.09923 [22] Saurabh Jha and Eric J. Topol. 2016. Adapting to Artificial Intelligence: We thank Xiaohe Yang for assisting us to complete the interviews. Radiologists and Pathologists as Information Specialists. JAMA: the journal of the American Medical Association 316, 22: 2353–2354. We thank all the anonymous interviewees for their contributions to https://doi.org/10.1001/jama.2016.17438 our study. We also thank Maie St. John, Peter Pellionisz and Jeff [23] Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang, Qiang Dong, Haipeng Shen, and Yongjun Wang. 2017. Artificial intelligence in Liang for their valuable comments on earlier drafts of this paper. healthcare: past, present and future. Stroke and vascular neurology 2, 4: 230– 243. https://doi.org/10.1136/svn-2017-000101 REFERENCES Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis IUI Workshops'19, March 20, 2019, Los Angeles, USA [24] Josua Krause, Adam Perer, and Kenney Ng. 2016. Interacting with Predictions: True Identity. In Proceedings of the SIGCHI Conference on Human Factors in Visual Inspection of Black-box Machine Learning Models. In Proceedings of Computing Systems (CHI ’13), 1931–1940. the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), https://doi.org/10.1145/2470654.2466255 5686–5697. https://doi.org/10.1145/2858036.2858529 [49] Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding [25] Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell Convolutional Networks. In Computer Vision – ECCV 2014, 818–833. Me More?: The Effects of Mental Model Soundness on Personalizing an https://doi.org/10.1007/978-3-319-10590-1_53 Intelligent Agent. In Proceedings of the SIGCHI Conference on Human Factors [50] 2001. UIST ’01: Proceedings of the 14th Annual ACM Symposium on User in Computing Systems (CHI ’12), 1–10. Interface Software and Technology, Orlando, Florida, November 11-14, 2001. https://doi.org/10.1145/2207676.2207678 Association for Computing Machinery. Retrieved from [26] T. Kulesza, S. Stumpf, M. Burnett, W. Wong, Y. Riche, T. Moore, I. Oberst, A. https://market.android.com/details?id=book-YLtQAAAAYAAJ Shinsel, and K. McIntosh. 2010. Explanatory Debugging: Supporting End-User Debugging of Machine-Learned Programs. In 2010 IEEE Symposium on Visual Languages and Human-Centric Computing, 41–48. https://doi.org/10.1109/VLHCC.2010.15 [27] Zachary C. Lipton. 2016. The Mythos of Model Interpretability. arXiv [cs.LG]. Retrieved from http://arxiv.org/abs/1606.03490 [28] Tania Lombrozo. 2010. Causal–explanatory pluralism: How intentions, functions, and mechanisms influence causal ascriptions. Cognitive psychology 61, 4: 303–332. https://doi.org/10.1016/j.cogpsych.2010.05.002 [29] Tania Lombrozo and Susan Carey. 2006. Functional explanation and the function of explanation. Cognition 99, 2: 167–204. https://doi.org/10.1016/j.cognition.2004.12.009 [30] Donna Reseigh Long. 1993. Basics of qualitative research: Grounded theory procedures and techniques. [31] William van Melle, Edward H. Shortliffe, and Bruce G. Buchanan. 1984. EMYCIN: A knowledge engineer’s tool for constructing rule-based expert systems. Rule-based expert systems: The MYCIN experiments of the Stanford Heuristic Programming Project: 302–313. Retrieved from http://www.aaai.org/Papers/Buchanan/Buchanan17.pdf [32] D. Douglas Miller and Eric W. Brown. 2018. Artificial Intelligence in Medical Practice: The Question to the Answer? The American journal of medicine 131, 2: 129–133. https://doi.org/10.1016/j.amjmed.2017.10.035 [33] Tim Miller. 2017. Explanation in Artificial Intelligence: Insights from the Social Sciences. arXiv [cs.AI]. Retrieved from http://arxiv.org/abs/1706.07269 [34] Neville Moray. 1987. Intelligent aids, mental models, and the theory of machines. International journal of man-machine studies 27, 5: 619–629. https://doi.org/10.1016/S0020-7373(87)80020-2 [35] Travis B. Murdoch and Allan S. Detsky. 2013. The inevitable application of big data to health care. JAMA: the journal of the American Medical Association 309, 13: 1351–1352. https://doi.org/10.1001/jama.2013.393 [36] Jakob Nielsen. 2010. Mental models. Jakob Nielsen’s Alertbox. [37] Don Norman. 2009. The Design of Future Things. Basic Books. Retrieved from https://market.android.com/details?id=book-aeIUGq1EL24C [38] Donald A. Norman. 2014. Some observations on mental models. In Mental models. Psychology Press, 15–22. Retrieved from https://www.taylorfrancis.com/books/e/9781317769408/chapters/10.4324%2F9 781315802725-5 [39] Seong Ho Park and Kyunghwa Han. 2018. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 286, 3: 800–809. https://doi.org/10.1148/radiol.2017171920 [40] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew P. Lungren, and Andrew Y. Ng. 2017. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv [cs.CV]. Retrieved from http://arxiv.org/abs/1711.05225 [41] Raymond Reiter. 1987. A theory of diagnosis from first principles. Artificial intelligence 32, 1: 57–95. https://doi.org/10.1016/0004-3702(87)90062-2 [42] B. Schilit, N. Adams, and R. Want. 1994. Context-aware computing applications. In Workshop on Mobile Computing Systems and Applications, 85–90. https://doi.org/10.1109/MCSA.1994.512740 [43] Ben Shneiderman. 2016. Opinion: The dangers of faulty, biased, or malicious algorithms requires independent oversight. Proceedings of the National Academy of Sciences of the United States of America 113, 48: 13538–13540. https://doi.org/10.1073/pnas.1618211113 [44] Ben Shneiderman, Catherine Plaisant, Maxine Cohen, Steven Jacobs, Niklas Elmqvist, and Nicholas Diakopoulos. 2016. Grand Challenges for HCI Researchers. Interactions 23, 5: 24–25. https://doi.org/10.1145/2977645 [45] Lucy A. Suchman. 1987. Plans and Situated Actions: The Problem of Human- Machine Communication. Cambridge University Press. Retrieved from https://market.android.com/details?id=book-AJ_eBJtHxmsC [46] William R. Swartout. 1983. Xplain: A system for creating and explaining expert consulting programs. UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY INFORMATION SCIENCES INST. Retrieved from http://www.dtic.mil/docs/citations/ADA130597 [47] Jo Vermeulen, Kris Luyten, Karin Coninx, and Nicolai Marquardt. 2014. The design of slow-motion feedback. In Proceedings of the 2014 conference on Designing interactive systems, 267–270. https://doi.org/10.1145/2598510.2598604 [48] Jo Vermeulen, Kris Luyten, Elise van den Hoven, and Karin Coninx. 2013. Crossing the Bridge over Norman’s Gulf of Execution: Revealing Feedforward's 7