=Paper=
{{Paper
|id=Vol-2327/ExSS20
|storemode=property
|title=Outlining the Design Space of Explainable Intelligent Systems for Medical Diagnosis
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-18.pdf
|volume=Vol-2327
|authors=Yao Xie,Xiang Chen,Ge Gao
|dblpUrl=https://dblp.org/rec/conf/iui/XieCG19
}}
==Outlining the Design Space of Explainable Intelligent Systems for Medical Diagnosis==
Outlining the Design Space of Explainable Intelligent Systems
for Medical Diagnosis
Yao Xie Xiang ‘Anthony’ Chen Ge Gao
UCLA, ECE UCLA, ECE University of Maryland, iSchool
Los Angeles, California Los Angeles, California College Park, Maryland
xac@ucla.edu gegao@umd.edu
yaoxie@g.ucla.edu
ABSTRACT promise of assisting human decision making through a data-driven
The adoption of intelligent systems creates opportunities as well as approach, non-computing professionals often find it challenging to
challenges for medical work. On the positive side, intelligent understand how the system transforms their initial input into a
systems have the potential to compute complex data from patients final decision and why.
and generate automated diagnosis recommendations for doctors. In the medical field, systems such as
However, medical professionals often perceive such systems as the CheXNet [40] have been
“black boxes” and, therefore, feel concerned about relying on developed to interpret a patient’s chest
system-generated results to make decisions. In this paper, we X-ray scan using deep learning. While
contribute to the ongoing discussion of explainable artificial the system can perform faster than
intelligence (XAI) by exploring the concept of explanation from a human doctors with impressive
human-centered perspective. We hypothesize that medical accuracy, it offers little clue to
professionals would perceive a system as explainable if the system indicate what happens within the
was designed to think and act like doctors. We report a “black box”. Human doctors holding
preliminary interview study that collected six medical medical responsibility can hardly trust
professionals’ reflection of how they interact with data for the system’s results without
diagnosis and treatment purposes. Our data reveals when and how understanding its underlying decision-
doctors prioritize among various types of data as a central part of making process [21].
their diagnosis process. Based on these findings, we outline future To help non-computing professional
directions regarding the design of XAI systems in the medical Figure 1. The input and better comprehend results generated
context. output image of by intelligent systems, a growing body
CheXNet [40] of research has been conducted with
CCS CONCEPTS the goal of building explainable AI (XAI). It provides various
system-centric solutions, such as developing accountable and
• Human-centered computing ~ Interactive systems and tools transparent algorithms [11,43], visualizing obscure features
• Human-centered computing ~ HCI design and evaluation [12,49], and employing theories from cognitive psychology to
methods explore effective explanations [28,29,32]. The current limitation
of these approaches is that there is a lack of empirical evidence to
KEYWORDS support the understanding by domain professionals [24].
Explainable artificial intelligence; human-centered design; In this project, we tackle the challenge of XAI from a user-centric
medical data; system design. perspective. We identify medical domain as the focus of our
research given the proliferation of AI-powered diagnosis systems
ACM Reference format: in recent years. We hypothesize that human doctors will find a
Yao Xie, Xiang ‘Anthony’ Chen and Ge Gao. 2019. Outlining the Design system more explainable when the system ‘speaks the language’
Space for Explainable Intelligent Systems for Medical Diagnosis. In Joint of a doctor and ‘thinks like’ a doctor,
Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March The remainders of this paper present our first step to the design of
20, 2019, 6 pages. an explainable AI system by taking the perspective of medical
professionals. We firstly review prior research on XAI, intelligent
system in the medical field, and mental model of medical
1 Introduction professionals, respectively. After that, we report a preliminary
Intelligent systems, the computational agent that employs interview study with six doctors that tells how medical
algorithms to process and make sense of data, are becoming professionals interact with data for diagnosis and treatment
increasingly ubiquitous in modern workplaces [1]. Despite the purposes in their daily work practice. Based on findings from the
interview, we discuss how interaction designers can incorporate
IUI Workshop’19, March 20, 2019, Los Angeles, USA. human doctors’ data processing model into medical intelligent
Copyright © 2019 for the individual papers by the papers' authors. systems and make such systems more explainable for the users.
Copying permitted for private and academic purposes. This volume is
published and copyrighted by its editors.
IUI Workshops'19, March 20, 2019, Los Angeles, USA Y. Xie et al.
2 Background & Related Work also the understanding of social context [33]. Software
Learnability is an important part of usability. It focuses on how to
In this section, we first lay out a background review on XAI
use complex software applications with the help of demonstrations
research, and then zoom into an HCI-oriented approach towards
or in-context videos [19] and it evaluates the easiness of using a
XAI. Since our focused field is in medicine, we further discuss
system.
prior work in medical AI, and specifically related to our interest—
Systems need to provide users with not only results but also the
literature on the reasoning process of medical professionals.
account of their behaviors [3]. Furthermore, research about a
Explainable Artificial Intelligence (XAI) Systems tailored interface that provides the visual or textual explanation for
Explainable artificial intelligence (XAI) raised a lot of concerns in context-aware rules has been done [10]. Researchers also studied
recent years [20]. Since 1970, researchers have focused on the the design strategies of interaction and how to help users predict
explanations of expert intelligent systems [31,46]. Recently, the system behavior through feedforward [2,3,47,48]. How users
need for explainable artificial intelligent is called for again understand and control the machine learning programs is also a
because of the development of machine learning and artificial relevant trend, which also works towards the debuggable and
intelligence. Especially, algorithms like deep learning are intelligible machine learning [50]. Understandability and
intrinsically difficult to be understood and it brings the need for predictability are very important in artificial intelligence
better explainable systems. applications such as autonomous vehicles [37]. Besides the
A lot of work of interpretable machine learning has been done to algorithmic accountability, transparency, and fairness, data
explain the inner principles of the machine learning models with visualization is also a stream from the computational perspective
mathematical and algorithmic solutions [4]. The main methods of of HCI, which seems to be isolated from what machine learning
interpretable machine learning are explanations of the complex researchers do [7].
algorithm like deep learning, causal inference, Bayesian rules, and
Intelligent Systems in Medical Fields
visual analytics. Algorithm accountability means that the
In medical fields, artificial intelligent systems also have a broad
algorithm should explain the decisions. For example, “right to
prospect. With the growth of availability of medical data and the
explanation” law in EU [18]. Planning oversight, retrospective
data processing techniques, artificial intelligent systems are
analyses, and continuous review are needed to make the algorithm
possible to be applied in the healthcare domain. They are able to
accountable[19]. However, there are still many challenges in XAI.
dig out useful information from a large amount of data which is
Lipton [27] proposed a taxonomy of the reasons for
difficult to be processed by doctors and thus, assist the medical
interpretability and also the ways to interpret but there is still no
decision making [16,35]. In the medical field, it has three major
consensus about the definition of interpretability. Some
applications: early detection, diagnosis, and treatment plan. They
researchers studied the evaluation of whether a system is
can also help with the diagnostic process of diseases such as
interpretable and evaluation methods are proposed [13]. Attempts
cardiology, cancer, and neurology [23]. The research in medical
have also been made to map the intelligibility, interpretable
artificial intelligence mainly focuses on pathology and radiology.
algorithms and explainability with the related work. In social
For example, systems are able to identify the radiographs and
science, researchers also study how people define, select, generate,
recognize patterns for radiologist and pathologist and work as an
evaluate and express an explanation [33].
information specialist during the diagnostic process [22]. Besides
Intelligibility and Explainable Systems Research in HCI the image analysis applications in radiology and pathology,
In HCI, researchers are focusing on user’s interaction with the artificial intelligent systems are also applied to read the medical
intelligent systems and explanation is one important topic. HCI scientific literature and integrate electronic medical records. In
researchers focus more on the interaction between the artificial addition, they may optimize and predict the treatment of chronic
intelligent system and users and they have done a lot of work from disease [32].
this aspect. Artificial intelligent systems have been criticized that However, comparing to the booming industry, the actual usage of
their rigid concepts are incompatible with human behavior styles the autodiagnostic system in hospitals is relatively low. A study
[45]. Explainable artificial intelligence in HCI contains topics has been made to know doctors’ acceptance and the adopt
including context awareness, cognitive psychology, and software intention of these systems [15]. Another research proposed the
learnability [44]. Context awareness is used to recognize user methods of evaluating the clinical performance and effect of the
reactions and activities. In the early 2000s, context awareness has artificial intelligent systems in medical diagnosis and one of the
raised a lot of concern with the development of mobile devices methods mentioned the explanations [39].
and sensors [9,42]. People should understand what is sensed and The explanation capabilities of artificial intelligence systems using
what reaction is taken under a specific situation. For a context- knowledge bases are firstly added for the applications in medical
aware system, it should let users know “what they know, how they decision making and computer-aided diagnosis in 1983 [46]. After
know it and what they are going to do next” [3]. The needs for that, a diagnostic reasoning theory is used to find the components
simplistic representations of the context in explainable AI is called of systems that lead to and explains the discrepancy between the
to let users be aware of what is obtained and which action will be expected result and observed behaviors [41] and it has a variety of
done by systems [14]. Cognitive psychology is more about settings such as medical auto-diagnostic systems. Further, how
explanation theory. Lombrozo studied cognitive explanations [28] doctors make decisions under uncertain and information
and found that it is strongly connected with causality reasoning. overloaded cases raises a lot of concerns. An argument-based
Also, XAI not only focuses on human cognitive psychology but interaction that is flexible and easily understood by human users is
Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis IUI Workshops'19, March 20, 2019, Los Angeles, USA
proposed to help doctors make decisions based on this question 3 Interview
[17]. It is also proved that a fuller explanation has a positive effect Even though a lot of researches have been done to explain the
on users’ trust of such systems and also helps to solve reliance intelligent systems. They seldom look into specific domains and
issues. Better explanations can let users better understand the incorporate empirical knowledge when explaining. We try to
reasoning chain, thus enhancing the system’s confidence and help understand this problem from the doctors’ perspective and that’s
doctors provide better diagnoses [5]. An interactive visual why we seek to investigate the following research question:
analytics system is also designed to help support interactive RQ: How do medical professionals interact with patients’ data for
dependence diagnostics by feature representation and visualization diagnosis and/or treatment purposes?
[24]. Overview
Medical Reasoning, Decision Making & Mental Models We conducted an interview study to explore research questions
Cosby summarized tow models of clinical reasoning: analytical presented above. Our current sample consists of six licensed
and intuitive [8]. The analytical approach is based on the medical professionals working in California, United States. Each
hypothetic-deductive model that is common in scientific research interview lasts about 1 hour. During the research process, we
and discovery, whereas the intuitive approach is akin to iterated between collecting new data, generating codes, and
recognizing common patterns from a patient’s symptoms rather revising/elaborating the existing coding book as suggested by the
than deliberately going through a methodological decision-making grounded theory [30]. Findings from these interviews offered
process. Doctors often choose one of these models based on how insights revealing the relationship between medical professionals,
experienced they are and how complicated a case is. data and intelligent systems from a human-centered perspective.
Also, due to the uniqueness of the medical field, medical Participants and Data Collection
reasoning and decision-making mean more than what they mean in
other fields. From the doctors’ perspective, explanation of the
decision making process is not only how the results come out, but Domain of # of Years
also the cost of medical decisions such as the responsibility and ID Expertise/Special Gender
risk [6]. In different scenarios, the requirement of explanations ty in the Medical Field
also varies. In addition, the decision-making process in the
medical field can be regarded as a combination of basic medical
P1 Pathologist Male 22
knowledge such as pathology, the experience gained by previous
patients in similar conditions and the cognition of the patient’s
demographic information. It is a lot more complex than regular P2 Orthopedist Female 17
decision-making process and mental model which can be reached
by splitting different features with “yes” or “no” [8].
P3 Neurologist Male 7
Broadly, the term ‘mental model’ is a concept derived from
cognitive psychology. It is the explanation of people’s thought
process about how things work [38]. The mental model can also P4 Family physician Male 10
be regarded as an internal representation of the external factors
and it is important in cognition, decision making, and reasoning P5 General Male 5
[36]. The internal conceptualizations including users’ beliefs and physician
understanding about the system behavior will guide their
interaction with the systems [38]. Also, during the interaction, the
mental models will develop individually according to different P6 Cardiologist Male 18
users. In general, most mental models are simpler than the actual Table 1. Background information of our six interviewees,
systems and it is sufficient to allow users to understand the system including their participant ID, the domain of expertise,
behavior [34]. However, when it comes to the complex cases, for gender, and number of years working/studying in the medical
example, medical diagnosis, if mental models cannot reflect the field
actual complexity of these artificial intelligent systems, users All interviewees joined this study by responding to an online
might feel difficult to understand, explain or predict the system participant call posted by the research team. We intentionally
behavior [38]. In order to make users better understand and looked for participants who hold different domains of expertise
explain how the system works, the system should be transparent within the medical field, so that the interview data can best capture
and show the mental model similar to human’s mental models both the commonalities and the differences between the thinking
[25,26]. Otherwise, users are likely to build flawed mental models styles of various medical professionals. Table 1 summarizes the
when interacting with such systems and be confused about the background information of each interviewee. For the anonymous
process of decision making [38]. For systems with improved purpose, we replaced their names by randomly assigned IDs.
mental models, user’s satisfaction perceived control, and the Between September and November of 2018, the first author of this
overall trust of the system will all be enhanced, which will also paper conducted semi-structured interviews with each participant.
facilitate understanding [25]. The interview protocol was initially developed through in-group
brainstorming sessions among the authors of this paper. It then got
3
IUI Workshops'19, March 20, 2019, Los Angeles, USA Y. Xie et al.
revised based on two pilot interviews with senior M.D. students at diseases, the guideline cannot include all of them. It will depend
UCLA. The final protocol consisted of questions revolving around on the doctor’s experience or some innovations to accomplish the
four issues: 1) the interviewee’s work and education experience in treatment. [P4, Family physician]
the medical field, 2) how s/he accesses to, processes and interprets
In the rest of this section, we describe how medical professionals
medical related data during daily work practice, 3) challenges and
navigate around the complexity of their interaction with medical
solutions s/her ever experienced, if any, when working with
data. We identify three critical steps from interviewees’ reflection,
medical data, and 4) experience and/or expectations of using
including detecting/reacting to borderline cases, generating
computer-based systems to facilitate daily medical work. All
prioritization matrices, and coordinating with computer-based
interviews were conducted face-to-face in English and audio-taped
systems. Across all these steps, medical professionals keep
for transcription.
prioritizing and re-prioritizing among information collected at
Analysis
different stages of the diagnosis process.
Three authors of this paper analyzed the interview data together
following an inductive approach. There were 60 codes and 201 Borderline Cases: When Challenges Emerge
quotations generated from the initial open coding. They yield All participants of our interview reported running into borderline
participants’ self-reflection regarding the forms of data they cases as the moments when the processing and interpreting of
interact with at daily medical work, the thinking process they go medical data turn challenging. One representative situation of
through when interacting with various data, the decisions they try encountering borderline cases is when the symptoms are still in
to make based on data processing, and the types of work they have their early state:
been delegating or hope to delegate to computer-based systems.
We reiteratively discussed and compared between codes as they At the very early state [of cancer], it is difficult to tell if the cell is
were generated. During the discussion, prioritization emerged as a abnormal. The architecture is minimally disrupted. You may think
focal theme from the data. It indicates that a central task medical it is abnormal, but you don’t know whether it is malignance. We
professional performs during diagnosis is to prioritize among will show the cases to other colleges to get consents, or we have to
various and sometimes conflicting information given by patients, say this case is inconclusive. [P1, Pathologist]
other doctors, and computer-based systems. We went through In other situations of the borderline cases, medical professionals
further coding to identify connections between this focal theme receive conflicting information that indicates different directions
and other emerged themes and categories. The following section of the diagnosis:
presents our detailed findings. Words and phrases directly quoted
from participants are written in italic. Many of us have run into cases when the MRI doesn’t confirm
[our diagnosis]. We think the problem is in the right brain, but the
4 Findings image shows nothing there. In that case, we may do the test again.
The process of generating a proper diagnosis and/or treatment plan We can also go back to the patient to ask them again, or we
is frequently described by our interviewees as being context- discuss with other doctors. [P3, Neurologist]
dependent, data-intensive, and open to alternative possibilities. In
To clear up the ambiguity as indicated by the two quotations
many cases, there lack one-to-one correspondences between signs,
above, doctors often need to cross-validate their initial evaluation
symptoms, and diseases. Medical professionals in the field,
of the patient by requesting further data. Our interview with the
therefore, are often required to integrate various kinds of data and
six medical professionals documented multiple types of such data,
think outside the box. As it is pointed out by the following two
including but not limited to, the patient’s demographic
participants:
information, cardinal symptoms, results from further physical
For medicine, it’s usually the grey area that matters. Everything is
examinations and lab tests, historical data from reference groups,
hardly black and white, and that’s why it is always difficult. …
and evaluations given by other doctors.
People say that medicine is both a science and an art, because
every disease is different, and every patient’s representation will Participants in our study yielded similar insights regarding how
be different. Every doctor obviously has different steps in making they deal with the rich yet complex medical data. Instead of
the decision. [P6, Cardiologist] following one hard rule of data processing, interviewees tend to
Authorities, like the American Heart Association, will publish weight/interpret each type of data differently based on their
guidelines and flow charts that we can refer. It prevents personalized prioritization matrices.
physicians from making ridiculous mistakes. But for more complex
Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis IUI Workshops'19, March 20, 2019, Los Angeles, USA
Prioritization Matrices: Validity and Beyond
We identified six parameters from participants’ self-reflections
that reveal how they perform data prioritization for diagnosis
and/or treatment purpose. These parameters are labeled as below:
Theoretical validity. Robustness of connections
between signs, symptoms, and
diseases as proved by theories,
medical textbooks, and
guidelines;
Severity of consequence. Quality and quantity of potential
consequences if the detected
signs/symptoms get put aside at
this moment; side-effects of a
treatment; interactions between
different treatments;
Time constraint. Timing; urgency; sequential
Figure 2. A diagram that compares the prioritization
order of taking care of different
matrices held by two different medical professionals (blue
symptoms and diseases; vs. orange line) when making diagnosis decisions. While one
doctor uses severity as the primary parameter to weight
Domain of expertise. The extent to which the signs and various data during diagnosis /treatment, the other cares
symptoms connected to the most about the calculation of risks and responsibilities.
doctor’s specialty; the level of
confidence in offering a
candidate treatment; Our interviewees sometimes referred to the personalized
prioritization matrices (or styles) to explain the disagreement
Risk avoidance. Responsibility assigned to a between diagnosis suggestions provided by different doctors (see
specific doctor; power dynamics Figure 2 for illustration).
between junior vs. senior doctors;
Coordination Between Medical Professionals & Systems
Technical feasibility. The sensitivity of the
All medical professionals in our study reported that they have
measurement; reliability of the
been using computer-based tools and systems to facilitate their
technique; the false
daily work practice. Most participants, for instance, have greatly
positive/negative rate of
relied on cloud-based platforms to store and connect their local
symptom detection.
medical data with other databases [P1, P2, P4, P5, P6]. They also
Participants often used styles to describe the detailed prioritization used various systems to generate automated calculation of
matrices held by different doctors. Similar to other dispositional chromosomes [P1], identify the degree of scoliosis [P2], check
attributes such as personality, the prioritization matrix of a possible interactions between medications [P5], and etc. The
medical professional is perceived as being self-aware and primary function of such tools is to “provide quantified
consistent across various diagnosis made by the same individual: information to doctors, but not [to give] answers in terms of high-
level decisions [P3]”.
The diagnosis depends on many factors –severity, possibility,
consistency with the patient’s history, and others. Some doctors While participants were confident that the auto-quantified
will make the most severe issues on the priority, others will make information given by systems is usually trustworthy and helpful,
the most possible ones their priority. It depends on their this optimism does not remain in their narratives of auto-diagnosis
perspective. It also depends on the time concern. For example, or treatment recommended by systems. The following quotation
neurologists may have a longer period of diagnosis, but surgeons from P6 indicates a shared attitude as reflected across all the six
and ER doctors don’t. [P2, Orthopedist] interviews:
Some doctors trust images [over other information], like MRI, to
tell what’s happening. About 80% of the time you would have
good images. You are very confident about the diagnosis from the
images. But I think most important information [to facilitate
diagnosis] is what the patient tells you. It helps to track the
patient’s history. [P6, Cardiologist]
5
IUI Workshops'19, March 20, 2019, Los Angeles, USA Y. Xie et al.
There is a lot of advanced analysis involving machine learning, [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan
Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and
and some of them have entered the clinical realm. For example, Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI
you will have the nuclear images, and you will have the software Conference on Human Factors in Computing Systems (CHI ’18), 582:1–582:18.
https://doi.org/10.1145/3173574.3174156
telling you “it’s abnormal here and there.” It’s as if you have a [2] Victoria Bellotti, Maribeth Back, W. Keith Edwards, Rebecca E. Grinter, Austin
second reader next to you. I would love to have the system Henderson, and Cristina Lopes. 2002. Making sense of sensing systems. In
generating results, but ultimately, it’s you that’s deciding on the Proceedings of the SIGCHI conference on Human factors in computing systems
Changing our world, changing ourselves - CHI ’02.
diagnosis. When there is a disagreement, me and everyone will be https://doi.org/10.1145/503447.503450
overwriting the machine-generated interpretation. [P6, [3] Victoria Bellotti and Keith Edwards. 2001. Intelligibility and Accountability:
Human Considerations in Context-Aware Systems. Human-Computer
Cardiologist] Interaction 16, 2-4: 193–212. https://doi.org/10.1207/S15327051HCI16234_05
[4] O. Biran and C. Cotton. 2017. Explanation and justification in machine learning:
To step forward from quantifying information to directly assisting A survey. IJCAI-17 Workshop on Explainable AI (XAI). Retrieved from
diagnosis and treatment, systems are expected to “give an http://www.intelligentrobots.org/files/IJCAI2017/IJCAI-
17_XAI_WS_Proceedings.pdf#page=8
argument for why the data should be interpreted in that way [P5]”. [5] A. Bussone, S. Stumpf, and D. O’Sullivan. 2015. The Role of Explanations on
The majority of our participants proposed the concept of reference Trust and Reliance in Clinical Decision Support Systems. In 2015 International
and comparison as one approach to ground the systems’ diagnostic Conference on Healthcare Informatics, 160–169.
https://doi.org/10.1109/ICHI.2015.26
reasoning with that of human doctors’: [6] C. K. Cassel and A. L. Jameton. 1981. Dementia in the elderly: an analysis of
medical responsibility. Annals of internal medicine 94, 6: 802–807. Retrieved
Any machine has to give an evidence for the top reasons like in from https://www.ncbi.nlm.nih.gov/pubmed/7235423
descending order for why in some matrices. It’s like if I say [7] Chi Conference Committee. 2016. CHI 16 Vol 1: CHI Conference on Human
Factors in Computing Systems. Association for Computing Machinery.
something and you think differently, then we should be able to Retrieved from https://market.android.com/details?id=book-MK8IMQAACAAJ
really compare the two. Otherwise, it doesn’t matter if the [8] Pat Croskerry, Karen Cosby, Mark L. Graber, and Hardeep Singh. 2017.
Diagnosis: Interpreting the Shadows. CRC Press. Retrieved from
machine’s suggestion is right. I don’t know what its thinking is https://www.taylorfrancis.com/books/9781351652926
and, ultimately, I take all the responsibility in this decision. [P1, [9] Anind K. Dey. 2001. Understanding and Using Context. Personal and
Ubiquitous Computing 5, 1: 4–7. https://doi.org/10.1007/s007790170019
Pathologist] [10] Anind K. Dey and Alan Newberger. 2009. Support for context-aware
intelligibility and control. In Proceedings of the SIGCHI Conference on Human
There are different ways [to help validate the systems’ diagnosis Factors in Computing Systems, 859–868.
recommendations]. One is showing me past examples in the https://doi.org/10.1145/1518701.1518832
[11] Nicholas Diakopoulos. 2016. Accountability in Algorithmic Decision Making.
database - will that support its conclusion? Another one is sources Communications of the ACM 59, 2: 56–62. https://doi.org/10.1145/2844110
of data, something like research articles or convincing cases have [12] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
been done. That’s upper-level evidence. [P5, General physician] Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition. In International Conference on
Interviewees further suggested that to build an ideal auto- Machine Learning, 647–655. Retrieved December 14, 2018 from
http://proceedings.mlr.press/v32/donahue14.html
diagnosis/treatment system, the algorithm should be able to [13] Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of
contextualize its reference data with personalized information of a Interpretable Machine Learning. arXiv [stat.ML]. Retrieved from
http://arxiv.org/abs/1702.08608
patient. Such contextualization work is what human doctors are [14] Paul Dourish. 1995. Developing a reflective model of collaborative systems.
good at based on their professional training and experience, but it ACM transactions on computer-human interaction: a publication of the
Association for Computing Machinery 2, 1: 40–63.
is perceived to be the major obstacle for systems to overcome. https://doi.org/10.1145/200968.200970
[15] Wenjuan Fan, Jingnan Liu, Shuwan Zhu, and Panos M. Pardalos. 2018.
5 Implication for Design Investigating the impacting factors for the healthcare professionals to adopt
artificial intelligence-based medical diagnosis support system (AIMDSS).
Based on the findings of the preliminary interview, we outline Annals of Operations Research. https://doi.org/10.1007/s10479-018-2818-y
[16] M. Fieschi. 2013. Artificial Intelligence in Medicine: Expert Systems. Springer.
design suggestions for explainable medical AI systems. Retrieved from https://market.android.com/details?id=book-_Vf0BwAAQBAJ
Specifically, we envisage a system that can [17] J. Fox, D. Glasspool, D. Grecu, S. Modgil, M. South, and V. Patkar. 2007.
Argumentation-Based Inference and Decision Making--A Medical Perspective.
Allow a medical professional to prioritize different types and IEEE intelligent systems 22, 6: 34–41. https://doi.org/10.1109/MIS.2007.102
[18] Bryce Goodman and Seth Flaxman. 2016. European Union regulations on
sources of data by directly manipulating a user interface akin to algorithmic decision-making and a “right to explanation.” arXiv [stat.ML].
our proposed prioritization matrix (Figure 2); Retrieved from http://arxiv.org/abs/1606.08813
[19] Tovi Grossman and George Fitzmaurice. 2010. ToolClips: An Investigation of
Support gradual engagement of medical AI systems into a medical Contextual Video Assistance for Functionality Understanding. In Proceedings of
the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10),
professional’s diagnosis process, spanning from low-level 1515–1524. https://doi.org/10.1145/1753326.1753552
automated measurement tasks, to mid-level constraint-aware [20] D. Gunning. 2016. Explainable artificial intelligence (XAI). Defense Advanced
planning of medical tests, and to high-level suggestions of Research Projects agency.
[21] Andreas Holzinger, Chris Biemann, Constantinos S. Pattichis, and Douglas B.
plausible diagnoses. Kell. 2017. What do we need to build explainable AI systems for the medical
domain? arXiv preprint arXiv:1712. 09923. Retrieved from
ACKNOWLEDGMENTS https://arxiv.org/abs/1712.09923
[22] Saurabh Jha and Eric J. Topol. 2016. Adapting to Artificial Intelligence:
We thank Xiaohe Yang for assisting us to complete the interviews. Radiologists and Pathologists as Information Specialists. JAMA: the journal of
the American Medical Association 316, 22: 2353–2354.
We thank all the anonymous interviewees for their contributions to https://doi.org/10.1001/jama.2016.17438
our study. We also thank Maie St. John, Peter Pellionisz and Jeff [23] Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang,
Qiang Dong, Haipeng Shen, and Yongjun Wang. 2017. Artificial intelligence in
Liang for their valuable comments on earlier drafts of this paper. healthcare: past, present and future. Stroke and vascular neurology 2, 4: 230–
243. https://doi.org/10.1136/svn-2017-000101
REFERENCES
Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis IUI Workshops'19, March 20, 2019, Los Angeles, USA
[24] Josua Krause, Adam Perer, and Kenney Ng. 2016. Interacting with Predictions: True Identity. In Proceedings of the SIGCHI Conference on Human Factors in
Visual Inspection of Black-box Machine Learning Models. In Proceedings of Computing Systems (CHI ’13), 1931–1940.
the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16), https://doi.org/10.1145/2470654.2466255
5686–5697. https://doi.org/10.1145/2858036.2858529 [49] Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding
[25] Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell Convolutional Networks. In Computer Vision – ECCV 2014, 818–833.
Me More?: The Effects of Mental Model Soundness on Personalizing an https://doi.org/10.1007/978-3-319-10590-1_53
Intelligent Agent. In Proceedings of the SIGCHI Conference on Human Factors [50] 2001. UIST ’01: Proceedings of the 14th Annual ACM Symposium on User
in Computing Systems (CHI ’12), 1–10. Interface Software and Technology, Orlando, Florida, November 11-14, 2001.
https://doi.org/10.1145/2207676.2207678 Association for Computing Machinery. Retrieved from
[26] T. Kulesza, S. Stumpf, M. Burnett, W. Wong, Y. Riche, T. Moore, I. Oberst, A. https://market.android.com/details?id=book-YLtQAAAAYAAJ
Shinsel, and K. McIntosh. 2010. Explanatory Debugging: Supporting End-User
Debugging of Machine-Learned Programs. In 2010 IEEE Symposium on Visual
Languages and Human-Centric Computing, 41–48.
https://doi.org/10.1109/VLHCC.2010.15
[27] Zachary C. Lipton. 2016. The Mythos of Model Interpretability. arXiv [cs.LG].
Retrieved from http://arxiv.org/abs/1606.03490
[28] Tania Lombrozo. 2010. Causal–explanatory pluralism: How intentions,
functions, and mechanisms influence causal ascriptions. Cognitive psychology
61, 4: 303–332. https://doi.org/10.1016/j.cogpsych.2010.05.002
[29] Tania Lombrozo and Susan Carey. 2006. Functional explanation and the
function of explanation. Cognition 99, 2: 167–204.
https://doi.org/10.1016/j.cognition.2004.12.009
[30] Donna Reseigh Long. 1993. Basics of qualitative research: Grounded theory
procedures and techniques.
[31] William van Melle, Edward H. Shortliffe, and Bruce G. Buchanan. 1984.
EMYCIN: A knowledge engineer’s tool for constructing rule-based expert
systems. Rule-based expert systems: The MYCIN experiments of the Stanford
Heuristic Programming Project: 302–313. Retrieved from
http://www.aaai.org/Papers/Buchanan/Buchanan17.pdf
[32] D. Douglas Miller and Eric W. Brown. 2018. Artificial Intelligence in Medical
Practice: The Question to the Answer? The American journal of medicine 131,
2: 129–133. https://doi.org/10.1016/j.amjmed.2017.10.035
[33] Tim Miller. 2017. Explanation in Artificial Intelligence: Insights from the Social
Sciences. arXiv [cs.AI]. Retrieved from http://arxiv.org/abs/1706.07269
[34] Neville Moray. 1987. Intelligent aids, mental models, and the theory of
machines. International journal of man-machine studies 27, 5: 619–629.
https://doi.org/10.1016/S0020-7373(87)80020-2
[35] Travis B. Murdoch and Allan S. Detsky. 2013. The inevitable application of big
data to health care. JAMA: the journal of the American Medical Association
309, 13: 1351–1352. https://doi.org/10.1001/jama.2013.393
[36] Jakob Nielsen. 2010. Mental models. Jakob Nielsen’s Alertbox.
[37] Don Norman. 2009. The Design of Future Things. Basic Books. Retrieved from
https://market.android.com/details?id=book-aeIUGq1EL24C
[38] Donald A. Norman. 2014. Some observations on mental models. In Mental
models. Psychology Press, 15–22. Retrieved from
https://www.taylorfrancis.com/books/e/9781317769408/chapters/10.4324%2F9
781315802725-5
[39] Seong Ho Park and Kyunghwa Han. 2018. Methodologic Guide for Evaluating
Clinical Performance and Effect of Artificial Intelligence Technology for
Medical Diagnosis and Prediction. Radiology 286, 3: 800–809.
https://doi.org/10.1148/radiol.2017171920
[40] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta,
Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya,
Matthew P. Lungren, and Andrew Y. Ng. 2017. CheXNet: Radiologist-Level
Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv [cs.CV].
Retrieved from http://arxiv.org/abs/1711.05225
[41] Raymond Reiter. 1987. A theory of diagnosis from first principles. Artificial
intelligence 32, 1: 57–95. https://doi.org/10.1016/0004-3702(87)90062-2
[42] B. Schilit, N. Adams, and R. Want. 1994. Context-aware computing
applications. In Workshop on Mobile Computing Systems and Applications,
85–90. https://doi.org/10.1109/MCSA.1994.512740
[43] Ben Shneiderman. 2016. Opinion: The dangers of faulty, biased, or malicious
algorithms requires independent oversight. Proceedings of the National
Academy of Sciences of the United States of America 113, 48: 13538–13540.
https://doi.org/10.1073/pnas.1618211113
[44] Ben Shneiderman, Catherine Plaisant, Maxine Cohen, Steven Jacobs, Niklas
Elmqvist, and Nicholas Diakopoulos. 2016. Grand Challenges for HCI
Researchers. Interactions 23, 5: 24–25. https://doi.org/10.1145/2977645
[45] Lucy A. Suchman. 1987. Plans and Situated Actions: The Problem of Human-
Machine Communication. Cambridge University Press. Retrieved from
https://market.android.com/details?id=book-AJ_eBJtHxmsC
[46] William R. Swartout. 1983. Xplain: A system for creating and explaining expert
consulting programs. UNIVERSITY OF SOUTHERN CALIFORNIA
MARINA DEL REY INFORMATION SCIENCES INST. Retrieved from
http://www.dtic.mil/docs/citations/ADA130597
[47] Jo Vermeulen, Kris Luyten, Karin Coninx, and Nicolai Marquardt. 2014. The
design of slow-motion feedback. In Proceedings of the 2014 conference on
Designing interactive systems, 267–270.
https://doi.org/10.1145/2598510.2598604
[48] Jo Vermeulen, Kris Luyten, Elise van den Hoven, and Karin Coninx. 2013.
Crossing the Bridge over Norman’s Gulf of Execution: Revealing Feedforward's
7