=Paper=
{{Paper
|id=Vol-2327/ExSS20
|storemode=property
|title=Outlining the Design Space of Explainable Intelligent Systems for Medical Diagnosis
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-18.pdf
|volume=Vol-2327
|authors=Yao Xie,Xiang Chen,Ge Gao
|dblpUrl=https://dblp.org/rec/conf/iui/XieCG19
}}
==Outlining the Design Space of Explainable Intelligent Systems for Medical Diagnosis==
<pdf width="1500px">https://ceur-ws.org/Vol-2327/IUI19WS-ExSS2019-18.pdf</pdf>
<pre>
      Outlining the Design Space of Explainable Intelligent Systems
                         for Medical Diagnosis
               Yao Xie                                      Xiang ‘Anthony’ Chen                                     Ge Gao
             UCLA, ECE                                              UCLA, ECE                          University of Maryland, iSchool
        Los Angeles, California                                Los Angeles, California                    College Park, Maryland
                                                                   xac@ucla.edu                               gegao@umd.edu
          yaoxie@g.ucla.edu

ABSTRACT                                                                     promise of assisting human decision making through a data-driven
The adoption of intelligent systems creates opportunities as well as         approach, non-computing professionals often find it challenging to
challenges for medical work. On the positive side, intelligent               understand how the system transforms their initial input into a
systems have the potential to compute complex data from patients                                           final decision and why.
and generate automated diagnosis recommendations for doctors.                                              In the medical field, systems such as
However, medical professionals often perceive such systems as                                              the CheXNet [40] have been
“black boxes” and, therefore, feel concerned about relying on                                              developed to interpret a patient’s chest
system-generated results to make decisions. In this paper, we                                              X-ray scan using deep learning. While
contribute to the ongoing discussion of explainable artificial                                             the system can perform faster than
intelligence (XAI) by exploring the concept of explanation from a                                          human doctors with impressive
human-centered perspective. We hypothesize that medical                                                    accuracy, it offers little clue to
professionals would perceive a system as explainable if the system                                         indicate what happens within the
was designed to think and act like doctors. We report a                                                    “black box”. Human doctors holding
preliminary interview study that collected six medical                                                     medical responsibility can hardly trust
professionals’ reflection of how they interact with data for                                               the system’s results without
diagnosis and treatment purposes. Our data reveals when and how                                            understanding its underlying decision-
doctors prioritize among various types of data as a central part of                                        making process [21].
their diagnosis process. Based on these findings, we outline future                                        To help non-computing professional
directions regarding the design of XAI systems in the medical                  Figure 1. The input and better comprehend results generated
context.                                                                            output image of        by intelligent systems, a growing body
                                                                                     CheXNet [40]          of research has been conducted with
CCS CONCEPTS                                                                 the goal of building explainable AI (XAI). It provides various
                                                                             system-centric solutions, such as developing accountable and
• Human-centered computing ~ Interactive systems and tools                   transparent algorithms [11,43], visualizing obscure features
• Human-centered computing ~ HCI design and evaluation                       [12,49], and employing theories from cognitive psychology to
methods                                                                      explore effective explanations [28,29,32]. The current limitation
                                                                             of these approaches is that there is a lack of empirical evidence to
KEYWORDS                                                                     support the understanding by domain professionals [24].
Explainable artificial intelligence; human-centered design;                  In this project, we tackle the challenge of XAI from a user-centric
medical data; system design.                                                 perspective. We identify medical domain as the focus of our
                                                                             research given the proliferation of AI-powered diagnosis systems
ACM Reference format:                                                        in recent years. We hypothesize that human doctors will find a
Yao Xie, Xiang ‘Anthony’ Chen and Ge Gao. 2019. Outlining the Design         system more explainable when the system ‘speaks the language’
Space for Explainable Intelligent Systems for Medical Diagnosis. In Joint    of a doctor and ‘thinks like’ a doctor,
Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March           The remainders of this paper present our first step to the design of
20, 2019, 6 pages.                                                           an explainable AI system by taking the perspective of medical
                                                                             professionals. We firstly review prior research on XAI, intelligent
                                                                             system in the medical field, and mental model of medical
1 Introduction                                                               professionals, respectively. After that, we report a preliminary
Intelligent systems, the computational agent that employs                    interview study with six doctors that tells how medical
algorithms to process and make sense of data, are becoming                   professionals interact with data for diagnosis and treatment
increasingly ubiquitous in modern workplaces [1]. Despite the                purposes in their daily work practice. Based on findings from the
                                                                             interview, we discuss how interaction designers can incorporate
IUI Workshop’19, March 20, 2019, Los Angeles, USA.                           human doctors’ data processing model into medical intelligent
Copyright © 2019 for the individual papers by the papers' authors.           systems and make such systems more explainable for the users.
Copying permitted for private and academic purposes. This volume is
published and copyrighted by its editors.
IUI Workshops'19, March 20, 2019, Los Angeles, USA                                                                               Y. Xie et al.


2 Background & Related Work                                             also the understanding of social context [33]. Software
                                                                        Learnability is an important part of usability. It focuses on how to
In this section, we first lay out a background review on XAI
                                                                        use complex software applications with the help of demonstrations
research, and then zoom into an HCI-oriented approach towards
                                                                        or in-context videos [19] and it evaluates the easiness of using a
XAI. Since our focused field is in medicine, we further discuss
                                                                        system.
prior work in medical AI, and specifically related to our interest—
                                                                        Systems need to provide users with not only results but also the
literature on the reasoning process of medical professionals.
                                                                        account of their behaviors [3]. Furthermore, research about a
Explainable Artificial Intelligence (XAI) Systems                       tailored interface that provides the visual or textual explanation for
Explainable artificial intelligence (XAI) raised a lot of concerns in   context-aware rules has been done [10]. Researchers also studied
recent years [20]. Since 1970, researchers have focused on the          the design strategies of interaction and how to help users predict
explanations of expert intelligent systems [31,46]. Recently, the       system behavior through feedforward [2,3,47,48]. How users
need for explainable artificial intelligent is called for again         understand and control the machine learning programs is also a
because of the development of machine learning and artificial           relevant trend, which also works towards the debuggable and
intelligence. Especially, algorithms like deep learning are             intelligible machine learning [50]. Understandability and
intrinsically difficult to be understood and it brings the need for     predictability are very important in artificial intelligence
better explainable systems.                                             applications such as autonomous vehicles [37]. Besides the
A lot of work of interpretable machine learning has been done to        algorithmic accountability, transparency, and fairness, data
explain the inner principles of the machine learning models with        visualization is also a stream from the computational perspective
mathematical and algorithmic solutions [4]. The main methods of         of HCI, which seems to be isolated from what machine learning
interpretable machine learning are explanations of the complex          researchers do [7].
algorithm like deep learning, causal inference, Bayesian rules, and
                                                                        Intelligent Systems in Medical Fields
visual analytics. Algorithm accountability means that the
                                                                        In medical fields, artificial intelligent systems also have a broad
algorithm should explain the decisions. For example, “right to
                                                                        prospect. With the growth of availability of medical data and the
explanation” law in EU [18]. Planning oversight, retrospective
                                                                        data processing techniques, artificial intelligent systems are
analyses, and continuous review are needed to make the algorithm
                                                                        possible to be applied in the healthcare domain. They are able to
accountable[19]. However, there are still many challenges in XAI.
                                                                        dig out useful information from a large amount of data which is
Lipton [27] proposed a taxonomy of the reasons for
                                                                        difficult to be processed by doctors and thus, assist the medical
interpretability and also the ways to interpret but there is still no
                                                                        decision making [16,35]. In the medical field, it has three major
consensus about the definition of interpretability. Some
                                                                        applications: early detection, diagnosis, and treatment plan. They
researchers studied the evaluation of whether a system is
                                                                        can also help with the diagnostic process of diseases such as
interpretable and evaluation methods are proposed [13]. Attempts
                                                                        cardiology, cancer, and neurology [23]. The research in medical
have also been made to map the intelligibility, interpretable
                                                                        artificial intelligence mainly focuses on pathology and radiology.
algorithms and explainability with the related work. In social
                                                                        For example, systems are able to identify the radiographs and
science, researchers also study how people define, select, generate,
                                                                        recognize patterns for radiologist and pathologist and work as an
evaluate and express an explanation [33].
                                                                        information specialist during the diagnostic process [22]. Besides
Intelligibility and Explainable Systems Research in HCI                 the image analysis applications in radiology and pathology,
In HCI, researchers are focusing on user’s interaction with the         artificial intelligent systems are also applied to read the medical
intelligent systems and explanation is one important topic. HCI         scientific literature and integrate electronic medical records. In
researchers focus more on the interaction between the artificial        addition, they may optimize and predict the treatment of chronic
intelligent system and users and they have done a lot of work from      disease [32].
this aspect. Artificial intelligent systems have been criticized that   However, comparing to the booming industry, the actual usage of
their rigid concepts are incompatible with human behavior styles        the autodiagnostic system in hospitals is relatively low. A study
[45]. Explainable artificial intelligence in HCI contains topics        has been made to know doctors’ acceptance and the adopt
including context awareness, cognitive psychology, and software         intention of these systems [15]. Another research proposed the
learnability [44]. Context awareness is used to recognize user          methods of evaluating the clinical performance and effect of the
reactions and activities. In the early 2000s, context awareness has     artificial intelligent systems in medical diagnosis and one of the
raised a lot of concern with the development of mobile devices          methods mentioned the explanations [39].
and sensors [9,42]. People should understand what is sensed and         The explanation capabilities of artificial intelligence systems using
what reaction is taken under a specific situation. For a context-       knowledge bases are firstly added for the applications in medical
aware system, it should let users know “what they know, how they        decision making and computer-aided diagnosis in 1983 [46]. After
know it and what they are going to do next” [3]. The needs for          that, a diagnostic reasoning theory is used to find the components
simplistic representations of the context in explainable AI is called   of systems that lead to and explains the discrepancy between the
to let users be aware of what is obtained and which action will be      expected result and observed behaviors [41] and it has a variety of
done by systems [14]. Cognitive psychology is more about                settings such as medical auto-diagnostic systems. Further, how
explanation theory. Lombrozo studied cognitive explanations [28]        doctors make decisions under uncertain and information
and found that it is strongly connected with causality reasoning.       overloaded cases raises a lot of concerns. An argument-based
Also, XAI not only focuses on human cognitive psychology but            interaction that is flexible and easily understood by human users is
Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis   IUI Workshops'19, March 20, 2019, Los Angeles, USA

proposed to help doctors make decisions based on this question                 3 Interview
[17]. It is also proved that a fuller explanation has a positive effect        Even though a lot of researches have been done to explain the
on users’ trust of such systems and also helps to solve reliance               intelligent systems. They seldom look into specific domains and
issues. Better explanations can let users better understand the                incorporate empirical knowledge when explaining. We try to
reasoning chain, thus enhancing the system’s confidence and help               understand this problem from the doctors’ perspective and that’s
doctors provide better diagnoses [5]. An interactive visual                    why we seek to investigate the following research question:
analytics system is also designed to help support interactive                  RQ: How do medical professionals interact with patients’ data for
dependence diagnostics by feature representation and visualization             diagnosis and/or treatment purposes?
[24].                                                                          Overview
Medical Reasoning, Decision Making & Mental Models                             We conducted an interview study to explore research questions
Cosby summarized tow models of clinical reasoning: analytical                  presented above. Our current sample consists of six licensed
and intuitive [8]. The analytical approach is based on the                     medical professionals working in California, United States. Each
hypothetic-deductive model that is common in scientific research               interview lasts about 1 hour. During the research process, we
and discovery, whereas the intuitive approach is akin to                       iterated between collecting new data, generating codes, and
recognizing common patterns from a patient’s symptoms rather                   revising/elaborating the existing coding book as suggested by the
than deliberately going through a methodological decision-making               grounded theory [30]. Findings from these interviews offered
process. Doctors often choose one of these models based on how                 insights revealing the relationship between medical professionals,
experienced they are and how complicated a case is.                            data and intelligent systems from a human-centered perspective.
Also, due to the uniqueness of the medical field, medical                      Participants and Data Collection
reasoning and decision-making mean more than what they mean in
other fields. From the doctors’ perspective, explanation of the
decision making process is not only how the results come out, but                          Domain         of                 # of Years
also the cost of medical decisions such as the responsibility and                 ID       Expertise/Special    Gender
risk [6]. In different scenarios, the requirement of explanations                          ty                                in the Medical Field
also varies. In addition, the decision-making process in the
medical field can be regarded as a combination of basic medical
                                                                                  P1       Pathologist          Male         22
knowledge such as pathology, the experience gained by previous
patients in similar conditions and the cognition of the patient’s
demographic information. It is a lot more complex than regular                    P2       Orthopedist          Female       17
decision-making process and mental model which can be reached
by splitting different features with “yes” or “no” [8].
                                                                                  P3       Neurologist          Male         7
Broadly, the term ‘mental model’ is a concept derived from
cognitive psychology. It is the explanation of people’s thought
process about how things work [38]. The mental model can also                     P4       Family physician     Male         10
be regarded as an internal representation of the external factors
and it is important in cognition, decision making, and reasoning                  P5       General              Male         5
[36]. The internal conceptualizations including users’ beliefs and                         physician
understanding about the system behavior will guide their
interaction with the systems [38]. Also, during the interaction, the
mental models will develop individually according to different                    P6       Cardiologist         Male         18
users. In general, most mental models are simpler than the actual              Table 1. Background information of our six interviewees,
systems and it is sufficient to allow users to understand the system           including their participant ID, the domain of expertise,
behavior [34]. However, when it comes to the complex cases, for                gender, and number of years working/studying in the medical
example, medical diagnosis, if mental models cannot reflect the                field
actual complexity of these artificial intelligent systems, users               All interviewees joined this study by responding to an online
might feel difficult to understand, explain or predict the system              participant call posted by the research team. We intentionally
behavior [38]. In order to make users better understand and                    looked for participants who hold different domains of expertise
explain how the system works, the system should be transparent                 within the medical field, so that the interview data can best capture
and show the mental model similar to human’s mental models                     both the commonalities and the differences between the thinking
[25,26]. Otherwise, users are likely to build flawed mental models             styles of various medical professionals. Table 1 summarizes the
when interacting with such systems and be confused about the                   background information of each interviewee. For the anonymous
process of decision making [38]. For systems with improved                     purpose, we replaced their names by randomly assigned IDs.
mental models, user’s satisfaction perceived control, and the                  Between September and November of 2018, the first author of this
overall trust of the system will all be enhanced, which will also              paper conducted semi-structured interviews with each participant.
facilitate understanding [25].                                                 The interview protocol was initially developed through in-group
                                                                               brainstorming sessions among the authors of this paper. It then got


                                                                           3
IUI Workshops'19, March 20, 2019, Los Angeles, USA                                                                                  Y. Xie et al.

revised based on two pilot interviews with senior M.D. students at     diseases, the guideline cannot include all of them. It will depend
UCLA. The final protocol consisted of questions revolving around       on the doctor’s experience or some innovations to accomplish the
four issues: 1) the interviewee’s work and education experience in     treatment. [P4, Family physician]
the medical field, 2) how s/he accesses to, processes and interprets
                                                                       In the rest of this section, we describe how medical professionals
medical related data during daily work practice, 3) challenges and
                                                                       navigate around the complexity of their interaction with medical
solutions s/her ever experienced, if any, when working with
                                                                       data. We identify three critical steps from interviewees’ reflection,
medical data, and 4) experience and/or expectations of using
                                                                       including detecting/reacting to borderline cases, generating
computer-based systems to facilitate daily medical work. All
                                                                       prioritization matrices, and coordinating with computer-based
interviews were conducted face-to-face in English and audio-taped
                                                                       systems. Across all these steps, medical professionals keep
for transcription.
                                                                       prioritizing and re-prioritizing among information collected at
Analysis
                                                                       different stages of the diagnosis process.
Three authors of this paper analyzed the interview data together
following an inductive approach. There were 60 codes and 201           Borderline Cases: When Challenges Emerge
quotations generated from the initial open coding. They yield          All participants of our interview reported running into borderline
participants’ self-reflection regarding the forms of data they         cases as the moments when the processing and interpreting of
interact with at daily medical work, the thinking process they go      medical data turn challenging. One representative situation of
through when interacting with various data, the decisions they try     encountering borderline cases is when the symptoms are still in
to make based on data processing, and the types of work they have      their early state:
been delegating or hope to delegate to computer-based systems.
We reiteratively discussed and compared between codes as they          At the very early state [of cancer], it is difficult to tell if the cell is
were generated. During the discussion, prioritization emerged as a     abnormal. The architecture is minimally disrupted. You may think
focal theme from the data. It indicates that a central task medical    it is abnormal, but you don’t know whether it is malignance. We
professional performs during diagnosis is to prioritize among          will show the cases to other colleges to get consents, or we have to
various and sometimes conflicting information given by patients,       say this case is inconclusive. [P1, Pathologist]
other doctors, and computer-based systems. We went through             In other situations of the borderline cases, medical professionals
further coding to identify connections between this focal theme        receive conflicting information that indicates different directions
and other emerged themes and categories. The following section         of the diagnosis:
presents our detailed findings. Words and phrases directly quoted
from participants are written in italic.                               Many of us have run into cases when the MRI doesn’t confirm
                                                                       [our diagnosis]. We think the problem is in the right brain, but the
4 Findings                                                             image shows nothing there. In that case, we may do the test again.
The process of generating a proper diagnosis and/or treatment plan     We can also go back to the patient to ask them again, or we
is frequently described by our interviewees as being context-          discuss with other doctors. [P3, Neurologist]
dependent, data-intensive, and open to alternative possibilities. In
                                                                       To clear up the ambiguity as indicated by the two quotations
many cases, there lack one-to-one correspondences between signs,
                                                                       above, doctors often need to cross-validate their initial evaluation
symptoms, and diseases. Medical professionals in the field,
                                                                       of the patient by requesting further data. Our interview with the
therefore, are often required to integrate various kinds of data and
                                                                       six medical professionals documented multiple types of such data,
think outside the box. As it is pointed out by the following two
                                                                       including but not limited to, the patient’s demographic
participants:
                                                                       information, cardinal symptoms, results from further physical
For medicine, it’s usually the grey area that matters. Everything is
                                                                       examinations and lab tests, historical data from reference groups,
hardly black and white, and that’s why it is always difficult. …
                                                                       and evaluations given by other doctors.
People say that medicine is both a science and an art, because
every disease is different, and every patient’s representation will    Participants in our study yielded similar insights regarding how
be different. Every doctor obviously has different steps in making     they deal with the rich yet complex medical data. Instead of
the decision. [P6, Cardiologist]                                       following one hard rule of data processing, interviewees tend to
Authorities, like the American Heart Association, will publish         weight/interpret each type of data differently based on their
guidelines and flow charts that we can refer. It prevents              personalized prioritization matrices.
physicians from making ridiculous mistakes. But for more complex
Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis   IUI Workshops'19, March 20, 2019, Los Angeles, USA

Prioritization Matrices: Validity and Beyond
We identified six parameters from participants’ self-reflections
that reveal how they perform data prioritization for diagnosis
and/or treatment purpose. These parameters are labeled as below:

Theoretical validity.          Robustness     of   connections
                               between signs, symptoms, and
                               diseases as proved by theories,
                               medical      textbooks,     and
                               guidelines;

Severity of consequence.       Quality and quantity of potential
                               consequences if the detected
                               signs/symptoms get put aside at
                               this moment; side-effects of a
                               treatment; interactions between
                               different treatments;

Time constraint.               Timing; urgency; sequential
                                                                                     Figure 2. A diagram that compares the prioritization
                               order of taking care of different
                                                                                  matrices held by two different medical professionals (blue
                               symptoms and diseases;                            vs. orange line) when making diagnosis decisions. While one
                                                                                   doctor uses severity as the primary parameter to weight
Domain of expertise.           The extent to which the signs and                   various data during diagnosis /treatment, the other cares
                               symptoms connected to the                            most about the calculation of risks and responsibilities.
                               doctor’s specialty; the level of
                               confidence     in    offering   a
                               candidate treatment;                            Our interviewees sometimes referred to the personalized
                                                                               prioritization matrices (or styles) to explain the disagreement
Risk avoidance.                Responsibility assigned to a                    between diagnosis suggestions provided by different doctors (see
                               specific doctor; power dynamics                 Figure 2 for illustration).
                               between junior vs. senior doctors;
                                                                               Coordination Between Medical Professionals & Systems
Technical feasibility.         The      sensitivity    of    the
                                                                               All medical professionals in our study reported that they have
                               measurement; reliability of the
                                                                               been using computer-based tools and systems to facilitate their
                               technique;         the      false
                                                                               daily work practice. Most participants, for instance, have greatly
                               positive/negative      rate    of
                                                                               relied on cloud-based platforms to store and connect their local
                               symptom detection.
                                                                               medical data with other databases [P1, P2, P4, P5, P6]. They also
Participants often used styles to describe the detailed prioritization         used various systems to generate automated calculation of
matrices held by different doctors. Similar to other dispositional             chromosomes [P1], identify the degree of scoliosis [P2], check
attributes such as personality, the prioritization matrix of a                 possible interactions between medications [P5], and etc. The
medical professional is perceived as being self-aware and                      primary function of such tools is to “provide quantified
consistent across various diagnosis made by the same individual:               information to doctors, but not [to give] answers in terms of high-
                                                                               level decisions [P3]”.
The diagnosis depends on many factors –severity, possibility,
consistency with the patient’s history, and others. Some doctors               While participants were confident that the auto-quantified
will make the most severe issues on the priority, others will make             information given by systems is usually trustworthy and helpful,
the most possible ones their priority. It depends on their                     this optimism does not remain in their narratives of auto-diagnosis
perspective. It also depends on the time concern. For example,                 or treatment recommended by systems. The following quotation
neurologists may have a longer period of diagnosis, but surgeons               from P6 indicates a shared attitude as reflected across all the six
and ER doctors don’t. [P2, Orthopedist]                                        interviews:

Some doctors trust images [over other information], like MRI, to
tell what’s happening. About 80% of the time you would have
good images. You are very confident about the diagnosis from the
images. But I think most important information [to facilitate
diagnosis] is what the patient tells you. It helps to track the
patient’s history. [P6, Cardiologist]


                                                                           5
IUI Workshops'19, March 20, 2019, Los Angeles, USA                                                                                              Y. Xie et al.

There is a lot of advanced analysis involving machine learning,         [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan
                                                                             Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and
and some of them have entered the clinical realm. For example,               Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI
you will have the nuclear images, and you will have the software             Conference on Human Factors in Computing Systems (CHI ’18), 582:1–582:18.
                                                                             https://doi.org/10.1145/3173574.3174156
telling you “it’s abnormal here and there.” It’s as if you have a       [2] Victoria Bellotti, Maribeth Back, W. Keith Edwards, Rebecca E. Grinter, Austin
second reader next to you. I would love to have the system                   Henderson, and Cristina Lopes. 2002. Making sense of sensing systems. In
generating results, but ultimately, it’s you that’s deciding on the          Proceedings of the SIGCHI conference on Human factors in computing systems
                                                                             Changing         our      world,     changing      ourselves     -     CHI     ’02.
diagnosis. When there is a disagreement, me and everyone will be             https://doi.org/10.1145/503447.503450
overwriting the machine-generated interpretation. [P6,                  [3] Victoria Bellotti and Keith Edwards. 2001. Intelligibility and Accountability:
                                                                             Human Considerations in Context-Aware Systems. Human-Computer
Cardiologist]                                                                Interaction 16, 2-4: 193–212. https://doi.org/10.1207/S15327051HCI16234_05
                                                                        [4] O. Biran and C. Cotton. 2017. Explanation and justification in machine learning:
To step forward from quantifying information to directly assisting           A survey. IJCAI-17 Workshop on Explainable AI (XAI). Retrieved from
diagnosis and treatment, systems are expected to “give an                    http://www.intelligentrobots.org/files/IJCAI2017/IJCAI-
                                                                             17_XAI_WS_Proceedings.pdf#page=8
argument for why the data should be interpreted in that way [P5]”.      [5] A. Bussone, S. Stumpf, and D. O’Sullivan. 2015. The Role of Explanations on
The majority of our participants proposed the concept of reference           Trust and Reliance in Clinical Decision Support Systems. In 2015 International
and comparison as one approach to ground the systems’ diagnostic             Conference             on          Healthcare        Informatics,         160–169.
                                                                             https://doi.org/10.1109/ICHI.2015.26
reasoning with that of human doctors’:                                  [6] C. K. Cassel and A. L. Jameton. 1981. Dementia in the elderly: an analysis of
                                                                             medical responsibility. Annals of internal medicine 94, 6: 802–807. Retrieved
Any machine has to give an evidence for the top reasons like in              from https://www.ncbi.nlm.nih.gov/pubmed/7235423
descending order for why in some matrices. It’s like if I say           [7] Chi Conference Committee. 2016. CHI 16 Vol 1: CHI Conference on Human
                                                                             Factors in Computing Systems. Association for Computing Machinery.
something and you think differently, then we should be able to               Retrieved from https://market.android.com/details?id=book-MK8IMQAACAAJ
really compare the two. Otherwise, it doesn’t matter if the             [8] Pat Croskerry, Karen Cosby, Mark L. Graber, and Hardeep Singh. 2017.
                                                                             Diagnosis: Interpreting the Shadows. CRC Press. Retrieved from
machine’s suggestion is right. I don’t know what its thinking is             https://www.taylorfrancis.com/books/9781351652926
and, ultimately, I take all the responsibility in this decision. [P1,   [9] Anind K. Dey. 2001. Understanding and Using Context. Personal and
                                                                             Ubiquitous Computing 5, 1: 4–7. https://doi.org/10.1007/s007790170019
Pathologist]                                                            [10] Anind K. Dey and Alan Newberger. 2009. Support for context-aware
                                                                             intelligibility and control. In Proceedings of the SIGCHI Conference on Human
There are different ways [to help validate the systems’ diagnosis            Factors             in            Computing            Systems,           859–868.
recommendations]. One is showing me past examples in the                     https://doi.org/10.1145/1518701.1518832
                                                                        [11] Nicholas Diakopoulos. 2016. Accountability in Algorithmic Decision Making.
database - will that support its conclusion? Another one is sources          Communications of the ACM 59, 2: 56–62. https://doi.org/10.1145/2844110
of data, something like research articles or convincing cases have      [12] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
been done. That’s upper-level evidence. [P5, General physician]              Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation
                                                                             Feature for Generic Visual Recognition. In International Conference on
Interviewees further suggested that to build an ideal auto-                  Machine Learning, 647–655. Retrieved December 14, 2018 from
                                                                             http://proceedings.mlr.press/v32/donahue14.html
diagnosis/treatment system, the algorithm should be able to             [13] Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of
contextualize its reference data with personalized information of a          Interpretable Machine Learning. arXiv [stat.ML]. Retrieved from
                                                                             http://arxiv.org/abs/1702.08608
patient. Such contextualization work is what human doctors are          [14] Paul Dourish. 1995. Developing a reflective model of collaborative systems.
good at based on their professional training and experience, but it          ACM transactions on computer-human interaction: a publication of the
                                                                             Association        for       Computing        Machinery      2,      1:     40–63.
is perceived to be the major obstacle for systems to overcome.               https://doi.org/10.1145/200968.200970
                                                                        [15] Wenjuan Fan, Jingnan Liu, Shuwan Zhu, and Panos M. Pardalos. 2018.
5 Implication for Design                                                     Investigating the impacting factors for the healthcare professionals to adopt
                                                                             artificial intelligence-based medical diagnosis support system (AIMDSS).
Based on the findings of the preliminary interview, we outline               Annals of Operations Research. https://doi.org/10.1007/s10479-018-2818-y
                                                                        [16] M. Fieschi. 2013. Artificial Intelligence in Medicine: Expert Systems. Springer.
design suggestions for explainable medical AI systems.                       Retrieved from https://market.android.com/details?id=book-_Vf0BwAAQBAJ
Specifically, we envisage a system that can                             [17] J. Fox, D. Glasspool, D. Grecu, S. Modgil, M. South, and V. Patkar. 2007.
                                                                             Argumentation-Based Inference and Decision Making--A Medical Perspective.
Allow a medical professional to prioritize different types and               IEEE intelligent systems 22, 6: 34–41. https://doi.org/10.1109/MIS.2007.102
                                                                        [18] Bryce Goodman and Seth Flaxman. 2016. European Union regulations on
sources of data by directly manipulating a user interface akin to            algorithmic decision-making and a “right to explanation.” arXiv [stat.ML].
our proposed prioritization matrix (Figure 2);                               Retrieved from http://arxiv.org/abs/1606.08813
                                                                        [19] Tovi Grossman and George Fitzmaurice. 2010. ToolClips: An Investigation of
Support gradual engagement of medical AI systems into a medical              Contextual Video Assistance for Functionality Understanding. In Proceedings of
                                                                             the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10),
professional’s diagnosis process, spanning from low-level                    1515–1524. https://doi.org/10.1145/1753326.1753552
automated measurement tasks, to mid-level constraint-aware              [20] D. Gunning. 2016. Explainable artificial intelligence (XAI). Defense Advanced
planning of medical tests, and to high-level suggestions of                  Research Projects agency.
                                                                        [21] Andreas Holzinger, Chris Biemann, Constantinos S. Pattichis, and Douglas B.
plausible diagnoses.                                                         Kell. 2017. What do we need to build explainable AI systems for the medical
                                                                             domain?        arXiv      preprint    arXiv:1712.     09923.     Retrieved    from
ACKNOWLEDGMENTS                                                              https://arxiv.org/abs/1712.09923
                                                                        [22] Saurabh Jha and Eric J. Topol. 2016. Adapting to Artificial Intelligence:
We thank Xiaohe Yang for assisting us to complete the interviews.            Radiologists and Pathologists as Information Specialists. JAMA: the journal of
                                                                             the      American        Medical       Association     316,     22:     2353–2354.
We thank all the anonymous interviewees for their contributions to           https://doi.org/10.1001/jama.2016.17438
our study. We also thank Maie St. John, Peter Pellionisz and Jeff       [23] Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang,
                                                                             Qiang Dong, Haipeng Shen, and Yongjun Wang. 2017. Artificial intelligence in
Liang for their valuable comments on earlier drafts of this paper.           healthcare: past, present and future. Stroke and vascular neurology 2, 4: 230–
                                                                             243. https://doi.org/10.1136/svn-2017-000101
REFERENCES
Outlining the Design Space for Explainable Intelligent Systems for Medical Diagnosis                  IUI Workshops'19, March 20, 2019, Los Angeles, USA

[24] Josua Krause, Adam Perer, and Kenney Ng. 2016. Interacting with Predictions:                  True Identity. In Proceedings of the SIGCHI Conference on Human Factors in
     Visual Inspection of Black-box Machine Learning Models. In Proceedings of                     Computing             Systems         (CHI         ’13),         1931–1940.
     the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16),                      https://doi.org/10.1145/2470654.2466255
     5686–5697. https://doi.org/10.1145/2858036.2858529                                       [49] Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding
[25] Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell                     Convolutional Networks. In Computer Vision – ECCV 2014, 818–833.
     Me More?: The Effects of Mental Model Soundness on Personalizing an                           https://doi.org/10.1007/978-3-319-10590-1_53
     Intelligent Agent. In Proceedings of the SIGCHI Conference on Human Factors              [50] 2001. UIST ’01: Proceedings of the 14th Annual ACM Symposium on User
     in          Computing           Systems           (CHI         ’12),        1–10.             Interface Software and Technology, Orlando, Florida, November 11-14, 2001.
     https://doi.org/10.1145/2207676.2207678                                                       Association       for      Computing     Machinery.      Retrieved     from
[26] T. Kulesza, S. Stumpf, M. Burnett, W. Wong, Y. Riche, T. Moore, I. Oberst, A.                 https://market.android.com/details?id=book-YLtQAAAAYAAJ
     Shinsel, and K. McIntosh. 2010. Explanatory Debugging: Supporting End-User
     Debugging of Machine-Learned Programs. In 2010 IEEE Symposium on Visual
     Languages           and        Human-Centric            Computing,         41–48.
     https://doi.org/10.1109/VLHCC.2010.15
[27] Zachary C. Lipton. 2016. The Mythos of Model Interpretability. arXiv [cs.LG].
     Retrieved from http://arxiv.org/abs/1606.03490
[28] Tania Lombrozo. 2010. Causal–explanatory pluralism: How intentions,
     functions, and mechanisms influence causal ascriptions. Cognitive psychology
     61, 4: 303–332. https://doi.org/10.1016/j.cogpsych.2010.05.002
[29] Tania Lombrozo and Susan Carey. 2006. Functional explanation and the
     function       of     explanation.        Cognition      99,     2:      167–204.
     https://doi.org/10.1016/j.cognition.2004.12.009
[30] Donna Reseigh Long. 1993. Basics of qualitative research: Grounded theory
     procedures and techniques.
[31] William van Melle, Edward H. Shortliffe, and Bruce G. Buchanan. 1984.
     EMYCIN: A knowledge engineer’s tool for constructing rule-based expert
     systems. Rule-based expert systems: The MYCIN experiments of the Stanford
     Heuristic      Programming         Project:     302–313.      Retrieved      from
     http://www.aaai.org/Papers/Buchanan/Buchanan17.pdf
[32] D. Douglas Miller and Eric W. Brown. 2018. Artificial Intelligence in Medical
     Practice: The Question to the Answer? The American journal of medicine 131,
     2: 129–133. https://doi.org/10.1016/j.amjmed.2017.10.035
[33] Tim Miller. 2017. Explanation in Artificial Intelligence: Insights from the Social
     Sciences. arXiv [cs.AI]. Retrieved from http://arxiv.org/abs/1706.07269
[34] Neville Moray. 1987. Intelligent aids, mental models, and the theory of
     machines. International journal of man-machine studies 27, 5: 619–629.
     https://doi.org/10.1016/S0020-7373(87)80020-2
[35] Travis B. Murdoch and Allan S. Detsky. 2013. The inevitable application of big
     data to health care. JAMA: the journal of the American Medical Association
     309, 13: 1351–1352. https://doi.org/10.1001/jama.2013.393
[36] Jakob Nielsen. 2010. Mental models. Jakob Nielsen’s Alertbox.
[37] Don Norman. 2009. The Design of Future Things. Basic Books. Retrieved from
     https://market.android.com/details?id=book-aeIUGq1EL24C
[38] Donald A. Norman. 2014. Some observations on mental models. In Mental
     models.         Psychology         Press,      15–22.        Retrieved       from
     https://www.taylorfrancis.com/books/e/9781317769408/chapters/10.4324%2F9
     781315802725-5
[39] Seong Ho Park and Kyunghwa Han. 2018. Methodologic Guide for Evaluating
     Clinical Performance and Effect of Artificial Intelligence Technology for
     Medical Diagnosis and Prediction. Radiology 286, 3: 800–809.
     https://doi.org/10.1148/radiol.2017171920
[40] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta,
     Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya,
     Matthew P. Lungren, and Andrew Y. Ng. 2017. CheXNet: Radiologist-Level
     Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv [cs.CV].
     Retrieved from http://arxiv.org/abs/1711.05225
[41] Raymond Reiter. 1987. A theory of diagnosis from first principles. Artificial
     intelligence 32, 1: 57–95. https://doi.org/10.1016/0004-3702(87)90062-2
[42] B. Schilit, N. Adams, and R. Want. 1994. Context-aware computing
     applications. In Workshop on Mobile Computing Systems and Applications,
     85–90. https://doi.org/10.1109/MCSA.1994.512740
[43] Ben Shneiderman. 2016. Opinion: The dangers of faulty, biased, or malicious
     algorithms requires independent oversight. Proceedings of the National
     Academy of Sciences of the United States of America 113, 48: 13538–13540.
     https://doi.org/10.1073/pnas.1618211113
[44] Ben Shneiderman, Catherine Plaisant, Maxine Cohen, Steven Jacobs, Niklas
     Elmqvist, and Nicholas Diakopoulos. 2016. Grand Challenges for HCI
     Researchers. Interactions 23, 5: 24–25. https://doi.org/10.1145/2977645
[45] Lucy A. Suchman. 1987. Plans and Situated Actions: The Problem of Human-
     Machine Communication. Cambridge University Press. Retrieved from
     https://market.android.com/details?id=book-AJ_eBJtHxmsC
[46] William R. Swartout. 1983. Xplain: A system for creating and explaining expert
     consulting programs. UNIVERSITY OF SOUTHERN CALIFORNIA
     MARINA DEL REY INFORMATION SCIENCES INST. Retrieved from
     http://www.dtic.mil/docs/citations/ADA130597
[47] Jo Vermeulen, Kris Luyten, Karin Coninx, and Nicolai Marquardt. 2014. The
     design of slow-motion feedback. In Proceedings of the 2014 conference on
     Designing                interactive               systems,              267–270.
     https://doi.org/10.1145/2598510.2598604
[48] Jo Vermeulen, Kris Luyten, Elise van den Hoven, and Karin Coninx. 2013.
     Crossing the Bridge over Norman’s Gulf of Execution: Revealing Feedforward's


                                                                                          7

</pre>