Designing XAI-based Computer-aided Diagnostic Systems:
Operationalising User Research Methods
Elsa Oliveira1,† , Cristiana Braga1,† , Ana Sampaio1 , Tiago Oliveira2 , Filipe Soares1,* and
Luís Rosado1,*
1
    Fraunhofer Portugal AICOS, Rua Alfredo Allen 455/461, 4200-135, Porto, Portugal
2
    First Solutions - Sistemas de Informação, S.A., Rua Conselheiro Costa Braga, Matosinhos, Portugal


                                          Abstract
                                          AI technology has the potential to support humans’ processes and tasks by augmenting human capabilities and effectiveness.
                                          Computer-aided systems have been implemented in healthcare mainly to support clinical decisions. As in other areas, the
                                          impact, complexity, and opacity of AI operations have led to the establishment of guidelines for trustworthy AI, which implies
                                          being understandable. This study describes the user research work carried out by a multidisciplinary team composed of ML
                                          engineers, design researchers, and medical experts, to inform the design of algorithms and user interfaces for two XAI-based
                                          clinical decision support tools targeted at Cervical cancer and Glaucoma screening. In particular, we sought to leverage and
                                          bridge individual and collective expertise to understand the context, decision-making processes and criteria, and values that
                                          frame the respective clinical decisions. The article describes how we operationalised the research activities with expert users
                                          and what strategies we followed for subsequent content analysis, ending with the sharing of lessons learned as valuable
                                          insights for other research teams interested in designing computer-aided diagnostic systems based on human-centred XAI
                                          approaches.

                                          Keywords
                                          Explainable AI, Computer-aided detection, Decision Support System, Ophthalmology, Glaucoma, Cytology, Cervical cancer,
                                          Retinal Imaging, Microscopy,


1. Introduction                                                                                         aged human-centred design methods to uncover what to
                                                                                                        explain, why, how, and for whom [4, 5, 6, 7]. We share
Despite its potential, AI has struggled to be understand- a study of how we operationalised Human-Centred De-
able. This requirement has been critical in several areas, sign (HCD) methods to inform the design of algorithms
mainly in healthcare [1, 2], where AI can support clini- and user interfaces for two XAI-based clinical decision
cal decisions. There has been consensus on the need to support tools for Cervical cancer and Glaucoma screen-
promote accountable and trustworthy AI. The European ing. We were concerned with grasping medical experts’
Commission’s High-Level Expert Group on Artificial In- mental models and reasoning processes. While mental
telligence (AI HLEG) says that whenever an AI system models are mental constructs that represent a distinct
has a significant impact on people’s lives, it should be pos- possibility and derive a conclusion from them, reasoning
sible to demand a suitable explanation of the AI system’s implies a process to derive a conclusion and depends on
decision-making process [3]. These considerations have envisaging the possibilities (mental models) consistent
led AI towards Explainable AI (XAI), which in turn lever- with a starting point [8]. So, to access the diagnosis’ rea-
                                                                                                        soning and identify the decision-making data and the
Elsa Oliveira, Cristiana Braga, Ana Sampaio, Tiago Oliveira, Filipe explanations structures to apply in the design of XAI-
Soares and Luís Rosado. 2023. Designing XAI-based Computer-aided based clinical decision support tools, we needed to get
Diagnostic Systems: Operationalising User Research Methods. Joint inside the diagnosis process with those who practice it -
Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney,
                                                                                                        the medical experts. [9, 10]
Australia, 11 pages.
*
  Corresponding author.                                                                                    This paper is structured into 6 sections. First section
†
  These authors contributed equally.                                                                    introduces the demand for XAI systems. Section 2 iden-
$ elsa.oliveira@aicos.fraunhofer.pt (E. Oliveira);                                                      tifies the objectives and design of the study, subdivided
cristiana.braga@aicos.fraunhofer.pt (C. Braga);                                                         into three phases: contextualisation, elicitation, and vali-
ana.sampaio@aicos.fraunhofer.pt (A. Sampaio);                                                           dation. Section 3 briefly introduces the medical context
tiago.oliveira@first-global.com (T. Oliveira);
                                                                                                        of Cervical cancer and Glaucoma, on which the work
filipe.soares@aicos.fraunhofer.pt (F. Soares);
luis.rosado@aicos.fraunhofer.pt (L. Rosado)                                                             was  focused. Section 4 describes how we operationalised
 0000-0002-7105-9654 (E. Oliveira); 0000-0002-9384-2252                                                the research work focusing on the research activities
(C. Braga); 0000-0003-1770-4429 (A. Sampaio);                                                           with the users and the analysis of the collected content.
0000-0002-2881-313X (F. Soares); 0000-0002-8060-831X (L. Rosado) Finally, in section 5 we share lessons learned from this
           © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
           Attribution 4.0 International (CC BY 4.0).                                                   study, and section 6 indicates the main conclusions and
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
future work.                                                   2.2. Elicitation
                                                               The elicitation phase asked for more detail on the
2. Goals and Study Design                                      decision-making process, decision-making data, and on
                                                               the explanation structures that support it. To this end the
As a multidisciplinary team, composed by Machine Learn-        research team relied on referenced methods for mental
ing (ML) engineers, design researchers, and medical ex-        models’ elicitation [11], such as Semi-structured inter-
perts, we sought to leverage and bridge individual and         views, Observation, and Think-Aloud [12, 13], together
collective expertise, especially from the medical area for     with co-creation practices - that made use of imaging
which the systems were conceived, to inform the de-            data and other design materials to facilitate participants
sign of algorithms and user interfaces for explainable         in demonstrating the processes of analysis and decision-
decision support software targeted at Cervical cancer          making. Nielsen refers Think-Aloud method as effective
and Glaucoma screening. We based our study on the              in giving us insights into users’ mental models regarding
results of the user research activities, which aimed to        a given task. The study also drew on the procedures of a
understand the context, processes, and values that frame       field study method based on Observation and interviews
clinical decisions in the above-mentioned health areas.        to understand work practices and behaviors - Contextual
The research process was guided by three phases: con-          inquiry [14, 15]. Kim Salazar on Nielsen Norman Group
textualisation, elicitation, and validation. The user re-      website highlights the value of the contextual inquiry
search methods applied in each phase (described below)         method – to inquiry in context, which results in a col-
returned a considerable amount of fieldwork materials,         laborative interpretation between researchers and expert
i.e., written, verbal and visual content, that researchers     users about work practices and behaviors, with a more
needed to analyse to enhance understanding of the data.        in-depth understanding of experts’ reasoning. With these
In analysing these data, we initially focused on codify-       references in mind, the research team set up workshops
ing what the clinicians said (written transcription) about     to observe, and question medical experts analysing and
their decision-making process. However, most of their          deciding on clinical cases and from clinical data.
explanations evoked visual aspects of the images. As we
are non-experts, we quickly realised that we needed to         2.3. Validation
match what the doctors were saying with the respective
visual elements they were characterising in their expla-       The validation stage allowed us to discuss, correct, com-
nations. For example, when clinicians explained that a         plete, and refine with medical experts the research find-
cell was abnormal "because it had a halo around the nu-        ings. Through co-creation design practices, researchers
cleus", HCD and ML researchers could not understand            designed group and individual workshops, in both remote
what a halo was without a visual reference of that cell        and in-person versions, in order to display the decision-
containing a halo. The content analysis process based on       making criteria within the respective structures, to be
Transcription, Coding, and Systematisation, paved the          discussed and easily edited and iterated in real-time. For
way for the decision-making data and inherent reasoning        some questions, we used A/B testing method for partici-
structure.                                                     pants to select the best option.

2.1. Contextualisation                                         3. Cervical cancer and Glaucoma
As a first step, the researchers sought to become famil-
                                                               As mentioned in the introduction, this study addresses
iarised with the jargon, clinical practices, and decision-
                                                               two distinct health areas, Anatomical Pathology and Oph-
making processes used by health professionals. Initially,
                                                               thalmology, more specifically, Cervical cancer and Glau-
the researchers made more superficial research in online
                                                               coma. The main goal of the study was to design an ex-
medical articles, also to acquire the basic knowledge to
                                                               plainable decision support software per area, both based
prepare for the interviews with medical experts. In fact,
                                                               on imaging screening, to be used by medical experts, and
the contextualisation was accomplished mainly through
                                                               physicians in training.
semi-structured interviews which script applied Task
                                                                  Cervical cancer screening will mainly rely on cytolog-
Reflection and Retrospection methods, to prompt partici-
                                                               ical microscopic images, while Glaucoma screening on
pants to reflect and describe their daily clinical tasks and
                                                               retinal images. The research team needed to go deep into
diagnostic practices. The interviews gave us an overview
                                                               each clinical practice to define the systems’ requirements.
of clinical practices, decision-making processes, values,
                                                               Table 1 lists the main aspects that characterise the two
and a quick window into participants’ mental models
                                                               health areas under study, considering the analysis that
as they gave examples of clinical cases and how they
                                                               medical experts carry out per patient. This knowledge
decided on them.
Table 1
Characteristics of Cervical cancer and Glaucoma screening

                             Cytology: Cervical cancer                    Ophthalmology: Glaucoma
  Purpose                    Screening for reversible gynaecological      Screening for irreversible eye disease
                             disease
  Main Imaging Data          Cervical cytology specimen (liquid-based)    Color Fundus Photography
  collection
  Complementary exams        HPV diagnosis                                Ocular anatomy (e.g., narrow angle), IOP
                                                                          (intraocular pressure), Corneal tomography
                                                                          for pachymetry, volume and depth, Retinal
                                                                          Nerve Fiber Layer Thickness (RNFLT)
                                                                          measured via OCT, and visual field tests.
  Patient data               (Not mandatory) Age, last menstruation,      Age, ethnicity, family history of the
                             contraceptive method, relevant medical       condition, associated pathologies (e.g.,
                             therapeutics, e.g., hormonal, chemotherapy   diabetes, cataracts) and risk medication
                                                                          (e.g., antidepressant)
  Digitalisation outcome -   Approximately 100 microscopic images per     Between 1 and 7 retinal fundus images per
  Artefact of analysis       sample                                       eye [16, 17]
  Variation                  Each image represents a small section of     Each image can vary in eye laterality (left
                             the entire sample                            or right) or field of view
  Image navigation           The expert browses, image by image,          The expert checks an image individually,
                             zooming in and out, to identify cells with   zooming in and out, to look for
                             abnormal aspect                              abnormalities in the main structures
  Criteria of adequacy       Representation of the Transformation Zone.   Visibility and sharpness of optic nerve and
                             Minimum of 5000 squamous cells [18]          macula. Completeness of temporal arcade.
                             (average of 3.8 cells/image). Good image     Clear visibility of small vessels. Field of
                             quality.                                     view well illuminated (min. 80%) [19].
  Case classification        Grading by lesion level, according to the    Staging of Glaucoma [21]
                             Bethesda System’s convention [20]
  Comparative analysis       Experts take intermediate squamous cells     Experts check the symmetry between the
                             as a reference for comparison                left eye and the right eye
  End-users                  Cytopathologists, cytotechnicians            Glaucomatologists, (diagnose),
                             (diagnose), and physicians in training       ophthalmologists, and physicians in
                                                                          training


was built-up throughout the contextualisation and elici- nologists; in Glaucoma, 4 ophthalmologists specialised
tation phases. While both areas share common aspects, in Glaucoma (glaucomatologists). Because of COVID-19,
they also have some significant differences.                  in particular the restrictions on in person group meet-
                                                              ings and on normal access to hospitals and clinical set-
                                                              tings, most user research activities took place remotely
4. Operationalisation of research through digital and online platforms. Through these,
     activities                                               participants were able to access anonymised screening
                                                              images, as well as other clinical data, to demonstrate their
In this section, we describe how we operationalised the re- decision-making process, and reasoning, while observed
search activities for the contextualisation, elicitation, and and questioned by the research team.
validation phases. At the beginning of the study, all par-
ticipants received a general informed consent that gave 4.1. Contextualisation interviews
an overview of the user research agenda, being further
provided a detailed informed consent per activity. The After some basic research through online medical articles,
study counted with up to 5 participants per health area the research team draw the interview script addressed
- in Cervical cancer, 3 cytopathologists and 2 cytotech- to the medical experts. The interview script aimed at:
understanding clinical procedures, i.e., from the first       diagnostic exams with diverse diagnosis: unconfirmed,
consultation up to and after diagnosis, eliciting medi-       borderline, early stage Glaucoma and advanced stage Glau-
cal experts’ values, i.e., their motivation for the medical   coma.
field, examples of impactful cases, and, very important,         To ensure participants’ unbiased decisions, we first
medical expectations regarding the introduction of AI         conducted individual workshops. Afterwards we ran a
systems in the clinical practice. The semi-structured in-     group workshop for both medical fields to help identify
terviews were carried out remotely through video call         consensual criteria and foster discussions around the
by Microsoft Teams software. To note that in Cervical         least consensual ones.
cancer, the research team took advantage of the results of
a previous and related study with cytopathologists and        4.2.2. Conducting individual workshops
cytotechnicians [22, 20] that had conducted in-person
semi-structured interviews with the same participants.        Each workshop consisted of one main task: the partici-
These interviews enabled us to understand the processes       pant, as a medical expert, would assess imaging exami-
involved in cytological analysis, from the reception of       nations in real-time and think aloud about their analysis.
the sample to the diagnosis.                                  This way, we could follow the assessment process and ask
                                                              questions whenever needed to better understand it. More-
                                                              over, we asked participants to annotate relevant findings
4.1.1. Interviews analysis
                                                              whose appearance suggested a pathological change and
Once the interviews were completed, we transcribed            to provide the respective diagnosis classification. Ex-
them using oTranscribe software [23]. We then organ-          perts in Cervical cancer classified cytological images ac-
ised the participants’ insights into the main themes raised   cording to the Bethesda System’s convention. Experts
during the interviews.                                        in Glaucoma classified retinal images according to the
                                                              four stages mentioned in section 4.2.1. Each participant
4.2. Workshops for eliciting diagnostic                       analysed from three to seven images consisting of liquid-
                                                              based cytological samples (in Cervical cancer) or from
     processes                                                four to sixteen images consisting of eight pairs of retinal
Familiarised with both medical areas, we inspired our-        images (in Glaucoma). Figure 1 shows a visual field of a
selves in the contextual inquiry method to design the         cytological sample with two cells annotated by a partici-
workshops that would enable us to elicit experts’ diag-       pant for their abnormality, and figure 2 a retinal image
nostic assessment process. Our goal was to understand         being analysed by a participant.
what experts look at when they analyse a clinical case,          Given the interdependence with other examinations in
specifically, an imaging examination, and what criteria       Glaucoma diagnosis, and the wider range of diagnostic
they use to assess whether it is a pathological change.       factors outside imaging data, Glaucoma workshops com-
                                                              prised an additional task. Each participant was asked to
4.2.1. Designing remote workshops                             list the steps of a usual medical procedure, from the first
                                                              consultation to the diagnosis, describing other relevant
The analysis of imaging examinations was a requirement        examinations beyond the retinal image. Figure 3 shows
for the diagnosis assessment, thus, we needed to observe      the timeline filled in by 1 of the 4 participants considering
experts analysing such images. Usually, we would visit        the examinations performed throughout the analysis of
the experts’ workplace and observe them in a real clini-      a given clinical case (for example, José, 62 years old with
cal setting. However, due to COVID-19, the workshops          high intraocular pressure) - from the first consultation
had to be remote, and so, we mimicked this observation        to diagnosis, and, where necessary, in the patient follow-
remotely.                                                     up. In the second part of the workshop, the participant
   For Cervical cancer, we asked a cytologist to provide      accessed anonymised eye examinations, corresponding
us with images of liquid-based cytological samples. For       to different diagnoses, from non-Glaucoma to advanced
Glaucoma, we asked a glaucomatologist to provide us           stage Glaucoma, to then select and analyse the most rep-
with retinal images. However, this was not all. From the      resentative of a specific clinical case. Figure 2 is one of
interviews, we learned that both medical fields comple-       the retinal images that a participant has zoomed in and
mented images’ interpretation with clinical data, which       centred on the optical disc to show which image features
we were attending to. But we also learned that Glaucoma-      reflect the state of the eye’s structures and should there-
tous pathology was more complex to diagnose, because          fore be considered as a criterion for decision making.
glaucomatologists often had to integrate complementary
diagnostic exams to reach a diagnosis.
   With this in mind, we asked the glaucomatologist to
provide us with a set of anonymised complementary
Figure 1: Screenshot from the individual workshop for elicita-      Figure 2: Screenshot from the individual workshop for elicita-
tion, conducted remotely through a digital platform showing         tion, conducted remotely through a digital platform showing
a digital liquid-based cervical sample, with two of several cells   a retinal image being analysed with the participant pointing
noted by the participant for their abnormality                      to the optical nerve to explain what is a pathologic optic disc
                                                                    cupping (excavation)


4.2.3. Transcribing and analysing
                                                                    in fact, we were able to verify this. Below (Figure 5) is an
As we transcribed the workshops, it became evident that             example of the same cytological field analysed by the five
we should assign textual excerpts to image cut-outs, as             Cervical cancer experts. Both annotations of suspicious
most of experts’ explanations consisted of descriptions             or abnormal cells and final classifications varied across
of visible characteristics in the analysed images. Thus,            analysts. While three experts classified the cytological
mapping the object of analysis with the respective tran-            field as ASC-US - an official classification for uncertainty
scription enabled us to keep a correspondence between               regarding an Abnormal Squamous Cell(s), two of the five
what was said and what was being observed in the image              experts classified the sample as LSIL - an official classi-
(Figure 4).                                                         fication comparable to ASC-US, but that assigns a Low
   We did this for each participant. Almost all partici-            grade of Intraepithelial Lesion to the Squamous cell(s).
pants, from both medical fields, mentioned how the anal-
ysis and conclusions of some clinical cases were subjec-
                                                                    4.2.4. Coding and systematisation
tive. For instance, a glaucomatologist said: “Sometimes
it’s not black and white, it’s grey”, meaning that the same         Once we completed the transcripts, we created a cate-
examinations and clinical data may lead experts to differ-          gorisation matrix in Excel to code the data into a set
ent decisions. This happens when the available elements             of categories that constitute the building blocks of the
for diagnosis are unclear, due to either image character-           explanations, which allowed us to uncover a generic ex-
istics that hinder experts’ analysis (e.g. blurry image), or        planation structure suitable for both use cases. We used
to characteristics of the anatomical structures, which can          the columns’ headings for the categories, and the rows
be themselves confusing (when the same visual appear-               to list the image that has triggered the explanation to-
ance can be the result of different possible causes), which         gether with the textual explanation (quote) and the set
requires more tests and more time.                                  of categorisable criteria (Figure 6).
   Moreover, Cervical cancer experts highlighted the sub-              As we went on with the codification, we iteratively
jectivity intra- and inter-observer, explaining that not            refined the categories into Key structure(s) examined,
only the decision may vary between experts, as the same             Key feature(s) concerned, Risk factor, Not Cervical can-
expert could give a different classification to the same            cer/Glaucoma factor, Doubt factor, Result attributed, and
sample at different moments in time. Therefore, we                  finally, Key expression used by the expert. As the Excel’s
sought this subjective dimension by comparing the analy-            content increased, we identified that the criteria we filled
sis of each participant to the same object of analysis, and         in the categories would repeat. So, we created an Excel
Figure 3: Screenshot from the individual workshops, conducted through Mural platform, listing a sequence of steps (required
eye examinations) until reaching a diagnosis for a given hypothetical patient


Figure 4: Word template setup for transcription aiming at mapping visual content - cells, or other structures - with textual
excerpts, while keeping the order of analysis. The process included annotating on the image the object/area analysed and
associating the number that corresponds to its order of analysis by the participant.


tab to list the criteria for each category as they emerged             It has a darker nucleus, but with this res-
throughout the process. We ended up gathering a list of                olution, when I try to zoom in, I can’t see
options that enabled us to streamline the filling-in pro-              the characteristics.
cess. To avoid subjectivity and/or interpretation errors
in the process of codification, we organised an internal           Figure 6 shows the variability of decision criteria by
panel of three coders composed of researchers involved          category that was raised throughout the analysis.
in these activities. All transcriptions were assigned to this      By the end of the analysis, we uncovered the most
panel, varying who would be the first coder. While the          relevant criteria used by experts in each medical field
first coder would codify the transcription from scratch,        to analyse and explain their decisions. And we could
the following two would validate the first codification.        standardise that most of the explanations followed this
Taking the following quote as an example from Cervical          structure:
cancer, we would describe as table 2 shows.
Figure 5: An example of the inter-observer comparative analysis carried out in the Cervical cancer study. The same microscopic
field of view was analysed by 5 medical experts resulting in different annotations and classifications.


Figure 6: Screenshot of the explanations’ systematisation in Excel for Glaucoma. In the columns’ headings, we may read the
categories, left to right: Explanation object, Explanation, Key structure examined, Key feature examined, Glaucoma potential
factor, Not Glaucoma factor, Unsatisfactory for analysis, and Confounding factors. Each category was fed according to the
criteria identified in the explanations given by the medical experts.


  The [Key feature concerned] of the [Key struc- suggest a plausible contradiction that prevented them
ture(s) examined] is [Risk factor] OR [Not Cervical from providing a classification of which they were confi-
cancer/Glaucoma].                                   dent.

        e.g. Cervical cancer: The [colour] of the                         e.g., Cervical cancer: The [colour] of the
        [nucleus] is [hyperchromatic]. Glau-                              [nucleus] is [hyperchromatic], how-
        coma: The [optic disc] has an [excava-                            ever, [there are overlapping cells]. Glau-
        tion greater than 0.4].                                           coma: The [optic disc] has an [excava-
                                                                          tion greater than 0.4], however, [is sym-
   Moreover, we found that sentences stating a “not Cer-
                                                                          metric].
vical cancer/Glaucoma factor” or “doubt factor” could
follow the Key feature concerned. Experts used them to


Table 2
Explanation categorisation example
Key area of image examined                 It has a darker nucleous (Part of a cell)
Key structure(s) examined                  nucleous (Nucleous)
Key feature(s) concerned                   a darker nucleous (Colour intensity)
Risk factor                                a darker nucleous (Hyperchromasia)
Not Cervical cancer / Glaucoma factor      Not applicable
Doubt factor                               but with this resolution,... I can’t see the characteristics (Image quality - Blurred)
Assigned result                            ... I can’t see the characteristics Insufficient / No classification
   In these explanations, the experts point out a struc-      with images of the prototype of the GUI together with its
ture that he/she observed and characterise its aspect –       content (the elicited decision-making criteria) listed in an
reflecting a well-known and established risk factor in the    editable text box, as shown in Figure 9. The content was
domain knowledge, i.e., Cervical cytology: [hyperchro-        discussed in real-time and, whenever necessary, easily
matic], Glaucoma: [excavation greater than 0.4]. Never-       edited.
theless, the explanations also stress - through the con-
trastive expression ‘however’ - other characteristics that
complement and contrast the first ones, i.e., Cervical cy-    5. Lessons Learned
tology: [there are overlapping cells], Glaucoma: [is sym-
                                                              L1: Multidisciplinary team The design of XAI-based
metric]. And this prevents the experts from discerning
                                                              clinical decision support tools requires extensive knowl-
with certainty whether the first observed characteristic
                                                              edge from various domains. It is paramount that teams
is an anomaly or not.
                                                              ensure an iterative communication that keeps all in the
                                                              loop, i.e., design researchers, medical experts, ML En-
4.3. Workshops for validation                                 gineers, etc. Let us highlight ML Engineers’ guidance
Based on the results of previous user research activi-        on the feasibility of the required functionalities, their
ties, researchers designed validation workshops to: (i)       support in defining the needed data, i.e., quantity and
ensure no conflicting information among the knowledge         quality, and the infrastructure for implementation. Many
shared by each participant, (ii) remove possible impre-       systems based on supervised learning require annotated
cision from researchers’ interpretation and consequent        data, analysed by experts in terms of elements needed to
analysis outcomes, and (iii) get insights on a first ver-     guide the models’ learning process. In the case of medical
sion of the graphical user interface (GUI) designed from      XAI systems, this requires close cooperation with clinical
scratch to attend the elicited diagnostic processes.          experts to ensure the annotation of the data instances
                                                              objectively and uniformly. This way, ML Engineers guar-
                                                              antee that the final data set comprises cases sufficiently
4.3.1. Conducting group workshops
                                                              representative of the different data properties that may
The first validation session was carried out through the      arise in practical scenarios.
Mural platform, from where participants accessed and             L2. Contextual inquiry method as a basis for elici-
interacted (by editing, deleting, or adding content) with     tation The contextual inquiry method inspired the study
the list of decision-making criteria raised so far in or-     to observe experts performing a task as close to real-
der to ensure their correctness and completeness. In          ity as possible by having them verbalise their thoughts
Glaucoma workshops, participants were also asked to           while analysing imaging examinations and providing di-
analyse several examinations, mainly retinal images, and      agnostic classifications for them. We conclude that, when
to choose the applicable criteria for each one from the       in-loco sessions are not possible, researchers can simulate
list elicited by researchers. We asked participants to po-    the method remotely using digital and online platforms
sition the selected criteria in one of three possibilities:   that enable to: video call, screen sharing, display rele-
non-Glaucoma, Glaucoma, or borderline (Figure 7).             vant data for analysis and discussion, and freely write.
                                                              We asked the experts for analysis materials from their
4.3.2. Validating content and container - an                  daily work, e.g., anonymised imaging examinations, and
       informed GUI prototype                                 then used the online platform Mural to present analysis
                                                              tasks using these materials. While sharing the screen,
In the second validation session, researchers presented       experts analysed, selected, and annotated the digital im-
the validated decision-making criteria integrated into a      ages, and researchers asked timely questions that arose
Graphical User Interface (GUI) prototype. The aim was         from observing what participants were doing and saying
to get feedback on the criteria and on the UI components      (think-aloud).
presenting it. According to participants’ availability, the      L3. Mapping text with images helped associate
Cervical cancer session took place in person (Figure 8),      features to structures From the elicitation to the con-
and the Glaucoma session took place remotely (Figure 9).      tent analysis, we found it elementary to map the textual
   Some categories and criteria seemed to have more           transcripts with the image that experts were analysing.
than one possible way to name or present in the inter-        We cropped, framed, and sketched over the images to
face. Thus, to assess the correctness and completeness        correlate what experts were saying with what they were
of the data as well as the system’s components and re-        seeing. In doing this, some categories emerged trans-
lated features, we applied A/B testing for participants to    versely among both experts and images analysed, so this
choose the best options.                                      mapping led to discovering a standard structure of the
   In Glaucoma study, we conducted a remote session           explanations.
through which we shared a PowerPoint presentation
Figure 7: Screenshot of the first validation of the clinical decision criteria by the glaucomatologists. At the top, a magnification
of a retinal image (right and left eye) centred on the optical disc. Below, 3 tables, one per eye structure: Discs, Neuroretinal
ring, and Vessels. To the left of each table, there is the respective list of criteria from which participants were asked to select
those observed in the retinal image and associate them to a non-Glaucoma, Glaucoma, or borderline diagnosis. The criteria
positioned between the two columns would be considered borderline case criteria.


Figure 8: In-person workshop with three cytopathologists and one cytotechnologist to validate the decision-making criteria
integrated into a GUI prototype.


  L4. Categorisation matrix for multidisciplinary                  6. Conclusions and Future Work
analysis As the categories emerged, we used Excel’s
functionalities, such as drop-down lists to streamline the         This paper describes the user research activities carried
process of matching features to structures facilitating            out by a multidisciplinary team to inform the design of
the systematisation of the analysis across more team               Machine Learning algorithms and user interfaces for two
members, i.e., design researchers and ML engineers.                XAI-based computer-aided diagnostic systems for Cer-
                                                                   vical cancer and Glaucoma. We shared what we think
                                                                   might be useful for other teams involved in the design of
                                                                   Explainable AI systems, namely, ways to operationalise
Figure 9: A PowerPoint slide showing on the right the decision-making criteria list regarding the optical disc, and on the
left side a GUI prototype showing an annotated retinal image with an open dropdown menu component showing part of the
decision-making criteria list.


human-centred design methods considering the objec-           References
tives of Contextualisation, Elicitation, and Validation of
such systems. In that scope, we demonstrate transcrip-         [1] F. K. Došilović, M. Brčić, N. Hlupić, Explainable
tion, coding, and systematisation strategies that facili-          artificial intelligence: A survey, in: 2018 41st Inter-
tated our content analysis, in particular, a categorisation        national Convention on Information and Communi-
matrix that helped uncover decision-making criteria and            cation Technology, Electronics and Microelectron-
respective explanations’ structure to inform the design            ics (MIPRO), 2018, pp. 0210–0215. doi:10.23919/
of AI-generated explanations. Future work will focus on            MIPRO.2018.8400040.
further developing the graphical user interface (GUI) to       [2] J.-M. Fellous, G. Sapiro, A. Rossi, H. Mayberg,
adapt it to an AI-based classification system to support           M. Ferrante, Explainable artificial intelligence for
experts’ decision-making process.                                  neuroscience: Behavioral neurostimulation, Fron-
                                                                   tiers in Neuroscience 13 (2019). URL: https://www.
                                                                   frontiersin.org/articles/10.3389/fnins.2019.01346.
Acknowledgments                                                    doi:10.3389/fnins.2019.01346.
                                                               [3] F. E. Commission, Ethics Guidelines for Trust-
We would like to thank the medical experts from the                worthy AI - FUTURIUM - European Commis-
Anatomical Pathology Service of the Portuguese Oncol-              sion, 2021. URL: https://ec.europa.eu/futurium/en/
ogy Institute - Porto (IPO-Porto) and from the University          ai-alliance-consultation.1.html, [Online; accessed
Hospital Centre of Porto (CHPorto), who participated               13. Oct. 2022].
in the user research sessions. A special thanks to our         [4] A. Adadi, M. Berrada, Peeking inside the black-box:
senior colleagues at Fraunhofer Portugal AICOS, Ana                A survey on explainable artificial intelligence (xai),
Barros and Francisco Nunes, who mentored us during                 IEEE Access 6 (2018) 52138–52160. doi:10.1109/
the writing of the article. Finally, this work was finan-          ACCESS.2018.2870052.
cially supported by the project Transparent Artificial         [5] N. Burkart, M. F. Huber, A Survey on the Explain-
Medical Intelligence (TAMI), co-funded by Portugal 2020            ability of Supervised Machine Learning, J. Artif.
framed under the Operational Programme for Competi-                Intell. Res. 70 (2021) 245–317. doi:10.1613/jair.
tiveness and Internationalization (COMPETE 2020), Fun-             1.12228.
dação para a Ciência and Technology (FCT), Carnegie            [6] Q. V. Liao, M. Pribić, J. Han, S. Miller, D. Sow,
Mellon University, and European Regional Development               Question-driven design process for explainable ai
Fund under Grant 45905.                                            user experiences, 2021. URL: https://arxiv.org/abs/
                                                                   2104.03483. doi:10.48550/ARXIV.2104.03483.
                                                               [7] P. Lopes, E. Silva, C. Braga, T. Oliveira, L. Rosado,
                                                                   Xai systems evaluation: A review of human and
     computer-centred methods, Applied Sciences 12              [17] L. P. Aiello, I. Odia, A. R. Glassman, M. Melia, L. M.
     (2022). URL: https://www.mdpi.com/2076-3417/12/                 Jampol, N. M. Bressler, S. Kiss, P. S. Silva, C. C.
     19/9423. doi:10.3390/app12199423.                               Wykoff, J. K. Sun, D. R. C. R. Network, Com-
 [8] P. N. Johnson-Laird, Mental models and human                    parison of Early Treatment Diabetic Retinopathy
     reasoning, Proc. Natl. Acad. Sci. U.S.A. 107 (2010)             Study Standard 7-Field Imaging With Ultrawide-
     18243–18250. doi:10.1073/pnas.1012933107.                       Field Imaging for Determining Severity of Diabetic
 [9] C. Rickheit, Gert; Habel, Mental Models in Dis-                 Retinopathy, JAMA Ophthalmol. 137 (2019) 65–
     course Processing and Reasoning, Elsevier Science               73. doi:10.1001/jamaophthalmol.2018.4982.
     B.V., Amsterdam, 1999. URL: https://books.google.               arXiv:30347105.
     pt/books?hl=pt-PT&lr=&id=96jBqz_ar8AC&                     [18] Eurocytology, Criteria for adequacy of a cervi-
     oi=fnd&pg=PP1&dq=mental+models+versus+                          cal cytology sample | Eurocytology, 2022. URL:
     reasoning+process&ots=Ou3b1SOv77&sig=                           https://www.eurocytology.eu/en/course/1142, [On-
     r5NouxMzR56klQTrokyvHSclJuQ&redir_esc=                          line; accessed 13. Oct. 2022].
     y#v=onepage&q=mental%20models%20versus%                    [19] S. Rêgo, M. Monteiro-Soares, M. Dutra-Medeiros,
     20reasoning%20process&f=false.                                  F. Soares, C. C. Dias, F. Nunes, Implementation and
[10] Z. Liu, J. Stasko, Mental models, visual reason-                evaluation of a mobile retinal image acquisition
     ing and interaction in information visualization: A             system for screening diabetic retinopathy: Study
     top-down perspective, IEEE Transactions on Visual-              protocol, Diabetology 3 (2022) 1–16. URL: https:
     ization and Computer Graphics 16 (2010) 999–1008.               //www.mdpi.com/2673-4540/3/1/1. doi:10.3390/
     doi:10.1109/TVCG.2010.177.                                      diabetology3010001.
[11] J. S. Holtrop, L. D. Scherer, D. D. Matlock, R. E. Glas-   [20] T. Conceição, C. Braga, L. Rosado, M. J. M. Vascon-
     gow, L. A. Green, The Importance of Mental Models               celos, A Review of Computational Methods for
     in Implementation Science, Front. Public Health 9               Cervical Cells Segmentation and Abnormality Clas-
     (2021). doi:10.3389/fpubh.2021.680316.                          sification, Int. J. Mol. Sci. 20 (2019). doi:10.3390/
[12] R. Binns, M. Van Kleek, M. Veale, U. Lyngs, J. Zhao,            ijms20205114.
     N. Shadbolt, ’it’s reducing a human being to a             [21] D. A. De Jesus, L. S. Brea, J. B. Breda, E. Fokkinga,
     percentage’: Perceptions of justice in algorithmic              V. Ederveen, N. Borren, A. Bekkers, M. Pircher,
     decisions, in: Proceedings of the 2018 CHI Con-                 I. Stalmans, S. Klein, T. van Walsum, OCTA Mul-
     ference on Human Factors in Computing Systems,                  tilayer and Multisector Peripapillary Microvascu-
     CHI ’18, Association for Computing Machinery,                   lar Modeling for Diagnosing and Staging of Glau-
     New York, NY, USA, 2018, p. 1–14. URL: https:                   coma, Trans. Vis. Sci. Tech. 9 (2020) 58. doi:10.
     //doi.org/10.1145/3173574.3173951. doi:10.1145/                 1167/tvst.9.2.58.
     3173574.3173951.                                           [22] CLARE: Computer-aided cervical cancer screen-
[13] T. Kulesza, S. Stumpf, M. Burnett, W.-K. Wong,                  ing, 2023. URL: https://www.aicos.fraunhofer.pt/
     Y. Riche, T. Moore, I. Oberst, A. Shinsel, K. McIn-             en/our_work/projects/clare.html, [Online; accessed
     tosh, Explanatory debugging: Supporting end-user                15. Feb. 2023].
     debugging of machine-learned programs, in: 2010            [23] E. Bentley, oTranscribe: A free web app to take the
     IEEE Symposium on Visual Languages and Human-                   pain out of transcribing recorded interviews., 2023.
     Centric Computing, 2010, pp. 41–48. doi:10.1109/                URL: https://otranscribe.com/, [Online; accessed 15.
     VLHCC.2010.15.                                                  Feb. 2023].
[14] S. Jalil, T. Myers, I. Atkinson, M. Soden, Comple-
     menting a Clinical Trial With Human-Computer
     Interaction: Patients’ User Experience With Tele-
     health, JMIR Human Factors 6 (2019) e9481. doi:10.
     2196/humanfactors.9481.
[15] T. Dagdelen, Modernizing the User Interface
     of a Legacy System at the Swedish Police
     Authority : Collaborative Mental Model: A
     New Participatory Design Method, 2019. URL:
     https://www.diva-portal.org/smash/record.jsf?
     pid=diva2%3A1366483&dswid=8599.
[16] D.-G. of Health of Portugal, Rastreio da retinopa-
     tia diabética - portal das normas clínicas, 2018.
     URL: https://normas.dgs.min-saude.pt/2018/09/13/
     rastreio-da-retinopatia-diabetica/.