Designing XAI-based Computer-aided Diagnostic Systems: Operationalising User Research Methods Elsa Oliveira1,† , Cristiana Braga1,† , Ana Sampaio1 , Tiago Oliveira2 , Filipe Soares1,* and Luís Rosado1,* 1 Fraunhofer Portugal AICOS, Rua Alfredo Allen 455/461, 4200-135, Porto, Portugal 2 First Solutions - Sistemas de Informação, S.A., Rua Conselheiro Costa Braga, Matosinhos, Portugal Abstract AI technology has the potential to support humans’ processes and tasks by augmenting human capabilities and effectiveness. Computer-aided systems have been implemented in healthcare mainly to support clinical decisions. As in other areas, the impact, complexity, and opacity of AI operations have led to the establishment of guidelines for trustworthy AI, which implies being understandable. This study describes the user research work carried out by a multidisciplinary team composed of ML engineers, design researchers, and medical experts, to inform the design of algorithms and user interfaces for two XAI-based clinical decision support tools targeted at Cervical cancer and Glaucoma screening. In particular, we sought to leverage and bridge individual and collective expertise to understand the context, decision-making processes and criteria, and values that frame the respective clinical decisions. The article describes how we operationalised the research activities with expert users and what strategies we followed for subsequent content analysis, ending with the sharing of lessons learned as valuable insights for other research teams interested in designing computer-aided diagnostic systems based on human-centred XAI approaches. Keywords Explainable AI, Computer-aided detection, Decision Support System, Ophthalmology, Glaucoma, Cytology, Cervical cancer, Retinal Imaging, Microscopy, 1. Introduction aged human-centred design methods to uncover what to explain, why, how, and for whom [4, 5, 6, 7]. We share Despite its potential, AI has struggled to be understand- a study of how we operationalised Human-Centred De- able. This requirement has been critical in several areas, sign (HCD) methods to inform the design of algorithms mainly in healthcare [1, 2], where AI can support clini- and user interfaces for two XAI-based clinical decision cal decisions. There has been consensus on the need to support tools for Cervical cancer and Glaucoma screen- promote accountable and trustworthy AI. The European ing. We were concerned with grasping medical experts’ Commission’s High-Level Expert Group on Artificial In- mental models and reasoning processes. While mental telligence (AI HLEG) says that whenever an AI system models are mental constructs that represent a distinct has a significant impact on people’s lives, it should be pos- possibility and derive a conclusion from them, reasoning sible to demand a suitable explanation of the AI system’s implies a process to derive a conclusion and depends on decision-making process [3]. These considerations have envisaging the possibilities (mental models) consistent led AI towards Explainable AI (XAI), which in turn lever- with a starting point [8]. So, to access the diagnosis’ rea- soning and identify the decision-making data and the Elsa Oliveira, Cristiana Braga, Ana Sampaio, Tiago Oliveira, Filipe explanations structures to apply in the design of XAI- Soares and Luís Rosado. 2023. Designing XAI-based Computer-aided based clinical decision support tools, we needed to get Diagnostic Systems: Operationalising User Research Methods. Joint inside the diagnosis process with those who practice it - Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, the medical experts. [9, 10] Australia, 11 pages. * Corresponding author. This paper is structured into 6 sections. First section † These authors contributed equally. introduces the demand for XAI systems. Section 2 iden- $ elsa.oliveira@aicos.fraunhofer.pt (E. Oliveira); tifies the objectives and design of the study, subdivided cristiana.braga@aicos.fraunhofer.pt (C. Braga); into three phases: contextualisation, elicitation, and vali- ana.sampaio@aicos.fraunhofer.pt (A. Sampaio); dation. Section 3 briefly introduces the medical context tiago.oliveira@first-global.com (T. Oliveira); of Cervical cancer and Glaucoma, on which the work filipe.soares@aicos.fraunhofer.pt (F. Soares); luis.rosado@aicos.fraunhofer.pt (L. Rosado) was focused. Section 4 describes how we operationalised  0000-0002-7105-9654 (E. Oliveira); 0000-0002-9384-2252 the research work focusing on the research activities (C. Braga); 0000-0003-1770-4429 (A. Sampaio); with the users and the analysis of the collected content. 0000-0002-2881-313X (F. Soares); 0000-0002-8060-831X (L. Rosado) Finally, in section 5 we share lessons learned from this © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). study, and section 6 indicates the main conclusions and CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) future work. 2.2. Elicitation The elicitation phase asked for more detail on the 2. Goals and Study Design decision-making process, decision-making data, and on the explanation structures that support it. To this end the As a multidisciplinary team, composed by Machine Learn- research team relied on referenced methods for mental ing (ML) engineers, design researchers, and medical ex- models’ elicitation [11], such as Semi-structured inter- perts, we sought to leverage and bridge individual and views, Observation, and Think-Aloud [12, 13], together collective expertise, especially from the medical area for with co-creation practices - that made use of imaging which the systems were conceived, to inform the de- data and other design materials to facilitate participants sign of algorithms and user interfaces for explainable in demonstrating the processes of analysis and decision- decision support software targeted at Cervical cancer making. Nielsen refers Think-Aloud method as effective and Glaucoma screening. We based our study on the in giving us insights into users’ mental models regarding results of the user research activities, which aimed to a given task. The study also drew on the procedures of a understand the context, processes, and values that frame field study method based on Observation and interviews clinical decisions in the above-mentioned health areas. to understand work practices and behaviors - Contextual The research process was guided by three phases: con- inquiry [14, 15]. Kim Salazar on Nielsen Norman Group textualisation, elicitation, and validation. The user re- website highlights the value of the contextual inquiry search methods applied in each phase (described below) method – to inquiry in context, which results in a col- returned a considerable amount of fieldwork materials, laborative interpretation between researchers and expert i.e., written, verbal and visual content, that researchers users about work practices and behaviors, with a more needed to analyse to enhance understanding of the data. in-depth understanding of experts’ reasoning. With these In analysing these data, we initially focused on codify- references in mind, the research team set up workshops ing what the clinicians said (written transcription) about to observe, and question medical experts analysing and their decision-making process. However, most of their deciding on clinical cases and from clinical data. explanations evoked visual aspects of the images. As we are non-experts, we quickly realised that we needed to 2.3. Validation match what the doctors were saying with the respective visual elements they were characterising in their expla- The validation stage allowed us to discuss, correct, com- nations. For example, when clinicians explained that a plete, and refine with medical experts the research find- cell was abnormal "because it had a halo around the nu- ings. Through co-creation design practices, researchers cleus", HCD and ML researchers could not understand designed group and individual workshops, in both remote what a halo was without a visual reference of that cell and in-person versions, in order to display the decision- containing a halo. The content analysis process based on making criteria within the respective structures, to be Transcription, Coding, and Systematisation, paved the discussed and easily edited and iterated in real-time. For way for the decision-making data and inherent reasoning some questions, we used A/B testing method for partici- structure. pants to select the best option. 2.1. Contextualisation 3. Cervical cancer and Glaucoma As a first step, the researchers sought to become famil- As mentioned in the introduction, this study addresses iarised with the jargon, clinical practices, and decision- two distinct health areas, Anatomical Pathology and Oph- making processes used by health professionals. Initially, thalmology, more specifically, Cervical cancer and Glau- the researchers made more superficial research in online coma. The main goal of the study was to design an ex- medical articles, also to acquire the basic knowledge to plainable decision support software per area, both based prepare for the interviews with medical experts. In fact, on imaging screening, to be used by medical experts, and the contextualisation was accomplished mainly through physicians in training. semi-structured interviews which script applied Task Cervical cancer screening will mainly rely on cytolog- Reflection and Retrospection methods, to prompt partici- ical microscopic images, while Glaucoma screening on pants to reflect and describe their daily clinical tasks and retinal images. The research team needed to go deep into diagnostic practices. The interviews gave us an overview each clinical practice to define the systems’ requirements. of clinical practices, decision-making processes, values, Table 1 lists the main aspects that characterise the two and a quick window into participants’ mental models health areas under study, considering the analysis that as they gave examples of clinical cases and how they medical experts carry out per patient. This knowledge decided on them. Table 1 Characteristics of Cervical cancer and Glaucoma screening Cytology: Cervical cancer Ophthalmology: Glaucoma Purpose Screening for reversible gynaecological Screening for irreversible eye disease disease Main Imaging Data Cervical cytology specimen (liquid-based) Color Fundus Photography collection Complementary exams HPV diagnosis Ocular anatomy (e.g., narrow angle), IOP (intraocular pressure), Corneal tomography for pachymetry, volume and depth, Retinal Nerve Fiber Layer Thickness (RNFLT) measured via OCT, and visual field tests. Patient data (Not mandatory) Age, last menstruation, Age, ethnicity, family history of the contraceptive method, relevant medical condition, associated pathologies (e.g., therapeutics, e.g., hormonal, chemotherapy diabetes, cataracts) and risk medication (e.g., antidepressant) Digitalisation outcome - Approximately 100 microscopic images per Between 1 and 7 retinal fundus images per Artefact of analysis sample eye [16, 17] Variation Each image represents a small section of Each image can vary in eye laterality (left the entire sample or right) or field of view Image navigation The expert browses, image by image, The expert checks an image individually, zooming in and out, to identify cells with zooming in and out, to look for abnormal aspect abnormalities in the main structures Criteria of adequacy Representation of the Transformation Zone. Visibility and sharpness of optic nerve and Minimum of 5000 squamous cells [18] macula. Completeness of temporal arcade. (average of 3.8 cells/image). Good image Clear visibility of small vessels. Field of quality. view well illuminated (min. 80%) [19]. Case classification Grading by lesion level, according to the Staging of Glaucoma [21] Bethesda System’s convention [20] Comparative analysis Experts take intermediate squamous cells Experts check the symmetry between the as a reference for comparison left eye and the right eye End-users Cytopathologists, cytotechnicians Glaucomatologists, (diagnose), (diagnose), and physicians in training ophthalmologists, and physicians in training was built-up throughout the contextualisation and elici- nologists; in Glaucoma, 4 ophthalmologists specialised tation phases. While both areas share common aspects, in Glaucoma (glaucomatologists). Because of COVID-19, they also have some significant differences. in particular the restrictions on in person group meet- ings and on normal access to hospitals and clinical set- tings, most user research activities took place remotely 4. Operationalisation of research through digital and online platforms. Through these, activities participants were able to access anonymised screening images, as well as other clinical data, to demonstrate their In this section, we describe how we operationalised the re- decision-making process, and reasoning, while observed search activities for the contextualisation, elicitation, and and questioned by the research team. validation phases. At the beginning of the study, all par- ticipants received a general informed consent that gave 4.1. Contextualisation interviews an overview of the user research agenda, being further provided a detailed informed consent per activity. The After some basic research through online medical articles, study counted with up to 5 participants per health area the research team draw the interview script addressed - in Cervical cancer, 3 cytopathologists and 2 cytotech- to the medical experts. The interview script aimed at: understanding clinical procedures, i.e., from the first diagnostic exams with diverse diagnosis: unconfirmed, consultation up to and after diagnosis, eliciting medi- borderline, early stage Glaucoma and advanced stage Glau- cal experts’ values, i.e., their motivation for the medical coma. field, examples of impactful cases, and, very important, To ensure participants’ unbiased decisions, we first medical expectations regarding the introduction of AI conducted individual workshops. Afterwards we ran a systems in the clinical practice. The semi-structured in- group workshop for both medical fields to help identify terviews were carried out remotely through video call consensual criteria and foster discussions around the by Microsoft Teams software. To note that in Cervical least consensual ones. cancer, the research team took advantage of the results of a previous and related study with cytopathologists and 4.2.2. Conducting individual workshops cytotechnicians [22, 20] that had conducted in-person semi-structured interviews with the same participants. Each workshop consisted of one main task: the partici- These interviews enabled us to understand the processes pant, as a medical expert, would assess imaging exami- involved in cytological analysis, from the reception of nations in real-time and think aloud about their analysis. the sample to the diagnosis. This way, we could follow the assessment process and ask questions whenever needed to better understand it. More- over, we asked participants to annotate relevant findings 4.1.1. Interviews analysis whose appearance suggested a pathological change and Once the interviews were completed, we transcribed to provide the respective diagnosis classification. Ex- them using oTranscribe software [23]. We then organ- perts in Cervical cancer classified cytological images ac- ised the participants’ insights into the main themes raised cording to the Bethesda System’s convention. Experts during the interviews. in Glaucoma classified retinal images according to the four stages mentioned in section 4.2.1. Each participant 4.2. Workshops for eliciting diagnostic analysed from three to seven images consisting of liquid- based cytological samples (in Cervical cancer) or from processes four to sixteen images consisting of eight pairs of retinal Familiarised with both medical areas, we inspired our- images (in Glaucoma). Figure 1 shows a visual field of a selves in the contextual inquiry method to design the cytological sample with two cells annotated by a partici- workshops that would enable us to elicit experts’ diag- pant for their abnormality, and figure 2 a retinal image nostic assessment process. Our goal was to understand being analysed by a participant. what experts look at when they analyse a clinical case, Given the interdependence with other examinations in specifically, an imaging examination, and what criteria Glaucoma diagnosis, and the wider range of diagnostic they use to assess whether it is a pathological change. factors outside imaging data, Glaucoma workshops com- prised an additional task. Each participant was asked to 4.2.1. Designing remote workshops list the steps of a usual medical procedure, from the first consultation to the diagnosis, describing other relevant The analysis of imaging examinations was a requirement examinations beyond the retinal image. Figure 3 shows for the diagnosis assessment, thus, we needed to observe the timeline filled in by 1 of the 4 participants considering experts analysing such images. Usually, we would visit the examinations performed throughout the analysis of the experts’ workplace and observe them in a real clini- a given clinical case (for example, José, 62 years old with cal setting. However, due to COVID-19, the workshops high intraocular pressure) - from the first consultation had to be remote, and so, we mimicked this observation to diagnosis, and, where necessary, in the patient follow- remotely. up. In the second part of the workshop, the participant For Cervical cancer, we asked a cytologist to provide accessed anonymised eye examinations, corresponding us with images of liquid-based cytological samples. For to different diagnoses, from non-Glaucoma to advanced Glaucoma, we asked a glaucomatologist to provide us stage Glaucoma, to then select and analyse the most rep- with retinal images. However, this was not all. From the resentative of a specific clinical case. Figure 2 is one of interviews, we learned that both medical fields comple- the retinal images that a participant has zoomed in and mented images’ interpretation with clinical data, which centred on the optical disc to show which image features we were attending to. But we also learned that Glaucoma- reflect the state of the eye’s structures and should there- tous pathology was more complex to diagnose, because fore be considered as a criterion for decision making. glaucomatologists often had to integrate complementary diagnostic exams to reach a diagnosis. With this in mind, we asked the glaucomatologist to provide us with a set of anonymised complementary Figure 1: Screenshot from the individual workshop for elicita- Figure 2: Screenshot from the individual workshop for elicita- tion, conducted remotely through a digital platform showing tion, conducted remotely through a digital platform showing a digital liquid-based cervical sample, with two of several cells a retinal image being analysed with the participant pointing noted by the participant for their abnormality to the optical nerve to explain what is a pathologic optic disc cupping (excavation) 4.2.3. Transcribing and analysing in fact, we were able to verify this. Below (Figure 5) is an As we transcribed the workshops, it became evident that example of the same cytological field analysed by the five we should assign textual excerpts to image cut-outs, as Cervical cancer experts. Both annotations of suspicious most of experts’ explanations consisted of descriptions or abnormal cells and final classifications varied across of visible characteristics in the analysed images. Thus, analysts. While three experts classified the cytological mapping the object of analysis with the respective tran- field as ASC-US - an official classification for uncertainty scription enabled us to keep a correspondence between regarding an Abnormal Squamous Cell(s), two of the five what was said and what was being observed in the image experts classified the sample as LSIL - an official classi- (Figure 4). fication comparable to ASC-US, but that assigns a Low We did this for each participant. Almost all partici- grade of Intraepithelial Lesion to the Squamous cell(s). pants, from both medical fields, mentioned how the anal- ysis and conclusions of some clinical cases were subjec- 4.2.4. Coding and systematisation tive. For instance, a glaucomatologist said: “Sometimes it’s not black and white, it’s grey”, meaning that the same Once we completed the transcripts, we created a cate- examinations and clinical data may lead experts to differ- gorisation matrix in Excel to code the data into a set ent decisions. This happens when the available elements of categories that constitute the building blocks of the for diagnosis are unclear, due to either image character- explanations, which allowed us to uncover a generic ex- istics that hinder experts’ analysis (e.g. blurry image), or planation structure suitable for both use cases. We used to characteristics of the anatomical structures, which can the columns’ headings for the categories, and the rows be themselves confusing (when the same visual appear- to list the image that has triggered the explanation to- ance can be the result of different possible causes), which gether with the textual explanation (quote) and the set requires more tests and more time. of categorisable criteria (Figure 6). Moreover, Cervical cancer experts highlighted the sub- As we went on with the codification, we iteratively jectivity intra- and inter-observer, explaining that not refined the categories into Key structure(s) examined, only the decision may vary between experts, as the same Key feature(s) concerned, Risk factor, Not Cervical can- expert could give a different classification to the same cer/Glaucoma factor, Doubt factor, Result attributed, and sample at different moments in time. Therefore, we finally, Key expression used by the expert. As the Excel’s sought this subjective dimension by comparing the analy- content increased, we identified that the criteria we filled sis of each participant to the same object of analysis, and in the categories would repeat. So, we created an Excel Figure 3: Screenshot from the individual workshops, conducted through Mural platform, listing a sequence of steps (required eye examinations) until reaching a diagnosis for a given hypothetical patient Figure 4: Word template setup for transcription aiming at mapping visual content - cells, or other structures - with textual excerpts, while keeping the order of analysis. The process included annotating on the image the object/area analysed and associating the number that corresponds to its order of analysis by the participant. tab to list the criteria for each category as they emerged It has a darker nucleus, but with this res- throughout the process. We ended up gathering a list of olution, when I try to zoom in, I can’t see options that enabled us to streamline the filling-in pro- the characteristics. cess. To avoid subjectivity and/or interpretation errors in the process of codification, we organised an internal Figure 6 shows the variability of decision criteria by panel of three coders composed of researchers involved category that was raised throughout the analysis. in these activities. All transcriptions were assigned to this By the end of the analysis, we uncovered the most panel, varying who would be the first coder. While the relevant criteria used by experts in each medical field first coder would codify the transcription from scratch, to analyse and explain their decisions. And we could the following two would validate the first codification. standardise that most of the explanations followed this Taking the following quote as an example from Cervical structure: cancer, we would describe as table 2 shows. Figure 5: An example of the inter-observer comparative analysis carried out in the Cervical cancer study. The same microscopic field of view was analysed by 5 medical experts resulting in different annotations and classifications. Figure 6: Screenshot of the explanations’ systematisation in Excel for Glaucoma. In the columns’ headings, we may read the categories, left to right: Explanation object, Explanation, Key structure examined, Key feature examined, Glaucoma potential factor, Not Glaucoma factor, Unsatisfactory for analysis, and Confounding factors. Each category was fed according to the criteria identified in the explanations given by the medical experts. The [Key feature concerned] of the [Key struc- suggest a plausible contradiction that prevented them ture(s) examined] is [Risk factor] OR [Not Cervical from providing a classification of which they were confi- cancer/Glaucoma]. dent. e.g. Cervical cancer: The [colour] of the e.g., Cervical cancer: The [colour] of the [nucleus] is [hyperchromatic]. Glau- [nucleus] is [hyperchromatic], how- coma: The [optic disc] has an [excava- ever, [there are overlapping cells]. Glau- tion greater than 0.4]. coma: The [optic disc] has an [excava- tion greater than 0.4], however, [is sym- Moreover, we found that sentences stating a “not Cer- metric]. vical cancer/Glaucoma factor” or “doubt factor” could follow the Key feature concerned. Experts used them to Table 2 Explanation categorisation example Key area of image examined It has a darker nucleous (Part of a cell) Key structure(s) examined nucleous (Nucleous) Key feature(s) concerned a darker nucleous (Colour intensity) Risk factor a darker nucleous (Hyperchromasia) Not Cervical cancer / Glaucoma factor Not applicable Doubt factor but with this resolution,... I can’t see the characteristics (Image quality - Blurred) Assigned result ... I can’t see the characteristics Insufficient / No classification In these explanations, the experts point out a struc- with images of the prototype of the GUI together with its ture that he/she observed and characterise its aspect – content (the elicited decision-making criteria) listed in an reflecting a well-known and established risk factor in the editable text box, as shown in Figure 9. The content was domain knowledge, i.e., Cervical cytology: [hyperchro- discussed in real-time and, whenever necessary, easily matic], Glaucoma: [excavation greater than 0.4]. Never- edited. theless, the explanations also stress - through the con- trastive expression ‘however’ - other characteristics that complement and contrast the first ones, i.e., Cervical cy- 5. Lessons Learned tology: [there are overlapping cells], Glaucoma: [is sym- L1: Multidisciplinary team The design of XAI-based metric]. And this prevents the experts from discerning clinical decision support tools requires extensive knowl- with certainty whether the first observed characteristic edge from various domains. It is paramount that teams is an anomaly or not. ensure an iterative communication that keeps all in the loop, i.e., design researchers, medical experts, ML En- 4.3. Workshops for validation gineers, etc. Let us highlight ML Engineers’ guidance Based on the results of previous user research activi- on the feasibility of the required functionalities, their ties, researchers designed validation workshops to: (i) support in defining the needed data, i.e., quantity and ensure no conflicting information among the knowledge quality, and the infrastructure for implementation. Many shared by each participant, (ii) remove possible impre- systems based on supervised learning require annotated cision from researchers’ interpretation and consequent data, analysed by experts in terms of elements needed to analysis outcomes, and (iii) get insights on a first ver- guide the models’ learning process. In the case of medical sion of the graphical user interface (GUI) designed from XAI systems, this requires close cooperation with clinical scratch to attend the elicited diagnostic processes. experts to ensure the annotation of the data instances objectively and uniformly. This way, ML Engineers guar- antee that the final data set comprises cases sufficiently 4.3.1. Conducting group workshops representative of the different data properties that may The first validation session was carried out through the arise in practical scenarios. Mural platform, from where participants accessed and L2. Contextual inquiry method as a basis for elici- interacted (by editing, deleting, or adding content) with tation The contextual inquiry method inspired the study the list of decision-making criteria raised so far in or- to observe experts performing a task as close to real- der to ensure their correctness and completeness. In ity as possible by having them verbalise their thoughts Glaucoma workshops, participants were also asked to while analysing imaging examinations and providing di- analyse several examinations, mainly retinal images, and agnostic classifications for them. We conclude that, when to choose the applicable criteria for each one from the in-loco sessions are not possible, researchers can simulate list elicited by researchers. We asked participants to po- the method remotely using digital and online platforms sition the selected criteria in one of three possibilities: that enable to: video call, screen sharing, display rele- non-Glaucoma, Glaucoma, or borderline (Figure 7). vant data for analysis and discussion, and freely write. We asked the experts for analysis materials from their 4.3.2. Validating content and container - an daily work, e.g., anonymised imaging examinations, and informed GUI prototype then used the online platform Mural to present analysis tasks using these materials. While sharing the screen, In the second validation session, researchers presented experts analysed, selected, and annotated the digital im- the validated decision-making criteria integrated into a ages, and researchers asked timely questions that arose Graphical User Interface (GUI) prototype. The aim was from observing what participants were doing and saying to get feedback on the criteria and on the UI components (think-aloud). presenting it. According to participants’ availability, the L3. Mapping text with images helped associate Cervical cancer session took place in person (Figure 8), features to structures From the elicitation to the con- and the Glaucoma session took place remotely (Figure 9). tent analysis, we found it elementary to map the textual Some categories and criteria seemed to have more transcripts with the image that experts were analysing. than one possible way to name or present in the inter- We cropped, framed, and sketched over the images to face. Thus, to assess the correctness and completeness correlate what experts were saying with what they were of the data as well as the system’s components and re- seeing. In doing this, some categories emerged trans- lated features, we applied A/B testing for participants to versely among both experts and images analysed, so this choose the best options. mapping led to discovering a standard structure of the In Glaucoma study, we conducted a remote session explanations. through which we shared a PowerPoint presentation Figure 7: Screenshot of the first validation of the clinical decision criteria by the glaucomatologists. At the top, a magnification of a retinal image (right and left eye) centred on the optical disc. Below, 3 tables, one per eye structure: Discs, Neuroretinal ring, and Vessels. To the left of each table, there is the respective list of criteria from which participants were asked to select those observed in the retinal image and associate them to a non-Glaucoma, Glaucoma, or borderline diagnosis. The criteria positioned between the two columns would be considered borderline case criteria. Figure 8: In-person workshop with three cytopathologists and one cytotechnologist to validate the decision-making criteria integrated into a GUI prototype. L4. Categorisation matrix for multidisciplinary 6. Conclusions and Future Work analysis As the categories emerged, we used Excel’s functionalities, such as drop-down lists to streamline the This paper describes the user research activities carried process of matching features to structures facilitating out by a multidisciplinary team to inform the design of the systematisation of the analysis across more team Machine Learning algorithms and user interfaces for two members, i.e., design researchers and ML engineers. XAI-based computer-aided diagnostic systems for Cer- vical cancer and Glaucoma. We shared what we think might be useful for other teams involved in the design of Explainable AI systems, namely, ways to operationalise Figure 9: A PowerPoint slide showing on the right the decision-making criteria list regarding the optical disc, and on the left side a GUI prototype showing an annotated retinal image with an open dropdown menu component showing part of the decision-making criteria list. human-centred design methods considering the objec- References tives of Contextualisation, Elicitation, and Validation of such systems. In that scope, we demonstrate transcrip- [1] F. K. Došilović, M. Brčić, N. Hlupić, Explainable tion, coding, and systematisation strategies that facili- artificial intelligence: A survey, in: 2018 41st Inter- tated our content analysis, in particular, a categorisation national Convention on Information and Communi- matrix that helped uncover decision-making criteria and cation Technology, Electronics and Microelectron- respective explanations’ structure to inform the design ics (MIPRO), 2018, pp. 0210–0215. doi:10.23919/ of AI-generated explanations. Future work will focus on MIPRO.2018.8400040. further developing the graphical user interface (GUI) to [2] J.-M. Fellous, G. Sapiro, A. Rossi, H. Mayberg, adapt it to an AI-based classification system to support M. Ferrante, Explainable artificial intelligence for experts’ decision-making process. neuroscience: Behavioral neurostimulation, Fron- tiers in Neuroscience 13 (2019). URL: https://www. frontiersin.org/articles/10.3389/fnins.2019.01346. Acknowledgments doi:10.3389/fnins.2019.01346. [3] F. E. Commission, Ethics Guidelines for Trust- We would like to thank the medical experts from the worthy AI - FUTURIUM - European Commis- Anatomical Pathology Service of the Portuguese Oncol- sion, 2021. URL: https://ec.europa.eu/futurium/en/ ogy Institute - Porto (IPO-Porto) and from the University ai-alliance-consultation.1.html, [Online; accessed Hospital Centre of Porto (CHPorto), who participated 13. Oct. 2022]. in the user research sessions. A special thanks to our [4] A. Adadi, M. Berrada, Peeking inside the black-box: senior colleagues at Fraunhofer Portugal AICOS, Ana A survey on explainable artificial intelligence (xai), Barros and Francisco Nunes, who mentored us during IEEE Access 6 (2018) 52138–52160. doi:10.1109/ the writing of the article. Finally, this work was finan- ACCESS.2018.2870052. cially supported by the project Transparent Artificial [5] N. Burkart, M. F. Huber, A Survey on the Explain- Medical Intelligence (TAMI), co-funded by Portugal 2020 ability of Supervised Machine Learning, J. Artif. framed under the Operational Programme for Competi- Intell. Res. 70 (2021) 245–317. doi:10.1613/jair. tiveness and Internationalization (COMPETE 2020), Fun- 1.12228. dação para a Ciência and Technology (FCT), Carnegie [6] Q. V. Liao, M. Pribić, J. Han, S. Miller, D. Sow, Mellon University, and European Regional Development Question-driven design process for explainable ai Fund under Grant 45905. user experiences, 2021. URL: https://arxiv.org/abs/ 2104.03483. doi:10.48550/ARXIV.2104.03483. [7] P. Lopes, E. Silva, C. Braga, T. Oliveira, L. Rosado, Xai systems evaluation: A review of human and computer-centred methods, Applied Sciences 12 [17] L. P. Aiello, I. Odia, A. R. Glassman, M. Melia, L. M. (2022). URL: https://www.mdpi.com/2076-3417/12/ Jampol, N. M. Bressler, S. Kiss, P. S. Silva, C. C. 19/9423. doi:10.3390/app12199423. Wykoff, J. K. Sun, D. R. C. R. Network, Com- [8] P. N. Johnson-Laird, Mental models and human parison of Early Treatment Diabetic Retinopathy reasoning, Proc. Natl. Acad. Sci. U.S.A. 107 (2010) Study Standard 7-Field Imaging With Ultrawide- 18243–18250. doi:10.1073/pnas.1012933107. Field Imaging for Determining Severity of Diabetic [9] C. Rickheit, Gert; Habel, Mental Models in Dis- Retinopathy, JAMA Ophthalmol. 137 (2019) 65– course Processing and Reasoning, Elsevier Science 73. doi:10.1001/jamaophthalmol.2018.4982. B.V., Amsterdam, 1999. URL: https://books.google. arXiv:30347105. pt/books?hl=pt-PT&lr=&id=96jBqz_ar8AC& [18] Eurocytology, Criteria for adequacy of a cervi- oi=fnd&pg=PP1&dq=mental+models+versus+ cal cytology sample | Eurocytology, 2022. URL: reasoning+process&ots=Ou3b1SOv77&sig= https://www.eurocytology.eu/en/course/1142, [On- r5NouxMzR56klQTrokyvHSclJuQ&redir_esc= line; accessed 13. Oct. 2022]. y#v=onepage&q=mental%20models%20versus% [19] S. Rêgo, M. Monteiro-Soares, M. Dutra-Medeiros, 20reasoning%20process&f=false. F. Soares, C. C. Dias, F. Nunes, Implementation and [10] Z. Liu, J. Stasko, Mental models, visual reason- evaluation of a mobile retinal image acquisition ing and interaction in information visualization: A system for screening diabetic retinopathy: Study top-down perspective, IEEE Transactions on Visual- protocol, Diabetology 3 (2022) 1–16. URL: https: ization and Computer Graphics 16 (2010) 999–1008. //www.mdpi.com/2673-4540/3/1/1. doi:10.3390/ doi:10.1109/TVCG.2010.177. diabetology3010001. [11] J. S. Holtrop, L. D. Scherer, D. D. Matlock, R. E. Glas- [20] T. Conceição, C. Braga, L. Rosado, M. J. M. Vascon- gow, L. A. Green, The Importance of Mental Models celos, A Review of Computational Methods for in Implementation Science, Front. Public Health 9 Cervical Cells Segmentation and Abnormality Clas- (2021). doi:10.3389/fpubh.2021.680316. sification, Int. J. Mol. Sci. 20 (2019). doi:10.3390/ [12] R. Binns, M. Van Kleek, M. Veale, U. Lyngs, J. Zhao, ijms20205114. N. Shadbolt, ’it’s reducing a human being to a [21] D. A. De Jesus, L. S. Brea, J. B. Breda, E. Fokkinga, percentage’: Perceptions of justice in algorithmic V. Ederveen, N. Borren, A. Bekkers, M. Pircher, decisions, in: Proceedings of the 2018 CHI Con- I. Stalmans, S. Klein, T. van Walsum, OCTA Mul- ference on Human Factors in Computing Systems, tilayer and Multisector Peripapillary Microvascu- CHI ’18, Association for Computing Machinery, lar Modeling for Diagnosing and Staging of Glau- New York, NY, USA, 2018, p. 1–14. URL: https: coma, Trans. Vis. Sci. Tech. 9 (2020) 58. doi:10. //doi.org/10.1145/3173574.3173951. doi:10.1145/ 1167/tvst.9.2.58. 3173574.3173951. [22] CLARE: Computer-aided cervical cancer screen- [13] T. Kulesza, S. Stumpf, M. Burnett, W.-K. Wong, ing, 2023. URL: https://www.aicos.fraunhofer.pt/ Y. Riche, T. Moore, I. Oberst, A. Shinsel, K. McIn- en/our_work/projects/clare.html, [Online; accessed tosh, Explanatory debugging: Supporting end-user 15. Feb. 2023]. debugging of machine-learned programs, in: 2010 [23] E. Bentley, oTranscribe: A free web app to take the IEEE Symposium on Visual Languages and Human- pain out of transcribing recorded interviews., 2023. Centric Computing, 2010, pp. 41–48. doi:10.1109/ URL: https://otranscribe.com/, [Online; accessed 15. VLHCC.2010.15. Feb. 2023]. [14] S. Jalil, T. Myers, I. Atkinson, M. Soden, Comple- menting a Clinical Trial With Human-Computer Interaction: Patients’ User Experience With Tele- health, JMIR Human Factors 6 (2019) e9481. doi:10. 2196/humanfactors.9481. [15] T. Dagdelen, Modernizing the User Interface of a Legacy System at the Swedish Police Authority : Collaborative Mental Model: A New Participatory Design Method, 2019. URL: https://www.diva-portal.org/smash/record.jsf? pid=diva2%3A1366483&dswid=8599. [16] D.-G. of Health of Portugal, Rastreio da retinopa- tia diabética - portal das normas clínicas, 2018. URL: https://normas.dgs.min-saude.pt/2018/09/13/ rastreio-da-retinopatia-diabetica/.